by James Casey
7 February, 2019 - 11 minute read

Winton has a voracious appetite for data, with many areas of the business storing large amounts of it in SQL Server databases.

The great flexibility of a relational database like SQL Server is the ability to write arbitrary SQL statements to pull your data out in whatever shape is required. When presented with an SQL query, the “optimiser” comes into play: it parses the query and constructs a plan to extract the data in an efficient way. In theory, that is. The flip side of this flexibility is that the job of the optimiser is very hard, and it is not uncommon for even quite simple queries to perform slowly.

The optimiser needs information to do its job; the query alone is not enough. When we build our database, we add keys defining the relationships between the data tables, indexes to make searching the data easier, and statistics on the distribution of data within the columns.

By looking at all this information, the optimiser can make informed decisions about which order to join the tables in our query together, what methods to use to join them together, which indexes to use, when and where it needs to perform operations like sorting and so on. It constructs a tree of operations, which can then be performed from the leaf level down to the root where the final result can be returned.

Often a performance issue will be down to the optimiser needing more information than we have provided. A typical example is a missing index. Even if there are indexes on a table, if the columns in the index do not cover the columns our query is filtering on, then the entire table may need to be scanned.

This sort of performance issue can be relatively easy to spot and fix. If we look at the execution plan and see that it is scanning a large table when we know it only needs a small slice of the data, the indexing should be the first thing we check.

But sometimes we will have suitable indexes in place and still get poor performance, such as in the example that follows. Through it, we will delve into how the optimiser makes some of its decisions, how those decisions sometimes go wrong, and how to nudge it back on track. This is simplified version of a real performance issue we fixed recently, which involved more complex queries with more tables.

Our performance issue

Originally we had a query like this:

SELECT *
FROM SomeTable 
INNER JOIN Skewed ON Skewed.SomeTableId = SomeTable.SomeTableId
WHERE SkewedData = 1;

However, the developer did not like that hard-coded constant of “1”. What does that magic number mean? So, they created a variable with a meaningful name and used that instead:

DECLARE @meaningfullyNamedVariable INT = 1;

SELECT *
FROM SomeTable 
INNER JOIN Skewed ON Skewed.SomeTableId = SomeTable.SomeTableId
WHERE SkewedData = @meaningfullyNamedVariable;

This seems like a reasonable approach. The queries are logically equivalent and return the same data, but on making the change the developer discovered that the performance deteriorated. Let’s look at why that might be.

Setup

(As an aside, we feel it is worth emphasising that we have no tables with such generic names in the real world: these are simply examples we have created for the purpose of demonstration.)

We set them up as follows. We join two tables with a foreign key, with an index covering the query we want to perform:

CREATE TABLE dbo.SomeTable
(
	SomeTableId INT NOT NULL CONSTRAINT pk_SomeTable PRIMARY KEY IDENTITY(1,1),
	SomeData INT NOT NULL
);

CREATE TABLE dbo.Skewed 
(
	SkewedId INT NOT NULL CONSTRAINT pk_Skewed PRIMARY KEY IDENTITY(1,1), 
	SomeTableId INT NOT NULL CONSTRAINT fk_Skewed_SomeTable FOREIGN KEY REFERENCES dbo.SomeTable(SomeTableId), 
	SkewedData INT NOT NULL
);

CREATE INDEX idx_Skewed ON Skewed(SkewedData, SomeTableId);

Now, let’s fill the tables with some data. To reproduce the issue we are demonstrating, a reasonably large dataset is required.

Into the first table we insert a million sequential numbers. While there are many ways to do this, the approach used here is efficient and does not rely on the existence of any tables from which to draw numbers:

;WITH e1(n) AS
(
    SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL 
	SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1
), -- 10
e2(n) AS (SELECT 1 FROM e1 CROSS JOIN e1 AS b), -- 10*10
e3(n) AS (SELECT 1 FROM e1 CROSS JOIN e2 AS b), -- 10*100
e4(n) AS (SELECT 1 FROM e3 CROSS JOIN e3 AS b)  -- 1000*1000
INSERT INTO dbo.SomeTable(SomeData)
SELECT ROW_NUMBER() OVER (ORDER BY n) FROM e4;

As the name suggests, we want the Skewed table’s data to be skewed. As we will see, an uneven data distribution can affect how a query is executed. In this example, we make it very uneven to demonstrate the difference it can make. We put a million rows in this table too, but we populate half the table with the value “1” and the other half with different values. We can leverage the first table’s data to generate this skewed data and link it to the first table at the same time:

-- 500000 rows where Skewed.SkewedData = 1
INSERT INTO dbo.Skewed(SomeTableId, SkewedData)
SELECT SomeTableId, 1
FROM dbo.SomeTable
WHERE SomeTableId <= 500000;

-- 500000 different values 
INSERT INTO dbo.Skewed(SomeTableId, SkewedData)
SELECT SomeTableId, SomeData
FROM dbo.SomeTable
WHERE SomeTableId > 500000;

Investigation

Let’s examine the execution plans to see what is happening. Put both queries into SSMS, click “Include Actual Execution Plan” and run both queries together in a batch:

Execution plans

First, note that even though the second query is slower, the execution plans indicate that the first query is going to take 100% of the batch cost. This is because even when we get the “actual” rather than “estimated” plans, the figures are still the estimated costs.

Sometimes, the reason we have a slow query is precisely because the estimates are inaccurate, which makes this especially unhelpful when it comes to performance tuning. This also goes for the percentages shown for individual nodes. The plans are useful for seeing how the query is being performed, but be mindful that the numbers are only estimates and may not be very accurate.

Both queries are using the same indexes, with the only difference in how the two tables are joined together. The original query created a merge join, but the introduction of the variable has changed it to a nested loop join.

Why is that? It comes down to the information available to the query optimiser when deciding how to perform the query. After deciding what indexes to use, it has to decide on the most efficient method of joining them together. The most salient information that the optimiser uses to make this decision relates to how much data it is expecting to find. To estimate this, it looks at the statistics on the first column of the chosen indexes.

Let’s take a look at these statistics:

Statistics Statistics detail

Here we see the optimiser has produced a histogram of the distribution of data within the table for the first column in the index: “SkewedData”, in our example. Notice that the numbers are not quite exact, because it does not look at every row when producing this histogram, only a sample. We could “fix” this by running:

UPDATE STATISTICS dbo.Skewed WITH FULLSCAN;

This would produce a more precise histogram. But since that is not actually going to change any behaviour in this instance, we leave this as an additional exercise for interested readers to pursue.

With the hard-coded query, the optimiser is able to look up the hard-coded value in the histogram:

Estimated rows for hard-coded query

The optimiser is expecting to find a lot of rows in the SkewedData column matching the value “1” and so it decides that it would be quickest to zipper the tables together with a merge join.

With the second query, there is a variable, so the optimiser will try to create a plan that works for an arbitrary value for the variable. While it might seem odd not to use the current value of the variable to obtain the best plan for this current query, SQL Server is thinking about the bigger picture. Compiling an execution plan is relatively expensive, so it aims to create a plan that can be reused if it receives a similar query in future.

Unlike many other languages, TSQL has no concept of constants. So even though we know it is a constant, the optimiser does not. It sees a variable. Instead of a firm number, the optimiser will thus attempt to calculate some reasonable expectation for the number of rows it will find for an arbitrary value of the variable. The details of how it calculates this cardinality estimate are beyond the scope of this article, but in this case, it comes up with an estimate of just under 4:

Estimated rows for query with variable

Expecting to only find four rows, it decides that the best way to do the join is by looping through those four values and looking each one up in the other table. When it runs, it actually finds 500,000 rows, and looping through those is what kills the performance. When looking at execution plans, a large disparity between the estimated and actual number of rows is a red flag indicating a potential problem relating to statistics. Similarly inefficient plans arise when statistics are stale and need to be refreshed.

Possible solutions

Having found the root of the problem, what can we do to fix it?

We could force the join to be done using a merge by adding a query hint:

SELECT *
FROM SomeTable 
INNER MERGE JOIN Skewed ON Skewed.SomeTableId = SomeTable.SomeTableId
WHERE SkewedData = @meaningfullyNamedVariable;

As a rule, however, it is best to avoid this type of hint. It might work initially, but if the distribution of data changes over time, a merge may no longer be the best method. It also forces the join order, so if the query involves additional tables, this could also cause suboptimal performance.

Another option might be to force a recompilation each time the query is run:

SELECT *
FROM SomeTable 
INNER JOIN Skewed ON Skewed.SomeTableId = SomeTable.SomeTableId
WHERE SkewedData = @meaningfullyNamedVariable
OPTION (RECOMPILE);

This recompilation will use the current values of the variables and get us the merge we want. But compilation does not come for free, and if this query were run multiple times, we would not be able to benefit from cached execution plans.

Better would be to revert to the hard-coded value, adding a comment explaining what it means. This gets us the performance benefits, and the comment does much the same job as the variable name:

SELECT *
FROM SomeTable 
INNER JOIN Skewed ON Skewed.SomeTableId = SomeTable.SomeTableId
WHERE SkewedData = 1; -- Comment explaining this mystery constant

This is our default approach to handling constants, and this was how the real issue was handled.

But what if we have some reason to keep it as a variable? Perhaps it is not a true constant. For example, it could be read from some other place into the variable, and could just usually take the value “1”. We do have an option in that case, in the form of another query hint:

DECLARE @meaningfullyNamedVariable INT = 1;
SELECT *
FROM SomeTable 
INNER JOIN Skewed ON Skewed.SomeTableId = SomeTable.SomeTableId
WHERE SkewedData = @meaningfullyNamedVariable
OPTION (OPTIMISE FOR (@meaningfullyNamedVariable = 1));  

This does what we might expect: it plugs the value “1” into the variable for the purposes of running the optimiser, and gives the desired merge join as if the value was hard-coded.

It is worth noting the relationship to “parameter sniffing”, where the optimiser will use the first parameters passed into a stored procedure to inform its execution plans. If our query was inside a procedure, to which @meaningfullyNamedVariable was a parameter, parameter sniffing would show its value and achieve our desired result automatically, provided that the value “1” was passed the first time the procedure was called.

Unfortunately, if some other value is passed first, the stored plan will be based on that value instead, giving poor performance when 1 is subsequently passed. This negative context is often how people first become aware of parameter sniffing, and it can therefore wrongly be characterised as a problem rather than a feature.

To summarise the lessons from this exercise: