HiveBrain v1.2.0
Get Started
← Back to all entries
patternsqlCritical

Huge slowdown to SQL Server query on adding wildcard (or top)

Submitted by: @import:stackexchange-dba··
0
Viewed 0 times
wildcardtopsqlslowdownqueryaddinghugeserver

Problem

I've got a zoo of 20 million animals which I track on my SQL Server 2005 database. About 1% of them are black and about 1% of them are swans. I wanted to get details of all the black swans and so, not wanting to swamp the results page I did:

select top 10 * 
from animal 
where colour like 'black'  
and species like 'swan'


(Yes, inadvisedly those fields are freetext but they are both indexed). It turns out we have no such animals as the query returns an empty set in about 300 milliseconds. It would have been about twice as fast if I'd used '=' rather than 'like' but I have a premonition the latter is about to save me some typing.

It turns out the head zookeeper thinks he may have entered some of the swans as 'blackish' so I modify the query accordingly:

select top 10 * 
from animal  
where colour like 'black%' 
and species like 'swan'


Turns out there are none of those either (and in fact there are no 'black%' animals except 'black' ones) but the query now takes about 30 seconds to return empty.

It seems it's only the combination of 'top' and 'like %' causing trouble though because

select count(*) 
from animal  
where colour like 'black%' 
and species like 'swan'


returns 0 very quickly, and even

select * 
from animal 
where colour like 'black%' 
and species like 'swan'


returns empty in a fraction of a second.

Does any one have any idea why 'top' and '%' should conspire to cause such a dramatic loss of performance, especially in an empty result set?

EDIT: Just to clarify, I'm not using any FreeText indexes, I just meant that the fields are freetext at the point of entry, i.e. not normalized in the database. Sorry for the confusion, poor wording on my part.

Solution

This is a decision of the cost based optimiser.

The estimated costs used in this choice are incorrect as it assumes statistical independence between values in different columns.

It is similar to the issue described in Row Goals Gone Rogue where the even and odd numbers are negatively correlated.

It is easy to reproduce.

CREATE TABLE dbo.animal(
    id int IDENTITY(1,1) NOT NULL PRIMARY KEY,
    colour varchar(50) NOT NULL,
    species varchar(50) NOT NULL,
    Filler char(10) NULL
);

/*Insert 20 million rows with 1% black and 1% swan but no black swans*/
WITH T
     AS (SELECT TOP 20000000 ROW_NUMBER() OVER (ORDER BY @@SPID) AS RN
         FROM   master..spt_values v1,
                master..spt_values v2,
                master..spt_values v3)
INSERT INTO dbo.animal
            (colour,
             species)
SELECT CASE
         WHEN RN % 100 = 1 THEN 'black'
         ELSE CAST(RN % 100 AS VARCHAR(3))
       END,
       CASE
         WHEN RN % 100 = 2 THEN 'swan'
         ELSE CAST(RN % 100 AS VARCHAR(3))
       END
FROM   T 

/*Create some indexes*/
CREATE NONCLUSTERED INDEX ix_species ON  dbo.animal(species);
CREATE NONCLUSTERED INDEX ix_colour ON  dbo.animal(colour);


Now try

SELECT TOP 10 *
FROM   animal
WHERE  colour LIKE 'black'
       AND species LIKE 'swan'


This gives the plan below which is costed at 0.0563167.

The plan is able to perform a merge join between the results of the two indexes on the id column. (More details of merge join algorithm here).

Merge join requires both inputs to be ordered by the joining key.

The nonclustered indexes are ordered by (species, id) and (colour, id) respectively (nonunique non clustered indexes always have the row locator added in to the end of the key implicitly if not added explicitly). The query without any wildcards is performing an equality seek into species = 'swan' and colour ='black'. As each seek is only retrieving one exact value from the leading column the matching rows will be ordered by id therefore this plan is possible.

Query plan operators execute from left to right. With the left operator requesting rows from its children, which in turn request rows from their children (and so on until the leaf nodes are reached). The TOP iterator will stop requesting any more rows from its child once 10 have been received.

SQL Server has statistics on the indexes that tell it that 1% of the rows match each predicate. It assumes that these statistics are independent (i.e. not correlated either positively or negatively) so that on average once it has processed 1,000 rows matching the first predicate it will find 10 matching the second and can exit. (the plan above actually shows 987 rather than 1,000 but close enough).

In fact as the predicates are negatively correlated the actual plan shows that all 200,000 matching rows needed to be processed from each index but this is mitigated to some extent because the zero joined rows also means zero lookups were actually needed.

Compare with

SELECT TOP 10 *
FROM   animal
WHERE  colour LIKE 'black%'
       AND species LIKE 'swan'


Which gives the plan below which is costed at 0.567943

The addition of the trailing wildcard has now caused an index scan. The cost of the plan is still quite low though for a scan on a 20 million row table.

Adding querytraceon 9130 shows some more information

SELECT TOP 10 *
FROM   animal
WHERE  colour LIKE 'black%'
       AND species LIKE 'swan'       
OPTION (QUERYTRACEON 9130)


It can be seen that SQL Server reckons it will only need to scan around 100,000 rows before it finds 10 matching the predicate and the TOP can stop requesting rows.

Again this makes sense with the independence assumption as 10 100 100 = 100,000

Finally lets try and force an index intersection plan

SELECT TOP 10 *
FROM   animal WITH (INDEX(ix_species), INDEX(ix_colour))
WHERE  colour LIKE 'black%'
       AND species LIKE 'swan'


This gives a parallel plan for me with estimated cost of 3.4625

The main difference here is that the colour like 'black%' predicate can now match multiple different colours. This means the matching index rows for that predicate are no longer guaranteed to be sorted in order of id.

For example the index seek on like 'black%' might return the following rows

+------------+----+
|   Colour   | id |
+------------+----+
| black      | 12 |
| black      | 20 |
| black      | 23 |
| black      | 25 |
| blackberry |  1 |
| blackberry | 50 |
+------------+----+


Within each colour the ids are ordered but the ids across different colours may well not be.

As a result SQL Server can no longer perform a merge join index intersection (without adding a blocking sort operator) and it opts to perform a hash join instead. Hash Join is blocking on the build input so now the cost reflects the fact that all matching rows will need to be processed from the build input rather than assuming it will only have to

Code Snippets

CREATE TABLE dbo.animal(
    id int IDENTITY(1,1) NOT NULL PRIMARY KEY,
    colour varchar(50) NOT NULL,
    species varchar(50) NOT NULL,
    Filler char(10) NULL
);

/*Insert 20 million rows with 1% black and 1% swan but no black swans*/
WITH T
     AS (SELECT TOP 20000000 ROW_NUMBER() OVER (ORDER BY @@SPID) AS RN
         FROM   master..spt_values v1,
                master..spt_values v2,
                master..spt_values v3)
INSERT INTO dbo.animal
            (colour,
             species)
SELECT CASE
         WHEN RN % 100 = 1 THEN 'black'
         ELSE CAST(RN % 100 AS VARCHAR(3))
       END,
       CASE
         WHEN RN % 100 = 2 THEN 'swan'
         ELSE CAST(RN % 100 AS VARCHAR(3))
       END
FROM   T 

/*Create some indexes*/
CREATE NONCLUSTERED INDEX ix_species ON  dbo.animal(species);
CREATE NONCLUSTERED INDEX ix_colour ON  dbo.animal(colour);
SELECT TOP 10 *
FROM   animal
WHERE  colour LIKE 'black'
       AND species LIKE 'swan'
SELECT TOP 10 *
FROM   animal
WHERE  colour LIKE 'black%'
       AND species LIKE 'swan'
SELECT TOP 10 *
FROM   animal
WHERE  colour LIKE 'black%'
       AND species LIKE 'swan'       
OPTION (QUERYTRACEON 9130)
SELECT TOP 10 *
FROM   animal WITH (INDEX(ix_species), INDEX(ix_colour))
WHERE  colour LIKE 'black%'
       AND species LIKE 'swan'

Context

StackExchange Database Administrators Q#55198, answer score: 76

Revisions (0)

No revisions yet.