HiveBrain v1.2.0
Get Started
← Back to all entries
snippetMinor

How can this larger subquery be (much) faster?

Submitted by: @import:stackexchange-dba··
0
Viewed 0 times
thiscanmuchlargerfastersubqueryhow

Problem

In my organisation Ricoh is the supplier of printers and copiers. Papercut is used for registering which copies are made and for which account. Another application working with Ricoh is registering copies in a different database, but both are somewhat linked.

My organisation wanted a central place to retrieve the total cost so I wrote some queries. Most work fine, but something weird is going on with the one below and I'm unable to figure out why this is failing. What could be the reason of the following issue.

It's hard to make a db<>fiddle as the databases are quite complicated. I'll try to explain the issue here.

In essence there are two main tables. A table maintained by Ricoh containing client identifiers ClientId, a price (cost of the copywork), print timestamps and a job identifier. The table maintained by Papercut also has a timestamp, a printed flag and a reference to the corresponding JobId in its job_comment.

Now consider the following query which gives the total printcost when having printed using the seperate Ricoh application.

SELECT CliendId,
       SUM(ricoh.price) AS cost
FROM
  (SELECT a.SystemId as ClientId,
          a.ProcessInternalUid as JobId,
          a.price
   FROM   ricoh.printhistory AS a
   WHERE a.Exportdate >= '2020-10-01'
     AND a.Exportdate = '2020-10-01'
     AND b.usage_date <= '2020-12-12'
     AND b.printed = 'Y'
   GROUP BY substring(b.job_comment, 16, 36)) AS papercut ON (papercut.JobId = ricoh.JobId)
GROUP BY ricoh.CliendId


If I run this query I have instant results, where the subqueries ricoh and papercut have 96.321 and 9.354 rows respectivly.

But if I change the lower bound of the date range to 2020-11-01 the query takes 49 minutes (!). While the subqueries have 46.223 and 4.547 records respectivly.

How can this be? How can a join with larger subqueries result in much faster results? I'm really puzzled by this. What could be a reason for this increase in time? (Then I can try to debug furth

Solution

Executon Plans

Reasons for different execution plans and/or different duration of execution plans can be because of different search predicates (aka WHERE clauses). This is what you have already observed.

When you change the WHERE clause and if the query isn't parameterised, then SQL Server's Query Optimizer will try and create a new best execution plan based on the information provided in the query.

To achieve the best execution plan, the Query Optimizer will have a peak at the statistics of the indexes/columns that could possibly be involved in the future execution plan and based on the information (data histogram) that is available in the statistics, the Query Optimizer will choose one or the other index.

That is a very short explanation on how SQL Server goes about creating an optimal execution plan based on the statistics.
Now what can impact the creation of a not so optimal execution plan?

If the data in the statistics are old or outdated or haven't been automatically updated, because the trigger value of the automatic statistics updated hasn't been reached, then the Query Optimizer could assume that an index is better than another and make a bad choice.

Trigger values for automatic statistic updates are when the following number of data rows have changed:

  • older than SQL Server 2016 : 20% of total rows changed + 500 rows



  • SQL Server 2016 or newer : Square Root (1000 * number of rows changed)



Reference: Statistics (Microsoft | SQL Docs)
Your Execution Plans

I took the liberty of having a look at your execution plans in SentryOne's Plan Explorer (free as in free beer, highly recommended, not affiliated). The advantage is that the tool displays the amount of rows for each branch of the execution.
Good Plan

Bad Plan

Observations

The execution plans are basically the same. It's just that the Clusterd Index Scan on [dbo].[RCH_DM_Printhistory] in the bad/slow plan is at the beginning and in the good/fast plan this step is performed later on.
Good Plan

If you have a look at the red number just above the lines, you will note for the good plan, that the query retrieves:

  • 2'874 rows from Nested Loop part



  • 14'808 rows from the Clustered Index Scan



The Query Optimizer only has to compare 2'874 rows with 14'808 rows which is pretty fast.
Bad Plan

If you have a look at the number for the bad plan, then the query has to retrieve:

  • 5'759 rows from the Clustered Index Scan



  • 4'048'577 rows from the Nested Loop part



In the bad plan the query optimizer has to compare 5'759 rows with 4 Mio rows, which is going to be slow anyway.

The Query Optimizer normally creates a plan that has tables/indexes with lower row numbers at the beginning in order to reduce the amount of data it has to compare with other tables later on in the execution plan.

Generally the execution plans seem to be the best possible execution plans for your statements.
Estimated Rows and Statistics

As pointed out by others the amount of rows estimated is always 1 compared to the actual amount of rows retrieved. This is bad, because it means the Queryy Optimizer is relying on statistics that contain the wrong information on the amount of distributed values in the indexes or relying on statistics that only have scanned a certain amount of data in the table(s).
Possible Solutions

(See script at end of post for retrieving pre-formatted statements and information on trigger level for status updates)

-
Update the statistics for the Clustered Index ORDERBY on the table [dbo].[RCH_DM_PRINTHISTORY]. You can retrieve the statements to update the statistics from my query at the end of the post.

-
Ensure Auto Update Statistics has been turned on for your database(s).

SELECT sdb.name, 
       sdb.is_auto_create_stats_on, 
       sdb.is_auto_update_stats_on, 
       sdb.is_auto_update_stats_async_on 
FROM sys.databases AS sdb


The value for is_auto_update_stats_on should be 1.

-
Ensure Auto Create Statistics has been truned on for your database(s).

SELECT sdb.name, 
       sdb.is_auto_create_stats_on, 
       sdb.is_auto_update_stats_on, 
       sdb.is_auto_update_stats_async_on 
FROM sys.databases AS sdb


The value for is_auto_create_stats_on should be 1.

-
Create the missing indexes mentioned in the execution plans. Be wary though. Sometimes the so-called "Missing Indexes" are already available. Test in a non-productive environment.

-
Re-write the query using temporary tables or CTEs.

Statement to Retrieve Statistics Information

```
SELECT 'DBCC SHOW_STATISTICS ([' + [sch].[NAME] + '.' +[so].[NAME] + '] , [' + [ss].[NAME] + ']) WITH STAT_HEADER' AS [SHOW_STATISTICS],
'update statistics ' + [sch].[name] + '.' + [so].[name] + ' ' + [ss].[name] + ' WITH FULLSCAN' AS [UPDATE_STATISTICS] -- PAGECOUNT=100, ROWCOUNT=100 | FULLSCAN
'DBCC UPDATEUSAGE(' + DB_NAME() + ', ''' + [sch].[NAME] + '.' +[so].[NAME] + ''')'

Code Snippets

SELECT sdb.name, 
       sdb.is_auto_create_stats_on, 
       sdb.is_auto_update_stats_on, 
       sdb.is_auto_update_stats_async_on 
FROM sys.databases AS sdb
SELECT sdb.name, 
       sdb.is_auto_create_stats_on, 
       sdb.is_auto_update_stats_on, 
       sdb.is_auto_update_stats_async_on 
FROM sys.databases AS sdb
SELECT 'DBCC SHOW_STATISTICS ([' + [sch].[NAME] + '.' +[so].[NAME] + '] , [' + [ss].[NAME] + ']) WITH STAT_HEADER'       AS [SHOW_STATISTICS],
       'update statistics ' + [sch].[name] + '.' + [so].[name] + ' ' + [ss].[name] + ' WITH FULLSCAN'                    AS [UPDATE_STATISTICS] -- PAGECOUNT=100, ROWCOUNT=100 | FULLSCAN
       'DBCC UPDATEUSAGE(' + DB_NAME() + ', ''' + [sch].[NAME] + '.' +[so].[NAME] + ''')'                                AS [UPDATE_USAGE],
       [sch].[name] + '.' + [so].[name]                        AS [TableName],
       [ss].[name]                                             AS [Statistic],
       [sp].[last_updated]                                     AS [StatsLastUpdated],
       [sp].[rows]                                             AS [RowsInTable],
       [sp].[rows_sampled]                                     AS [RowsSampled],
       [sp].[modification_counter]                             AS [RowModifications],
       100 / (1.0 * [sp].[rows]) * [sp].[modification_counter] AS [PercentChanged],
       SQRT(1000 * [sp].[rows])                                AS [> 2014 Algorithm Change Value],
       CASE 
            WHEN SQRT(1000 * [sp].[rows]) < [sp].[modification_counter] THEN 1
            ELSE 0
       END                                                     AS [Auto Update > 2014 Triggered],
       [sp].[rows] * 1.0 / 100 * 20 + 500                      AS [<=2014 Algorithm Change Value],
       CASE 
            WHEN [sp].[rows] * 1.0 / 100 * 20 + 500 < [sp].[modification_counter] THEN 1
            ELSE 0
       END                                                     AS [Auto Update <= 2014 Triggered]
FROM   [sys].[stats] [ss]
       JOIN [sys].[objects] [so]
            ON  [ss].[object_id] = [so].[object_id]
       JOIN [sys].[schemas] [sch]
            ON  [so].[schema_id] = [sch].[schema_id]
       OUTER APPLY [sys].[dm_db_stats_properties]([so].[object_id], [ss].[stats_id]) sp
WHERE  1 = 1
       AND [so].[type] = 'U'
           -- AND [sp].[modification_counter] > 0
           -- AND 100/(1.0*[sp].[rows])*[sp].[modification_counter] < 10.0     -- maximum percentage change (certain tables have a high volatility)
           -- AND 100/(1.0*[sp].[rows])*[sp].[modification_counter] > 0.001    -- minimum percentage change (we aren't going to be looking at statistics with a very low percentage of change)
           -- AND [sp].[rows] > 1000000                                        -- only look at statistics which contain more than 1'000'000 rows.
           -- AND [sp].[last_updated] < dateadd(hh,-1,getdate())               -- only look at statistics which have been updated more than an hour ago
       AND [sch].[name] = 'dbo'
       AND [so].[name] = 'RCH_DM_PRINTHISTORY'
       -- AND [ss].[name] NOT LIKE '_WA_Sys%'                                  -- Exclude automatically create statistics
       -- AND [ss].[name] not like '_dta_stat%'                                -- Exclude statistics crete by the

Context

StackExchange Database Administrators Q#281565, answer score: 4

Revisions (0)

No revisions yet.