patternsqlModerate
COUNT(*) OVER ()
Viewed 0 times
countoverstackoverflow
Problem
I have a
COUNT(*) OVER () as part of my query to fetch a large amount of data with about 15 joins to tables (some big, some small). What's the best solution for count here?- Separate query to find count.
- COUNT(*) OVER () as part of the query that fetches the data (since it is going to calculate count for each row?!)
- Or a query to get count union the query to fetch data. (Of course, I will have to put null for all the other columns the data query is going to fetch.)
Solution
COUNT() OVER () is one of those operations that sounds like it ought to be cheap for the query optimizer. After all, SQL Server already knows how many rows are returned by the query. You're just asking it to project that value into the result set of the query. Unfortunately, COUNT() OVER () can be an expensive operation depending on the query that you're adding it to.For a simple test I'll put a moderate amount of testing into a table. Really any table will do, but here is my sample data for those following along at home:
SELECT
CAST(REPLICATE('A', 100) AS VARCHAR(100)) ID1
, CAST(REPLICATE('Z', 100) AS VARCHAR(100)) ID2
INTO #JUNK_DATA
FROM master..spt_values t1
CROSS JOIN master..spt_values t2;This query takes 2.5 seconds when discarding result sets:
SELECT ID1, ID2
FROM #JUNK_DATA;This query takes 2.7 seconds when run serially but can parallelize:
SELECT COUNT(*)
FROM #JUNK_DATA
OPTION (MAXDOP 1);The query with the
COUNT aggregate query takes 47.2 seconds to complete on my machine.SELECT ID1, ID2, COUNT(*) OVER () CNT
FROM #JUNK_DATA;Perhaps my machine has issues but it's definitely doing more work. Here's the query plan:
SQL Server is loading all of the column values into a table spool then reading that data twice and joining it together. Why is it doing all of that work when it already knows how many rows are in the result set?
SQL Server query plans process data one row at a time (more or less) in a streaming fashion. So if we calculated the number of rows as it went it might look like this:
╔═════╦═════╦══════════════╗
║ ID1 ║ ID2 ║ COUNT_SO_FAR ║
╠═════╬═════╬══════════════╣
║ A ║ B ║ 1 ║
║ C ║ D ║ 2 ║
║ E ║ F ║ 3 ║
║ ... ║ ... ║ ... ║
║ ZZZ ║ ZZZ ║ 6431296 ║
╚═════╩═════╩══════════════╝That operation doesn't require a separate copy of the data, but how can the final value of 6431296 be applied to all of the previous rows? Sometimes this operation is implemented as the double spool that we saw in the query plan. It is possible to imagine more efficient internal algorithms that we'd like SQL Server to use but we obviously don't have direct control over that.
As I see it here are the your options to solve your problem:
-
Fix the application. SQL Server already knows how many rows that the query will return. Using
@@ROWCOUNT or some other method is by far the most preferable.-
Run a separate query to get the count. This can be a good choice if you're returning a large result set to the client and your data is guaranteed not to change inbetween queries. The separate
COUNT may benefit from parallelism and from having the relevant tables already loaded into the buffer cache. Depending on indexes on the tables SQL Server may be able to do less IO to get the count as opposed to the full result set.-
Add the
COUNT aggregate to the query. This can be a good choice if your typical result set is small. That way you don't load as much data into the spool.The rest of this answer contains information about some tricks that can be done with window functions. Not all of them are relevant to your version of SQL Server, 2008.
A simple
ROW_NUMBER() addition does not require the result set to be put into a table spool. If you think about the earlier example that makes sense. SQL Server can just calculate the value as it goes and it doesn't need to apply a value to a previous row. This query ran in 3.1 seconds on my machine:SELECT
ID1
, ID2
, ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) RN
FROM #JUNK_DATA;We can see this in the query plan:
If you change your application to simply take the maximum value of
RN as the count of rows then this could be a good option for you. You might be tempted to use
ROW_NUMBER() along with a UNION ALL to get the value that you're looking for in the final dummy row, similar to what you proposed in your question. For example:SELECT
t.ID1
, t.ID2
, CASE WHEN ROW_COUNT_ROW = 'Y' THEN -1 + ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) ELSE NULL END ROW_COUNT
FROM
(
SELECT
ID1
, ID2
, 'N' ROW_COUNT_ROW
FROM #JUNK_DATA
UNION ALL
SELECT
NULL ID1
, NULL ID2
, 'Y' ROW_COUNT_ROW
) t;The above query finished in 3.4 seconds and returned the correct results for me. However, but because it relies on an unspecified join order it can also return the wrong results. Looking at the plan:
The part of the query in blue has to be the top part of the concatenation. The part of the query in red has to be the bottom half. The query optimization is free to rearrange the query plan or to implement the
UNION with a different physical operator so you may not always get correct results.As of SQL Server 2012 it's possible to use the
LEAD window function to accomplish something similar to the above example:```
SELECT
t.ID1
, t.ID2
, CASE WHEN LD IS NULL THEN RN ELSE NULL END ROW_CO
Code Snippets
SELECT
CAST(REPLICATE('A', 100) AS VARCHAR(100)) ID1
, CAST(REPLICATE('Z', 100) AS VARCHAR(100)) ID2
INTO #JUNK_DATA
FROM master..spt_values t1
CROSS JOIN master..spt_values t2;SELECT ID1, ID2
FROM #JUNK_DATA;SELECT COUNT(*)
FROM #JUNK_DATA
OPTION (MAXDOP 1);SELECT ID1, ID2, COUNT(*) OVER () CNT
FROM #JUNK_DATA;╔═════╦═════╦══════════════╗
║ ID1 ║ ID2 ║ COUNT_SO_FAR ║
╠═════╬═════╬══════════════╣
║ A ║ B ║ 1 ║
║ C ║ D ║ 2 ║
║ E ║ F ║ 3 ║
║ ... ║ ... ║ ... ║
║ ZZZ ║ ZZZ ║ 6431296 ║
╚═════╩═════╩══════════════╝Context
StackExchange Database Administrators Q#164493, answer score: 11
Revisions (0)
No revisions yet.