HiveBrain v1.2.0
Get Started
← Back to all entries
patternsqlMinor

Replace long GROUP BY list with a subquery

Submitted by: @import:stackexchange-dba··
0
Viewed 0 times
groupwithlongreplacesubquerylist

Problem

This is a repost of my question on Stack Overflow. They suggested to ask it here:

I found an online article from 2005, where the author claims, that many devs use GROUP BY wrong, and that you should better replace it with a subquery.

I've tested it on one of my queries, where I need to sort the result of a search by the number of joined entries from another table (more common ones should appear first). My original, classic approach was to join both tables on a common ID, group by each field in the select list and order the result by the count of the sub table.

Now, Jeff Smith from the linked blog claims, that you should better use a subselect, which does all the grouping, and than join to that subselect. Checking the execution plans of both approaches, SSMS states, that the large group by requires 52% of the time and the subselect one 48%, so from a technical standpoint, it seems, that the subselect approach is actually marginally faster. However, the "improved" SQL command seems to generate a more complicated execution plan (in terms of nodes)

What do you think? Can you give me some detail about how to interpret the execution plans in this specific case and which one is generally the preferable option?

```
SELECT
a.ID,
a.ID_AddressType,
a.Name1,
a.Name2,
a.Street,
a.Number,
a.ZipCode,
a.City,
a.Country
FROM dbo.[Address] a
INNER JOIN CONTAINSTABLE(
dbo.[Address],
FullAddress,
'"ZIE*"',
5
) s ON a.ID = s.[KEY]
LEFT JOIN dbo.Haul h ON h.ID_DestinationAddress = a.ID
GROUP BY
a.ID,
a.ID_AddressType,
a.Name1,
a.Name2,
a.Street,
a.Number,
a.ZipCode,
a.City,
a.Country,
s.RANK
ORDER BY s.RANK DESC, COUNT(*) DESC;

SELECT
a.ID,
a.ID_AddressType,
a.Name1,
a.Name2,
a.Street,
a.Number,
a.ZipCode,
a.City,
a.Country
FROM dbo.[Address] a
INNER JOIN CONTAINSTABLE(
dbo.[Address],
FullAddress,
'"ZIE*"',
5
) s ON a.ID = s.[KEY]
LEFT JOIN

Solution

If you change the left join with dbo.Haul to a subquery, it will calculate these distinct values of ID_DestinationAddress (Stream Aggregate) and Count them (Compute scalar) directly after getting the data from the scan.

This is what you are seeing in the execution plan:

While, when using the GROUP BY method it is only doing the grouping after data passed through the left join between dbo.Haul and dbo.[Address].

How much better it is will depend on the unique value ratio of dbo.Haul. Less unique values means a better outcome for the second execution plan, since the left join has to process less values.

The other positive result of the second query is that only the uniqueness of ID_DestinationAddress is calculated, not the uniqueness of all the columns as a whole in the group by.

Again, you should test & validate the results for your query, dataset & indexes. One of the ways to test if you are not familiar with execution plans is setting SET STATISTICS IO, TIME ON; before executing the queries and making these runtime stats more readable by pasting them in a tool such as statisticsparser.

Testing

A small test to show what differences in data can do for these queries.

If the dbo.Haul table does not have many matches with the 5 records returned by the FULLTEXT index filtering, the difference is not so big:

Group by query plan

Subquery Query plan

1000 rows could get filtered earlier, but the execution time is around 15ms for both queries anyway on my machine.

Now, if I change my data so these 5 records have many more matches with dbo.Haul on the left join:

The difference between the group by query



And the Subquery becomes more clear

and the stats:

Code Snippets

<QueryTimeStats CpuTime="1564" ElapsedTime="1566" />
<QueryTimeStats CpuTime="680" ElapsedTime="690"/>

Context

StackExchange Database Administrators Q#250975, answer score: 8

Revisions (0)

No revisions yet.