HiveBrain v1.2.0
Get Started
← Back to all entries
snippetMinor

How can I make this aggregation query more efficient?

Submitted by: @import:stackexchange-dba··
0
Viewed 0 times
thiscanquerymakemoreefficienthowaggregation

Problem

I have a query that's taking quite a long time on my machine (7 minutes) to execute and I was wondering whether I could make it (significantly) quicker:

SELECT
    rec.[Id] AS RecordId,
    MIN(rec.[CreationDate]) AS RecordCreationDate,
    MIN(rec.[LastModified]) AS RecordLastModified,
    MIN(rec.[AssetType]) AS RecordAssetType,
    MIN(rec.[MasterFilename]) AS RecordMasterFilename,
    MIN(rec.[GameName]) AS RecordGameName,
    usr.[OrganizationName],
    COUNT(hist.[Id]) AS TimesDownloaded
FROM
(
    SELECT
        innerRec.Id,
        MIN(innerRec.CreationDate) AS CreationDate,
        MIN(innerRec.LastModified) AS LastModified,
        MIN(innerRec.AssetType) AS AssetType,
        MIN(innerRec.MasterFilename) AS MasterFilename,
        MIN(innerRec.GameName) AS GameName
    FROM
        [dbo].[Record] innerRec INNER JOIN [dbo].[RecordClassificationLink] innerLnk ON innerRec.Id = innerLnk.RecordId
    -- WHERE (classification ID is foo or bar)
    GROUP BY
        innerRec.Id
    -- HAVING COUNT(innerLnk.ClassificationId) = (number of specified classifications)
) rec
CROSS JOIN
    [dbo].[AdamUser] usr
LEFT JOIN
    (SELECT * FROM [dbo].[MaintenanceJobHistory] WHERE [CreatedOn] > '2016-01-01 00:00:00' AND [CreatedOn] < '2016-12-01 00:00:00') hist ON usr.Name = hist.AccessingUser AND rec.Id = hist.RecordId
GROUP BY
    rec.Id, usr.OrganizationName


What it's doing is extracting data to be put in an Excel spreadsheet report (whether a spreadsheet is a good presentation of this data is outside the scope of this question :-) )

The first subquery pulls out records optionally filtered by a list of classification IDs. These are then cross joined with the user table, because each user table row actually contains the info we really need for this: the user's organization name. I then left join the maintenance job history table (storing an entry for each record download) in order to create multiple rows if a record has been accessed multiple times, then group

Solution

The stuff that is probably taking a lot of time is a large number of Sort operations in your query plan. You can preempt those by sorting the data yourself, in the form of indexes.

Here are some index suggestions that I think would get you started:

CREATE INDEX IX_RecordId
    ON dbo.RecordClassificationLink (RecordId);

CREATE INDEX IX_MaintenanceJobHistory_ByUser
    ON dbo.MaintenanceJobHistory (AccessingUser, RecordId);

CREATE UNIQUE INDEX UIX_AdamUser_OrgnName
    ON dbo.AdamUser (OrganizationName, Name);


Then, you could modify your query a little bit to help the optimizer make some smart choices, like aggregating certain data streams before they're joined and create a much larger product that will take more time to aggregate:

SELECT
    rec.[Id] AS RecordId,
    MIN(rec.[CreationDate]) AS RecordCreationDate,
    MIN(rec.[LastModified]) AS RecordLastModified,
    MIN(rec.[AssetType]) AS RecordAssetType,
    MIN(rec.[MasterFilename]) AS RecordMasterFilename,
    MIN(rec.[GameName]) AS RecordGameName,
    usr.[OrganizationName],
    SUM(ISNULL(hist._count, 0)) AS TimesDownloaded
FROM
(
    SELECT
        innerRec.Id,
        MIN(innerRec.CreationDate) AS CreationDate,
        MIN(innerRec.LastModified) AS LastModified,
        MIN(innerRec.AssetType) AS AssetType,
        MIN(innerRec.MasterFilename) AS MasterFilename,
        MIN(innerRec.GameName) AS GameName
    FROM [dbo].[Record] innerRec
    INNER JOIN [dbo].[RecordClassificationLink] innerLnk ON innerRec.Id = innerLnk.RecordId
    -- WHERE (classification ID is foo or bar)
    GROUP BY
        innerRec.Id
    -- HAVING COUNT(innerLnk.ClassificationId) = (number of specified classifications)
) AS rec
CROSS JOIN [dbo].[AdamUser] AS usr
LEFT JOIN
    (SELECT AccessingUser, RecordId, COUNT(*) AS _count
     FROM [dbo].[MaintenanceJobHistory]
     WHERE [CreatedOn] > '2016-01-01 00:00:00' AND
           [CreatedOn] < '2016-12-01 00:00:00'
     GROUP BY AccessingUser, RecordId
    ) AS hist ON
        usr.Name = hist.AccessingUser AND
        rec.Id = hist.RecordId
GROUP BY
    rec.Id, usr.OrganizationName;


I've done the following:

  • the hist subquery is aggregated on AccessingUser, RecordId, and I've created a COUNT(*) AS _count. This query uses the new index IX_MaintenanceJobHistory_ByUser to perform very efficiently without any memory grant or hash tables.



  • Because of this, I've replaced COUNT(hist.Id) with SUM(ISNULL(hist._count, 0)) AS TimesDownloaded



  • The new index on dbo.RecordClassificationLink helps perform a smooth join with the Record table, but if you add your WHERE and HAVING, that index won't help you.



  • The new index on dbo.AdamUser also improves performance by eliminating a Sort operator - because you aggregate on the OrganizationName column, might as well have your data sorted on this from the get-go.



In my mind, that should give you the same result, but it's late here, so you'll have to verify the results yourself. :)

Here's my query plan:

EDIT: You could also simplify the rec part a bit - probably be a bit easier to read:

SELECT
    rec.[Id] AS RecordId,
    rec.[CreationDate] AS RecordCreationDate,
    rec.[LastModified] AS RecordLastModified,
    rec.[AssetType] AS RecordAssetType,
    rec.[MasterFilename] AS RecordMasterFilename,
    rec.[GameName] AS RecordGameName,
    usr.[OrganizationName],
    SUM(ISNULL(hist._count, 0)) AS TimesDownloaded
FROM
(
    SELECT
        Id,
        CreationDate,
        LastModified,
        AssetType,
        MasterFilename,
        GameName
    FROM [dbo].[Record]
    WHERE Id IN (
        SELECT RecordId
        FROM [dbo].[RecordClassificationLink]
        --- WHERE ClassificationId=...
        )
) AS rec
CROSS JOIN [dbo].[AdamUser] AS usr
LEFT JOIN
    (SELECT AccessingUser, RecordId, COUNT(*) AS _count
     FROM [dbo].[MaintenanceJobHistory]
     WHERE [CreatedOn] > '2016-01-01 00:00:00' AND
           [CreatedOn] < '2016-12-01 00:00:00'
     GROUP BY AccessingUser, RecordId
    ) AS hist ON
        usr.Name = hist.AccessingUser AND
        rec.Id = hist.RecordId
GROUP BY
    usr.OrganizationName,
    rec.[Id],
    rec.[CreationDate],
    rec.[LastModified],
    rec.[AssetType],
    rec.[MasterFilename],
    rec.[GameName],
    usr.[OrganizationName];


... and the plan looks a tiny bit better as well (look for the red Lazy spool at the bottom of the original plan, that is now gone.)

Code Snippets

CREATE INDEX IX_RecordId
    ON dbo.RecordClassificationLink (RecordId);

CREATE INDEX IX_MaintenanceJobHistory_ByUser
    ON dbo.MaintenanceJobHistory (AccessingUser, RecordId);

CREATE UNIQUE INDEX UIX_AdamUser_OrgnName
    ON dbo.AdamUser (OrganizationName, Name);
SELECT
    rec.[Id] AS RecordId,
    MIN(rec.[CreationDate]) AS RecordCreationDate,
    MIN(rec.[LastModified]) AS RecordLastModified,
    MIN(rec.[AssetType]) AS RecordAssetType,
    MIN(rec.[MasterFilename]) AS RecordMasterFilename,
    MIN(rec.[GameName]) AS RecordGameName,
    usr.[OrganizationName],
    SUM(ISNULL(hist._count, 0)) AS TimesDownloaded
FROM
(
    SELECT
        innerRec.Id,
        MIN(innerRec.CreationDate) AS CreationDate,
        MIN(innerRec.LastModified) AS LastModified,
        MIN(innerRec.AssetType) AS AssetType,
        MIN(innerRec.MasterFilename) AS MasterFilename,
        MIN(innerRec.GameName) AS GameName
    FROM [dbo].[Record] innerRec
    INNER JOIN [dbo].[RecordClassificationLink] innerLnk ON innerRec.Id = innerLnk.RecordId
    -- WHERE (classification ID is foo or bar)
    GROUP BY
        innerRec.Id
    -- HAVING COUNT(innerLnk.ClassificationId) = (number of specified classifications)
) AS rec
CROSS JOIN [dbo].[AdamUser] AS usr
LEFT JOIN
    (SELECT AccessingUser, RecordId, COUNT(*) AS _count
     FROM [dbo].[MaintenanceJobHistory]
     WHERE [CreatedOn] > '2016-01-01 00:00:00' AND
           [CreatedOn] < '2016-12-01 00:00:00'
     GROUP BY AccessingUser, RecordId
    ) AS hist ON
        usr.Name = hist.AccessingUser AND
        rec.Id = hist.RecordId
GROUP BY
    rec.Id, usr.OrganizationName;
SELECT
    rec.[Id] AS RecordId,
    rec.[CreationDate] AS RecordCreationDate,
    rec.[LastModified] AS RecordLastModified,
    rec.[AssetType] AS RecordAssetType,
    rec.[MasterFilename] AS RecordMasterFilename,
    rec.[GameName] AS RecordGameName,
    usr.[OrganizationName],
    SUM(ISNULL(hist._count, 0)) AS TimesDownloaded
FROM
(
    SELECT
        Id,
        CreationDate,
        LastModified,
        AssetType,
        MasterFilename,
        GameName
    FROM [dbo].[Record]
    WHERE Id IN (
        SELECT RecordId
        FROM [dbo].[RecordClassificationLink]
        --- WHERE ClassificationId=...
        )
) AS rec
CROSS JOIN [dbo].[AdamUser] AS usr
LEFT JOIN
    (SELECT AccessingUser, RecordId, COUNT(*) AS _count
     FROM [dbo].[MaintenanceJobHistory]
     WHERE [CreatedOn] > '2016-01-01 00:00:00' AND
           [CreatedOn] < '2016-12-01 00:00:00'
     GROUP BY AccessingUser, RecordId
    ) AS hist ON
        usr.Name = hist.AccessingUser AND
        rec.Id = hist.RecordId
GROUP BY
    usr.OrganizationName,
    rec.[Id],
    rec.[CreationDate],
    rec.[LastModified],
    rec.[AssetType],
    rec.[MasterFilename],
    rec.[GameName],
    usr.[OrganizationName];

Context

StackExchange Database Administrators Q#157602, answer score: 3

Revisions (0)

No revisions yet.