HiveBrain v1.2.0
Get Started
← Back to all entries
patternsqlModerate

Query to count interactions between users

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
queryinteractionsbetweencountusers

Problem

The source data represents interaction between people. Some are internal, others are external.

The internals are recorded in the Users table (represented as the Users CTE in this demonstration).

Each Entries record is identified by an ID (ItemID), the time of the interaction (Sequence) and who performed the interaction (UserID).

The goal is to have a single line per ItemID with the following columns:

  • ItemID - self explanatory



  • FirstSequence - Sequence value of first interaction (remember in reality this is a time-stamp)



  • firstInternal - UserID of first user that IsInternal



  • FirstInternalSequence - Sequence of firstInternal



  • CountPerFirstInternal - A count of all interactions by firstInternal user



  • CountAllInternal - A count of all interaction with any IsInternal user



  • CountAll - Count of all interactions for ItemID



  • LastSequence - The last interaction for ItemID - allows to measure the 'age' of the interaction.



```
----- Demo source data BEGIN
WITH Users AS (
SELECT 'A' AS UserID UNION ALL
SELECT 'B' AS UserID
)
, Entries AS (
SELECT 10001 ItemID, 'X' AS UserID, 101 AS Sequence UNION ALL
SELECT 10001 ItemID, 'A' AS UserID, 102 AS Sequence UNION ALL
SELECT 10001 ItemID, 'X' AS UserID, 103 AS Sequence UNION ALL
SELECT 10001 ItemID, 'B' AS UserID, 104 AS Sequence UNION ALL
SELECT 10001 ItemID, 'X' AS UserID, 105 AS Sequence UNION ALL
SELECT 10001 ItemID, 'A' AS UserID, 106 AS Sequence
UNION ALL
SELECT 10020 ItemID, 'Y' AS UserID, 201 AS Sequence UNION ALL
SELECT 10020 ItemID, 'Y' AS UserID, 202 AS Sequence UNION ALL
SELECT 10020 ItemID, 'B' AS UserID, 203 AS Sequence UNION ALL
SELECT 10020 ItemID, 'Y' AS UserID, 204 AS Sequence UNION ALL
SELECT 10020 ItemID, 'A' AS UserID, 205 AS Sequence UNION ALL
SELECT 10020 ItemID, 'Y' AS UserID, 206 AS Sequence UNION ALL
SELECT 10020 ItemID, 'B' AS UserID, 207 AS Sequence UNION ALL
SELECT 10020

Solution

Before we begin with the code...

I just want to address one thing regarding test cases with sample data. To get the best out of a performance review of your queries, try to provide a sample that's as close as possible to your real data. You stated:

Difference between the Demo code and real life:

  • Items are in tables and not UNION ALL CTEs



  • The list represented by Entries in the demo, in reality is 800K rows, and takes 8 seconds to retrieve.



  • Sequence column is actually a date.



While (2) would be difficult to replicate on a small scale, (1) and (3) are fairly simple. I modified your sample data in the following ways to match your real life data more closely:

  • Created temp tables #Users and #Entries including keys and indexes (clustered indexes created automatically on primary key constraints



  • N/A - cannot produce 800K rows of demo data



  • Changed sequence column to DATETIME type and seeded demo data using a RAND() formula with DATEADD(). While not completely identical, it should be "close enough".



New demo data:

----- Demo source data BEGIN
IF OBJECT_ID('tempdb..#Users') IS NOT NULL DROP TABLE #Users;
IF OBJECT_ID('tempdb..#Entries') IS NOT NULL DROP TABLE #Entries;
GO
CREATE TABLE #Users (
    UserID VARCHAR(100) NOT NULL,
    CONSTRAINT PK_#Users PRIMARY KEY (UserID)
);
CREATE TABLE #Entries (
    ItemID INT NOT NULL,
    UserID VARCHAR(100) NULL,
    Sequence DATETIME NOT NULL,
    CONSTRAINT PK_#Entries PRIMARY KEY (ItemID, Sequence),
    CONSTRAINT FK_#Users FOREIGN KEY (UserID) REFERENCES #Users(UserID)
);
GO
INSERT INTO #Users (UserID)
    SELECT 'A' UNION ALL
    SELECT 'B' ;
INSERT INTO #Entries (ItemID, UserID, Sequence)
    SELECT 10001 ItemID, 'X' AS UserID, DATEADD(HOUR, (RAND() * 1000), GETDATE()) AS Sequence UNION ALL
    SELECT 10001 ItemID, 'A' AS UserID, DATEADD(HOUR, (RAND() * 1000), GETDATE()) AS Sequence UNION ALL
    -- etc.
    SELECT 10300 ItemID, 'A' AS UserID, DATEADD(HOUR, (RAND() * 1000), GETDATE()) AS Sequence ;
GO
----- Demo source data END


For reference to others looking at this, the result set after running the whole query with sample data is as follows:

Performance

This being the meat of your question, let's start by looking at our execution plan, which I ran based on the above sample data. I added markers 1-4 which caught my attention and will address individually.

Note: I will make changes mainly in formatting as we go along.

  1. Duplicated Index Scans



Both of those identical scans come from the Src CTE, which is called from 2 the other 2 CTEs separately. I was looking for a way to eliminate the left join in favor of an existence check, however due to needing the u.UserID in your IsInternal field we will have to keep this join. One possibility, if this kind of operation (checking whether a user is internal) is something that is done frequently in your code base, you may consider adding an IsInternal boolean/bit column in Entries so you could eliminate this join altogether from your code base when you need to check if an entry is internal.

I cannot tell you exactly how to optimize that CTE otherwise, but since you are scanning the same source data sets twice, perhaps consider storing the result set inside a temp table, which presumably might be a smaller set than the entire two original tables.

WITH Src AS (
    SELECT
            Src_Ent.ItemID
        ,   Src_Ent.UserID
        ,   Src_Ent.Sequence
        /*If the user for this item sequence is not found in Users, we mark it as Internal.*/
        ,   (CASE WHEN Src_Usr.UserID IS NULL THEN 0 ELSE 1 END) AS IsInternal
    FROM #Entries AS Src_Ent
    LEFT JOIN #Users AS Src_Usr
        ON Src_Usr.UserID = Src_Ent.UserID
)


Improvements to formatting: Changed table aliases to make query (and execution plan) easier to read. #Entries AS Src_Ent was e and #Users AS Src_Usr was u. I also wrapped the CASE expression in round brackets to help isolate it visually from its alias. I added a bit of documentation to the CASE statement.

  1. Sort #1 - Src_UserID_Sub



This Sort results from the GROUP BY clause of Src_UserID_Sub subquery. Unfortunately it's not possible to eliminate this expensive sort, as rows must be sorted prior to being grouped. It's possible that this would be less expensive if you used a temp table as suggested in step (1) if you had a clustered index, for example by making an artificial primary key such as a RowID INT IDENTITY(1,1) column on the temp table.
  1. Sort #2 - Src_UserID with ROW_NUMBER()



This Sort is also impossible to eliminate with your current logic, otherwise the following error will be raised: The function 'ROW_NUMBER' must have an OVER clause with ORDER BY. It might be possible to do away with ROW_NUMBER(), but that might also be more harmful than beneficial, as it would likely require a loop or some other construct that is not very "SQL-ish", and in the end, the query optimizer can probably work better with this

Code Snippets

----- Demo source data BEGIN
IF OBJECT_ID('tempdb..#Users') IS NOT NULL DROP TABLE #Users;
IF OBJECT_ID('tempdb..#Entries') IS NOT NULL DROP TABLE #Entries;
GO
CREATE TABLE #Users (
    UserID VARCHAR(100) NOT NULL,
    CONSTRAINT PK_#Users PRIMARY KEY (UserID)
);
CREATE TABLE #Entries (
    ItemID INT NOT NULL,
    UserID VARCHAR(100) NULL,
    Sequence DATETIME NOT NULL,
    CONSTRAINT PK_#Entries PRIMARY KEY (ItemID, Sequence),
    CONSTRAINT FK_#Users FOREIGN KEY (UserID) REFERENCES #Users(UserID)
);
GO
INSERT INTO #Users (UserID)
    SELECT 'A' UNION ALL
    SELECT 'B' ;
INSERT INTO #Entries (ItemID, UserID, Sequence)
    SELECT 10001 ItemID, 'X' AS UserID, DATEADD(HOUR, (RAND() * 1000), GETDATE()) AS Sequence UNION ALL
    SELECT 10001 ItemID, 'A' AS UserID, DATEADD(HOUR, (RAND() * 1000), GETDATE()) AS Sequence UNION ALL
    -- etc.
    SELECT 10300 ItemID, 'A' AS UserID, DATEADD(HOUR, (RAND() * 1000), GETDATE()) AS Sequence ;
GO
----- Demo source data END
WITH Src AS (
    SELECT
            Src_Ent.ItemID
        ,   Src_Ent.UserID
        ,   Src_Ent.Sequence
        /*If the user for this item sequence is not found in Users, we mark it as Internal.*/
        ,   (CASE WHEN Src_Usr.UserID IS NULL THEN 0 ELSE 1 END) AS IsInternal
    FROM #Entries AS Src_Ent
    LEFT JOIN #Users AS Src_Usr
        ON Src_Usr.UserID = Src_Ent.UserID
)
, Src_UserID AS (
    SELECT
            Src_UserID_Sub.ItemID
        ,   Src_UserID_Sub.IsInternal
        ,   Src_UserID_Sub.UserID
        ,   Src_UserID_Sub.CountPerUser
        ,   Src_UserID_Sub.FirstUserSequence
        ,   ROW_NUMBER() OVER (
                PARTITION BY 
                    Src_UserID_Sub.ItemID
                  , Src_UserID_Sub.IsInternal
                ORDER BY 
                    Src_UserID_Sub.FirstUserSequence
            ) AS [RowCount]
        FROM (
            /*This subquery is used to get the number of entries per user, 
            as well as the earliest sequence related to said entries*/
            SELECT
                    Src.ItemID
                ,   Src.IsInternal
                ,   Src.UserID
                ,   COUNT(*) AS CountPerUser
                ,   MIN(Src.Sequence) AS FirstUserSequence
                FROM Src
                GROUP BY Src.ItemID, Src.IsInternal, Src.UserID
        ) AS Src_UserID_Sub
SELECT
        items.ItemID
    ,   items.FirstSequence
    ,   internal.firstInternal
    ,   internal.FirstInternalSequence
    ,   internal.CountPerFirstInternal
    ,   items.CountAllInternal
    ,   items.CountAll
    ,   items.LastSequence
FROM Src_Items AS items
JOIN Src_FirstInternal AS internal
    ON internal.ItemID = items.ItemID
;

Context

StackExchange Code Review Q#114250, answer score: 10

Revisions (0)

No revisions yet.