patternsqlModerate
Query to count interactions between users
Viewed 0 times
queryinteractionsbetweencountusers
Problem
The source data represents interaction between people. Some are internal, others are external.
The internals are recorded in the
Each
The goal is to have a single line per
```
----- Demo source data BEGIN
WITH Users AS (
SELECT 'A' AS UserID UNION ALL
SELECT 'B' AS UserID
)
, Entries AS (
SELECT 10001 ItemID, 'X' AS UserID, 101 AS Sequence UNION ALL
SELECT 10001 ItemID, 'A' AS UserID, 102 AS Sequence UNION ALL
SELECT 10001 ItemID, 'X' AS UserID, 103 AS Sequence UNION ALL
SELECT 10001 ItemID, 'B' AS UserID, 104 AS Sequence UNION ALL
SELECT 10001 ItemID, 'X' AS UserID, 105 AS Sequence UNION ALL
SELECT 10001 ItemID, 'A' AS UserID, 106 AS Sequence
UNION ALL
SELECT 10020 ItemID, 'Y' AS UserID, 201 AS Sequence UNION ALL
SELECT 10020 ItemID, 'Y' AS UserID, 202 AS Sequence UNION ALL
SELECT 10020 ItemID, 'B' AS UserID, 203 AS Sequence UNION ALL
SELECT 10020 ItemID, 'Y' AS UserID, 204 AS Sequence UNION ALL
SELECT 10020 ItemID, 'A' AS UserID, 205 AS Sequence UNION ALL
SELECT 10020 ItemID, 'Y' AS UserID, 206 AS Sequence UNION ALL
SELECT 10020 ItemID, 'B' AS UserID, 207 AS Sequence UNION ALL
SELECT 10020
The internals are recorded in the
Users table (represented as the Users CTE in this demonstration).Each
Entries record is identified by an ID (ItemID), the time of the interaction (Sequence) and who performed the interaction (UserID).The goal is to have a single line per
ItemID with the following columns:ItemID- self explanatory
FirstSequence-Sequencevalue of first interaction (remember in reality this is a time-stamp)
firstInternal-UserIDof first user thatIsInternal
FirstInternalSequence-SequenceoffirstInternal
CountPerFirstInternal- A count of all interactions byfirstInternaluser
CountAllInternal- A count of all interaction with anyIsInternaluser
CountAll- Count of all interactions forItemID
LastSequence- The last interaction forItemID- allows to measure the 'age' of the interaction.
```
----- Demo source data BEGIN
WITH Users AS (
SELECT 'A' AS UserID UNION ALL
SELECT 'B' AS UserID
)
, Entries AS (
SELECT 10001 ItemID, 'X' AS UserID, 101 AS Sequence UNION ALL
SELECT 10001 ItemID, 'A' AS UserID, 102 AS Sequence UNION ALL
SELECT 10001 ItemID, 'X' AS UserID, 103 AS Sequence UNION ALL
SELECT 10001 ItemID, 'B' AS UserID, 104 AS Sequence UNION ALL
SELECT 10001 ItemID, 'X' AS UserID, 105 AS Sequence UNION ALL
SELECT 10001 ItemID, 'A' AS UserID, 106 AS Sequence
UNION ALL
SELECT 10020 ItemID, 'Y' AS UserID, 201 AS Sequence UNION ALL
SELECT 10020 ItemID, 'Y' AS UserID, 202 AS Sequence UNION ALL
SELECT 10020 ItemID, 'B' AS UserID, 203 AS Sequence UNION ALL
SELECT 10020 ItemID, 'Y' AS UserID, 204 AS Sequence UNION ALL
SELECT 10020 ItemID, 'A' AS UserID, 205 AS Sequence UNION ALL
SELECT 10020 ItemID, 'Y' AS UserID, 206 AS Sequence UNION ALL
SELECT 10020 ItemID, 'B' AS UserID, 207 AS Sequence UNION ALL
SELECT 10020
Solution
Before we begin with the code...
I just want to address one thing regarding test cases with sample data. To get the best out of a performance review of your queries, try to provide a sample that's as close as possible to your real data. You stated:
Difference between the Demo code and real life:
While (2) would be difficult to replicate on a small scale, (1) and (3) are fairly simple. I modified your sample data in the following ways to match your real life data more closely:
New demo data:
For reference to others looking at this, the result set after running the whole query with sample data is as follows:
Performance
This being the meat of your question, let's start by looking at our execution plan, which I ran based on the above sample data. I added markers 1-4 which caught my attention and will address individually.
Note: I will make changes mainly in formatting as we go along.
Both of those identical scans come from the
I cannot tell you exactly how to optimize that CTE otherwise, but since you are scanning the same source data sets twice, perhaps consider storing the result set inside a temp table, which presumably might be a smaller set than the entire two original tables.
Improvements to formatting: Changed table aliases to make query (and execution plan) easier to read.
This Sort results from the
This Sort is also impossible to eliminate with your current logic, otherwise the following error will be raised:
I just want to address one thing regarding test cases with sample data. To get the best out of a performance review of your queries, try to provide a sample that's as close as possible to your real data. You stated:
Difference between the Demo code and real life:
- Items are in tables and not UNION ALL CTEs
- The list represented by Entries in the demo, in reality is 800K rows, and takes 8 seconds to retrieve.
- Sequence column is actually a date.
While (2) would be difficult to replicate on a small scale, (1) and (3) are fairly simple. I modified your sample data in the following ways to match your real life data more closely:
- Created temp tables
#Usersand#Entriesincluding keys and indexes (clustered indexes created automatically on primary key constraints
- N/A - cannot produce 800K rows of demo data
- Changed sequence column to
DATETIMEtype and seeded demo data using aRAND()formula withDATEADD(). While not completely identical, it should be "close enough".
New demo data:
----- Demo source data BEGIN
IF OBJECT_ID('tempdb..#Users') IS NOT NULL DROP TABLE #Users;
IF OBJECT_ID('tempdb..#Entries') IS NOT NULL DROP TABLE #Entries;
GO
CREATE TABLE #Users (
UserID VARCHAR(100) NOT NULL,
CONSTRAINT PK_#Users PRIMARY KEY (UserID)
);
CREATE TABLE #Entries (
ItemID INT NOT NULL,
UserID VARCHAR(100) NULL,
Sequence DATETIME NOT NULL,
CONSTRAINT PK_#Entries PRIMARY KEY (ItemID, Sequence),
CONSTRAINT FK_#Users FOREIGN KEY (UserID) REFERENCES #Users(UserID)
);
GO
INSERT INTO #Users (UserID)
SELECT 'A' UNION ALL
SELECT 'B' ;
INSERT INTO #Entries (ItemID, UserID, Sequence)
SELECT 10001 ItemID, 'X' AS UserID, DATEADD(HOUR, (RAND() * 1000), GETDATE()) AS Sequence UNION ALL
SELECT 10001 ItemID, 'A' AS UserID, DATEADD(HOUR, (RAND() * 1000), GETDATE()) AS Sequence UNION ALL
-- etc.
SELECT 10300 ItemID, 'A' AS UserID, DATEADD(HOUR, (RAND() * 1000), GETDATE()) AS Sequence ;
GO
----- Demo source data ENDFor reference to others looking at this, the result set after running the whole query with sample data is as follows:
Performance
This being the meat of your question, let's start by looking at our execution plan, which I ran based on the above sample data. I added markers 1-4 which caught my attention and will address individually.
Note: I will make changes mainly in formatting as we go along.
- Duplicated Index Scans
Both of those identical scans come from the
Src CTE, which is called from 2 the other 2 CTEs separately. I was looking for a way to eliminate the left join in favor of an existence check, however due to needing the u.UserID in your IsInternal field we will have to keep this join. One possibility, if this kind of operation (checking whether a user is internal) is something that is done frequently in your code base, you may consider adding an IsInternal boolean/bit column in Entries so you could eliminate this join altogether from your code base when you need to check if an entry is internal.I cannot tell you exactly how to optimize that CTE otherwise, but since you are scanning the same source data sets twice, perhaps consider storing the result set inside a temp table, which presumably might be a smaller set than the entire two original tables.
WITH Src AS (
SELECT
Src_Ent.ItemID
, Src_Ent.UserID
, Src_Ent.Sequence
/*If the user for this item sequence is not found in Users, we mark it as Internal.*/
, (CASE WHEN Src_Usr.UserID IS NULL THEN 0 ELSE 1 END) AS IsInternal
FROM #Entries AS Src_Ent
LEFT JOIN #Users AS Src_Usr
ON Src_Usr.UserID = Src_Ent.UserID
)Improvements to formatting: Changed table aliases to make query (and execution plan) easier to read.
#Entries AS Src_Ent was e and #Users AS Src_Usr was u. I also wrapped the CASE expression in round brackets to help isolate it visually from its alias. I added a bit of documentation to the CASE statement.- Sort #1 -
Src_UserID_Sub
This Sort results from the
GROUP BY clause of Src_UserID_Sub subquery. Unfortunately it's not possible to eliminate this expensive sort, as rows must be sorted prior to being grouped. It's possible that this would be less expensive if you used a temp table as suggested in step (1) if you had a clustered index, for example by making an artificial primary key such as a RowID INT IDENTITY(1,1) column on the temp table.- Sort #2 -
Src_UserIDwithROW_NUMBER()
This Sort is also impossible to eliminate with your current logic, otherwise the following error will be raised:
The function 'ROW_NUMBER' must have an OVER clause with ORDER BY. It might be possible to do away with ROW_NUMBER(), but that might also be more harmful than beneficial, as it would likely require a loop or some other construct that is not very "SQL-ish", and in the end, the query optimizer can probably work better with thisCode Snippets
----- Demo source data BEGIN
IF OBJECT_ID('tempdb..#Users') IS NOT NULL DROP TABLE #Users;
IF OBJECT_ID('tempdb..#Entries') IS NOT NULL DROP TABLE #Entries;
GO
CREATE TABLE #Users (
UserID VARCHAR(100) NOT NULL,
CONSTRAINT PK_#Users PRIMARY KEY (UserID)
);
CREATE TABLE #Entries (
ItemID INT NOT NULL,
UserID VARCHAR(100) NULL,
Sequence DATETIME NOT NULL,
CONSTRAINT PK_#Entries PRIMARY KEY (ItemID, Sequence),
CONSTRAINT FK_#Users FOREIGN KEY (UserID) REFERENCES #Users(UserID)
);
GO
INSERT INTO #Users (UserID)
SELECT 'A' UNION ALL
SELECT 'B' ;
INSERT INTO #Entries (ItemID, UserID, Sequence)
SELECT 10001 ItemID, 'X' AS UserID, DATEADD(HOUR, (RAND() * 1000), GETDATE()) AS Sequence UNION ALL
SELECT 10001 ItemID, 'A' AS UserID, DATEADD(HOUR, (RAND() * 1000), GETDATE()) AS Sequence UNION ALL
-- etc.
SELECT 10300 ItemID, 'A' AS UserID, DATEADD(HOUR, (RAND() * 1000), GETDATE()) AS Sequence ;
GO
----- Demo source data ENDWITH Src AS (
SELECT
Src_Ent.ItemID
, Src_Ent.UserID
, Src_Ent.Sequence
/*If the user for this item sequence is not found in Users, we mark it as Internal.*/
, (CASE WHEN Src_Usr.UserID IS NULL THEN 0 ELSE 1 END) AS IsInternal
FROM #Entries AS Src_Ent
LEFT JOIN #Users AS Src_Usr
ON Src_Usr.UserID = Src_Ent.UserID
), Src_UserID AS (
SELECT
Src_UserID_Sub.ItemID
, Src_UserID_Sub.IsInternal
, Src_UserID_Sub.UserID
, Src_UserID_Sub.CountPerUser
, Src_UserID_Sub.FirstUserSequence
, ROW_NUMBER() OVER (
PARTITION BY
Src_UserID_Sub.ItemID
, Src_UserID_Sub.IsInternal
ORDER BY
Src_UserID_Sub.FirstUserSequence
) AS [RowCount]
FROM (
/*This subquery is used to get the number of entries per user,
as well as the earliest sequence related to said entries*/
SELECT
Src.ItemID
, Src.IsInternal
, Src.UserID
, COUNT(*) AS CountPerUser
, MIN(Src.Sequence) AS FirstUserSequence
FROM Src
GROUP BY Src.ItemID, Src.IsInternal, Src.UserID
) AS Src_UserID_SubSELECT
items.ItemID
, items.FirstSequence
, internal.firstInternal
, internal.FirstInternalSequence
, internal.CountPerFirstInternal
, items.CountAllInternal
, items.CountAll
, items.LastSequence
FROM Src_Items AS items
JOIN Src_FirstInternal AS internal
ON internal.ItemID = items.ItemID
;Context
StackExchange Code Review Q#114250, answer score: 10
Revisions (0)
No revisions yet.