patternsqlMinor
Top 10 editors partitioned by year
Viewed 0 times
partitionedtopyeareditors
Problem
I wanted to get some stats on who has done the most edits on Code Review, so I wrote the following query on SE Data Explorer. All improvement suggestions are welcome.
Note that I used a cursor to partition top users by year because at first there was a fundamental flaw in my logic trying to get the top users overall and then grouping them by years, which gave incorrect results.
The execution plan doesn't look too bad, considering the cursor only iterates a total of 6 times (2011-2016), but if it can be improved without degrading readability too much, I'd like to know.
```
/*
Aggregate the top 10 users with the most post edits
for each year, along with relevant statistics.
*/
/ Filter for Edit events: /
declare @EditPostHistoryTypes table (Id int);
insert into @EditPostHistoryTypes
select Id
from PostHistoryTypes
where Name like 'Edit%';
/ All edits on record: /
declare @CountAllEdits int = (
select count(Id)
from PostHistory
where
PostHistoryTypeId in (select Id from @EditPostHistoryTypes)
and PostId in (
select Id from Posts
where DeletionDate is null
)
);
/ Hold top editors per year for main query: /
declare @TopEditorsPerYear table (
[Year] int,
UserId int,
CountByYear int,
TotalCount int
);
/ Get all years with edit activity: /
declare _years cursor for
select distinct datepart(year, CreationDate)
from PostHistory
where PostHistoryTypeId in (
select Id
from PostHistoryTypes
where Name like 'Edit%'
);
declare @currentYear int;
/ Cursor to split top editors into yearly partitions: /
open _years;
fetch next from _years into @currentYear;
while @@fetch_status = 0
begin;
insert into @TopEditorsPerYear ([Year], UserId, CountByYear, TotalCount)
select top 10
datepart(year, ph.CreationDate) as [Year],
ph.UserId,
count(ph.Id),
(select count(Id) from PostHistory a
Note that I used a cursor to partition top users by year because at first there was a fundamental flaw in my logic trying to get the top users overall and then grouping them by years, which gave incorrect results.
The execution plan doesn't look too bad, considering the cursor only iterates a total of 6 times (2011-2016), but if it can be improved without degrading readability too much, I'd like to know.
```
/*
Aggregate the top 10 users with the most post edits
for each year, along with relevant statistics.
*/
/ Filter for Edit events: /
declare @EditPostHistoryTypes table (Id int);
insert into @EditPostHistoryTypes
select Id
from PostHistoryTypes
where Name like 'Edit%';
/ All edits on record: /
declare @CountAllEdits int = (
select count(Id)
from PostHistory
where
PostHistoryTypeId in (select Id from @EditPostHistoryTypes)
and PostId in (
select Id from Posts
where DeletionDate is null
)
);
/ Hold top editors per year for main query: /
declare @TopEditorsPerYear table (
[Year] int,
UserId int,
CountByYear int,
TotalCount int
);
/ Get all years with edit activity: /
declare _years cursor for
select distinct datepart(year, CreationDate)
from PostHistory
where PostHistoryTypeId in (
select Id
from PostHistoryTypes
where Name like 'Edit%'
);
declare @currentYear int;
/ Cursor to split top editors into yearly partitions: /
open _years;
fetch next from _years into @currentYear;
while @@fetch_status = 0
begin;
insert into @TopEditorsPerYear ([Year], UserId, CountByYear, TotalCount)
select top 10
datepart(year, ph.CreationDate) as [Year],
ph.UserId,
count(ph.Id),
(select count(Id) from PostHistory a
Solution
This can be simplified down to a single query :-)
You calculate COUNTs using different WHERE-conditions and aggregation levels, but the base tables are always the same:
In a case like this you can utilize conditional counts (
To be able to calculate both Edits only and All actions I use a
Aggregation starts with the lowest level,
This results in this base query:
And then you just have to filter for the top 10 and do the percentage calculations:
I noticed you didn't use exactly the same WHERE-conditions all the time, in your
It was hard for me to infer the actual meaning from the column names.
And the calculation of [% Total/Site] doesn't make sense for me, you compare all actions of a user to all edit-actions.
First I would expect indicators like:
And additionally maybe:
You calculate COUNTs using different WHERE-conditions and aggregation levels, but the base tables are always the same:
- CountByYear: Edits per user & year
- TotalCount: All actions of a user
- CountAllEdits: All Edits of all users (including UserId NULL)
In a case like this you can utilize conditional counts (
COUNT(CASE...)) and Windowed Aggregate Functions (SUM() OVER). To be able to calculate both Edits only and All actions I use a
LEFT JOIN PostHistoryTypes ON Name like 'Edit%' which returns NULL for non-edit-actions. This enables both counts using:count(PostHistory.PostHistoryTypeId) = all actions
count(PostHistoryTypes.id) = edits onlyAggregation starts with the lowest level,
GROUP BY UserId, YEAR. And because OVER is processed after aggregation you get the higher aggregation levels using:sum(count(...)) over (partition by UserId) = -- count by user
sum(count(...)) over () = -- count allThis results in this base query:
select
[Year Rank] = row_number()
over (partition by datepart(year, CreationDate)
-- (excluding NULL UserId)
order by count(case when UserId is not null then PHT.Id end) desc),
[Year] = datepart(year, CreationDate),
UserId,
CountByYear = count(pht.id), -- Edits per user & year
TotalCount = sum(count(UserId)) over (partition by UserId), -- All actions of a user
CountAllEdits = sum(count(pht.Id)) over () -- All Edits (including NULL UserId)
from
PostHistory as ph
left join PostHistoryTypes as pht
on ph.PostHistoryTypeId = pht.Id
and pht.Name like 'Edit%'
where PostId in (select Id from Posts where DeletionDate is null)
group by
[UserId],
datepart(year, CreationDate)And then you just have to filter for the top 10 and do the percentage calculations:
/*
Aggregate the top 10 users with the most post edits
for each year, along with relevant statistics.
*/
with cte as
(
select
[Year Rank] = row_number()
over (partition by datepart(year, CreationDate)
-- (excluding NULL UserId)
order by count(case when UserId is not null then PHT.Id end) desc),
[Year] = datepart(year, CreationDate),
UserId,
CountByYear = count(pht.id), -- Edits per user & year
TotalCount = sum(count(UserId)) over (partition by UserId), -- All actions of a user
CountAllEdits = sum(count(pht.Id)) over () -- All Edits (including NULL UserId)
from
PostHistory as ph
left join PostHistoryTypes as pht
on ph.PostHistoryTypeId = pht.Id
and pht.Name like 'Edit%'
where PostId in (select Id from Posts where DeletionDate is null)
group by
[UserId],
datepart(year, CreationDate)
)
select
[Year Rank],
[Year],
[User Link] = UserId,
-- number of edits by user & year
-- [Edits by User & Year]?
[Annual Total] = CountByYear,
-- % of edits vs. all actions by user & year
[% Year/User] = cast(100.00 * CountByYear / TotalCount as decimal(5,2)),
-- % of edits by user & year vs. all edits
[% Year/Site] = cast(100.00 * CountByYear / CountAllEdits as decimal(5,2)),
-- % of actions by user vs. all edits
[% Total/Site] = cast(100.00 * TotalCount / CountAllEdits as decimal(5,2)),
-- number of all actions by user
[Grand Total] = TotalCount
from cte
where [Year Rank] 0
order by
[Year Rank] asc,
[Year] asc;
;I noticed you didn't use exactly the same WHERE-conditions all the time, in your
TotalCount calculation you didn't check for DeletionDate is null, but in fact there's no NULL at all in that column (thus it could be removed from the query).It was hard for me to infer the actual meaning from the column names.
And the calculation of [% Total/Site] doesn't make sense for me, you compare all actions of a user to all edit-actions.
First I would expect indicators like:
- edit count of this user (per year & globally)
- % edits of this user vs all edits (per year & globally)
And additionally maybe:
- all actions count of this user (per year & globally)
- % edits vs all actions of this user (per year & globally)
Code Snippets
count(PostHistory.PostHistoryTypeId) = all actions
count(PostHistoryTypes.id) = edits onlysum(count(...)) over (partition by UserId) = -- count by user
sum(count(...)) over () = -- count allselect
[Year Rank] = row_number()
over (partition by datepart(year, CreationDate)
-- (excluding NULL UserId)
order by count(case when UserId is not null then PHT.Id end) desc),
[Year] = datepart(year, CreationDate),
UserId,
CountByYear = count(pht.id), -- Edits per user & year
TotalCount = sum(count(UserId)) over (partition by UserId), -- All actions of a user
CountAllEdits = sum(count(pht.Id)) over () -- All Edits (including NULL UserId)
from
PostHistory as ph
left join PostHistoryTypes as pht
on ph.PostHistoryTypeId = pht.Id
and pht.Name like 'Edit%'
where PostId in (select Id from Posts where DeletionDate is null)
group by
[UserId],
datepart(year, CreationDate)/*
Aggregate the top 10 users with the most post edits
for each year, along with relevant statistics.
*/
with cte as
(
select
[Year Rank] = row_number()
over (partition by datepart(year, CreationDate)
-- (excluding NULL UserId)
order by count(case when UserId is not null then PHT.Id end) desc),
[Year] = datepart(year, CreationDate),
UserId,
CountByYear = count(pht.id), -- Edits per user & year
TotalCount = sum(count(UserId)) over (partition by UserId), -- All actions of a user
CountAllEdits = sum(count(pht.Id)) over () -- All Edits (including NULL UserId)
from
PostHistory as ph
left join PostHistoryTypes as pht
on ph.PostHistoryTypeId = pht.Id
and pht.Name like 'Edit%'
where PostId in (select Id from Posts where DeletionDate is null)
group by
[UserId],
datepart(year, CreationDate)
)
select
[Year Rank],
[Year],
[User Link] = UserId,
-- number of edits by user & year
-- [Edits by User & Year]?
[Annual Total] = CountByYear,
-- % of edits vs. all actions by user & year
[% Year/User] = cast(100.00 * CountByYear / TotalCount as decimal(5,2)),
-- % of edits by user & year vs. all edits
[% Year/Site] = cast(100.00 * CountByYear / CountAllEdits as decimal(5,2)),
-- % of actions by user vs. all edits
[% Total/Site] = cast(100.00 * TotalCount / CountAllEdits as decimal(5,2)),
-- number of all actions by user
[Grand Total] = TotalCount
from cte
where [Year Rank] <= 10
and CountByYear > 0
order by
[Year Rank] asc,
[Year] asc;
;Context
StackExchange Code Review Q#125106, answer score: 3
Revisions (0)
No revisions yet.