HiveBrain v1.2.0
Get Started
← Back to all entries
patternsqlMinor

Top 10 editors partitioned by year

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
partitionedtopyeareditors

Problem

I wanted to get some stats on who has done the most edits on Code Review, so I wrote the following query on SE Data Explorer. All improvement suggestions are welcome.

Note that I used a cursor to partition top users by year because at first there was a fundamental flaw in my logic trying to get the top users overall and then grouping them by years, which gave incorrect results.

The execution plan doesn't look too bad, considering the cursor only iterates a total of 6 times (2011-2016), but if it can be improved without degrading readability too much, I'd like to know.

```
/*
Aggregate the top 10 users with the most post edits
for each year, along with relevant statistics.
*/

/ Filter for Edit events: /
declare @EditPostHistoryTypes table (Id int);
insert into @EditPostHistoryTypes
select Id
from PostHistoryTypes
where Name like 'Edit%';

/ All edits on record: /
declare @CountAllEdits int = (
select count(Id)
from PostHistory
where
PostHistoryTypeId in (select Id from @EditPostHistoryTypes)
and PostId in (
select Id from Posts
where DeletionDate is null
)
);
/ Hold top editors per year for main query: /
declare @TopEditorsPerYear table (
[Year] int,
UserId int,
CountByYear int,
TotalCount int
);
/ Get all years with edit activity: /
declare _years cursor for
select distinct datepart(year, CreationDate)
from PostHistory
where PostHistoryTypeId in (
select Id
from PostHistoryTypes
where Name like 'Edit%'
);
declare @currentYear int;
/ Cursor to split top editors into yearly partitions: /
open _years;
fetch next from _years into @currentYear;
while @@fetch_status = 0
begin;
insert into @TopEditorsPerYear ([Year], UserId, CountByYear, TotalCount)
select top 10
datepart(year, ph.CreationDate) as [Year],
ph.UserId,
count(ph.Id),
(select count(Id) from PostHistory a

Solution

This can be simplified down to a single query :-)

You calculate COUNTs using different WHERE-conditions and aggregation levels, but the base tables are always the same:

  • CountByYear: Edits per user & year



  • TotalCount: All actions of a user



  • CountAllEdits: All Edits of all users (including UserId NULL)



In a case like this you can utilize conditional counts (COUNT(CASE...)) and Windowed Aggregate Functions (SUM() OVER).

To be able to calculate both Edits only and All actions I use a LEFT JOIN PostHistoryTypes ON Name like 'Edit%' which returns NULL for non-edit-actions. This enables both counts using:

count(PostHistory.PostHistoryTypeId) = all actions 
count(PostHistoryTypes.id)           = edits only


Aggregation starts with the lowest level, GROUP BY UserId, YEAR. And because OVER is processed after aggregation you get the higher aggregation levels using:

sum(count(...)) over (partition by UserId) = -- count by user
sum(count(...)) over ()                    = -- count all


This results in this base query:

select
      [Year Rank] = row_number() 
                    over (partition by datepart(year, CreationDate)
                                         -- (excluding NULL UserId)
                          order by count(case when UserId is not null then PHT.Id end) desc),
      [Year] = datepart(year, CreationDate),
      UserId,
      CountByYear = count(pht.id), -- Edits per user & year
      TotalCount = sum(count(UserId)) over (partition by UserId), -- All actions of a user
      CountAllEdits = sum(count(pht.Id)) over () -- All Edits (including NULL UserId)
   from 
       PostHistory as ph
   left join PostHistoryTypes as pht
     on ph.PostHistoryTypeId = pht.Id
    and pht.Name like 'Edit%'
   where PostId in (select Id from Posts where DeletionDate is null)
   group by 
       [UserId], 
       datepart(year, CreationDate)


And then you just have to filter for the top 10 and do the percentage calculations:

/*
  Aggregate the top 10 users with the most post edits 
  for each year, along with relevant statistics.
*/

with cte as 
 (
   select
      [Year Rank] = row_number() 
                    over (partition by datepart(year, CreationDate)
                                         -- (excluding NULL UserId)
                          order by count(case when UserId is not null then PHT.Id end) desc),
      [Year] = datepart(year, CreationDate),
      UserId,
      CountByYear = count(pht.id), -- Edits per user & year
      TotalCount = sum(count(UserId)) over (partition by UserId), -- All actions of a user
      CountAllEdits = sum(count(pht.Id)) over () -- All Edits (including NULL UserId)
   from 
       PostHistory as ph
   left join PostHistoryTypes as pht
     on ph.PostHistoryTypeId = pht.Id
    and pht.Name like 'Edit%'
   where PostId in (select Id from Posts where DeletionDate is null)
   group by 
       [UserId], 
       datepart(year, CreationDate)
 )
select
   [Year Rank],
   [Year],
   [User Link] = UserId,

   -- number of edits by user & year
   -- [Edits by User & Year]?
   [Annual Total] = CountByYear,

   -- % of edits vs. all actions by user & year
   [% Year/User]  = cast(100.00 * CountByYear / TotalCount    as decimal(5,2)),

   -- % of edits by user & year vs. all edits 
   [% Year/Site]  = cast(100.00 * CountByYear / CountAllEdits as decimal(5,2)),

   -- % of actions by user vs. all edits 
   [% Total/Site] = cast(100.00 * TotalCount  / CountAllEdits as decimal(5,2)),

   -- number of all actions by user
   [Grand Total] = TotalCount

from cte
where [Year Rank]  0
order by 
    [Year Rank] asc, 
    [Year] asc;
;


I noticed you didn't use exactly the same WHERE-conditions all the time, in your TotalCount calculation you didn't check for DeletionDate is null, but in fact there's no NULL at all in that column (thus it could be removed from the query).

It was hard for me to infer the actual meaning from the column names.
And the calculation of [% Total/Site] doesn't make sense for me, you compare all actions of a user to all edit-actions.

First I would expect indicators like:

  • edit count of this user (per year & globally)



  • % edits of this user vs all edits (per year & globally)



And additionally maybe:

  • all actions count of this user (per year & globally)



  • % edits vs all actions of this user (per year & globally)

Code Snippets

count(PostHistory.PostHistoryTypeId) = all actions 
count(PostHistoryTypes.id)           = edits only
sum(count(...)) over (partition by UserId) = -- count by user
sum(count(...)) over ()                    = -- count all
select
      [Year Rank] = row_number() 
                    over (partition by datepart(year, CreationDate)
                                         -- (excluding NULL UserId)
                          order by count(case when UserId is not null then PHT.Id end) desc),
      [Year] = datepart(year, CreationDate),
      UserId,
      CountByYear = count(pht.id), -- Edits per user & year
      TotalCount = sum(count(UserId)) over (partition by UserId), -- All actions of a user
      CountAllEdits = sum(count(pht.Id)) over () -- All Edits (including NULL UserId)
   from 
       PostHistory as ph
   left join PostHistoryTypes as pht
     on ph.PostHistoryTypeId = pht.Id
    and pht.Name like 'Edit%'
   where PostId in (select Id from Posts where DeletionDate is null)
   group by 
       [UserId], 
       datepart(year, CreationDate)
/*
  Aggregate the top 10 users with the most post edits 
  for each year, along with relevant statistics.
*/

with cte as 
 (
   select
      [Year Rank] = row_number() 
                    over (partition by datepart(year, CreationDate)
                                         -- (excluding NULL UserId)
                          order by count(case when UserId is not null then PHT.Id end) desc),
      [Year] = datepart(year, CreationDate),
      UserId,
      CountByYear = count(pht.id), -- Edits per user & year
      TotalCount = sum(count(UserId)) over (partition by UserId), -- All actions of a user
      CountAllEdits = sum(count(pht.Id)) over () -- All Edits (including NULL UserId)
   from 
       PostHistory as ph
   left join PostHistoryTypes as pht
     on ph.PostHistoryTypeId = pht.Id
    and pht.Name like 'Edit%'
   where PostId in (select Id from Posts where DeletionDate is null)
   group by 
       [UserId], 
       datepart(year, CreationDate)
 )
select
   [Year Rank],
   [Year],
   [User Link] = UserId,

   -- number of edits by user & year
   -- [Edits by User & Year]?
   [Annual Total] = CountByYear,

   -- % of edits vs. all actions by user & year
   [% Year/User]  = cast(100.00 * CountByYear / TotalCount    as decimal(5,2)),

   -- % of edits by user & year vs. all edits 
   [% Year/Site]  = cast(100.00 * CountByYear / CountAllEdits as decimal(5,2)),

   -- % of actions by user vs. all edits 
   [% Total/Site] = cast(100.00 * TotalCount  / CountAllEdits as decimal(5,2)),

   -- number of all actions by user
   [Grand Total] = TotalCount

from cte
where [Year Rank] <= 10
  and CountByYear > 0
order by 
    [Year Rank] asc, 
    [Year] asc;
;

Context

StackExchange Code Review Q#125106, answer score: 3

Revisions (0)

No revisions yet.