HiveBrain v1.2.0
Get Started
← Back to all entries
patternsqlMinor

Who are these lurkers?

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
arewholurkersthese

Problem

Last night in The 2nd Monitor there was a discussion about postless users - I wanted to see how many there are, whether they're one-timers that showed up once, registered and never came back, and whether they vote at all.

So I whipped up this query on SEDE, showing voters and their post count, as well as other information:

with sub as (

  select distinct
    Users.Id
   ,Users.DisplayName
   ,Users.CreationDate
   ,Users.LastAccessDate
   ,datediff(day, cast(Users.CreationDate as date), cast(Users.LastAccessDate as date)) MembershipDays
   ,case when cast(Users.CreationDate as date) = cast(Users.LastAccessDate as date) then 1 else 0 end IsOneTimer
   ,sum(case when Posts.Id is null then 0 else 1 end) over (partition by Users.Id) PostCount
   ,Users.Reputation
   ,Users.UpVotes + Users.DownVotes TotalVotes
   ,Users.UpVotes
   ,Users.DownVotes
  from Users
  left join Posts on Users.Id = Posts.OwnerUserId
  where Users.Id > 0

), agg as (

  select
     Id
    ,DisplayName
    ,CreationDate
    ,LastAccessDate
    ,IsOneTimer
    ,MembershipDays
    ,case when MembershipDays = 0 then TotalVotes else TotalVotes / cast(MembershipDays as decimal) end VotesPerDay
    ,PostCount
    ,Reputation
    ,TotalVotes
    ,UpVotes
    ,DownVotes
  from sub
  where TotalVotes > 0)

select
     Id
    ,DisplayName
    ,CreationDate
    ,LastAccessDate
    ,IsOneTimer
    ,MembershipDays
    ,round(VotesPerDay, 3) VotesPerDay
    ,PostCount
    ,Reputation
    ,TotalVotes
    ,UpVotes
    ,DownVotes
from agg
order by
  PostCount
 ,cast(VotesPerDay as int) desc
 ,cast(LastAccessDate as date) desc
 ,TotalVotes desc
 ,Reputation
 ,IsOneTimer
 ,CreationDate


Returns 26740 rows returned in 211 ms (cached). Is there any way it could be improved, performance or otherwise?

Postless Users query on SEDE.

Solution

Output

It would be nice if the usernames were clickable. You can achieve that using the magic [User Link] column. It might make the query fail to complete on a large site like Stack Overflow, though.

It's weird that you cast(VotesPerDay as int) and cast(LastAccessDate as date) for sorting. Why not sort by VotesPerDay (without rounding) and LastAccessDate directly?

Implementation

The names of your CTEs, sub and agg, don't mean anything to me.

select distinct should be replaced with a smarter join wherever possible. If you look at the execution plan, you'll see that it estimates that there are 3 million rows, when there are actually just 26740 Code Review users. The estimated subtree cost is 1048. In comparison, my query below has an estimated subtree cost of 5.7 and 27580 rows for Code Review, and those numbers scale depending on the size of the site.

Your agg CTE has a TotalVotes > 0 as a WHERE condition. It would probably be more efficient to test for (UpVotes > 0 OR DownVotes > 0), because that has a possibility to take advantage of an index (though we don't know whether the Users table actually has such an index in SEDE).

Suggested solution

WITH FilteredUsers AS (
    SELECT *
         , DATEDIFF(day, CAST(CreationDate AS DATE), CAST(LastAccessDate AS DATE)) AS MembershipDays
        FROM Users
        WHERE
            Id > 0 AND
            (UpVotes > 0 OR DownVotes > 0)
), PostCounts AS (
    SELECT OwnerUserId
         , COUNT(*) AS PostCount
        FROM Posts
        GROUP BY OwnerUserId
)
SELECT Id
     , DisplayName
     , CreationDate
     , LastAccessDate
     , CASE WHEN MembershipDays = 0 THEN 1 ELSE 0 END AS IsOneTimer
     , MembershipDays
     , ROUND((UpVotes + DownVotes) / CASE WHEN MembershipDays = 0 THEN 1 ELSE CAST(MembershipDays AS DECIMAL) END, 3) AS VotesPerDay
     , COALESCE(PostCount, 0) AS PostCount
     , Reputation
     , UpVotes + DownVotes AS TotalVotes
     , UpVotes
     , DownVotes
    FROM FilteredUsers
        LEFT OUTER JOIN PostCounts
            ON FilteredUsers.Id = PostCounts.OwnerUserId
    ORDER BY PostCount
           , VotesPerDay DESC
           , LastAccessDate DESC
           , TotalVotes DESC
           , Reputation
           , IsOneTimer
           , CreationDate;

Code Snippets

WITH FilteredUsers AS (
    SELECT *
         , DATEDIFF(day, CAST(CreationDate AS DATE), CAST(LastAccessDate AS DATE)) AS MembershipDays
        FROM Users
        WHERE
            Id > 0 AND
            (UpVotes > 0 OR DownVotes > 0)
), PostCounts AS (
    SELECT OwnerUserId
         , COUNT(*) AS PostCount
        FROM Posts
        GROUP BY OwnerUserId
)
SELECT Id
     , DisplayName
     , CreationDate
     , LastAccessDate
     , CASE WHEN MembershipDays = 0 THEN 1 ELSE 0 END AS IsOneTimer
     , MembershipDays
     , ROUND((UpVotes + DownVotes) / CASE WHEN MembershipDays = 0 THEN 1 ELSE CAST(MembershipDays AS DECIMAL) END, 3) AS VotesPerDay
     , COALESCE(PostCount, 0) AS PostCount
     , Reputation
     , UpVotes + DownVotes AS TotalVotes
     , UpVotes
     , DownVotes
    FROM FilteredUsers
        LEFT OUTER JOIN PostCounts
            ON FilteredUsers.Id = PostCounts.OwnerUserId
    ORDER BY PostCount
           , VotesPerDay DESC
           , LastAccessDate DESC
           , TotalVotes DESC
           , Reputation
           , IsOneTimer
           , CreationDate;

Context

StackExchange Code Review Q#147339, answer score: 4

Revisions (0)

No revisions yet.