HiveBrain v1.2.0
Get Started
← Back to all entries
patternsqlMinor

SE Data Explorer Query: Average score for questions and answers, by tag

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
taganswersexplorerqueryquestionsaveragescoreforanddata

Problem

Being curious to see which tags on this site get the most attention, I developed a Stack Exchange Data Explorer query:

```
WITH CanonicalTags AS (
SELECT Id, TagName, Id AS AliasId, TagName AS Alias
FROM Tags
WHERE NOT EXISTS (
SELECT Id
FROM TagSynonyms
WHERE SourceTagName=TagName AND ApprovedByUserId IS NOT NULL
)
UNION
SELECT Tags.Id, Tags.TagName, Alias.Id AS AliasId, Alias.TagName
FROM Tags
INNER JOIN TagSynonyms
ON TargetTagName=TagName
INNER JOIN Tags AS Alias
ON Alias.TagName=SourceTagName
WHERE ApprovedByUserId IS NOT NULL
), TagQuestionStats AS (
SELECT Tags.Id AS Id
, Tags.TagName AS TagName
, AVG(CAST(Questions.Score AS DECIMAL)) AS AvgScore
, STDEV(Questions.Score) AS StdDevScore
, COUNT(Questions.Id) AS N
FROM CanonicalTags AS Tags
INNER JOIN PostTags
ON PostTags.TagId=Tags.AliasId
INNER JOIN Posts AS Questions
ON Questions.Id=PostTags.PostId
GROUP BY Tags.Id, Tags.TagName
), TagAnswerStats AS (
SELECT Tags.Id AS Id
, Tags.TagName AS TagName
, AVG(CAST(Answers.Score AS DECIMAL)) AS AvgScore
, STDEV(Answers.Score) AS StdDevScore
, COUNT(Answers.Id) AS N
FROM CanonicalTags AS Tags
INNER JOIN PostTags
ON PostTags.TagId=Tags.AliasId
INNER JOIN Posts AS Questions
ON Questions.Id=PostTags.PostId
INNER JOIN Posts AS Answers
ON Answers.ParentId=Questions.Id
WHERE
Answers.PostTypeId=2
GROUP BY Tags.Id, Tags.TagName
)
SELECT '' AS [TagName]
, ROUND(QStats.AvgScore, 2) AS [Avg Qst Score]
, ROUND(QStats.StdDevScore, 2) AS [Qst Score Std Dev]
, QStats.N AS [# Qst]
, ROUND(AStats.AvgScore, 2) AS [Avg Ans Score]
, ROUND(AStats.St

Solution

Your code is, for the most part, consistently styled, and predictably structured, which is nice. I am not a fan of the comma-on-the-new-line style of continuation. I prefer the comma at the end. But, at least you are consistent with this.

One inconsistency which I see is that for smaller selects you have multiple selected columns on a single line.

SELECT Tags.Id, Tags.TagName, Alias.Id AS AliasId, Alias.TagName
    ....


To be consistent with the other selects, as much as these are short statements, you should maintain consistency:

SELECT Tags.Id
     , Tags.TagName
     , Alias.Id AS AliasId
     , Alias.TagName
   ....


There is only one example of it in your code, but I really dislike multiple conditions on a single line:

WHERE SourceTagName=TagName AND ApprovedByUserId IS NOT NULL


This really should be:

WHERE SourceTagName=TagName
  AND ApprovedByUserId IS NOT NULL


The UNION condition is also very buried in there, and it has too much indentation ;-)

Finally, white-space is cheap in SQL, I really don't like the space-less arg=val expressions, spacing them is easy arg = val.

Right, that's the nit-picky stuff done.

Talking about the UNION, the statement could be rewritten as:

SELECT IsNull(Main.TagName, Alias.TagName) AS TagName
     , IsNull(Main.Id,      Alias.Id     ) AS Id
     , Alias.Id                            AS AliasId
    FROM Tags AS Alias
    LEFT JOIN TagSynonyms 
            ON SourceTagName = Alias.TagName
    LEFT JOIN Tags AS Main
            ON TargetTagName = Main.TagName


I prefer the more compact result, although it is debatable as to whether the "Outer" Join is a better solution than the UNION. Outer joins can lead to confusion. So can UNIONS and I prefer the outer join.

In the above query I have removed the Alias TagName since it is not used outside the sub-select.

I also found the final 'assimilation' of all the data in the final select to have too many casts and data conversions. The actual logic was lost in the detail. I moved the complicating CASTS to the previous with clauses.

Finally, the indexed ORDER BY has caught a number of people out in the past, as it is too easy to change columns and suddenly the data ordering is off, even though the query still runs. I always recommend named-column ordering.

Putting this all together, I have the SQL:

WITH CanonicalTags AS (
    SELECT IsNull(Main.TagName, Alias.TagName) AS TagName
         , IsNull(Main.Id,      Alias.Id     ) AS Id
--         , Alias.TagName                       AS AliasName
         , Alias.Id                            AS AliasId
        FROM Tags AS Alias
            LEFT JOIN TagSynonyms 
                ON SourceTagName = Alias.TagName
            LEFT JOIN Tags AS Main
                ON TargetTagName = Main.TagName
), TagQuestionStats AS (
    SELECT Tags.Id AS Id
         , Tags.TagName AS TagName
         , CAST(AVG(CAST(Questions.Score AS DECIMAL)) AS DECIMAL(8,2)) AS AvgScore
         , CAST(STDEV(Questions.Score) AS DECIMAL(8,2))  AS StdDevScore
         , CAST(COUNT(Questions.Id) AS DECIMAL(8,2))  AS N
        FROM CanonicalTags AS Tags
            INNER JOIN PostTags
                ON PostTags.TagId=Tags.AliasId
            INNER JOIN Posts AS Questions
                ON Questions.Id=PostTags.PostId
        GROUP BY Tags.Id, Tags.TagName
), TagAnswerStats AS (
    SELECT Tags.Id AS Id
         , Tags.TagName AS TagName
         , CAST(AVG(CAST(Answers.Score AS DECIMAL)) AS DECIMAL(8,2)) AS AvgScore
         , CAST(STDEV(Answers.Score) AS DECIMAL(8,2)) AS StdDevScore
         , CAST(COUNT(Answers.Id) AS DECIMAL(8,2)) AS N
        FROM CanonicalTags AS Tags
            INNER JOIN PostTags
                ON PostTags.TagId=Tags.AliasId
            INNER JOIN Posts AS Questions
                ON Questions.Id=PostTags.PostId
        INNER JOIN Posts AS Answers
            ON Answers.ParentId=Questions.Id
        WHERE
            Answers.PostTypeId=2
        GROUP BY Tags.Id, Tags.TagName
)
SELECT '' AS [TagName]
     , QStats.AvgScore AS [Avg Qst Score]
     , QStats.StdDevScore AS [Qst Score Std Dev]
     , QStats.N AS [# Qst]
     , AStats.AvgScore AS [Avg Ans Score]
     , AStats.StdDevScore AS [Ans Score Std Dev]
     , AStats.N AS [# Ans]
     , ROUND(AStats.N / QStats.N, 2) AS [Avg # Ans/Qst]
    FROM
        TagQuestionStats AS QStats
            LEFT OUTER JOIN TagAnswerStats AS AStats
                ON AStats.Id=QStats.Id
    ORDER BY [Avg Ans Score] DESC
           , [Avg Qst Score] DESC
           , [# Ans]
           , [# Qst]
           , [TagName];


This query supports this fork of your original question

Code Snippets

SELECT Tags.Id, Tags.TagName, Alias.Id AS AliasId, Alias.TagName
    ....
SELECT Tags.Id
     , Tags.TagName
     , Alias.Id AS AliasId
     , Alias.TagName
   ....
WHERE SourceTagName=TagName AND ApprovedByUserId IS NOT NULL
WHERE SourceTagName=TagName
  AND ApprovedByUserId IS NOT NULL
SELECT IsNull(Main.TagName, Alias.TagName) AS TagName
     , IsNull(Main.Id,      Alias.Id     ) AS Id
     , Alias.Id                            AS AliasId
    FROM Tags AS Alias
    LEFT JOIN TagSynonyms 
            ON SourceTagName = Alias.TagName
    LEFT JOIN Tags AS Main
            ON TargetTagName = Main.TagName

Context

StackExchange Code Review Q#39410, answer score: 4

Revisions (0)

No revisions yet.