patternsqlMinor
SEDE query to find users with exactly one post in a particular tag
Viewed 0 times
tagwithqueryuserspostonefindparticularsedeexactly
Problem
I'm doing some research for a meta post on Gardening.SE (where I'm a pro tempore moderator). I'm using the Stack Exchange Data Explorer (SEDE) to find information about first time users who post questions in our [identification] tag, and how they subsequently interact with the people who respond to their question, whether that's comments on the post, comments on answers, editing the post to provide more information, and so on.
I have the following query which gives me the list of people with no answers and one question that includes the [identification] tag. I plan to refine it by adding extra criteria, e.g. to find instances where the OP leaves a comment replying to someone else.
My concern is that this SQL is far from optimal. I'm looking especially at the repeated inner joins between the Users and Posts tables, and the similarity between the sub-queries to find users with one question and users with at least one answer.
Gar
I have the following query which gives me the list of people with no answers and one question that includes the [identification] tag. I plan to refine it by adding extra criteria, e.g. to find instances where the OP leaves a comment replying to someone else.
-- Find users who have exactly one post
-- with that post being a question
-- and the question has the [identification] tag
SELECT
Users.Id AS [User Link], Posts.Id AS [Post Link]
FROM
Users INNER JOIN Posts ON Posts.OwnerUserId = Users.Id
INNER JOIN PostTags ON Posts.Id = PostTags.PostId
INNER JOIN Tags ON Tags.Id = PostTags.TagId
WHERE
Tags.TagName LIKE 'identification' AND
Posts.ParentId IS NULL AND -- ParentId is NULL if post is a question
Users.Id in (
-- Users with exactly one question
SELECT Users.Id
FROM
Users INNER JOIN Posts ON Posts.OwnerUserId = Users.Id
WHERE
Posts.ParentId IS NULL
GROUP BY
Users.id
HAVING
Count(Posts.Id) = 1
) AND
Users.Id NOT IN (
-- Users with 1 or more answers
SELECT Users.Id
FROM
Users INNER JOIN Posts ON Posts.OwnerUserId = Users.Id
WHERE
Posts.ParentId IS NOT NULL
GROUP BY
Users.id
HAVING
Count(Posts.Id) > 0
)
ORDER BY
Users.Id DESC -- newest users first.My concern is that this SQL is far from optimal. I'm looking especially at the repeated inner joins between the Users and Posts tables, and the similarity between the sub-queries to find users with one question and users with at least one answer.
Gar
Solution
Half of the tricks with optimizing SQL is knowing the database schema.
The schema documentation can be found in this Meta post: Database schema documentation for the public data dump and SEDE.
There are four functional issues I see with your query, and a couple of suggestions after that:
As for the suggestions:
Using these suggestions I restructured your query to do a few things:
I forked your query here... ... for me it ran through in 33 milliseconds
The schema documentation can be found in this Meta post: Database schema documentation for the public data dump and SEDE.
There are four functional issues I see with your query, and a couple of suggestions after that:
- Use PostTypeId - it describes what the Post is (Question == 1, Answer == 2) and is well indexed.
- you are querying closed posts (which have a non-null
ClosedDate)
- you join to
Tagsfor a single record. This can be resolved independantly
- you have Posts three times in your query, and it can be there only twice if you do double-duty in the 'Only One Question' part of the query....
As for the suggestions:
- I like parametrize things, and being able to change the tag you are interested makes sense, so I would parametrize that.
- The SQL Server syntax allows for 'with' sections in a query, and that makes the code a lot more readable than the query-in-from-clause syntax
Using these suggestions I restructured your query to do a few things:
- add a User-Name sort column (SEDE does not sort [User Link] columns well)
- parametrized the tag, with a default value of
identification
- made the group-by and having a much smaller component in the query.
- added the closed-query support (and community-wiki - if there are).
I forked your query here... ... for me it ran through in 33 milliseconds
-- Find users who have exactly one post
-- with that post being a question
-- and the question has the [xxxx] tag (default identification)
declare @tag as nvarchar(25) = ##tag:string?identification##
;
declare @tagid as int
;
-- get the tag id here so it is not needed as a join table
select @tagid = Id
from Tags
where TagName = lower(@tag)
;
print 'Tag for ' + @tag + ' is ' + Convert(NVarchar(10), @tagid)
;
with Answerers as (
select distinct OwnerUserId as [UserId]
from Posts p
where PostTypeId = 2 -- answers
and CommunityOwnedDate is null -- not CW
and ClosedDate is null -- not closed
), SingleQ as (
Select OwnerUserId as [UserId],
Min(Id) as [FirstPost],
Sum(Score) as [Score],
count (*) as [QCount]
from Posts p
where PostTypeId = 1 -- question
and CommunityOwnedDate is null -- not CW
and ClosedDate is null -- not closed
and not exists (
select UserId
from Answerers
where UserId = OwnerUserId) -- active answerers.
group by OwnerUserId
having count(*) = 1
)
SELECT
ROW_NUMBER() OVER (order by Users.Id Desc) as [Recent],
Users.Id AS [User Link],
Users.DisplayName AS [Sort By Name],
SingleQ.FirstPost AS [Post Link],
SingleQ.Score AS [Score]
FROM Users
INNER JOIN SingleQ ON Users.Id = SingleQ.UserId
INNER JOIN PostTags ON SingleQ.FirstPost = PostTags.PostId
WHERE PostTags.TagId = @tagid
ORDER BY
Users.Id DESC -- newest users first.Code Snippets
-- Find users who have exactly one post
-- with that post being a question
-- and the question has the [xxxx] tag (default identification)
declare @tag as nvarchar(25) = ##tag:string?identification##
;
declare @tagid as int
;
-- get the tag id here so it is not needed as a join table
select @tagid = Id
from Tags
where TagName = lower(@tag)
;
print 'Tag for ' + @tag + ' is ' + Convert(NVarchar(10), @tagid)
;
with Answerers as (
select distinct OwnerUserId as [UserId]
from Posts p
where PostTypeId = 2 -- answers
and CommunityOwnedDate is null -- not CW
and ClosedDate is null -- not closed
), SingleQ as (
Select OwnerUserId as [UserId],
Min(Id) as [FirstPost],
Sum(Score) as [Score],
count (*) as [QCount]
from Posts p
where PostTypeId = 1 -- question
and CommunityOwnedDate is null -- not CW
and ClosedDate is null -- not closed
and not exists (
select UserId
from Answerers
where UserId = OwnerUserId) -- active answerers.
group by OwnerUserId
having count(*) = 1
)
SELECT
ROW_NUMBER() OVER (order by Users.Id Desc) as [Recent],
Users.Id AS [User Link],
Users.DisplayName AS [Sort By Name],
SingleQ.FirstPost AS [Post Link],
SingleQ.Score AS [Score]
FROM Users
INNER JOIN SingleQ ON Users.Id = SingleQ.UserId
INNER JOIN PostTags ON SingleQ.FirstPost = PostTags.PostId
WHERE PostTags.TagId = @tagid
ORDER BY
Users.Id DESC -- newest users first.Context
StackExchange Code Review Q#46780, answer score: 2
Revisions (0)
No revisions yet.