patternsqlMinor
Cross-posts from Stack Overflow to CR
Viewed 0 times
fromstackcrossoverflowposts
Problem
What this code does it explained in detail in the top portion of the query.
I have left out part of the list of language tags as I felt it was just noise. You can see the full collection here.
You can run this query on SEDE and play with it as you wish.
Note that I left plenty of wiggle room for others to modify and aggregate their own data starting from the simple query at the bottom.
1089 rows returned in 5886 ms
How could I improve this with regards to readability, performance, etc.?
```
/**
* The objective of this query is to gather data related to cross-posts from
* Stack Overflow (SO) to Code Review (CR). A cross-post as defined in this context is
* a question which has first been asked on SO and then a short time later asked
* again on CR (albeit often slightly modified in the way it is titled or phrased).
* Querying from 2 or more sites requires cross-database queries, and the following
* 2 databases are used here. All relevant tables are in the [dbo] schema.
* - Stack Overflow DB: [StackOverflow]
* - Code Review DB: [StackExchange.Codereview]
* 2 temporary tables are used in order to compensate for the physsical limitations
* of SEDE which otherwise will often time out before the query is completed.
* param @minutesFromSoPostToCrPost int not null : The number of minutes allowed between the original
* SO question and its cross-post on CR. Default 120 minutes.
* param @maximumCharacterCountDifferenceAllowed int not null : The maximum number
* of characters difference between the body of the question.
* NOTE: The higher the number, the more likely that it's not actually a cross-post.
*/
if object_id('tempdb..#LanguageTags') is not null
drop table #LanguageTags;
if object_id('tempdb..#CrossPosts') is not null
drop table #CrossPosts;
go
create table #LanguageTags (
TagName varchar(35) collate SQL_Latin1_General_CP1_CS_AS
, constraint pk_#LanguageTags primary key (TagName)
);
go
insert into #LanguageT
I have left out part of the list of language tags as I felt it was just noise. You can see the full collection here.
You can run this query on SEDE and play with it as you wish.
Note that I left plenty of wiggle room for others to modify and aggregate their own data starting from the simple query at the bottom.
1089 rows returned in 5886 ms
How could I improve this with regards to readability, performance, etc.?
```
/**
* The objective of this query is to gather data related to cross-posts from
* Stack Overflow (SO) to Code Review (CR). A cross-post as defined in this context is
* a question which has first been asked on SO and then a short time later asked
* again on CR (albeit often slightly modified in the way it is titled or phrased).
* Querying from 2 or more sites requires cross-database queries, and the following
* 2 databases are used here. All relevant tables are in the [dbo] schema.
* - Stack Overflow DB: [StackOverflow]
* - Code Review DB: [StackExchange.Codereview]
* 2 temporary tables are used in order to compensate for the physsical limitations
* of SEDE which otherwise will often time out before the query is completed.
* param @minutesFromSoPostToCrPost int not null : The number of minutes allowed between the original
* SO question and its cross-post on CR. Default 120 minutes.
* param @maximumCharacterCountDifferenceAllowed int not null : The maximum number
* of characters difference between the body of the question.
* NOTE: The higher the number, the more likely that it's not actually a cross-post.
*/
if object_id('tempdb..#LanguageTags') is not null
drop table #LanguageTags;
if object_id('tempdb..#CrossPosts') is not null
drop table #CrossPosts;
go
create table #LanguageTags (
TagName varchar(35) collate SQL_Latin1_General_CP1_CS_AS
, constraint pk_#LanguageTags primary key (TagName)
);
go
insert into #LanguageT
Solution
Style
I like your style. I don't mind reading keywords in lowercase. I
think it's easier on the eyes, but then again, I'm fairly used to
case-sensitive languages with lowercase keywords and syntax
highlighting.
I'm also very fond of the prefix column aliases as opposed to the
postfix
Bug if I would have to nitpick: the hanging comma in the
statement messes with the alignment of the column names. There :)
Filter-join on lookup table
The joins on both
that for each and every
unfortunate m-to-n match-and-filter), and then for each of those, it
will join with the
The
pre-lookup all the tags that we want to use, by adding their
respective
follows:
Now we can use this table to filter tags that we want to see by
table now.
SEDE and cross database queries
It seems that SEDE is not really optimized to join the large database
tables to temp tables. So when we put the join described above inside
the fetching query, we get a timeout. The solution: store the
values in the temp table:
and then filter-and-distinct on the
query:
Turns out that this is quite a bit faster, as the query now returns
1130 rows returned in 4166 ms
I like your style. I don't mind reading keywords in lowercase. I
think it's easier on the eyes, but then again, I'm fairly used to
case-sensitive languages with lowercase keywords and syntax
highlighting.
I'm also very fond of the prefix column aliases as opposed to the
postfix
AS Alias style that is mainstream.Bug if I would have to nitpick: the hanging comma in the
create tablestatement messes with the alignment of the column names. There :)
Filter-join on lookup table
The joins on both
Tags tables is very deeply nested. Which means,that for each and every
Post, SQL will join with each and everyPostTag (which is unavoidable, because they are needed in anunfortunate m-to-n match-and-filter), and then for each of those, it
will join with the
Tags table.The
Tags table is actually not needed in this join. We canpre-lookup all the tags that we want to use, by adding their
respective
TagId values to the #LanguageTags temp table asfollows:
create table #LanguageTags (
TagName varchar(35) collate SQL_Latin1_General_CP1_CS_AS
, CrTagId int
, SoTagId int
, constraint pk_#LanguageTags primary key (TagName)
);
go
insert into #LanguageTags (TagName)
values
('applescript'),
('asp.net-mvc-3'),
('bash'),
('brainfuck'),
('c'),
('c#'),
('c++'),
/*SNIP...*/
('sql'),
('swift'),
('wolfram-mathematica'),
('xslt');
go
update Langs
set CrTagId = CrTags.Id
, SoTagId = SoTags.Id
from #LanguageTags Langs
inner join [StackExchange.CodeReview].dbo.Tags as CrTags
on CrTags.TagName = Langs.TagName
inner join [StackOverflow].dbo.Tags as SoTags
on SoTags.TagName = Langs.TagName;Now we can use this table to filter tags that we want to see by
TagId, and we can also join the posts from both sites to a singletable now.
SEDE and cross database queries
It seems that SEDE is not really optimized to join the large database
tables to temp tables. So when we put the join described above inside
the fetching query, we get a timeout. The solution: store the
TagIdvalues in the temp table:
select
-- SNIP
, [Tags] = CrPosts.Tags
, [CrTagId] = CrPT.TagId
, [SoTagId] = SoPT.TagIdand then filter-and-distinct on the
#LanguageTags table in the finalquery:
select distinct
CP.[Primary Stack]
, CP.[Primary User]
, CP.[SO Original]
, CP.[CR Xpost]
, CP.[CharCountDiff]
, CP.[SO Score]
, CP.[CR Score]
, CP.[SO Status]
, CP.[CR Status]
, CP.[DugaComments?]
, CP.[SO Answers]
, CP.[CR Answers]
, CP.[SO Accept?]
, CP.[CR Accept?]
, CP.[SO Created]
, CP.[Minutes to Xpost]
, CP.[Tags]
, CP.[CrTagId]
, CP.[SoTagId]
from #CrossPosts CP
/*Match at least one language tag from CR->SO per post.*/
inner join #LanguageTags as Langs
on Langs.SoTagId = CP.SoTagId
and Langs.CrTagId = CP.CrTagId
order by [SO Created] desc;Turns out that this is quite a bit faster, as the query now returns
1130 rows returned in 4166 ms
Code Snippets
create table #LanguageTags (
TagName varchar(35) collate SQL_Latin1_General_CP1_CS_AS
, CrTagId int
, SoTagId int
, constraint pk_#LanguageTags primary key (TagName)
);
go
insert into #LanguageTags (TagName)
values
('applescript'),
('asp.net-mvc-3'),
('bash'),
('brainfuck'),
('c'),
('c#'),
('c++'),
/*SNIP...*/
('sql'),
('swift'),
('wolfram-mathematica'),
('xslt');
go
update Langs
set CrTagId = CrTags.Id
, SoTagId = SoTags.Id
from #LanguageTags Langs
inner join [StackExchange.CodeReview].dbo.Tags as CrTags
on CrTags.TagName = Langs.TagName
inner join [StackOverflow].dbo.Tags as SoTags
on SoTags.TagName = Langs.TagName;select
-- SNIP
, [Tags] = CrPosts.Tags
, [CrTagId] = CrPT.TagId
, [SoTagId] = SoPT.TagIdselect distinct
CP.[Primary Stack]
, CP.[Primary User]
, CP.[SO Original]
, CP.[CR Xpost]
, CP.[CharCountDiff]
, CP.[SO Score]
, CP.[CR Score]
, CP.[SO Status]
, CP.[CR Status]
, CP.[DugaComments?]
, CP.[SO Answers]
, CP.[CR Answers]
, CP.[SO Accept?]
, CP.[CR Accept?]
, CP.[SO Created]
, CP.[Minutes to Xpost]
, CP.[Tags]
, CP.[CrTagId]
, CP.[SoTagId]
from #CrossPosts CP
/*Match at least one language tag from CR->SO per post.*/
inner join #LanguageTags as Langs
on Langs.SoTagId = CP.SoTagId
and Langs.CrTagId = CP.CrTagId
order by [SO Created] desc;Context
StackExchange Code Review Q#114406, answer score: 4
Revisions (0)
No revisions yet.