HiveBrain v1.2.0
Get Started
← Back to all entries
patternsqlMinor

Cross-posts from Stack Overflow to CR

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
fromstackcrossoverflowposts

Problem

What this code does it explained in detail in the top portion of the query.

I have left out part of the list of language tags as I felt it was just noise. You can see the full collection here.

You can run this query on SEDE and play with it as you wish.

Note that I left plenty of wiggle room for others to modify and aggregate their own data starting from the simple query at the bottom.


1089 rows returned in 5886 ms

How could I improve this with regards to readability, performance, etc.?

```
/**
* The objective of this query is to gather data related to cross-posts from
* Stack Overflow (SO) to Code Review (CR). A cross-post as defined in this context is
* a question which has first been asked on SO and then a short time later asked
* again on CR (albeit often slightly modified in the way it is titled or phrased).
* Querying from 2 or more sites requires cross-database queries, and the following
* 2 databases are used here. All relevant tables are in the [dbo] schema.
* - Stack Overflow DB: [StackOverflow]
* - Code Review DB: [StackExchange.Codereview]
* 2 temporary tables are used in order to compensate for the physsical limitations
* of SEDE which otherwise will often time out before the query is completed.
* param @minutesFromSoPostToCrPost int not null : The number of minutes allowed between the original
* SO question and its cross-post on CR. Default 120 minutes.
* param @maximumCharacterCountDifferenceAllowed int not null : The maximum number
* of characters difference between the body of the question.
* NOTE: The higher the number, the more likely that it's not actually a cross-post.
*/
if object_id('tempdb..#LanguageTags') is not null
drop table #LanguageTags;
if object_id('tempdb..#CrossPosts') is not null
drop table #CrossPosts;
go
create table #LanguageTags (
TagName varchar(35) collate SQL_Latin1_General_CP1_CS_AS
, constraint pk_#LanguageTags primary key (TagName)
);
go
insert into #LanguageT

Solution

Style

I like your style. I don't mind reading keywords in lowercase. I
think it's easier on the eyes, but then again, I'm fairly used to
case-sensitive languages with lowercase keywords and syntax
highlighting.

I'm also very fond of the prefix column aliases as opposed to the
postfix AS Alias style that is mainstream.

Bug if I would have to nitpick: the hanging comma in the create table
statement messes with the alignment of the column names. There :)

Filter-join on lookup table

The joins on both Tags tables is very deeply nested. Which means,
that for each and every Post, SQL will join with each and every
PostTag (which is unavoidable, because they are needed in an
unfortunate m-to-n match-and-filter), and then for each of those, it
will join with the Tags table.

The Tags table is actually not needed in this join. We can
pre-lookup all the tags that we want to use, by adding their
respective TagId values to the #LanguageTags temp table as
follows:

create table #LanguageTags (
    TagName varchar(35) collate SQL_Latin1_General_CP1_CS_AS
    , CrTagId int
    , SoTagId int
    , constraint pk_#LanguageTags primary key (TagName)
);
go
insert into #LanguageTags (TagName)
values
  ('applescript'),
  ('asp.net-mvc-3'),
  ('bash'),
  ('brainfuck'),
  ('c'),
  ('c#'),
  ('c++'),

  /*SNIP...*/

  ('sql'),
  ('swift'),
  ('wolfram-mathematica'),
  ('xslt');
go
update Langs
   set CrTagId = CrTags.Id
     , SoTagId = SoTags.Id
  from #LanguageTags Langs
 inner join [StackExchange.CodeReview].dbo.Tags as CrTags
         on CrTags.TagName = Langs.TagName
 inner join [StackOverflow].dbo.Tags as SoTags
         on SoTags.TagName = Langs.TagName;


Now we can use this table to filter tags that we want to see by
TagId, and we can also join the posts from both sites to a single
table now.

SEDE and cross database queries

It seems that SEDE is not really optimized to join the large database
tables to temp tables. So when we put the join described above inside
the fetching query, we get a timeout. The solution: store the TagId
values in the temp table:

select
  -- SNIP
  , [Tags] = CrPosts.Tags
  , [CrTagId] = CrPT.TagId
  , [SoTagId] = SoPT.TagId


and then filter-and-distinct on the #LanguageTags table in the final
query:

select distinct
       CP.[Primary Stack]
     , CP.[Primary User]
     , CP.[SO Original]
     , CP.[CR Xpost]
     , CP.[CharCountDiff]
     , CP.[SO Score]
     , CP.[CR Score]
     , CP.[SO Status]
     , CP.[CR Status]
     , CP.[DugaComments?]
     , CP.[SO Answers]
     , CP.[CR Answers]
     , CP.[SO Accept?]
     , CP.[CR Accept?]
     , CP.[SO Created]
     , CP.[Minutes to Xpost]
     , CP.[Tags]
     , CP.[CrTagId]
     , CP.[SoTagId]
  from #CrossPosts CP
    /*Match at least one language tag from CR->SO per post.*/
 inner join #LanguageTags as Langs
         on Langs.SoTagId = CP.SoTagId
        and Langs.CrTagId = CP.CrTagId
 order by [SO Created] desc;


Turns out that this is quite a bit faster, as the query now returns


1130 rows returned in 4166 ms

Code Snippets

create table #LanguageTags (
    TagName varchar(35) collate SQL_Latin1_General_CP1_CS_AS
    , CrTagId int
    , SoTagId int
    , constraint pk_#LanguageTags primary key (TagName)
);
go
insert into #LanguageTags (TagName)
values
  ('applescript'),
  ('asp.net-mvc-3'),
  ('bash'),
  ('brainfuck'),
  ('c'),
  ('c#'),
  ('c++'),

  /*SNIP...*/

  ('sql'),
  ('swift'),
  ('wolfram-mathematica'),
  ('xslt');
go
update Langs
   set CrTagId = CrTags.Id
     , SoTagId = SoTags.Id
  from #LanguageTags Langs
 inner join [StackExchange.CodeReview].dbo.Tags as CrTags
         on CrTags.TagName = Langs.TagName
 inner join [StackOverflow].dbo.Tags as SoTags
         on SoTags.TagName = Langs.TagName;
select
  -- SNIP
  , [Tags] = CrPosts.Tags
  , [CrTagId] = CrPT.TagId
  , [SoTagId] = SoPT.TagId
select distinct
       CP.[Primary Stack]
     , CP.[Primary User]
     , CP.[SO Original]
     , CP.[CR Xpost]
     , CP.[CharCountDiff]
     , CP.[SO Score]
     , CP.[CR Score]
     , CP.[SO Status]
     , CP.[CR Status]
     , CP.[DugaComments?]
     , CP.[SO Answers]
     , CP.[CR Answers]
     , CP.[SO Accept?]
     , CP.[CR Accept?]
     , CP.[SO Created]
     , CP.[Minutes to Xpost]
     , CP.[Tags]
     , CP.[CrTagId]
     , CP.[SoTagId]
  from #CrossPosts CP
    /*Match at least one language tag from CR->SO per post.*/
 inner join #LanguageTags as Langs
         on Langs.SoTagId = CP.SoTagId
        and Langs.CrTagId = CP.CrTagId
 order by [SO Created] desc;

Context

StackExchange Code Review Q#114406, answer score: 4

Revisions (0)

No revisions yet.