HiveBrain v1.2.0
Get Started
← Back to all entries
patternsqlMinor

Top Python badged users from NYC

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
topusersbadgedpythonfromnyc

Problem

I forked this StackExchange Data Explorer (SEDE) query from another one, and made it a bit more complex. I'm looking for feedback on my style, as well as answers to other questions (see below).

-- top users: NYC
-- forked from Avinash Raj's query for Chennai
select
    row_number() over(order by u.Reputation desc) as [#], 
    u.Id as [User Link], 
    u.Reputation,
    u.Location 
from -- denormalize the Users with their Badges, i.e., flatten into a single table.
    Users u
  join 
    Badges b
  on
    u.Id = b.UserId
where
    b.name = N'python'
  and
    b.class = 3 -- 1 is gold, 2: silver, 3: bronze
  and
    u.Reputation >= 1000 
  and 
    (
      lower(u.Location) like '%nyc%'
      or upper(Location) like '%NY, NY%'
      or lower(Location) like '%staten%'
      or lower(Location) like '%bronx%'
      or lower(Location) like '%queens%'
      or lower(Location) like '%new york, ny%'
      or lower(Location) like '%manhattan%'
      or lower(Location) like '%brooklyn%'
    )
order by
    u.Reputation desc;


Questions:

  • Would this sql work with all relational DB's that accept SQL?



  • If not, what is non-standard? How would I standardize it?



  • I don't like all-uppercase queries - I find them more difficult to read, so I used the lowercase, is that problematic?



  • Is it possible, in a standard way, to factor pieces out of this query? Or nonstandard way specific to SEDE (or other system)? For example, could I avoid doing lower(Location) repeatedly? What about aliasing my joined table outside the code block for usage inside of it? I have order by u.Reputation desc twice, for example, can I factor that out?



I have just got some feedback on using table identifiers so that people don't have to guess, so I'll try to do that from now on. Am I doing it anywhere that's silly?

I was asked about allowing user input for the badge name - Yes, I intended to do that too - as well as specify type of badge (gold, silver, bronze) but I ran out of time, and don

Solution

Questions

First to address some of your questions:

  • Would this sql work with all relational DB's that accept SQL?



No, it would not. In fact it's quite unrealistic (if not impossible) to write SQL queries that would port over to multiple DBMS, as they all have different syntax and different approaches. For example, row_number() over(order by u.Reputation desc) only works with Transact-SQL language, it would fail with syntax error on every other DBMS that I know of.

The best way to write queries that can port to multiple DBMS is to use a library/framework that supports it. For example, Python has SQLAlchemy and Java has Hibernate. These can adapt queries to the database being used.

  • If not, what is non-standard? How would I standardize it?



See (1) above. There's not really a "standard" per-se, it really depends on the DBMS and on the application code that uses it.

  • I don't like all-uppercase queries - I find them more difficult to read, so I used the lowercase, is that problematic?



No, it is not problematic. Just be consistent with whatever style you use.

  • Is it possible, in a standard way, to factor pieces out of this query? Or nonstandard way specific to SEDE (or other system)? For example, could I avoid doing lower(Location) repeatedly? What about aliasing my joined table outside the code block for usage inside of it? I have order by u.Reputation desc twice, for example, can I factor that out?



The nature of SQL makes it somewhat difficult to refactor parts of queries out. Most SQL engines support the creation of user-defined functions and/or procedures, which can help to simplify queries. In your case those calls to lower() and upper() are needed because SEDE has case-sensitivity enabled.

Review

from -- denormalize the Users with their Badges, i.e., flatten into a single table.
    Users u
  join 
    Badges b


This from clause is a bit ugly. The aliases u and b are not very good names. It's a good habit to use meaningful names for aliases, which are meant to be identifiers rather than just a way to shorten the references. Here with having such short table names, an alias seems like overkill. It is also a good habit to use the as keyword and to specify the schema (although in the case of SEDE they are all the same schema, some other databases have tables organized in multiple schemes and it can become ambiguous).

from 
    dbo.Users as usr
  -- denormalize the Users with their Badges, i.e., flatten into a single table.
  join dbo.Badges as bdg
    on usr.Id = bdg.UserId


[Rank] would make a better column name than [#].

Your order by clause at the end is redundant, you are already sorting in your row_number() clause, so no need to sort them again. orber by is expensive so you would want to limit the amount of ordering as much as you can.

We can extract the badge name and minimum reputation into variables:

declare @targetBadgeName nvarchar(50) = N'python';
declare @minimumReputation int = 1000;


And just use those in the query. You could also search multiple badges if you used a table variable and then just joined it.

I think it would be elegant to include the Gold, Silver and Bronze into the query if you want that as part of your result set. Note that this will decrease performance due to the min() aggregate and resulting group by clauses:

case min(bdg.Class)
  when 1 then 'Gold'
  when 2 then 'Silver'
  when 3 then 'Bronze'
  else 'unknown' end as [Highest Badge],


Now your result set will look like this:

Rank User Link               Reputation Highest Badge Location                    
---- ----------------------- ---------- ------------- --------------------------- 
1    cletus                  368937     Bronze        New York, NY                
2    Triptych                96145      Gold          NYC                         
3    Claudiu                 81412      Gold          New York, NY                
4    Ben Hoffstein           59650      Bronze        New York, NY                
5    Yuji 'Tomita' Tomita    59318      Gold          New York, NY                
6    danben                  42686      Bronze        New York, NY                
7    David Robinson          42489      Gold          New York, NY                
8    Larry Lustig            33544      Bronze        New York, NY                
9    Aaron Hall              32035      Gold          New York, NY, United States 
10   chown                   31348      Silver        New York, NY


Finally, we can use a bit of SEDE magic to parametrize the query:

-- badgeName: Badge to search for:
declare @targetBadgeName nvarchar(50) = N##badgeName:string?python##;
-- minimumRep: Minimum reputation:
declare @minimumReputation int = ##minimumRep:int?1000##;


This will let you search other badges and different rep thresholds.

Everything combined (demo):

```
-- top users in NYC region for desired tag (default Python)
-- forked from Avinash

Code Snippets

from -- denormalize the Users with their Badges, i.e., flatten into a single table.
    Users u
  join 
    Badges b
from 
    dbo.Users as usr
  -- denormalize the Users with their Badges, i.e., flatten into a single table.
  join dbo.Badges as bdg
    on usr.Id = bdg.UserId
declare @targetBadgeName nvarchar(50) = N'python';
declare @minimumReputation int = 1000;
case min(bdg.Class)
  when 1 then 'Gold'
  when 2 then 'Silver'
  when 3 then 'Bronze'
  else 'unknown' end as [Highest Badge],
Rank User Link               Reputation Highest Badge Location                    
---- ----------------------- ---------- ------------- --------------------------- 
1    cletus                  368937     Bronze        New York, NY                
2    Triptych                96145      Gold          NYC                         
3    Claudiu                 81412      Gold          New York, NY                
4    Ben Hoffstein           59650      Bronze        New York, NY                
5    Yuji 'Tomita' Tomita    59318      Gold          New York, NY                
6    danben                  42686      Bronze        New York, NY                
7    David Robinson          42489      Gold          New York, NY                
8    Larry Lustig            33544      Bronze        New York, NY                
9    Aaron Hall              32035      Gold          New York, NY, United States 
10   chown                   31348      Silver        New York, NY

Context

StackExchange Code Review Q#117852, answer score: 7

Revisions (0)

No revisions yet.