patternsqlMinor
Top Python badged users from NYC
Viewed 0 times
topusersbadgedpythonfromnyc
Problem
I forked this StackExchange Data Explorer (SEDE) query from another one, and made it a bit more complex. I'm looking for feedback on my style, as well as answers to other questions (see below).
Questions:
I have just got some feedback on using table identifiers so that people don't have to guess, so I'll try to do that from now on. Am I doing it anywhere that's silly?
I was asked about allowing user input for the badge name - Yes, I intended to do that too - as well as specify type of badge (gold, silver, bronze) but I ran out of time, and don
-- top users: NYC
-- forked from Avinash Raj's query for Chennai
select
row_number() over(order by u.Reputation desc) as [#],
u.Id as [User Link],
u.Reputation,
u.Location
from -- denormalize the Users with their Badges, i.e., flatten into a single table.
Users u
join
Badges b
on
u.Id = b.UserId
where
b.name = N'python'
and
b.class = 3 -- 1 is gold, 2: silver, 3: bronze
and
u.Reputation >= 1000
and
(
lower(u.Location) like '%nyc%'
or upper(Location) like '%NY, NY%'
or lower(Location) like '%staten%'
or lower(Location) like '%bronx%'
or lower(Location) like '%queens%'
or lower(Location) like '%new york, ny%'
or lower(Location) like '%manhattan%'
or lower(Location) like '%brooklyn%'
)
order by
u.Reputation desc;Questions:
- Would this sql work with all relational DB's that accept SQL?
- If not, what is non-standard? How would I standardize it?
- I don't like all-uppercase queries - I find them more difficult to read, so I used the lowercase, is that problematic?
- Is it possible, in a standard way, to factor pieces out of this query? Or nonstandard way specific to SEDE (or other system)? For example, could I avoid doing
lower(Location)repeatedly? What about aliasing my joined table outside the code block for usage inside of it? I haveorder by u.Reputation desctwice, for example, can I factor that out?
I have just got some feedback on using table identifiers so that people don't have to guess, so I'll try to do that from now on. Am I doing it anywhere that's silly?
I was asked about allowing user input for the badge name - Yes, I intended to do that too - as well as specify type of badge (gold, silver, bronze) but I ran out of time, and don
Solution
Questions
First to address some of your questions:
No, it would not. In fact it's quite unrealistic (if not impossible) to write SQL queries that would port over to multiple DBMS, as they all have different syntax and different approaches. For example,
The best way to write queries that can port to multiple DBMS is to use a library/framework that supports it. For example, Python has SQLAlchemy and Java has Hibernate. These can adapt queries to the database being used.
See (1) above. There's not really a "standard" per-se, it really depends on the DBMS and on the application code that uses it.
No, it is not problematic. Just be consistent with whatever style you use.
The nature of SQL makes it somewhat difficult to refactor parts of queries out. Most SQL engines support the creation of user-defined functions and/or procedures, which can help to simplify queries. In your case those calls to
Review
This
Your
We can extract the badge name and minimum reputation into variables:
And just use those in the query. You could also search multiple badges if you used a table variable and then just joined it.
I think it would be elegant to include the Gold, Silver and Bronze into the query if you want that as part of your result set. Note that this will decrease performance due to the
Now your result set will look like this:
Finally, we can use a bit of SEDE magic to parametrize the query:
This will let you search other badges and different rep thresholds.
Everything combined (demo):
```
-- top users in NYC region for desired tag (default Python)
-- forked from Avinash
First to address some of your questions:
- Would this sql work with all relational DB's that accept SQL?
No, it would not. In fact it's quite unrealistic (if not impossible) to write SQL queries that would port over to multiple DBMS, as they all have different syntax and different approaches. For example,
row_number() over(order by u.Reputation desc) only works with Transact-SQL language, it would fail with syntax error on every other DBMS that I know of.The best way to write queries that can port to multiple DBMS is to use a library/framework that supports it. For example, Python has SQLAlchemy and Java has Hibernate. These can adapt queries to the database being used.
- If not, what is non-standard? How would I standardize it?
See (1) above. There's not really a "standard" per-se, it really depends on the DBMS and on the application code that uses it.
- I don't like all-uppercase queries - I find them more difficult to read, so I used the lowercase, is that problematic?
No, it is not problematic. Just be consistent with whatever style you use.
- Is it possible, in a standard way, to factor pieces out of this query? Or nonstandard way specific to SEDE (or other system)? For example, could I avoid doing lower(Location) repeatedly? What about aliasing my joined table outside the code block for usage inside of it? I have order by u.Reputation desc twice, for example, can I factor that out?
The nature of SQL makes it somewhat difficult to refactor parts of queries out. Most SQL engines support the creation of user-defined functions and/or procedures, which can help to simplify queries. In your case those calls to
lower() and upper() are needed because SEDE has case-sensitivity enabled.Review
from -- denormalize the Users with their Badges, i.e., flatten into a single table.
Users u
join
Badges bThis
from clause is a bit ugly. The aliases u and b are not very good names. It's a good habit to use meaningful names for aliases, which are meant to be identifiers rather than just a way to shorten the references. Here with having such short table names, an alias seems like overkill. It is also a good habit to use the as keyword and to specify the schema (although in the case of SEDE they are all the same schema, some other databases have tables organized in multiple schemes and it can become ambiguous).from
dbo.Users as usr
-- denormalize the Users with their Badges, i.e., flatten into a single table.
join dbo.Badges as bdg
on usr.Id = bdg.UserId[Rank] would make a better column name than [#].Your
order by clause at the end is redundant, you are already sorting in your row_number() clause, so no need to sort them again. orber by is expensive so you would want to limit the amount of ordering as much as you can.We can extract the badge name and minimum reputation into variables:
declare @targetBadgeName nvarchar(50) = N'python';
declare @minimumReputation int = 1000;And just use those in the query. You could also search multiple badges if you used a table variable and then just joined it.
I think it would be elegant to include the Gold, Silver and Bronze into the query if you want that as part of your result set. Note that this will decrease performance due to the
min() aggregate and resulting group by clauses:case min(bdg.Class)
when 1 then 'Gold'
when 2 then 'Silver'
when 3 then 'Bronze'
else 'unknown' end as [Highest Badge],Now your result set will look like this:
Rank User Link Reputation Highest Badge Location
---- ----------------------- ---------- ------------- ---------------------------
1 cletus 368937 Bronze New York, NY
2 Triptych 96145 Gold NYC
3 Claudiu 81412 Gold New York, NY
4 Ben Hoffstein 59650 Bronze New York, NY
5 Yuji 'Tomita' Tomita 59318 Gold New York, NY
6 danben 42686 Bronze New York, NY
7 David Robinson 42489 Gold New York, NY
8 Larry Lustig 33544 Bronze New York, NY
9 Aaron Hall 32035 Gold New York, NY, United States
10 chown 31348 Silver New York, NYFinally, we can use a bit of SEDE magic to parametrize the query:
-- badgeName: Badge to search for:
declare @targetBadgeName nvarchar(50) = N##badgeName:string?python##;
-- minimumRep: Minimum reputation:
declare @minimumReputation int = ##minimumRep:int?1000##;This will let you search other badges and different rep thresholds.
Everything combined (demo):
```
-- top users in NYC region for desired tag (default Python)
-- forked from Avinash
Code Snippets
from -- denormalize the Users with their Badges, i.e., flatten into a single table.
Users u
join
Badges bfrom
dbo.Users as usr
-- denormalize the Users with their Badges, i.e., flatten into a single table.
join dbo.Badges as bdg
on usr.Id = bdg.UserIddeclare @targetBadgeName nvarchar(50) = N'python';
declare @minimumReputation int = 1000;case min(bdg.Class)
when 1 then 'Gold'
when 2 then 'Silver'
when 3 then 'Bronze'
else 'unknown' end as [Highest Badge],Rank User Link Reputation Highest Badge Location
---- ----------------------- ---------- ------------- ---------------------------
1 cletus 368937 Bronze New York, NY
2 Triptych 96145 Gold NYC
3 Claudiu 81412 Gold New York, NY
4 Ben Hoffstein 59650 Bronze New York, NY
5 Yuji 'Tomita' Tomita 59318 Gold New York, NY
6 danben 42686 Bronze New York, NY
7 David Robinson 42489 Gold New York, NY
8 Larry Lustig 33544 Bronze New York, NY
9 Aaron Hall 32035 Gold New York, NY, United States
10 chown 31348 Silver New York, NYContext
StackExchange Code Review Q#117852, answer score: 7
Revisions (0)
No revisions yet.