HiveBrain v1.2.0
Get Started
← Back to all entries
patternsqlMinor

GROUP BY and ORDER BY problem

Submitted by: @import:stackexchange-dba··
0
Viewed 0 times
andproblemordergroup

Problem

I have two tables like this:

CREATE TABLE cmap5 (
   name     varchar(2000),
   lexemes  tsquery
);


and

CREATE TABLE IF NOT EXISTS synonyms_all_gin_tsvcolumn (
   cid       int NOT NULL,  -- REFERENCES pubchem_compounds_index(cid)
   name      varchar(2000) NOT NULL,  
   synonym   varchar(2000) NOT NULL,
   tsv_syns  tsvector,
   PRIMARY KEY (cid, name, synonym)
);


My current query is:

SELECT s.cid, s.synonym, c.name, ts_rank(s.tsv_syns,c.lexemes,16) 
FROM synonyms_all_gin_tsvcolumn s, cmap5 c
WHERE c.lexemes @@ s.tsv_syns


And the output is:

cid     |  synonym                              | name (query)              | rank
5474706 | 10-Methoxyharmalan                    | 10-methoxyharmalan        | 0.0901673
1416    | (+/-)12,13-EODE                       | 12,13-EODE                | 0.211562
5356421 | LEUKOTOXIN B (12,13-EODE)             | 12,13-EODE                | 0.211562
 180933 | 1,4-Chrysenequinone                   | 1,4-chrysenequinone       | 0.211562
5283035 | 15-Deoxy-delta-12,14-prostaglandin J2 | 15-delta prostaglandin J2 | 0.304975
5311211 | 15-deoxy-delta 12 14-prostaglandin J2 | 15-delta prostaglandin J2 | 0.304975
5311211 | 15-deoxy-Delta(12,14)-prostaglandin J2| 15-delta prostaglandin J2 | 0.304975
5311211 | 15-Deoxy-delta-12,14-prostaglandin J2 | 15-delta prostaglandin J2 | 0.304975
5311211 | 15-Deoxy-delta 12, 14-Prostaglandin J2| 15-delta prostaglandin J2 | 0.304975


I would like to return the name matches of all rows in cmap5 in my main table ranked by the ts_rank() function but for each row in cmap5 I want to:

  • select only the best X cids to each query (group by cid)



  • or ORDER BY my results as 1+ts_rank/count(cid)



To get the best match I tried to add select distinct on c.name, but when the rank is the same I want to get the cid with more matches to the query. I have tried adding a simple group by at the end of the query but I get an error, how could I do this?

Added comments:

Solution

First off, your PRIMARY KEY spanning two varchar(2000) columns seems extremely expensive. If you use your PK for anything else I suggest a surrogate PK (use a serial column) and add a UNIQUE constraint to enforce uniqueness on (cid, name, synonym).

If one of your varchar columns actually uses the maximum length you would exceed the maximum size for an index entry. See:

  • Character varying index overhead & length limit



I guess what you want is this, because it would make sense:

SELECT DISTINCT ON (c.name)
       c.name, min(s.synonym) AS min_synonym, s.cid
     , ts_rank(s.tsv_syns, c.lexemes, 16) AS rnk
     , count(*) AS ct
FROM   synonyms_all_gin_tsvcolumn s
JOIN   cmap5                      c ON c.lexemes @@ s.tsv_syns
GROUP  BY c.name, rnk, s.cid
ORDER  BY c.name, rnk DESC, ct DESC;


-
I use explicit [INNER] JOIN with attached join condition replacing your CROSS JOIN plus WHERE clause. It's generally considered superior (easier to read and debug). I also use rnk as column name to avoid the basic function name rank as identifier.

-
Group the results per c.name that have the same rnk and s.cid, take min(s.synonym) (for lack of definition in the question), and count(*) the peers per group.

-
Reduce to one row per c.name with DISTINCT ON (Postgres specific extension of SQL standard DISTINCT), taking the highest rank first and, within same rank, the highest peer count. See:

-
Select first row in each GROUP BY group?

Combining GROUP BY and DISTINCT ON this way in one query level is possible since DISTINCT or DISTINCT ON are applied after GROUP BY.

Code Snippets

SELECT DISTINCT ON (c.name)
       c.name, min(s.synonym) AS min_synonym, s.cid
     , ts_rank(s.tsv_syns, c.lexemes, 16) AS rnk
     , count(*) AS ct
FROM   synonyms_all_gin_tsvcolumn s
JOIN   cmap5                      c ON c.lexemes @@ s.tsv_syns
GROUP  BY c.name, rnk, s.cid
ORDER  BY c.name, rnk DESC, ct DESC;

Context

StackExchange Database Administrators Q#35681, answer score: 9

Revisions (0)

No revisions yet.