patternsqlMinor
GROUP BY and ORDER BY problem
Viewed 0 times
andproblemordergroup
Problem
I have two tables like this:
and
My current query is:
And the output is:
I would like to return the name matches of all rows in
To get the best match I tried to add
Added comments:
CREATE TABLE cmap5 (
name varchar(2000),
lexemes tsquery
);and
CREATE TABLE IF NOT EXISTS synonyms_all_gin_tsvcolumn (
cid int NOT NULL, -- REFERENCES pubchem_compounds_index(cid)
name varchar(2000) NOT NULL,
synonym varchar(2000) NOT NULL,
tsv_syns tsvector,
PRIMARY KEY (cid, name, synonym)
);My current query is:
SELECT s.cid, s.synonym, c.name, ts_rank(s.tsv_syns,c.lexemes,16)
FROM synonyms_all_gin_tsvcolumn s, cmap5 c
WHERE c.lexemes @@ s.tsv_synsAnd the output is:
cid | synonym | name (query) | rank
5474706 | 10-Methoxyharmalan | 10-methoxyharmalan | 0.0901673
1416 | (+/-)12,13-EODE | 12,13-EODE | 0.211562
5356421 | LEUKOTOXIN B (12,13-EODE) | 12,13-EODE | 0.211562
180933 | 1,4-Chrysenequinone | 1,4-chrysenequinone | 0.211562
5283035 | 15-Deoxy-delta-12,14-prostaglandin J2 | 15-delta prostaglandin J2 | 0.304975
5311211 | 15-deoxy-delta 12 14-prostaglandin J2 | 15-delta prostaglandin J2 | 0.304975
5311211 | 15-deoxy-Delta(12,14)-prostaglandin J2| 15-delta prostaglandin J2 | 0.304975
5311211 | 15-Deoxy-delta-12,14-prostaglandin J2 | 15-delta prostaglandin J2 | 0.304975
5311211 | 15-Deoxy-delta 12, 14-Prostaglandin J2| 15-delta prostaglandin J2 | 0.304975I would like to return the name matches of all rows in
cmap5 in my main table ranked by the ts_rank() function but for each row in cmap5 I want to:- select only the best X
cids to each query (group by cid)
- or
ORDER BY my results as 1+ts_rank/count(cid)
To get the best match I tried to add
select distinct on c.name, but when the rank is the same I want to get the cid with more matches to the query. I have tried adding a simple group by at the end of the query but I get an error, how could I do this?Added comments:
Solution
First off, your
If one of your
I guess what you want is this, because it would make sense:
-
I use explicit
-
Group the results per
-
Reduce to one row per
-
Select first row in each GROUP BY group?
Combining
PRIMARY KEY spanning two varchar(2000) columns seems extremely expensive. If you use your PK for anything else I suggest a surrogate PK (use a serial column) and add a UNIQUE constraint to enforce uniqueness on (cid, name, synonym).If one of your
varchar columns actually uses the maximum length you would exceed the maximum size for an index entry. See:- Character varying index overhead & length limit
I guess what you want is this, because it would make sense:
SELECT DISTINCT ON (c.name)
c.name, min(s.synonym) AS min_synonym, s.cid
, ts_rank(s.tsv_syns, c.lexemes, 16) AS rnk
, count(*) AS ct
FROM synonyms_all_gin_tsvcolumn s
JOIN cmap5 c ON c.lexemes @@ s.tsv_syns
GROUP BY c.name, rnk, s.cid
ORDER BY c.name, rnk DESC, ct DESC;-
I use explicit
[INNER] JOIN with attached join condition replacing your CROSS JOIN plus WHERE clause. It's generally considered superior (easier to read and debug). I also use rnk as column name to avoid the basic function name rank as identifier.-
Group the results per
c.name that have the same rnk and s.cid, take min(s.synonym) (for lack of definition in the question), and count(*) the peers per group.-
Reduce to one row per
c.name with DISTINCT ON (Postgres specific extension of SQL standard DISTINCT), taking the highest rank first and, within same rank, the highest peer count. See:-
Select first row in each GROUP BY group?
Combining
GROUP BY and DISTINCT ON this way in one query level is possible since DISTINCT or DISTINCT ON are applied after GROUP BY.Code Snippets
SELECT DISTINCT ON (c.name)
c.name, min(s.synonym) AS min_synonym, s.cid
, ts_rank(s.tsv_syns, c.lexemes, 16) AS rnk
, count(*) AS ct
FROM synonyms_all_gin_tsvcolumn s
JOIN cmap5 c ON c.lexemes @@ s.tsv_syns
GROUP BY c.name, rnk, s.cid
ORDER BY c.name, rnk DESC, ct DESC;Context
StackExchange Database Administrators Q#35681, answer score: 9
Revisions (0)
No revisions yet.