patternsqlMajor
SELECT DISTINCT on multiple columns
Viewed 0 times
selectdistinctcolumnsmultiple
Problem
Supposing we have a table with four columns
Is it possible to select all distinct values within the data in the columns and return them as a single column or do I have to create a function to achieve this?
(a,b,c,d) of the same data type.Is it possible to select all distinct values within the data in the columns and return them as a single column or do I have to create a function to achieve this?
Solution
Update: Tested all 5 queries in SQLfiddle with 100K rows (and 2 separate cases, one with few (25) distinct values and another with lots (around 25K values).
A very simple query would be to use
Another would be to first
The other answers have provided with more options using array functions or the
Actually Jack's query results are a bit better than shown above (if we remove the
Finally, if - and only if - the distinct values of the 4 columns are relatively few, you can use the
SQLfiddle
To summarize, when the distinct values are few, the recursive query is the absolute winner while with lots of values, my 2nd one, Jack's (improved version below) and AndriyM's queries are the best performers.
Late additions, a variation on the 1st query which despite the extra distinct operations, performs much better than the original 1st and only slightly worse than the 2nd:
and Jack's improved:
A very simple query would be to use
UNION DISTINCT. I think it would be most efficient if there is a separate index on each of the four columns It would be efficient with a separate index on each of the four columns, if Postgres had implemented Loose Index Scan optimization, which it hasn't. So this query will not be efficient as it requires 4 scans of the table (and no index is used):-- Query 1. (334 ms, 368ms)
SELECT a AS abcd FROM tablename
UNION -- means UNION DISTINCT
SELECT b FROM tablename
UNION
SELECT c FROM tablename
UNION
SELECT d FROM tablename ;Another would be to first
UNION ALL and then use DISTINCT. This will also require 4 table scans (and no use of indexes). Not bad efficiency when the values are few, and with more values becomes the fastest in my (not extensive) test:-- Query 2. (87 ms, 117 ms)
SELECT DISTINCT a AS abcd
FROM
( SELECT a FROM tablename
UNION ALL
SELECT b FROM tablename
UNION ALL
SELECT c FROM tablename
UNION ALL
SELECT d FROM tablename
) AS x ;The other answers have provided with more options using array functions or the
LATERAL syntax. Jack's query (187 ms, 261 ms) has reasonable performance but AndriyM's query seems more efficient (125 ms, 155 ms). Both of them do one sequential scan of the table and do not use any index.Actually Jack's query results are a bit better than shown above (if we remove the
order by) and can be further improved by removing the 4 internal distinct and leaving only the external one.Finally, if - and only if - the distinct values of the 4 columns are relatively few, you can use the
WITH RECURSIVE hack/optimization described in the above Loose Index Scan page and use all 4 indexes, with remarkably fast result! Tested with the same 100K rows and approximately 25 distinct values spread across the 4 columns (runs in only 2 ms!) while with 25K distinct values it's the slowest with 368 ms:-- Query 3. (2 ms, 368ms)
WITH RECURSIVE
da AS (
SELECT min(a) AS n FROM observations
UNION ALL
SELECT (SELECT min(a) FROM observations
WHERE a > s.n)
FROM da AS s WHERE s.n IS NOT NULL ),
db AS (
SELECT min(b) AS n FROM observations
UNION ALL
SELECT (SELECT min(b) FROM observations
WHERE b > s.n)
FROM db AS s WHERE s.n IS NOT NULL ),
dc AS (
SELECT min(c) AS n FROM observations
UNION ALL
SELECT (SELECT min(c) FROM observations
WHERE c > s.n)
FROM dc AS s WHERE s.n IS NOT NULL ),
dd AS (
SELECT min(d) AS n FROM observations
UNION ALL
SELECT (SELECT min(d) FROM observations
WHERE d > s.n)
FROM db AS s WHERE s.n IS NOT NULL )
SELECT n
FROM
( TABLE da UNION
TABLE db UNION
TABLE dc UNION
TABLE dd
) AS x
WHERE n IS NOT NULL ;SQLfiddle
To summarize, when the distinct values are few, the recursive query is the absolute winner while with lots of values, my 2nd one, Jack's (improved version below) and AndriyM's queries are the best performers.
Late additions, a variation on the 1st query which despite the extra distinct operations, performs much better than the original 1st and only slightly worse than the 2nd:
-- Query 1b. (85 ms, 149 ms)
SELECT DISTINCT a AS n FROM observations
UNION
SELECT DISTINCT b FROM observations
UNION
SELECT DISTINCT c FROM observations
UNION
SELECT DISTINCT d FROM observations ;and Jack's improved:
-- Query 4b. (104 ms, 128 ms)
select distinct unnest( array_agg(a)||
array_agg(b)||
array_agg(c)||
array_agg(d) )
from t ;Code Snippets
-- Query 1. (334 ms, 368ms)
SELECT a AS abcd FROM tablename
UNION -- means UNION DISTINCT
SELECT b FROM tablename
UNION
SELECT c FROM tablename
UNION
SELECT d FROM tablename ;-- Query 2. (87 ms, 117 ms)
SELECT DISTINCT a AS abcd
FROM
( SELECT a FROM tablename
UNION ALL
SELECT b FROM tablename
UNION ALL
SELECT c FROM tablename
UNION ALL
SELECT d FROM tablename
) AS x ;-- Query 3. (2 ms, 368ms)
WITH RECURSIVE
da AS (
SELECT min(a) AS n FROM observations
UNION ALL
SELECT (SELECT min(a) FROM observations
WHERE a > s.n)
FROM da AS s WHERE s.n IS NOT NULL ),
db AS (
SELECT min(b) AS n FROM observations
UNION ALL
SELECT (SELECT min(b) FROM observations
WHERE b > s.n)
FROM db AS s WHERE s.n IS NOT NULL ),
dc AS (
SELECT min(c) AS n FROM observations
UNION ALL
SELECT (SELECT min(c) FROM observations
WHERE c > s.n)
FROM dc AS s WHERE s.n IS NOT NULL ),
dd AS (
SELECT min(d) AS n FROM observations
UNION ALL
SELECT (SELECT min(d) FROM observations
WHERE d > s.n)
FROM db AS s WHERE s.n IS NOT NULL )
SELECT n
FROM
( TABLE da UNION
TABLE db UNION
TABLE dc UNION
TABLE dd
) AS x
WHERE n IS NOT NULL ;-- Query 1b. (85 ms, 149 ms)
SELECT DISTINCT a AS n FROM observations
UNION
SELECT DISTINCT b FROM observations
UNION
SELECT DISTINCT c FROM observations
UNION
SELECT DISTINCT d FROM observations ;-- Query 4b. (104 ms, 128 ms)
select distinct unnest( array_agg(a)||
array_agg(b)||
array_agg(c)||
array_agg(d) )
from t ;Context
StackExchange Database Administrators Q#102677, answer score: 27
Revisions (0)
No revisions yet.