patternsqlMinor
Aggregate function to return any (first) encountered value
Viewed 0 times
returnanyfunctionencounteredvaluefirstaggregate
Problem
How do I tell PostgreSQL to return first encountered value instead of an aggregate column?
Query grouping by object_id and data2 will make hash aggregate, what I want to avoid.
I found
Moreover, there is no such function for any datatype. What I want to do is to get any of the values from data2 column so that engine don't have to iterate all rows.
What about data column which is integer?
Table "public.cache"
Column | Type | Modifiers | Storage | Stats target | Description
-----------+---------+-----------+---------+--------------+-------------
user_id | integer | | plain | |
object_id | integer | | plain | |
data | integer | | plain | |
data2 | boolean | | plain | |
Indexes:
"cache_object_id_user_id_key" UNIQUE CONSTRAINT, btree (object_id, user_id)
"cache_user_id_object_id_key" UNIQUE CONSTRAINT, btree (user_id, object_id)
Has OIDs: noQuery grouping by object_id and data2 will make hash aggregate, what I want to avoid.
SELECT object_id, data2 FROM cache GROUP BY object_id, data2;I found
bool_or() but it will scan all values in bad case.SELECT object_id, bool_or(data2) FROM cache GROUP BY object_id;Moreover, there is no such function for any datatype. What I want to do is to get any of the values from data2 column so that engine don't have to iterate all rows.
What about data column which is integer?
Solution
There are various ways to do this, with different performance, depending on the distribution of data (number of distinct
The easiest query to write - but not necessary the most efficient, is of course using an aggregate,
If you have an index on
Another way would be using
If, compared to the table size, there is a small number of
The same index would be needed. The
But you'd lose some efficiency, especially in older versions. In that case, you could replace this subquery with a complicated recursive query that traverses the
Read also these great answers by Erwin in the related questions:
-
How do I efficiently get “the most recent corresponding row”?.
-
Optimize GROUP BY query to retrieve latest record per user
-
Best performance in sampling repeated value from a grouped column
And last but not least, the main reason for the problems is this:
I have denormalized data and I want to avoid ... (I know that all values in that group are equal booleans).
Normalizing the table would lead to much more efficient queries.
object_id values, etc).The easiest query to write - but not necessary the most efficient, is of course using an aggregate,
MIN() or MAX():SELECT object_id, MIN(data2) AS data2
FROM cache
GROUP BY object_id ;If you have an index on
(object_id, data2) this will not be too bad in recent versions of Postgres that can use index-only-scan for the execution plan.Another way would be using
DISTINCT ON syntax. The same index as above would help:SELECT DISTINCT ON (object_id)
object_id, data2
FROM cache
ORDER BY object_id ;If, compared to the table size, there is a small number of
object_id values, a different approach would be much more efficient. Provided you also have another tables (say objects) that has object_id as its primary key:SELECT o.object_id, c.data2
FROM objects AS o
CROSS JOIN LATERAL
( SELECT data2
FROM cache AS c
WHERE c.object_id = o.object_id
ORDER BY c.data2
LIMIT 1
) AS c ;The same index would be needed. The
ORDER BY is not required but with the index in place, it won't hurt efficiency. If you don't have an objects table then that part would have to be replaced with:---
FROM ( SELECT DISTINCT object_id FROM cache) AS o
CROSS JOIN LATERAL
---But you'd lose some efficiency, especially in older versions. In that case, you could replace this subquery with a complicated recursive query that traverses the
object_id index efficiently. See the Posgres docs for more details: Loose Index Scan.Read also these great answers by Erwin in the related questions:
-
How do I efficiently get “the most recent corresponding row”?.
-
Optimize GROUP BY query to retrieve latest record per user
-
Best performance in sampling repeated value from a grouped column
And last but not least, the main reason for the problems is this:
I have denormalized data and I want to avoid ... (I know that all values in that group are equal booleans).
Normalizing the table would lead to much more efficient queries.
Code Snippets
SELECT object_id, MIN(data2) AS data2
FROM cache
GROUP BY object_id ;SELECT DISTINCT ON (object_id)
object_id, data2
FROM cache
ORDER BY object_id ;SELECT o.object_id, c.data2
FROM objects AS o
CROSS JOIN LATERAL
( SELECT data2
FROM cache AS c
WHERE c.object_id = o.object_id
ORDER BY c.data2
LIMIT 1
) AS c ;---
FROM ( SELECT DISTINCT object_id FROM cache) AS o
CROSS JOIN LATERAL
---Context
StackExchange Database Administrators Q#138218, answer score: 7
Revisions (0)
No revisions yet.