patternsqlModerate
GROUP BY one column, while sorting by another in PostgreSQL
Viewed 0 times
postgresqlsortingwhilegroupcolumnoneanother
Problem
How can I
I'm trying to do the following:
I want to select the last /n/ items from FileItems, in descending order, with the rows filtered by
Note this is basically the gestalt of the below question, with a lot of the extraneous details removed for clarity.
Original question
I have a system I'm migrating from sqlite3 to PostgreSQL, because I've largely outgrown sqlite:
Basically, I want to select the last n
Unfortunately, the above query, while it works fine in sqlite, errors out in PostgreSQL with the error
Unfortunately, while adding
GROUP BY one column, while sorting only by another.I'm trying to do the following:
SELECT dbId,retreivalTime
FROM FileItems
WHERE sourceSite='something'
GROUP BY seriesName
ORDER BY retreivalTime DESC
LIMIT 100
OFFSET 0;I want to select the last /n/ items from FileItems, in descending order, with the rows filtered by
DISTINCT values of seriesName. The above query errors out ERROR: column "fileitems.dbid" must appear in the GROUP BY clause or be used in an aggregate function. I need the dbid value in order to then take the output of this query, and JOIN it on the source table to get the rest of the columns I wasn.Note this is basically the gestalt of the below question, with a lot of the extraneous details removed for clarity.
Original question
I have a system I'm migrating from sqlite3 to PostgreSQL, because I've largely outgrown sqlite:
SELECT
d.dbId,
d.dlState,
d.sourceSite,
[snip a bunch of rows]
d.note
FROM FileItems AS d
JOIN
( SELECT dbId
FROM FileItems
WHERE sourceSite='{something}'
GROUP BY seriesName
ORDER BY MAX(retreivalTime) DESC
LIMIT 100
OFFSET 0
) AS di
ON di.dbId = d.dbId
ORDER BY d.retreivalTime DESC;Basically, I want to select the last n
DISTINCT items in the database, where the distinct constraint is on one column, and the sorting order is on a different column.Unfortunately, the above query, while it works fine in sqlite, errors out in PostgreSQL with the error
psycopg2.ProgrammingError: column "fileitems.dbid" must appear in the GROUP BY clause or be used in an aggregate function.Unfortunately, while adding
dbId to the GROUP BY clause fixes the issue (e.g. GROUP BY seriesName,dbId), it means the distinct filtering on the query results no longer work, Solution
Consistent rows
The important question which does not seem to be on your radar yet:
From each set of rows for the same
Your answer does the latter, you combine the maximum
To get consistent rows, use
Details for
Aside: should probably be
Better Performance with rCTE
Since we are dealing with a big table here, we'd need a query that can use an index, which is not the case for the above query (except for
On closer inspection, your problem seems to be a special case of a loose index scan. Postgres does not support loose index scans natively, but it can be emulated with a recursive CTE. There is a code example for the simple case in the Postgres Wiki.
See:
Your case is more complex, though. I think I found a variant to make it work for you. Building on this index (without
Or (with
The first index can be used for both queries, but is not fully efficient with the additional
I added
This query with a recursive CTE makes use of the index. I tested with Postgres 9.3 and it works for me: no sequential scan, all index-only scans:
You need to include
Explanation
-
The non-recursive query starts with the latest row.
-
The recursive query adds the next-latest row with a
-
Essential parts are the
-
Or a "fat" mv with whole rows to return. Faster, yet. Needs to be refreshed more often, obviously.
Details in the manual here and here.
The important question which does not seem to be on your radar yet:
From each set of rows for the same
seriesName, do you want the columns of one row, or just any values from multiple rows (which may or may not come from the same row)?Your answer does the latter, you combine the maximum
dbid with the maximum retreivaltime, which may come from a different row.To get consistent rows, use
DISTINCT ON and wrap it in a subquery to order the result differently:SELECT * FROM (
SELECT DISTINCT ON (seriesName)
dbid, seriesName, retreivaltime
FROM FileItems
WHERE sourceSite = 'mk'
ORDER BY seriesName, retreivaltime DESC NULLS LAST -- latest retreivaltime
) sub
ORDER BY retreivaltime DESC NULLS LAST
LIMIT 100;Details for
DISTINCT ON:- Select first row in each GROUP BY group?
Aside: should probably be
retrievalTime, or better yet: retrieval_time. Unquoted mixed case identifiers are a common source of confusion in Postgres.Better Performance with rCTE
Since we are dealing with a big table here, we'd need a query that can use an index, which is not the case for the above query (except for
WHERE sourceSite = 'mk')On closer inspection, your problem seems to be a special case of a loose index scan. Postgres does not support loose index scans natively, but it can be emulated with a recursive CTE. There is a code example for the simple case in the Postgres Wiki.
See:
- SELECT DISTINCT is slower than expected on my table in PostgreSQL
- Optimize GROUP BY query to retrieve latest record per user
Your case is more complex, though. I think I found a variant to make it work for you. Building on this index (without
WHERE sourceSite = 'mk')CREATE INDEX mi_special_full_idx ON MangaItems
(retreivaltime DESC NULLS LAST, seriesName DESC NULLS LAST, dbid)Or (with
WHERE sourceSite = 'mk')CREATE INDEX mi_special_granulated_idx ON MangaItems
(sourceSite, retreivaltime DESC NULLS LAST, seriesName DESC NULLS LAST, dbid)The first index can be used for both queries, but is not fully efficient with the additional
WHERE condition. The second index is of very limited use for the first query. Since you have both variants of the query, consider creating both indexes.I added
dbid at the end to allow index-only scans.This query with a recursive CTE makes use of the index. I tested with Postgres 9.3 and it works for me: no sequential scan, all index-only scans:
WITH RECURSIVE cte AS (
(
SELECT dbid, seriesName, retreivaltime, 1 AS rn, ARRAY[seriesName] AS arr
FROM MangaItems
WHERE sourceSite = 'mk'
ORDER BY retreivaltime DESC NULLS LAST, seriesName DESC NULLS LAST
LIMIT 1
)
UNION ALL
SELECT i.dbid, i.seriesName, i.retreivaltime, c.rn + 1, c.arr || i.seriesName
FROM cte c
, LATERAL (
SELECT dbid, seriesName, retreivaltime
FROM MangaItems
WHERE (retreivaltime, seriesName) ALL(c.arr)
ORDER BY retreivaltime DESC NULLS LAST, seriesName DESC NULLS LAST
LIMIT 1
) i
WHERE c.rn < 101
)
SELECT dbid
FROM cte
ORDER BY rn;You need to include
seriesName in ORDER BY, since retreivaltime is not unique. "Almost" unique is still not unique.Explanation
-
The non-recursive query starts with the latest row.
-
The recursive query adds the next-latest row with a
seriesName that's not in the list, yet etc., until we have 100 rows.-
Essential parts are the
JOIN condition (b.retreivaltime, b.seriesName)
-
Collecting seriesName in an array to rule out duplicates. The cost for b.seriesName <> ALL(c.foo_arr) grows progressively with the number of rows, but for just 100 rows it is still cheap.
-
Just returning dbid as clarified in the comments.
Alternative with partial indexes:
We have been dealing with similar problems before. Here is a highly optimized complete solution based on partial indexes and a looping function:
- Can spatial index help a "range - order by - limit" query
Probably the fastest way (except for a materialized view) if done right. But more complex.
Materialized View
Since you do not have a lot of write operations and they are not performance-critical as stated in the comments (should be in the question), save the top n pre-computed rows in a materialized view and refresh it after relevant changes to the underlying table. Base your performance-critical queries on the materialized view instead.
-
Could just be a "thin" mv of the latest 1000 dbid` or so. In the query, join to the original table. For instance, if content is sometimes updated, but the top n rows can remain unchanged.-
Or a "fat" mv with whole rows to return. Faster, yet. Needs to be refreshed more often, obviously.
Details in the manual here and here.
Code Snippets
SELECT * FROM (
SELECT DISTINCT ON (seriesName)
dbid, seriesName, retreivaltime
FROM FileItems
WHERE sourceSite = 'mk'
ORDER BY seriesName, retreivaltime DESC NULLS LAST -- latest retreivaltime
) sub
ORDER BY retreivaltime DESC NULLS LAST
LIMIT 100;CREATE INDEX mi_special_full_idx ON MangaItems
(retreivaltime DESC NULLS LAST, seriesName DESC NULLS LAST, dbid)CREATE INDEX mi_special_granulated_idx ON MangaItems
(sourceSite, retreivaltime DESC NULLS LAST, seriesName DESC NULLS LAST, dbid)WITH RECURSIVE cte AS (
(
SELECT dbid, seriesName, retreivaltime, 1 AS rn, ARRAY[seriesName] AS arr
FROM MangaItems
WHERE sourceSite = 'mk'
ORDER BY retreivaltime DESC NULLS LAST, seriesName DESC NULLS LAST
LIMIT 1
)
UNION ALL
SELECT i.dbid, i.seriesName, i.retreivaltime, c.rn + 1, c.arr || i.seriesName
FROM cte c
, LATERAL (
SELECT dbid, seriesName, retreivaltime
FROM MangaItems
WHERE (retreivaltime, seriesName) < (c.retreivaltime, c.seriesName)
AND sourceSite = 'mk' -- repeat condition!
AND seriesName <> ALL(c.arr)
ORDER BY retreivaltime DESC NULLS LAST, seriesName DESC NULLS LAST
LIMIT 1
) i
WHERE c.rn < 101
)
SELECT dbid
FROM cte
ORDER BY rn;Context
StackExchange Database Administrators Q#74773, answer score: 12
Revisions (0)
No revisions yet.