snippetsqlCritical
How do I efficiently get "the most recent corresponding row"?
Viewed 0 times
efficientlytherecentgethowrowcorrespondingmost
Problem
I have a query pattern that must be very common, but I don't know how to write an efficient query for it. I want to look up the rows of a table that correspond to "the most recent date not after" the rows of another table.
I have a table,
and a table, "price" say, which holds the price of a good on a given day
How can I efficiently get the "most recent" price for each row of the inventory table, i.e.
I know one way of doing this:
and then join this query again to inventory. For large tables even doing the first query (without joining again to inventory) is very slow. However, the same problem is quickly solved if I simply use my programming language to issue one
Is there a standard way to do this efficiently? It feels like it
I have a table,
inventory say, which represents the inventory I hold on a certain day.date | good | quantity
------------------------------
2013-08-09 | egg | 5
2013-08-09 | pear | 7
2013-08-02 | egg | 1
2013-08-02 | pear | 2and a table, "price" say, which holds the price of a good on a given day
date | good | price
--------------------------
2013-08-07 | egg | 120
2013-08-06 | pear | 200
2013-08-01 | egg | 110
2013-07-30 | pear | 220How can I efficiently get the "most recent" price for each row of the inventory table, i.e.
date | pricing date | good | quantity | price
----------------------------------------------------
2013-08-09 | 2013-08-07 | egg | 5 | 120
2013-08-09 | 2013-08-06 | pear | 7 | 200
2013-08-02 | 2013-08-01 | egg | 1 | 110
2013-08-02 | 2013-07-30 | pear | 2 | 220I know one way of doing this:
select inventory.date, max(price.date) as pricing_date, good
from inventory, price
where inventory.date >= price.date
and inventory.good = price.good
group by inventory.date, goodand then join this query again to inventory. For large tables even doing the first query (without joining again to inventory) is very slow. However, the same problem is quickly solved if I simply use my programming language to issue one
max(price.date) ... where price.date <= date_of_interest ... order by price.date desc limit 1 query for each date_of_interest from the inventory table, so I know there is no computational impediment. I would, however, prefer to solve the whole problem with a single SQL query, because it would allow me to do further SQL processing on the result of the query.Is there a standard way to do this efficiently? It feels like it
Solution
It very much depends on circumstances and exact requirements. Consider my comment.
Simple solution
With
Returned rows are ordered. See:
Or with
Same result, but with arbitrary sort order - unless you add
Depending on data distribution, exact requirements and indices, either one of these may be faster. See:
With only few rows per good,
Solutions with subqueries to compute max / min values are typically slower. Variants with CTEs are generally slower, yet. (CTEs improved with Postgres 12.)
Plain views (like proposed by another answer) do not help performance at all in Postgres.
db<>fiddle here
Old sqlfiddle
Proper solution
Strings and collation
First of all, your table layout is a sub-optimal. It may seem trivial, but normalizing your schema can go a long way.
Sorting by character types (
This makes sorting and index look-ups slower. The longer your strings (names of goods) the worse. If you do not actually care for collation rules in your output (or the sort order), this can be faster with
Note the added collation in two places.
Twice as fast in my test with 20k rows each and very basic names ('good123').
Index
If your query is supposed to use an index, columns with character data have to use a matching collation (
Read the last two chapters of the related answer I linked above.
You can even have multiple indexes with different collations on the same columns - if you also need goods sorted according to another (or the default) collation in other queries.
Normalize
Redundant strings (name of good) bloat tables and indexes, which makes everything slower. A proper table layout can avoid most of the problem. Could look like this:
The primary keys automatically provide (almost) all indices we need.
Depending on missing details, a multicolumn index on
Again, the collation must match your query (see above).
Since Postgres 9.2 "covering indices" for index-only scans can help some more - especially if tables hold additional columns, making the table substantially bigger than the index.
These resulting queries are much faster:
db<>fiddle here
OLD sqliddle
Faster solutions
If that still is not fast enough, there may be faster solutions.
Recursive CTE /
Especially for data distributions with many prices per good:
Materialized view
If you need to run this often and fast, I suggest you create a materialized view. I think it is safe to assume, that prices and inventor
Simple solution
With
DISTINCT ON in Postgres:SELECT DISTINCT ON (i.good, i.the_date)
i.the_date, p.the_date AS pricing_date, i.good, p.price
FROM inventory i
LEFT JOIN price p ON i.good = p.good AND i.the_date >= p.the_date
ORDER BY i.good, i.the_date, p.the_date DESC;Returned rows are ordered. See:
- Select first row in each GROUP BY group?
Or with
NOT EXISTS in standard SQL (works with every RDBMS I know):SELECT i.the_date, p.the_date AS pricing_date, i.good, i.quantity, p.price
FROM inventory i
LEFT JOIN price p ON p.good = i.good AND p.the_date p.the_date
);Same result, but with arbitrary sort order - unless you add
ORDER BY.Depending on data distribution, exact requirements and indices, either one of these may be faster. See:
- How do I (or can I) SELECT DISTINCT on multiple columns?
With only few rows per good,
DISTINCT ON is typically faster and you get a sorted result on top of it. But for certain cases other query techniques are (much) faster, yet. See below.Solutions with subqueries to compute max / min values are typically slower. Variants with CTEs are generally slower, yet. (CTEs improved with Postgres 12.)
Plain views (like proposed by another answer) do not help performance at all in Postgres.
db<>fiddle here
Old sqlfiddle
Proper solution
Strings and collation
First of all, your table layout is a sub-optimal. It may seem trivial, but normalizing your schema can go a long way.
Sorting by character types (
text, varchar, ...) is done according to current COLLATION. Typically, your DB would use some local set of rules, like in my case: de_AT.UTF-8. Find out with:SHOW lc_collate;This makes sorting and index look-ups slower. The longer your strings (names of goods) the worse. If you do not actually care for collation rules in your output (or the sort order), this can be faster with
COLLATE "C":SELECT DISTINCT ON (i.good COLLATE "C", i.the_date)
i.the_date, p.the_date AS pricing_date, i.good, p.price
FROM inventory i
LEFT JOIN price p ON i.good = p.good AND i.the_date >= p.the_date
ORDER BY i.good COLLATE "C", i.the_date, p.the_date DESC;Note the added collation in two places.
Twice as fast in my test with 20k rows each and very basic names ('good123').
Index
If your query is supposed to use an index, columns with character data have to use a matching collation (
good in the example):CREATE INDEX inventory_good_date_desc_collate_c_idx
ON price(good COLLATE "C", the_date DESC);Read the last two chapters of the related answer I linked above.
You can even have multiple indexes with different collations on the same columns - if you also need goods sorted according to another (or the default) collation in other queries.
Normalize
Redundant strings (name of good) bloat tables and indexes, which makes everything slower. A proper table layout can avoid most of the problem. Could look like this:
CREATE TABLE good (
good_id serial PRIMARY KEY
, good text NOT NULL
);
CREATE TABLE inventory (
good_id int REFERENCES good (good_id)
, the_date date NOT NULL
, quantity int NOT NULL
, PRIMARY KEY(good_id, the_date)
);
CREATE TABLE price (
good_id int REFERENCES good (good_id)
, the_date date NOT NULL
, price numeric NOT NULL
, PRIMARY KEY(good_id, the_date));The primary keys automatically provide (almost) all indices we need.
Depending on missing details, a multicolumn index on
price with descending order on the second column may improve performance:CREATE INDEX price_good_date_desc_idx ON price(good, the_date DESC);Again, the collation must match your query (see above).
Since Postgres 9.2 "covering indices" for index-only scans can help some more - especially if tables hold additional columns, making the table substantially bigger than the index.
These resulting queries are much faster:
DISTINCT ONSELECT DISTINCT ON (i.the_date)
i.the_date, p.the_date AS pricing_date, g.good, i.quantity, p.price
FROM inventory i
JOIN good g USING (good_id)
LEFT JOIN price p ON p.good_id = i.good_id AND p.the_date <= i.the_date
ORDER BY i.the_date, p.the_date DESC;NOT EXISTSSELECT i.the_date, p.the_date AS pricing_date, g.good, i.quantity, p.price
FROM inventory i
JOIN good g USING (good_id)
LEFT JOIN price p ON p.good_id = i.good_id AND p.the_date p.the_date
);db<>fiddle here
OLD sqliddle
Faster solutions
If that still is not fast enough, there may be faster solutions.
Recursive CTE /
JOIN LATERAL / correlated subqueryEspecially for data distributions with many prices per good:
- Optimize GROUP BY query to retrieve latest record per user
Materialized view
If you need to run this often and fast, I suggest you create a materialized view. I think it is safe to assume, that prices and inventor
Code Snippets
SELECT DISTINCT ON (i.good, i.the_date)
i.the_date, p.the_date AS pricing_date, i.good, p.price
FROM inventory i
LEFT JOIN price p ON i.good = p.good AND i.the_date >= p.the_date
ORDER BY i.good, i.the_date, p.the_date DESC;SELECT i.the_date, p.the_date AS pricing_date, i.good, i.quantity, p.price
FROM inventory i
LEFT JOIN price p ON p.good = i.good AND p.the_date <= i.the_date
WHERE NOT EXISTS (
SELECT FROM price p1
WHERE p1.good = p.good
AND p1.the_date <= i.the_date
AND p1.the_date > p.the_date
);SHOW lc_collate;SELECT DISTINCT ON (i.good COLLATE "C", i.the_date)
i.the_date, p.the_date AS pricing_date, i.good, p.price
FROM inventory i
LEFT JOIN price p ON i.good = p.good AND i.the_date >= p.the_date
ORDER BY i.good COLLATE "C", i.the_date, p.the_date DESC;CREATE INDEX inventory_good_date_desc_collate_c_idx
ON price(good COLLATE "C", the_date DESC);Context
StackExchange Database Administrators Q#49540, answer score: 67
Revisions (0)
No revisions yet.