patternsqlMinor
Store millions of rows of denomalized data or some SQL magic?
Viewed 0 times
rowsmillionssqlmagicstoresomedenomalizeddata
Problem
My DBA experience doesn't go much further than simple storage + retrieval of CMS style data - so this may be a silly question, I don't know!
I have a problem whereby I need to lookup or calculate holiday prices for a certain group size and a certain number of days within a certain time period. E.g.:
How much is a hotel room for 2 people for 4 nights anytime in January?
I have pricing and availability data for, say, 5000 hotels stored like so:
With this table, I can do a query like so:
results
The
So far so good.
However, is there a way to find out all the 4 night periods in January where there is space available? We could repeat the query 27 times - incrementing the dates each time, which does seem a little bit awkward. Or another way around could be to store all possible combinations in a lookup table like so:
```
Hotel ID | total price pp | num_people | num_nights | start_date
----------------------------------------------------------------
123 | 400 | 2 | 4 | Jan1
123 | 400 | 2 | 4 | Jan2
123 | 400
I have a problem whereby I need to lookup or calculate holiday prices for a certain group size and a certain number of days within a certain time period. E.g.:
How much is a hotel room for 2 people for 4 nights anytime in January?
I have pricing and availability data for, say, 5000 hotels stored like so:
Hotel ID | Date | Spaces | Price PP
-----------------------------------
123 | Jan1 | 5 | 100
123 | Jan2 | 7 | 100
123 | Jan3 | 5 | 100
123 | Jan4 | 3 | 100
123 | Jan5 | 5 | 100
123 | Jan6 | 7 | 110
456 | Jan1 | 5 | 120
456 | Jan2 | 1 | 120
456 | Jan3 | 4 | 130
456 | Jan4 | 3 | 110
456 | Jan5 | 5 | 100
456 | Jan6 | 7 | 90With this table, I can do a query like so:
SELECT hotel_id, sum(price_pp)
FROM hotel_data
WHERE
date >= Jan1 and date = 2
GROUP BY hotel_id
HAVING count(*) = 4;results
hotel_id | sum
----------------
123 | 400The
HAVING clause here makes sure that there is an entry for every single day between my desired dates that has the spaces available. ie. Hotel 456 had 1 space available on Jan2, the HAVING clause would return 3, so we don't get a result for hotel 456.So far so good.
However, is there a way to find out all the 4 night periods in January where there is space available? We could repeat the query 27 times - incrementing the dates each time, which does seem a little bit awkward. Or another way around could be to store all possible combinations in a lookup table like so:
```
Hotel ID | total price pp | num_people | num_nights | start_date
----------------------------------------------------------------
123 | 400 | 2 | 4 | Jan1
123 | 400 | 2 | 4 | Jan2
123 | 400
Solution
You can do much with window functions. Presenting two solutions: one with and one without materialized view.
Test case
Building on this table:
Days per
Multicolumn index for base table:
Note the reversed order as compared to the PK. You will probably need both indexes, for the following query, the 2nd index is essential. Detailed explanation:
Direct query without
Also see @ypercube's variant with
How?
-
In the subquery, only consider days within your time frame ("in January" means, the last day is included in the time frame).
-
The frame for the window functions spans the current row plus the next
-
I carefully drafted all window functions in the subquery to reuse the same window, using a single sort step.
-
The resulting price sum_price
Alternative
Test case
Building on this table:
CREATE TABLE hotel_data (
hotel_id int
, day date -- using "day", not "date"
, spaces int
, price int
, PRIMARY KEY (hotel_id, day) -- provides essential index automatically
);Days per
hotel_id must be unique (enforced by PK here), or the rest is invalid.Multicolumn index for base table:
CREATE INDEX mv_hotel_mult_idx ON mv_hotel (day, hotel_id);Note the reversed order as compared to the PK. You will probably need both indexes, for the following query, the 2nd index is essential. Detailed explanation:
- Is a composite index also good for queries on the first field?
- Working of indexes in PostgreSQL
Direct query without
MATERIALIZED VIEWSELECT hotel_id, day, sum_price
FROM (
SELECT hotel_id, day, price, spaces
, sum(price) OVER w * 2 AS sum_price
, min(spaces) OVER w AS min_spaces
, last_value(day) OVER w - day AS day_diff
, count(*) OVER w AS day_ct
FROM hotel_data
WHERE day BETWEEN '2014-01-01'::date AND '2014-01-31'::date
AND spaces >= 2
WINDOW w AS (PARTITION BY hotel_id ORDER BY day
ROWS BETWEEN CURRENT ROW AND 3 FOLLOWING) -- adapt to nights - 1
) sub
WHERE day_ct = 4
AND day_diff = 3 -- make sure there is not gap
AND min_spaces >= 2
ORDER BY sum_price, hotel_id, day;
-- LIMIT 1 to get only 1 winner;Also see @ypercube's variant with
lag(), which can replace day_ct and day_diff with a single check.How?
-
In the subquery, only consider days within your time frame ("in January" means, the last day is included in the time frame).
-
The frame for the window functions spans the current row plus the next
num_nights - 1 (4 - 1 = 3) rows (days). Calculate the difference in days , the count of rows and the minimum of spaces to make sure the range is long enough, gapless and always has enough spaces.- Unfortunately, the frame clause of window functions does not accept dynamic values, so
ROWS BETWEEN CURRENT ROW AND 3 FOLLOWINGcannot be parameterized for a prepared statement.
-
I carefully drafted all window functions in the subquery to reuse the same window, using a single sort step.
-
The resulting price sum_price
is already multiplied by the number of spaces requested.
With MATERIALIZED VIEW
To avoid inspecting many rows without chance of success, save only the columns you need plus three redundant, calculated values from the base table. Be sure the MV is up to date. If you are not familiar with the concept, read the manual first.
CREATE MATERIALIZED VIEW mv_hotel AS
SELECT hotel_id, day
, first_value(day) OVER (w ORDER BY day) AS range_start
, price, spaces
,(count(*) OVER w)::int2 AS range_len
,(max(spaces) OVER w)::int2 AS max_spaces
FROM (
SELECT *
, day - row_number() OVER (PARTITION BY hotel_id ORDER BY day)::int AS grp
FROM hotel_data
) sub1
WINDOW w AS (PARTITION BY hotel_id, grp);
-
range_start stores the first day of each continuous range for two purposes:
- to mark a set of rows as members of a common range
- to show the start of the range for possible other purposes.
-
range_len is the number of days in the gapless range.
max_spaces is the maximum of open spaces in the range.
- Both columns are used to exclude impossible rows from the query immediately.
-
I cast both to smallint ( max. 32768 should be plenty for both) to optimize storage: only 52 bytes per row (incl. heap tuple header and item identifier). Details:
- Configuring PostgreSQL for read performance
Multicolumn index for MV:
CREATE INDEX mv_hotel_mult_idx ON mv_hotel (range_len, max_spaces, day);
Query based on MV
SELECT hotel_id, day, sum_price
FROM (
SELECT hotel_id, day, price, spaces
, sum(price) OVER w * 2 AS sum_price
, min(spaces) OVER w AS min_spaces
, count(*) OVER w AS day_ct
FROM mv_hotel
WHERE day BETWEEN '2014-01-01'::date AND '2014-01-31'::date
AND range_len >= 4 -- exclude impossible rows
AND max_spaces >= 2 -- exclude impossible rows
WINDOW w AS (PARTITION BY hotel_id, range_start ORDER BY day
ROWS BETWEEN CURRENT ROW AND 3 FOLLOWING) -- adapt to $nights - 1
) sub
WHERE day_ct = 4
AND min_spaces >= 2
ORDER BY sum_price, hotel_id, day;
-- LIMIT 1 to get only 1 winner;
This is faster than the query on the table because more rows can be eliminated immediately. Again, the index is essential. Since partitions are gapless here, checking day_ct is enough.
SQL Fiddle demonstrating both.
Repeated use
If you use it a lot, I would create an SQL function and only pass parameters. Or a PL/pgSQL function with dynamic SQL and EXECUTE` to allow adapting the frame clause.Alternative
Code Snippets
CREATE TABLE hotel_data (
hotel_id int
, day date -- using "day", not "date"
, spaces int
, price int
, PRIMARY KEY (hotel_id, day) -- provides essential index automatically
);CREATE INDEX mv_hotel_mult_idx ON mv_hotel (day, hotel_id);SELECT hotel_id, day, sum_price
FROM (
SELECT hotel_id, day, price, spaces
, sum(price) OVER w * 2 AS sum_price
, min(spaces) OVER w AS min_spaces
, last_value(day) OVER w - day AS day_diff
, count(*) OVER w AS day_ct
FROM hotel_data
WHERE day BETWEEN '2014-01-01'::date AND '2014-01-31'::date
AND spaces >= 2
WINDOW w AS (PARTITION BY hotel_id ORDER BY day
ROWS BETWEEN CURRENT ROW AND 3 FOLLOWING) -- adapt to nights - 1
) sub
WHERE day_ct = 4
AND day_diff = 3 -- make sure there is not gap
AND min_spaces >= 2
ORDER BY sum_price, hotel_id, day;
-- LIMIT 1 to get only 1 winner;CREATE MATERIALIZED VIEW mv_hotel AS
SELECT hotel_id, day
, first_value(day) OVER (w ORDER BY day) AS range_start
, price, spaces
,(count(*) OVER w)::int2 AS range_len
,(max(spaces) OVER w)::int2 AS max_spaces
FROM (
SELECT *
, day - row_number() OVER (PARTITION BY hotel_id ORDER BY day)::int AS grp
FROM hotel_data
) sub1
WINDOW w AS (PARTITION BY hotel_id, grp);CREATE INDEX mv_hotel_mult_idx ON mv_hotel (range_len, max_spaces, day);Context
StackExchange Database Administrators Q#68405, answer score: 6
Revisions (0)
No revisions yet.