patternsqlModerate
Optimise a LATERAL JOIN query on a big table
Viewed 0 times
queryjoinlateraloptimisebigtable
Problem
I'm using Postgres 9.5. I have a table that records page hits from several web sites. This table contains about 32 million rows spanning from Jan 1, 2016 to June 30, 2016.
I'm trying to tune a query that counts the number of people that performed a given sequence of page hits. The query is meant to answer questions like "how many people viewed the home-page, and then went to the help site and then viewed the thank-you page"? The result looks like this
Notice the numbers are decreasing which makes sense, because of the 10000 who viewed the home-page 9800 went on to the help site and of those 1500 went on to hit the thank you page.
The SQL for a 3 step sequence uses lateral joins as follows:
```
SELECT
sum(view_homepage) AS view_homepage,
sum(use_help) AS use_help,
sum(thank_you) AS thank_you
FROM (
-- Get the first time each user viewed the homepage.
SELECT X.person_id,
1 AS view_homepage,
min(timestamp_) AS view_homepage_time
FROM event_pg X
WHERE X.timestamp_ between '2016-04-23 00:00:00.0' and timestamp '2016-04-30 23:59:59.999'
AND X.location_host like '2015.testonline.ca'
GROUP BY X.person_id
) e1
LEFT JOIN LATERAL (
SELECT
Y.person_id,
1 AS use_help,
timestamp_ AS use_help_time
FROM event_pg Y
WHERE
Y.person_id = e1.person_id AND
location_host = 'helpcentre.testonline.ca' AND
timestamp_ BETWEEN view_homepage_time AND timestamp '2016-04-30 23:59:59.999'
ORDER BY timestamp_
LIMIT 1
) e2 ON true
LEFT JOIN LATERAL (
CREATE TABLE event_pg (
timestamp_ timestamp without time zone NOT NULL,
person_id character(24),
location_host varchar(256),
location_path varchar(256),
location_query varchar(256),
location_fragment varchar(256)
);I'm trying to tune a query that counts the number of people that performed a given sequence of page hits. The query is meant to answer questions like "how many people viewed the home-page, and then went to the help site and then viewed the thank-you page"? The result looks like this
╔════════════╦════════════╦═════════════╗
║ home-page ║ help site ║ thankyou ║
╠════════════╬════════════╬═════════════╣
║ 10000 ║ 9800 ║1500 ║
╚════════════╩════════════╩═════════════╝Notice the numbers are decreasing which makes sense, because of the 10000 who viewed the home-page 9800 went on to the help site and of those 1500 went on to hit the thank you page.
The SQL for a 3 step sequence uses lateral joins as follows:
```
SELECT
sum(view_homepage) AS view_homepage,
sum(use_help) AS use_help,
sum(thank_you) AS thank_you
FROM (
-- Get the first time each user viewed the homepage.
SELECT X.person_id,
1 AS view_homepage,
min(timestamp_) AS view_homepage_time
FROM event_pg X
WHERE X.timestamp_ between '2016-04-23 00:00:00.0' and timestamp '2016-04-30 23:59:59.999'
AND X.location_host like '2015.testonline.ca'
GROUP BY X.person_id
) e1
LEFT JOIN LATERAL (
SELECT
Y.person_id,
1 AS use_help,
timestamp_ AS use_help_time
FROM event_pg Y
WHERE
Y.person_id = e1.person_id AND
location_host = 'helpcentre.testonline.ca' AND
timestamp_ BETWEEN view_homepage_time AND timestamp '2016-04-30 23:59:59.999'
ORDER BY timestamp_
LIMIT 1
) e2 ON true
LEFT JOIN LATERAL (
Solution
Preliminary notes
-
You are using odd data types.
-
-
Use
-
You're inconsistent in sometimes casting the string literal to
It doesn't. When compared to a
-
The Postgres data type
Indexes
Important: to optimize performance create multicolumn indexes.
For the first subquery
Or, if you can get index-only scans out of it, append
For very large time ranges spanning most or all of the table, this index should be preferable - it also supports the
For
Optimized with partial indexes
If your predicates on
Consider:
Again, all of these indexes are substantially smaller and faster with
Generally, you need to
To get index-only scans, your table has to be
Basic Query
Implementing what I discussed. Query for small ranges (few rows per
If you have many rows per
See it integrated below.
Optimize & automate best query
It's the old conundrum: one query technique is best for a smaller set, another for a larger set. In your particular case we have a very good indicator from the start - the length of the given time period - which we can use to decide.
We wrap it all in a PL/pgSQL function. My implementation switches from
```
CREATE OR REPLACE FUNCTION f_my_counts(_ts_low_inc timestamp, _ts_hi_excl timestamp)
RETURNS TABLE (view_homepage int, use_help int, thank_you int) AS
$func$
BEGIN
CASE
WHEN _ts_hi_excl = _ts_low_inc
-
You are using odd data types.
character(24)? char(n) is an outdated type and almost always the wrong choice. You have indexes on person_id and join on it repeatedly. integer would be much more efficient for multiple reasons. (Or bigint, if you plan to burn more than 2 billion rows over the lifetime of the table.) Related:- Would index lookup be noticeably faster with char vs varchar when all values are 36 chars
-
LIKE is pointless without wildcards. Use = instead. Faster.x.location_host LIKE '2015.testonline.ca'x.location_host = '2015.testonline.ca'-
Use
count(e1.) or count() instead of adding a dummy column with the value 1 for each subquery. (Except for the last (e3), where you don't need any actual data.)-
You're inconsistent in sometimes casting the string literal to
timestamp and sometimes not (timestamp '2016-04-30 23:59:59.999'). Either it makes sense, then do it all the time, or it doesn't, then don't do it.It doesn't. When compared to a
timestamp column, a string literal is coerced to timestamp anyway. So you don't need an explicit cast.-
The Postgres data type
timestamp has up to 6 fractional digits. Your BETWEEN expressions leave corner cases. I replaced them with less error-prone expressions.Indexes
Important: to optimize performance create multicolumn indexes.
For the first subquery
hp:CREATE INDEX event_pg_location_host_timestamp__idx
ON event_pg (location_host, timestamp_);Or, if you can get index-only scans out of it, append
person_id to the index:CREATE INDEX event_pg_location_host_timestamp__person_id_idx
ON event_pg (location_host, timestamp_, person_id);For very large time ranges spanning most or all of the table, this index should be preferable - it also supports the
hlp subquery, so create it either way:CREATE INDEX event_pg_location_host_person_id_timestamp__idx
ON event_pg (location_host, person_id, timestamp_);For
tnk:CREATE INDEX event_pg_location_fragment_timestamp__idx
ON event_pg (location_fragment, person_id, timestamp_);Optimized with partial indexes
If your predicates on
location_host and location_fragment are constants, we can use much cheaper partial indexes instead, especially since your location_* columns seem big:CREATE INDEX event_pg_hp_person_id_ts_idx ON event_pg (person_id, timestamp_)
WHERE location_host = '2015.testonline.ca';
CREATE INDEX event_pg_hlp_person_id_ts_idx ON event_pg (person_id, timestamp_)
WHERE location_host = 'helpcentre.testonline.ca';
CREATE INDEX event_pg_tnk_person_id_ts_idx ON event_pg (person_id, timestamp_)
WHERE location_fragment = '/file/thank-you';Consider:
- Multicolumn index and performance
- Is a composite index also good for queries on the first field?
Again, all of these indexes are substantially smaller and faster with
integer or bigint for person_id.Generally, you need to
ANALYZE the table after creating a new index - or wait till autovacuum kicks in to do it for you.- Index that is not used, yet influences query
- PostgreSQL partial index unused when created on a table with existing data
To get index-only scans, your table has to be
VACUUM'ed enough. Test immediately after VACUUM as proof of concept. Read the linked Postgres Wiki page for details if you are unfamiliar with index-only scans.Basic Query
Implementing what I discussed. Query for small ranges (few rows per
person_id):SELECT count(*)::int AS view_homepage
, count(hlp.hlp_ts)::int AS use_help
, count(tnk.yes)::int AS thank_you
FROM (
SELECT DISTINCT ON (person_id)
person_id, timestamp_ AS hp_ts
FROM event_pg
WHERE timestamp_ >= '2016-04-23'
AND timestamp_ = hp.hp_ts
AND timestamp_ = hlp.hlp_ts -- this introduces dependency on hlp anyways.
AND timestamp_ < '2016-05-01'
ORDER BY timestamp_
LIMIT 1
) tnk ON true;DISTINCT ON is often cheaper for few rows per person_id. Detailed explanation:- Select first row in each GROUP BY group?
If you have many rows per
person_id (more likely for bigger time ranges), the recursive CTE discussed in this answer in chapter 1a can be (much) faster:- Optimize GROUP BY query to retrieve latest record per user
See it integrated below.
Optimize & automate best query
It's the old conundrum: one query technique is best for a smaller set, another for a larger set. In your particular case we have a very good indicator from the start - the length of the given time period - which we can use to decide.
We wrap it all in a PL/pgSQL function. My implementation switches from
DISTINCT ON to rCTE when the given time period is longer than a set threshold:```
CREATE OR REPLACE FUNCTION f_my_counts(_ts_low_inc timestamp, _ts_hi_excl timestamp)
RETURNS TABLE (view_homepage int, use_help int, thank_you int) AS
$func$
BEGIN
CASE
WHEN _ts_hi_excl = _ts_low_inc
Code Snippets
CREATE INDEX event_pg_location_host_timestamp__idx
ON event_pg (location_host, timestamp_);CREATE INDEX event_pg_location_host_timestamp__person_id_idx
ON event_pg (location_host, timestamp_, person_id);CREATE INDEX event_pg_location_host_person_id_timestamp__idx
ON event_pg (location_host, person_id, timestamp_);CREATE INDEX event_pg_location_fragment_timestamp__idx
ON event_pg (location_fragment, person_id, timestamp_);CREATE INDEX event_pg_hp_person_id_ts_idx ON event_pg (person_id, timestamp_)
WHERE location_host = '2015.testonline.ca';
CREATE INDEX event_pg_hlp_person_id_ts_idx ON event_pg (person_id, timestamp_)
WHERE location_host = 'helpcentre.testonline.ca';
CREATE INDEX event_pg_tnk_person_id_ts_idx ON event_pg (person_id, timestamp_)
WHERE location_fragment = '/file/thank-you';Context
StackExchange Database Administrators Q#143044, answer score: 14
Revisions (0)
No revisions yet.