HiveBrain v1.2.0
Get Started
← Back to all entries
patternsqlModerate

Optimise a LATERAL JOIN query on a big table

Submitted by: @import:stackexchange-dba··
0
Viewed 0 times
queryjoinlateraloptimisebigtable

Problem

I'm using Postgres 9.5. I have a table that records page hits from several web sites. This table contains about 32 million rows spanning from Jan 1, 2016 to June 30, 2016.

CREATE TABLE event_pg (
   timestamp_        timestamp without time zone NOT NULL,
   person_id         character(24),
   location_host     varchar(256),
   location_path     varchar(256),
   location_query    varchar(256),
   location_fragment varchar(256)
);


I'm trying to tune a query that counts the number of people that performed a given sequence of page hits. The query is meant to answer questions like "how many people viewed the home-page, and then went to the help site and then viewed the thank-you page"? The result looks like this

╔════════════╦════════════╦═════════════╗
║  home-page ║ help site  ║ thankyou    ║
╠════════════╬════════════╬═════════════╣
║ 10000      ║ 9800       ║1500         ║
╚════════════╩════════════╩═════════════╝


Notice the numbers are decreasing which makes sense, because of the 10000 who viewed the home-page 9800 went on to the help site and of those 1500 went on to hit the thank you page.

The SQL for a 3 step sequence uses lateral joins as follows:

```
SELECT
sum(view_homepage) AS view_homepage,
sum(use_help) AS use_help,
sum(thank_you) AS thank_you
FROM (
-- Get the first time each user viewed the homepage.
SELECT X.person_id,
1 AS view_homepage,
min(timestamp_) AS view_homepage_time
FROM event_pg X
WHERE X.timestamp_ between '2016-04-23 00:00:00.0' and timestamp '2016-04-30 23:59:59.999'
AND X.location_host like '2015.testonline.ca'
GROUP BY X.person_id
) e1
LEFT JOIN LATERAL (
SELECT
Y.person_id,
1 AS use_help,
timestamp_ AS use_help_time
FROM event_pg Y
WHERE
Y.person_id = e1.person_id AND
location_host = 'helpcentre.testonline.ca' AND
timestamp_ BETWEEN view_homepage_time AND timestamp '2016-04-30 23:59:59.999'
ORDER BY timestamp_
LIMIT 1
) e2 ON true
LEFT JOIN LATERAL (

Solution

Preliminary notes

-
You are using odd data types. character(24)? char(n) is an outdated type and almost always the wrong choice. You have indexes on person_id and join on it repeatedly. integer would be much more efficient for multiple reasons. (Or bigint, if you plan to burn more than 2 billion rows over the lifetime of the table.) Related:

  • Would index lookup be noticeably faster with char vs varchar when all values are 36 chars



-
LIKE is pointless without wildcards. Use = instead. Faster.

x.location_host LIKE '2015.testonline.ca'

x.location_host = '2015.testonline.ca'

-
Use count(e1.) or count() instead of adding a dummy column with the value 1 for each subquery. (Except for the last (e3), where you don't need any actual data.)

-
You're inconsistent in sometimes casting the string literal to timestamp and sometimes not (timestamp '2016-04-30 23:59:59.999'). Either it makes sense, then do it all the time, or it doesn't, then don't do it.

It doesn't. When compared to a timestamp column, a string literal is coerced to timestamp anyway. So you don't need an explicit cast.

-
The Postgres data type timestamp has up to 6 fractional digits. Your BETWEEN expressions leave corner cases. I replaced them with less error-prone expressions.

Indexes

Important: to optimize performance create multicolumn indexes.

For the first subquery hp:

CREATE INDEX event_pg_location_host_timestamp__idx
ON event_pg (location_host, timestamp_);


Or, if you can get index-only scans out of it, append person_id to the index:

CREATE INDEX event_pg_location_host_timestamp__person_id_idx
ON event_pg (location_host, timestamp_, person_id);


For very large time ranges spanning most or all of the table, this index should be preferable - it also supports the hlp subquery, so create it either way:

CREATE INDEX event_pg_location_host_person_id_timestamp__idx
ON event_pg (location_host, person_id, timestamp_);


For tnk:

CREATE INDEX event_pg_location_fragment_timestamp__idx
ON event_pg (location_fragment, person_id, timestamp_);


Optimized with partial indexes

If your predicates on location_host and location_fragment are constants, we can use much cheaper partial indexes instead, especially since your location_* columns seem big:

CREATE INDEX event_pg_hp_person_id_ts_idx ON event_pg (person_id, timestamp_)
WHERE  location_host = '2015.testonline.ca';

CREATE INDEX event_pg_hlp_person_id_ts_idx ON event_pg (person_id, timestamp_)
WHERE  location_host = 'helpcentre.testonline.ca';

CREATE INDEX event_pg_tnk_person_id_ts_idx ON event_pg (person_id, timestamp_)
WHERE  location_fragment = '/file/thank-you';


Consider:

  • Multicolumn index and performance



  • Is a composite index also good for queries on the first field?



Again, all of these indexes are substantially smaller and faster with integer or bigint for person_id.

Generally, you need to ANALYZE the table after creating a new index - or wait till autovacuum kicks in to do it for you.

  • Index that is not used, yet influences query



  • PostgreSQL partial index unused when created on a table with existing data



To get index-only scans, your table has to be VACUUM'ed enough. Test immediately after VACUUM as proof of concept. Read the linked Postgres Wiki page for details if you are unfamiliar with index-only scans.

Basic Query

Implementing what I discussed. Query for small ranges (few rows per person_id):

SELECT count(*)::int           AS view_homepage
     , count(hlp.hlp_ts)::int AS use_help
     , count(tnk.yes)::int     AS thank_you
FROM  (
   SELECT DISTINCT ON (person_id)
          person_id, timestamp_ AS hp_ts
   FROM   event_pg
   WHERE  timestamp_ >= '2016-04-23'
   AND    timestamp_ = hp.hp_ts
   AND    timestamp_ = hlp.hlp_ts      -- this introduces dependency on hlp anyways.
   AND    timestamp_ <  '2016-05-01'
   ORDER  BY timestamp_
   LIMIT  1
   ) tnk ON true;


DISTINCT ON is often cheaper for few rows per person_id. Detailed explanation:

  • Select first row in each GROUP BY group?



If you have many rows per person_id (more likely for bigger time ranges), the recursive CTE discussed in this answer in chapter 1a can be (much) faster:

  • Optimize GROUP BY query to retrieve latest record per user



See it integrated below.

Optimize & automate best query

It's the old conundrum: one query technique is best for a smaller set, another for a larger set. In your particular case we have a very good indicator from the start - the length of the given time period - which we can use to decide.

We wrap it all in a PL/pgSQL function. My implementation switches from DISTINCT ON to rCTE when the given time period is longer than a set threshold:

```
CREATE OR REPLACE FUNCTION f_my_counts(_ts_low_inc timestamp, _ts_hi_excl timestamp)
RETURNS TABLE (view_homepage int, use_help int, thank_you int) AS
$func$
BEGIN

CASE
WHEN _ts_hi_excl = _ts_low_inc

Code Snippets

CREATE INDEX event_pg_location_host_timestamp__idx
ON event_pg (location_host, timestamp_);
CREATE INDEX event_pg_location_host_timestamp__person_id_idx
ON event_pg (location_host, timestamp_, person_id);
CREATE INDEX event_pg_location_host_person_id_timestamp__idx
ON event_pg (location_host, person_id, timestamp_);
CREATE INDEX event_pg_location_fragment_timestamp__idx
ON event_pg (location_fragment, person_id, timestamp_);
CREATE INDEX event_pg_hp_person_id_ts_idx ON event_pg (person_id, timestamp_)
WHERE  location_host = '2015.testonline.ca';

CREATE INDEX event_pg_hlp_person_id_ts_idx ON event_pg (person_id, timestamp_)
WHERE  location_host = 'helpcentre.testonline.ca';

CREATE INDEX event_pg_tnk_person_id_ts_idx ON event_pg (person_id, timestamp_)
WHERE  location_fragment = '/file/thank-you';

Context

StackExchange Database Administrators Q#143044, answer score: 14

Revisions (0)

No revisions yet.