HiveBrain v1.2.0
Get Started
← Back to all entries
patternsqlMinor

VERY slow lateral join on relatively small database

Submitted by: @import:stackexchange-dba··
0
Viewed 0 times
joinslowlateraldatabasesmallveryrelatively

Problem

My database consists of apps and their reviews (schema below). I'm trying to answer the following question:

Given a series of dates from the earliest reviews.review_date to the latest reviews.review_date (incrementing by a day), for each date, D, which apps had the most reviews if the app's earliest review was on or later than D?

This is the query that I've come up with to try and answer that question:
select
review_windows.review_window_start,
id,
slug,
review_total,
earliest_review
from
(
select
date_trunc('day', review_windows.review_windows) :: date as review_window_start
from
generate_series(
(
SELECT
min(reviews.review_date)
FROM
reviews
),
(
SELECT
max(reviews.review_date)
FROM
reviews
),
'1 day'
) review_windows
order by
1 desc
) review_windows
left join lateral (
SELECT
apps.id,
apps.slug,
count(reviews.*) as review_total,
min(reviews.review_date) as earliest_review
FROM
reviews
INNER JOIN apps ON apps.id = reviews.app_id
where
reviews.review_date >= review_windows.review_window_start
group by
1,
2
having
min(reviews.review_date) >= review_windows.review_window_start
order by
3 desc,
4 desc
limit
2
) apps_most_reviews on true;



It is extremely slow and I'm not sure why. If I want any kinds of results I use week instead of day in the generate_series call and even then that might take a minute or even longer.

Where should I start when debugging a performance issue like this?

Visualized query plan here

There are ~5K rows in apps and ~400K rows in reviews so it's a mystery to me why this is taking so long.

Running the individual subquery that is run for each entry in the lateral join given a single date only takes 161 ms (below) and the subquery for `ge

Solution

The main reason for the slowness is that you aggregate over the big table from scratch for every iteration of the lateral sibquery. Compute earliest review & current total count per app in a CTE once and base the lateral subquery on it. I discussed that and some other optizations under your predating related question on SO:

  • Get apps with the highest review count since a dynamic series of days



One difference: In this question you also return an additional attribute of the app (slug), so we need to join to table apps after all. Literally: join to apps after aggregating reviews. That's cheaper.

Another difference: earliest_review as additioal tiebreaker. That's just a gradual improvement, though, as there can stiill be 0-n winners per review_window_start. Instead of picking the (arbitrary) top two ( limit 2), select the first one, with ties. (Wording already hints at the new technique in Postgres 13; see below.)
WITH cte AS MATERIALIZED (
SELECT a.id, a.slug, r.earliest_review, r.review_total
FROM apps a
JOIN (
SELECT app_id, min(review_date) AS earliest_review, count(*)::int AS review_total
FROM reviews
GROUP BY 1
) r ON r.app_id = a.id
)
SELECT *
FROM (
SELECT generate_series(min(review_date)
, max(review_date)
, '1 day')::date
FROM reviews
) d(review_window_start)
LEFT JOIN LATERAL (
SELECT app_id, slug, review_total, earliest_review
FROM (
SELECT app_id, slug, review_total, earliest_review
, rank() OVER (ORDER BY review_total DESC, earliest_review DESC) AS rnk
FROM cte c
WHERE c.earliest_review >= d.review_window_start
) sub
WHERE rnk = 1
) a ON true;


Like in the related answer on SO, it will be cheaper yet in Postgres 13 using WITH TIES. See:

  • Greater than or equal to ALL() and equal to MAX() speed



...
LEFT JOIN LATERAL (
SELECT app_id, slug, review_total, earliest_review
FROM cte c
WHERE earliest_review >= d.review_window_start
ORDER BY review_total DESC, earliest_review DESC
FETCH FIRST 1 ROWS WITH TIES -- new!
) a ON true;

Context

StackExchange Database Administrators Q#273452, answer score: 3

Revisions (0)

No revisions yet.