HiveBrain v1.2.0
Get Started
← Back to all entries
patternsqlMinor

Postgres poor performance because of bad query plan based on bad row estimates

Submitted by: @import:stackexchange-dba··
0
Viewed 0 times
estimatespostgresquerybadpoorbecauseplanperformancebasedrow

Problem

Context

I am using Postgres v13 hosted in GCP Cloud SQL. The database partitions most tables by a tenant id.

Structure

The query is more complex than this but I can create a minimal reproducible example from the essence of these three tables:
create table student (
id text not null primary key,
grade_id int not null
);

create table student_snapshot (
student_id text not null,
created_at timestamp not null,
grade_id int not null,
primary key (student_id, created_at)
);

create table test (
id text not null primary key,
student_id text not null,
student_snapshot_date timestamp not null,
foreign key (student_id, student_snapshot_date) references student_snapshot (student_id, created_at)
);


There are students and when they take a test we take a snapshot of the student to understand their demographics at the time the test was taken. There isn't an FK back to student because we want to keep the historic data when the student is deleted. The real example uses partitioned tables but this example uses normal ones for simplicity.

Query

I need to join the 3 tables together in a query like the following:
select t.*, st.grade_id as current_grade_id, snp.grade_id as snapshot_grade_id
from student st
join test t
on t.student_id = st.student_id
join student_snapshot snp
on snp.student_id = t.student_id
and snp.created_at = t.student_snapshot_date


Issue

The query was taking CREATE STATISTICS "test_st_snp_pk" ON student_id, student_snapshot_date FROM "test";
CREATE STATISTICS "st_snp_pk" ON student_id, created_at FROM "student_snapshot";


I tried increasing the stats collection on the relevant columns:
alter table "test" alter student_id set statistics 500;
alter table "test" alter student_snapshot_date set statistics 500;
alter table "student_snapshot" alter student_id set statistics 500;
alter table "student_snapshot" alter created_at set statistics 500;


I tried clustering on a parallel indexes for test and student_snapshot to see

Solution

Your diligent research made clear that the problem is a correlation between your two join conditions. Since PostgreSQL doesn't have cross-table correlation statistics (now that would be a game changer!), you'll probably have to force the hand of the optimizer:

WITH cte AS MATERIALIZED (
   select r.*, st.grade_id as current_grade_id, snp.grade_id as snapshot_grade_id,
          snp.created_at, r.student_snapshot_date
   from student st
   join rating r
     on r.student_id = st.student_id
   join student_snapshot snp
     on snp.student_id = r.student_id
)
SELECT *
FROM cte
WHERE created_at = student_snapshot_date;


You probably don't want to SELECT *, but I don't know the columns of rating, so you will have to fix the query accordingly. Also, if some column names occur in more than one table, you might have to use aliases where necessary.

Code Snippets

WITH cte AS MATERIALIZED (
   select r.*, st.grade_id as current_grade_id, snp.grade_id as snapshot_grade_id,
          snp.created_at, r.student_snapshot_date
   from student st
   join rating r
     on r.student_id = st.student_id
   join student_snapshot snp
     on snp.student_id = r.student_id
)
SELECT *
FROM cte
WHERE created_at = student_snapshot_date;

Context

StackExchange Database Administrators Q#317203, answer score: 2

Revisions (0)

No revisions yet.