patternsqlMinor
Debug query on big table that is sometimes slow
Viewed 0 times
sometimesdebugqueryslowbigthattable
Problem
I have a web API that is backed by a Postgres database, and the performance is generally very good. I monitor the performance of both the database and the application as a whole. Most of my queries (and API calls for that matter) complete in less than 100ms, however occasionally there's an outlier.
Just today I received an alert that an API call took longer than 5,000ms and was thusly terminated by a watchdog. From digging into the logs the underlying Postgres query took over 13 seconds to complete (everything is asynchronous, so even though the API request was terminated the SQL query continued).
This is very atypical, and even when I run the offending query manually I am unable to reproduce such atrocious timing. It runs to completion in 985ms (according to explain analyze) for me.
My Question
I am not sure what else to look at next to try and formulate theories on why this occurred. It doesn't happen that often, only once or twice a day out of thousands of similar events a day, but it does occur often enough that it's annoying. What am I missing? What next steps should I go down to debug this? I'm not from a DBA background, so this probably is a silly question.
Some quick background & what I've tried
This is all hosted on Amazon's RDS, running Postgres 9.4 on a m3.xlarge, Provisioned IOPS (2,000).
One of my tables, let's call it "details" is rather large, containing nearly 5 millions rows and growing at a rate of 25,000 records a day. This table is never updated or deleted, just inserts and selects, but represents the "core" of my application - most nearly anything of interest is read from this table.
In this specific case, I do know that this query has some parameters (e.g. the dates at the bottom and the ids) such that it's looking at a rather large dataset. I've already developed a much improved version of this query that takes this specific scenario down from 985ms down to 20. However, I am concerned that there's something else "at play" here that a
Just today I received an alert that an API call took longer than 5,000ms and was thusly terminated by a watchdog. From digging into the logs the underlying Postgres query took over 13 seconds to complete (everything is asynchronous, so even though the API request was terminated the SQL query continued).
This is very atypical, and even when I run the offending query manually I am unable to reproduce such atrocious timing. It runs to completion in 985ms (according to explain analyze) for me.
My Question
I am not sure what else to look at next to try and formulate theories on why this occurred. It doesn't happen that often, only once or twice a day out of thousands of similar events a day, but it does occur often enough that it's annoying. What am I missing? What next steps should I go down to debug this? I'm not from a DBA background, so this probably is a silly question.
Some quick background & what I've tried
This is all hosted on Amazon's RDS, running Postgres 9.4 on a m3.xlarge, Provisioned IOPS (2,000).
One of my tables, let's call it "details" is rather large, containing nearly 5 millions rows and growing at a rate of 25,000 records a day. This table is never updated or deleted, just inserts and selects, but represents the "core" of my application - most nearly anything of interest is read from this table.
In this specific case, I do know that this query has some parameters (e.g. the dates at the bottom and the ids) such that it's looking at a rather large dataset. I've already developed a much improved version of this query that takes this specific scenario down from 985ms down to 20. However, I am concerned that there's something else "at play" here that a
Solution
Table definition
After trimming the noise:
A low hanging fruit: The
Query
A couple of things for your query:
-
Use
-
Why join to
Then I would ask for the definition of
I've already developed a much improved version of this query that takes this specific scenario down from 985ms down to 20.
... I might be just wasting time. Rather have a separate question based on your improved version of the query - if you still need it.
Query plan gone wrong
Most of the total time is spent in one of the nested loop steps, where Postgres bases the plan on seriously underestimated row counts in both query plans:
-> Nested Loop (cost=5.19..23499.04 rows=33 width=8)
(actual time=1.964..929.479 rows=7166 loops=1)
-> Nested Loop (cost=5.19..23499.35 rows=33 width=8)
(actual time=0.275..277.738 rows=8413 loops=1)
That's the join between
If Postgres would know to expect that many rows, it would probably choose a merge join or hash join instead.
The number of actual rows has grown in your second query plan, while the estimate has not. This would indicate that you are not running
I would guess you add newer rows all the time, your
What's more,
Allow inserts, not only updates and deletes, to trigger vacuuming
activity in autovacuum (Laurenz Albe, Darafei Praliaskouski)
Previously, insert-only activity would trigger auto-analyze but not
auto-vacuum, on the grounds that there could not be any dead tuples to
remove. However, a vacuum scan has other useful side-effects such as
setting page-all-visible bits, which improves the efficiency of
index-only scans. Also, allowing an insert-only table to receive
periodic vacuuming helps to spread out the work of “freezing” old
tuples, so that there is not suddenly a large amount of freezing work
to do when the entire table reaches the anti-wraparound threshold all
at once.
Statistics for your big table
5M rows, 25K new rows per day. After 10 days,
Only 5 % new rows, my default
The manual:
Specifies a fraction of the table size to add to
only be set in the
but the setting can be overridden for individual tables by changing
table storage parameters.
Bold emphasis mine.
Read about the other
I suggest more aggressive settings - at least for your all-important table
It is possible to run
specific columns of a table, so the flexibility exists to update some
statistics more frequently than others if your application requires it.
Use Storage Parameters for your table. Example:
Related:
After trimming the noise:
CREATE TABLE public.details (
value numeric,
created_at timestamptz NOT NULL,
updated_at timestamptz NOT NULL,
effective_date timestamptz,
submission_date timestamptz,
id uuid NOT NULL,
form_id uuid,
order_id uuid,
invoice_id uuid,
customer_id uuid,
seq int8 NOT NULL DEFAULT nextval('details_seq_seq'::regclass),
client_day date,
CONSTRAINT details_pkey PRIMARY KEY (id),
CONSTRAINT details_id_key UNIQUE (id) -- total waste
);A low hanging fruit: The
UNIQUE constraint details_id_key adds nothing useful over the existing PK details_pkey. Drop it.Query
A couple of things for your query:
-
Use
json_agg(t) instead of array_to_json(array_agg(t))?-
Why join to
orders at all? Can you join invoices to details directly?JOIN invoices j2 ON j2.id = t1.invoice_idThen I would ask for the definition of
count_pages() (in your question) and for other details. But given your statement:I've already developed a much improved version of this query that takes this specific scenario down from 985ms down to 20.
... I might be just wasting time. Rather have a separate question based on your improved version of the query - if you still need it.
Query plan gone wrong
Most of the total time is spent in one of the nested loop steps, where Postgres bases the plan on seriously underestimated row counts in both query plans:
-> Nested Loop (cost=5.19..23499.04 rows=33 width=8)
(actual time=1.964..929.479 rows=7166 loops=1)
-> Nested Loop (cost=5.19..23499.35 rows=33 width=8)
(actual time=0.275..277.738 rows=8413 loops=1)
That's the join between
details and orders. (I am not even convinced you need orders in the query at all.) The low estimates also build up in every step that leads up to this nested loop.If Postgres would know to expect that many rows, it would probably choose a merge join or hash join instead.
The number of actual rows has grown in your second query plan, while the estimate has not. This would indicate that you are not running
ANALYZE enough or that you just had big INSERTs adding more data and that's not reflected in the table statistics, yet. These two predicates in your query are typical suspects for such a situation:AND t1.effective_date >= '2016-01-28T14:56:31.000Z'
AND t1.updated_at >= '2016-02-07T21:29:50.000Z'I would guess you add newer rows all the time, your
autovacuum isn't keeping up with the latest stats on the newest rows while your query focuses on just those rows. This can escalate over time or after particularly big INSERTs.What's more,
INSERT did not trigger automatic vacuuming at all (thereby hampering possible index-only scans) before Postgres 13, where this was changed. Quoting the release notes:Allow inserts, not only updates and deletes, to trigger vacuuming
activity in autovacuum (Laurenz Albe, Darafei Praliaskouski)
Previously, insert-only activity would trigger auto-analyze but not
auto-vacuum, on the grounds that there could not be any dead tuples to
remove. However, a vacuum scan has other useful side-effects such as
setting page-all-visible bits, which improves the efficiency of
index-only scans. Also, allowing an insert-only table to receive
periodic vacuuming helps to spread out the work of “freezing” old
tuples, so that there is not suddenly a large amount of freezing work
to do when the entire table reaches the anti-wraparound threshold all
at once.
Statistics for your big table
autovacuum is based on a percentage of rows inserted / updated / deleted. Default settings can be a bit of a trap for your special situation: a big table growing at a constant rate. Especially when mostly the latest additions are relevant for your query.5M rows, 25K new rows per day. After 10 days,
autovacuum "reckons":Only 5 % new rows, my default
autovacuum_analyze_scale_factor is 10 %. No need to run ANALYZEThe manual:
autovacuum_analyze_scale_factor (floating point)Specifies a fraction of the table size to add to
autovacuum_analyze_threshold when deciding whether to trigger anANALYZE. The default is 0.1 (10% of table size). This parameter canonly be set in the
postgresql.conf file or on the server command line;but the setting can be overridden for individual tables by changing
table storage parameters.
Bold emphasis mine.
Read about the other
autovacuum setting, too!I suggest more aggressive settings - at least for your all-important table
details. Per-table settings are possible:It is possible to run
ANALYZE on specific tables and even justspecific columns of a table, so the flexibility exists to update some
statistics more frequently than others if your application requires it.
Use Storage Parameters for your table. Example:
ALTER TABLE public.details SET (autovacuum_analyze_scale_factor = 0.001);Related:
- When is Postgres autovacuum execut
Code Snippets
CREATE TABLE public.details (
value numeric,
created_at timestamptz NOT NULL,
updated_at timestamptz NOT NULL,
effective_date timestamptz,
submission_date timestamptz,
id uuid NOT NULL,
form_id uuid,
order_id uuid,
invoice_id uuid,
customer_id uuid,
seq int8 NOT NULL DEFAULT nextval('details_seq_seq'::regclass),
client_day date,
CONSTRAINT details_pkey PRIMARY KEY (id),
CONSTRAINT details_id_key UNIQUE (id) -- total waste
);JOIN invoices j2 ON j2.id = t1.invoice_idAND t1.effective_date >= '2016-01-28T14:56:31.000Z'
AND t1.updated_at >= '2016-02-07T21:29:50.000Z'ALTER TABLE public.details SET (autovacuum_analyze_scale_factor = 0.001);CREATE INDEX details_order_id_timestamps ON public.details
(order_id, effective_date DESC NULLS LAST, updated_at DESC NULLS LAST);Context
StackExchange Database Administrators Q#129015, answer score: 9
Revisions (0)
No revisions yet.