snippetsqlMinor
How to avoid a subquery in FILTER clause?
Viewed 0 times
avoidsubqueryfilterhowclause
Problem
Schema:
Notes:
Value of
Query:
Goal:
My goal is to see total number of applications vs number of first applications per day in particular country.
By first application I mean a number of applications on particular day, which were first time registered and had no applications before.
Problem:
Query performance. The number of rows is growing and the performance now is not on the a good level.
Data sample: Here (
CREATE TABLE "applications" (
"id" SERIAL NOT NULL PRIMARY KEY,
"country" VARCHAR(2) NOT NULL,
"created" TIMESTAMP WITH TIME ZONE NOT NULL,
"is_preliminary" BOOLEAN NOT NULL,
"first_name" VARCHAR(128) NOT NULL,
"last_name" VARCHAR(128) NOT NULL,
"birth_number" VARCHAR(11) NULL
);
CREATE TABLE "persons" (
"id" UUID NOT NULL PRIMARY KEY,
"created" TIMESTAMP WITH TIME ZONE NOT NULL,
"modified" TIMESTAMP WITH TIME ZONE NOT NULL
);
ALTER TABLE "applications" ADD COLUMN "physical_person_id" UUID NULL;
CREATE INDEX "physical_person_id_idx" ON "applications" ("physical_person_id");
ALTER TABLE "applications" ADD CONSTRAINT "physical_person_id_fk" FOREIGN KEY ("physical_person_id") REFERENCES "persons" ("id") DEFERRABLE INITIALLY DEFERRED;
CREATE INDEX "country_created" ON "applications" (country, created);Notes:
Value of
persons.created should be the same as first application.created for this person, regardless the is_preliminary value.Query:
SELECT
to_char(created, 'YYYY-MM-DD') AS "Date",
COUNT(*) AS "Total",
COALESCE(
COUNT(*) FILTER(
WHERE applications.is_preliminary = false
AND NOT EXISTS(
SELECT 1
FROM applications A
WHERE A.physical_person_id = applications.physical_person_id
AND A.created = '2017-01-01'::TIMESTAMP AND created < '2017-07-01'::TIMESTAMP
AND country = 'CZ'
GROUP BY 1
ORDER BY 1Goal:
My goal is to see total number of applications vs number of first applications per day in particular country.
By first application I mean a number of applications on particular day, which were first time registered and had no applications before.
Problem:
Query performance. The number of rows is growing and the performance now is not on the a good level.
Data sample: Here (
xz compressed output of pg_dumpSolution
Several minor improvements:
-
-
The way you had it, you group and sort by the text representation of your
The simplest way to achieve this is a plain cast to date:
-
Do not use
-
I also shortened the syntax by trimming noise double quotes (all identifiers are legal anyway) and using a strategic table alias (
Indices
Since you are concerned with optimizing read performance ...
Your multicolumn index
But you can easily improve the
To allow index-only scans (only if your write patterns allow it!):
The appended columns
I get two index-only scans after adding the last index, which is massively faster for big tables.
More on index-only scans:
Alternative solution
Your last comment opens up new options:
when the application is created for the first time a new person is also created with the same created value.
(The earlier statement in the question was too ambiguous to work with it.)
If this is enforced reliably (and
For perfect read performance with two index-only scans you'd have the index
SELECT created::date AS the_date
, COUNT(*) AS total
, COUNT(*) FILTER( WHERE is_preliminary = false
AND NOT EXISTS (
SELECT 1
FROM applications
WHERE physical_person_id = a.physical_person_id
AND created = '2017-01-01'::timestamptz
AND created < '2017-07-01'::timestamptz
AND country = 'CZ'
GROUP BY created::date
ORDER BY created::date;-
COALESCE( count(...), 0) is always redundant noise since count() never returns NULL to begin with. Just remove it. Related:- Query optimization or missing indexes?
-
The way you had it, you group and sort by the text representation of your
timestamptz column created, which happens to work just fine. But it's more expensive than grouping and sorting by an actual date (4-byte integer value internally). Sorting by an actual date or timestamp is also typically more reliable, though it does not make any difference in this particular query.The simplest way to achieve this is a plain cast to date:
created::date. You can still format the output if you desire: to_char(created::date, 'YYYY-MM-DD') AS date. Same result, but since we GROUP BY created::date, you have to repeat the grouped expression.-
Do not use
BETWEEN like has been advised. Your filter with >= and = and
-
You are aware that the date derived from a timestamptz value (as well as the cast to timestamptz without specifying a time zone) always depend on the current time zone setting, right? You can place the query in a selected time zone explicitly if you want to eliminate this sneaky error source. Basics:
- Ignoring timezones altogether in Rails and PostgreSQL
- Disadvantages of using date_trunc
-
There may be logic error(s) in the computation of is_first_app. This is just speculation on my side, though:
You are checking whether any row in applications for the same person predates the current one. But while you only allow is_preliminary = false for the current row, you do not enforce the same predicate for the rows to compare to. Typically you would want to compare to rows that are also is_preliminary = false. I added a commented line in the query above.
Also, since you form groups per day, do you really want to count rows that have a previous entry on the same day, too? Maybe so, but maybe you really want to check for rows predating the day with created -
I also shortened the syntax by trimming noise double quotes (all identifiers are legal anyway) and using a strategic table alias (
applications a) in the outer SELECT.Indices
Since you are concerned with optimizing read performance ...
Your multicolumn index
country_created seems ideal for the outer SELECT. But read on ...But you can easily improve the
EXISTS subquery with another multicolumn index:CREATE INDEX app_person_created_idx ON applications (physical_person_id, created);To allow index-only scans (only if your write patterns allow it!):
CREATE INDEX app_country_created_person_preliminary_idx
ON applications (country, created, physical_person_id, is_preliminary);The appended columns
physical_person_id and is_preliminary only make sense if you get index-only scans out of it.I get two index-only scans after adding the last index, which is massively faster for big tables.
More on index-only scans:
- Slow index scans in large table
Alternative solution
Your last comment opens up new options:
when the application is created for the first time a new person is also created with the same created value.
(The earlier statement in the question was too ambiguous to work with it.)
If this is enforced reliably (and
created is never updated in either table), there is a simpler, faster query which also happens to "avoid a subquery in the FILTER clause" - by using a LEFT [OUTER] JOIN instead:SELECT a.created::date AS date
, COUNT(*) AS total
, COUNT(p.id) AS is_first_app -- count only counts non-null values
FROM applications a
LEFT JOIN persons p ON a.is_preliminary = false
AND p.id = a.physical_person_id -- FK enforces max. 1 match
AND p.created = a.created
WHERE a.created >= '2017-01-01'::timestamptz
AND a.created < '2017-07-01'::timestamptz
AND a.country = 'CZ'
GROUP BY a.created::date
ORDER BY a.created::date;For perfect read performance with two index-only scans you'd have the index
app_country_created_person_preliminary_idx from above. Plus this one on persons:CREATE INDEX pers_id_created ON persons (id, created);Code Snippets
SELECT created::date AS the_date
, COUNT(*) AS total
, COUNT(*) FILTER( WHERE is_preliminary = false
AND NOT EXISTS (
SELECT 1
FROM applications
WHERE physical_person_id = a.physical_person_id
AND created < a.created
-- AND created < a.created::date -- alternative? see below
-- AND is_preliminary = false -- omission? see below
-- AND country = 'CZ' -- not sure. see below
LIMIT 1
)
) AS is_first_app
FROM applications a
WHERE created >= '2017-01-01'::timestamptz
AND created < '2017-07-01'::timestamptz
AND country = 'CZ'
GROUP BY created::date
ORDER BY created::date;WHERE created >= '2017-01-01'::timestamptz
AND created < '2017-07-01'::timestamptzCREATE INDEX app_person_created_idx ON applications (physical_person_id, created);CREATE INDEX app_country_created_person_preliminary_idx
ON applications (country, created, physical_person_id, is_preliminary);SELECT a.created::date AS date
, COUNT(*) AS total
, COUNT(p.id) AS is_first_app -- count only counts non-null values
FROM applications a
LEFT JOIN persons p ON a.is_preliminary = false
AND p.id = a.physical_person_id -- FK enforces max. 1 match
AND p.created = a.created
WHERE a.created >= '2017-01-01'::timestamptz
AND a.created < '2017-07-01'::timestamptz
AND a.country = 'CZ'
GROUP BY a.created::date
ORDER BY a.created::date;Context
StackExchange Database Administrators Q#179629, answer score: 6
Revisions (0)
No revisions yet.