HiveBrain v1.2.0
Get Started
← Back to all entries
snippetsqlMinor

How to avoid a subquery in FILTER clause?

Submitted by: @import:stackexchange-dba··
0
Viewed 0 times
avoidsubqueryfilterhowclause

Problem

Schema:

CREATE TABLE "applications" (
  "id"             SERIAL                   NOT NULL PRIMARY KEY,
  "country"        VARCHAR(2)               NOT NULL,
  "created"        TIMESTAMP WITH TIME ZONE NOT NULL,
  "is_preliminary" BOOLEAN                  NOT NULL,
  "first_name"     VARCHAR(128)             NOT NULL,
  "last_name"      VARCHAR(128)             NOT NULL,
  "birth_number"   VARCHAR(11)              NULL
);

CREATE TABLE "persons" (
  "id"       UUID                     NOT NULL PRIMARY KEY,
  "created"  TIMESTAMP WITH TIME ZONE NOT NULL,
  "modified" TIMESTAMP WITH TIME ZONE NOT NULL
);

ALTER TABLE "applications" ADD COLUMN "physical_person_id" UUID NULL;
CREATE INDEX "physical_person_id_idx" ON "applications" ("physical_person_id");

ALTER TABLE "applications" ADD CONSTRAINT "physical_person_id_fk" FOREIGN KEY ("physical_person_id") REFERENCES "persons" ("id") DEFERRABLE INITIALLY DEFERRED;
CREATE INDEX "country_created" ON "applications" (country, created);


Notes:
Value of persons.created should be the same as first application.created for this person, regardless the is_preliminary value.

Query:

SELECT
  to_char(created, 'YYYY-MM-DD') AS "Date",
  COUNT(*) AS "Total",
  COALESCE(
    COUNT(*) FILTER(
      WHERE applications.is_preliminary = false
      AND NOT EXISTS(
        SELECT 1
        FROM applications A
        WHERE A.physical_person_id = applications.physical_person_id
          AND A.created = '2017-01-01'::TIMESTAMP AND created < '2017-07-01'::TIMESTAMP
  AND country = 'CZ'
GROUP BY 1
ORDER BY 1


Goal:
My goal is to see total number of applications vs number of first applications per day in particular country.
By first application I mean a number of applications on particular day, which were first time registered and had no applications before.

Problem:
Query performance. The number of rows is growing and the performance now is not on the a good level.

Data sample: Here (xz compressed output of pg_dump

Solution

Several minor improvements:

SELECT created::date AS the_date
     , COUNT(*) AS total
     , COUNT(*) FILTER( WHERE is_preliminary = false
                        AND   NOT EXISTS (
                           SELECT 1
                           FROM   applications
                           WHERE  physical_person_id = a.physical_person_id
                           AND    created = '2017-01-01'::timestamptz
AND    created <  '2017-07-01'::timestamptz
AND    country = 'CZ'
GROUP  BY created::date
ORDER  BY created::date;


-
COALESCE( count(...), 0) is always redundant noise since count() never returns NULL to begin with. Just remove it. Related:

  • Query optimization or missing indexes?



-
The way you had it, you group and sort by the text representation of your timestamptz column created, which happens to work just fine. But it's more expensive than grouping and sorting by an actual date (4-byte integer value internally). Sorting by an actual date or timestamp is also typically more reliable, though it does not make any difference in this particular query.
The simplest way to achieve this is a plain cast to date: created::date. You can still format the output if you desire: to_char(created::date, 'YYYY-MM-DD') AS date. Same result, but since we GROUP BY created::date, you have to repeat the grouped expression.

-
Do not use BETWEEN like has been advised. Your filter with >= and = and

-
You are aware that the date derived from a
timestamptz value (as well as the cast to timestamptz without specifying a time zone) always depend on the current time zone setting, right? You can place the query in a selected time zone explicitly if you want to eliminate this sneaky error source. Basics:

  • Ignoring timezones altogether in Rails and PostgreSQL



  • Disadvantages of using date_trunc



-
There may be logic error(s) in the computation of
is_first_app. This is just speculation on my side, though:
You are checking whether any row in
applications for the same person predates the current one. But while you only allow is_preliminary = false for the current row, you do not enforce the same predicate for the rows to compare to. Typically you would want to compare to rows that are also is_preliminary = false. I added a commented line in the query above.

Also, since you form groups per day, do you really want to count rows that have a previous entry on the same day, too? Maybe so, but maybe you really want to check for rows predating the day with
created

-
I also shortened the syntax by trimming noise double quotes (all identifiers are legal anyway) and using a strategic table alias (applications a) in the outer SELECT.

Indices

Since you are concerned with optimizing read performance ...

Your multicolumn index country_created seems ideal for the outer SELECT. But read on ...

But you can easily improve the EXISTS subquery with another multicolumn index:

CREATE INDEX app_person_created_idx ON applications (physical_person_id, created);


To allow index-only scans (only if your write patterns allow it!):

CREATE INDEX app_country_created_person_preliminary_idx
ON applications (country, created, physical_person_id, is_preliminary);


The appended columns physical_person_id and is_preliminary only make sense if you get index-only scans out of it.

I get two index-only scans after adding the last index, which is massively faster for big tables.

More on index-only scans:

  • Slow index scans in large table



Alternative solution

Your last comment opens up new options:


when the application is created for the first time a new person is also created with the same created value.

(The earlier statement in the question was too ambiguous to work with it.)

If this is enforced reliably (and created is never updated in either table), there is a simpler, faster query which also happens to "avoid a subquery in the FILTER clause" - by using a LEFT [OUTER] JOIN instead:

SELECT a.created::date AS date
     , COUNT(*)        AS total
     , COUNT(p.id)     AS is_first_app  -- count only counts non-null values
FROM   applications a
LEFT   JOIN persons p ON a.is_preliminary = false
                     AND p.id = a.physical_person_id  -- FK enforces max. 1 match
                     AND p.created = a.created
WHERE  a.created >= '2017-01-01'::timestamptz
AND    a.created <  '2017-07-01'::timestamptz
AND    a.country = 'CZ'
GROUP  BY a.created::date
ORDER  BY a.created::date;


For perfect read performance with two index-only scans you'd have the index app_country_created_person_preliminary_idx from above. Plus this one on persons:

CREATE INDEX pers_id_created ON persons (id, created);

Code Snippets

SELECT created::date AS the_date
     , COUNT(*) AS total
     , COUNT(*) FILTER( WHERE is_preliminary = false
                        AND   NOT EXISTS (
                           SELECT 1
                           FROM   applications
                           WHERE  physical_person_id = a.physical_person_id
                           AND    created < a.created
                        -- AND    created < a.created::date  -- alternative? see below
                        -- AND    is_preliminary = false     -- omission? see below
                        -- AND    country = 'CZ'             -- not sure. see below
                           LIMIT  1
                           )
                        ) AS is_first_app
FROM   applications a
WHERE  created >= '2017-01-01'::timestamptz
AND    created <  '2017-07-01'::timestamptz
AND    country = 'CZ'
GROUP  BY created::date
ORDER  BY created::date;
WHERE  created >= '2017-01-01'::timestamptz
AND    created <  '2017-07-01'::timestamptz
CREATE INDEX app_person_created_idx ON applications (physical_person_id, created);
CREATE INDEX app_country_created_person_preliminary_idx
ON applications (country, created, physical_person_id, is_preliminary);
SELECT a.created::date AS date
     , COUNT(*)        AS total
     , COUNT(p.id)     AS is_first_app  -- count only counts non-null values
FROM   applications a
LEFT   JOIN persons p ON a.is_preliminary = false
                     AND p.id = a.physical_person_id  -- FK enforces max. 1 match
                     AND p.created = a.created
WHERE  a.created >= '2017-01-01'::timestamptz
AND    a.created <  '2017-07-01'::timestamptz
AND    a.country = 'CZ'
GROUP  BY a.created::date
ORDER  BY a.created::date;

Context

StackExchange Database Administrators Q#179629, answer score: 6

Revisions (0)

No revisions yet.