HiveBrain v1.2.0
Get Started
← Back to all entries
patternsqlMinor

PostgreSQL performance with (col = value or col is NULL)

Submitted by: @import:stackexchange-dba··
0
Viewed 0 times
postgresqlwithnullcolvalueperformance

Problem

This question deals with PostgreSQL 9.5 query performance.

The table is:

CREATE TABLE big_table
(
   id integer NOT NULL,
   flag bigint NOT NULL,
   time timestamp with timezone NOT NULL,
   val int,
   primary key (id)
)


Consider the query:

SELECT * 
FROM big_table 
WHERE time > '0666-06-06 00:00:00+00' AND 
      flag & 2 = 2 AND 
      (val = 5 or val IS NULL) 
ORDER by time desc


The goal is to make this query performance as good as possible while not using a union. I achieved the best performance using the index:

CREATE INDEX ind 
ON big_table USING btree (time DESC, val)
WHERE (flag & 2::smallint = 2::smallint);


which makes the query use the index twice and use bitmap or between them. This is still much worse than if I could do val = ANY('{NULL,5}') which is not possible.

This makes me consider to put some special value such as -1 instead of NULL which will allow me to do val = ANY('{-1,5}') to run a full index scan with better performance.

So is using some special value instead of NULL better in such situations? Or does PostgreSQL has some optimization for situation where you want to do an "OR" between a non-NULL value and a NULL?

I read this great article but didn't see there a solution for what I just asked without using a union: https://www.cybertec-postgresql.com/en/avoid-or-for-better-performance/

I also searched for such question on Stack Overflow and here but the related questions either don't talk about performance or don't talk about non-NULL together with NULL. This similar question (and the answer it received) spoke about SQL Server and not PostgreSQL: Best way to write SQL Query that checks a column for non-NULL value or NULL

I also considered doing coalesce to get rid of null but have read that it makes the performance much worse.

The table has more than 4 million records. From them only 56 records have NULL in val column.

UPDATE:
Original explain:

```
"Limit (cost=1112.32..1112.32 rows=2 width=519)"

Solution

Replacing NULL with a surrogate value may or may not help a little. NULL handling is a bit more expensive in indexes but, on the other hand, it's typically smaller in storage, which also impacts performance.

I expect a much greater impact from this: flip the order of index expressions:

CREATE INDEX ON big_table (val, time DESC)
WHERE flag & 2::smallint = 2::smallint;


Rule of thumb: index for equality first — then for ranges. See:

  • Multicolumn index and performance



To your consolation: val = ANY('{-1,5}') burns down to being syntax shorthand for (val = -1 OR val = 5), which is hardly better than (val IS NULL OR val = 5). (The more important factor is number of rows for NULL vs. -1 - the same in your case if you just replace NULL with -1.).

Also consider upgrading to a current version of Postgres. 9.5 is getting old, there have been various performance improvements for big tables.

To return only few columns, an index-only scan would be a possible optimization, but you need to return most of your 21 columns according to comments.

To save 8 bytes per row needlessly lost to alignment padding, reorder the columns of your demo table like this:

CREATE TABLE big_table (
   flag bigint NOT NULL,
   time timestamptz NOT NULL,
   id int PRIMARY KEY,
   val int
);


Smaller is faster overall. Now, that's obviously just a demo table, but the same principles apply to your actual table. See:

  • Configuring PostgreSQL for read performance



  • Calculating and saving space in PostgreSQL



Sort

For a single val, Postgres can return pre-sorted data from the index directly. But for more than one value it has to merge equally many sorted sub-sets (one sorted set for val IS NULL, another one for val = 5 in your example), so another sort step on top of index access is inevitable. Pre-sorted sets from the index still can make it cheaper - and you need sorted index tuples in any case. The actual query plan also depends on the chosen index access method. It's trivial to return pre-sorted data from an index scan (or index-only scan). Not so much for a bitmap index scan.
Special case: very few NULL values, used all the time

Since you mentioned only a handful of NULL values for millions of rows I would add a tailored index for that special case:

CREATE INDEX ON big_table (time DESC)
WHERE flag & 2::smallint = 2::smallint
AND   val IS NULL;


Maybe even append all other columns of interest to this very tiny special index to get an index-only scan. (Depends on preconditions.) Even more so in Postgres 11 or later with a true covering index using the INCLUDE clause. Results from this and the other index are merged in a BitmapOr node, just like you see for multiple subset from the same index now. Postgres has precise estimates for the special case and it becomes completely irrelevant whether the special case is NULL or -1 or whatever. (Not that it mattered all that much to begin with.) See:

  • Can Postgres use an index-only scan for this query with joined tables?



  • Index that is not used, yet influences query

Code Snippets

CREATE INDEX ON big_table (val, time DESC)
WHERE flag & 2::smallint = 2::smallint;
CREATE TABLE big_table (
   flag bigint NOT NULL,
   time timestamptz NOT NULL,
   id int PRIMARY KEY,
   val int
);
CREATE INDEX ON big_table (time DESC)
WHERE flag & 2::smallint = 2::smallint
AND   val IS NULL;

Context

StackExchange Database Administrators Q#226416, answer score: 4

Revisions (0)

No revisions yet.