HiveBrain v1.2.0
Get Started
← Back to all entries
snippetsqlMinor

How to get index scan for OR'ed time-range predicates?

Submitted by: @import:stackexchange-dba··
0
Viewed 0 times
predicatesscanrangetimegetforhowindex

Problem

I have table events with fields:

id
user_id
time_start
time_end
...


And have b-tree index on (time_start, time_end).

SELECT user_id
FROM events
WHERE ((time_start = '2021-08-24T15:30:00+00:00') OR
       (time_start = '2021-08-24T15:59:00+00:00'))
GROUP BY user_id);


Group  (cost=243735.42..243998.32 rows=1103 width=4) (actual time=186.533..188.244 rows=166 loops=1)
  Group Key: user_id
  Buffers: shared hit=224848
  ->  Gather Merge  (cost=243735.42..243992.80 rows=2206 width=4) (actual time=186.532..188.199 rows=176 loops=1)
        Workers Planned: 2
        Workers Launched: 2
        Buffers: shared hit=224848
        ->  Sort  (cost=242735.39..242738.15 rows=1103 width=4) (actual time=184.121..184.126 rows=59 loops=3)
              Sort Key: user_id
              Sort Method: quicksort  Memory: 27kB
              Worker 0:  Sort Method: quicksort  Memory: 27kB
              Worker 1:  Sort Method: quicksort  Memory: 28kB
              Buffers: shared hit=224848
              ->  Partial HashAggregate  (cost=242668.62..242679.65 rows=1103 width=4) (actual time=184.065..184.085 rows=59 loops=3)
                    Group Key: user_id
                    Buffers: shared hit=224834
                    ->  Parallel Seq Scan on events  (cost=0.00..242553.74 rows=45952 width=4) (actual time=104.085..183.994 rows=64 loops=3)
                          Filter: (((time_start = '2021-08-24 15:30:00+00'::timestamp with time zone)) OR ((time_start = '2021-08-24 15:59:00+00'::timestamp with time zone)))
                          Rows Removed by Filter: 708728
                          Buffers: shared hit=224834
Planning Time: 0.169 ms
Execution Time: 188.294 ms


Postgres uses Seq Scan with filter:

Filter: (((time_start = '2021-08-24 15:30:00+00'::timestamp with time zone)) OR ((time_start = '2021-08-24 15:59:00+00'::timestamp with time zone)))


But when I leave one condition for time_start and time_end it starts to use Index Scan.

Solution

Expression index

A GiST or (even better) SP-GiST expression index on an inclusive timestamp range should work wonders.

CREATE INDEX events_right_idx ON events USING spgist (tsrange(time_start, time_end, '[]'));


Rewrite your query with the "range contains" operator @> and match the indexed expression (exactly equivalent to your original):

SELECT user_id
FROM   events
WHERE  tsrange(time_start, time_end, '[]') @> timestamp '2021-08-24 15:30:00'
    OR tsrange(time_start, time_end, '[]') @> timestamp '2021-08-24 15:59:00'
GROUP  BY user_id;


You'll get a query plan like this:
HashAggregate (cost=9.90..10.00 rows=10 width=4)
Group Key: user_id
-> Bitmap Heap Scan on events (cost=2.57..9.88 rows=10 width=4)
Recheck Cond: ((tsrange(time_start, time_end, ''[]''::text) @> ''2021-08-24 15:30:00''::timestamp without time zone) OR (tsrange(time_start, time_end, ''[]''::text) @> ''2021-08-24 15:59:00''::timestamp without time zone))
-> BitmapOr (cost=2.57..2.57 rows=10 width=0)
-> Bitmap Index Scan on events_right_expr_idx (cost=0.00..1.28 rows=5 width=0)
Index Cond: (tsrange(time_start, time_end, ''[]''::text) @> ''2021-08-24 15:30:00''::timestamp without time zone)
-> Bitmap Index Scan on events_right_expr_idx (cost=0.00..1.28 rows=5 width=0)
Index Cond: (tsrange(time_start, time_end, ''[]''::text) @> ''2021-08-24 15:59:00''::timestamp without time zone)


Should be much faster.

Range types assume inclusive lower and exclusive upper bound unless instructed otherwise. tsrange(time_start, time_end) is the same as tsrange(time_start, time_end), '[)').

Since you operate with >= and

  • Optimizing queries on a range of timestamps (two columns)



Alternatively, store range column in table

Should be a bit faster, yet, as plain (not expression) index.

You can add a timestamp range column to your table, like:

ALTER TABLE event ADD COLUMN ts_range tsrange GENERATED ALWAYE AS (tsrange(time_start, time_end, '[]')) STORED;


See:

  • Computed / calculated / virtual / derived columns in PostgreSQL



Or, more radically, replace
time_start and time_end with the range column. Then index and query are a bit simpler:

CREATE INDEX events_right_idx ON events USING spgist (ts_range);

SELECT user_id
FROM   events
WHERE  ts_range @> timestamp '2021-08-24T15:30:00'
    OR ts_range @> timestamp '2021-08-24T15:59:00'
GROUP  BY user_id;


But the
tsrange column occupies more space than two timestamp columns. Weigh cost and benefits.
Asides

Postgres 14 (currently beta) even allows a covering SP-GiST index. The release notes:

Allow SP-GiST to use INCLUDE'd columns (Pavel Borisov)

But I don't think you can get index-only scans for your particular query.

If you have to make do with your B-tree index for some reason, this fixed
UNION query shouldn't be too bad:

SELECT user_id
FROM   events
WHERE  '2021-08-24T15:30:00' BETWEEN time_start AND time_end
UNION
SELECT user_id
FROM   events
WHERE  '2021-08-24T15:59:00' BETWEEN time_start AND time_end


Notably, no
GROUP BY. UNION already does all the work.

And simplify with
BETWEEN (no effect on performance).

Also, you seem to have a wild mix of
timestamp without time zone and timestamp with time zone. And name it "time" to add to the confusion. Typically timestamptz is the better choice. See:

  • Ignoring time zones altogether in Rails and PostgreSQL



Last, but not least, this indicates inaccurate column statistics, leading to a sub-optimal query plan:
-> Parallel Seq Scan on events (cost=0.00..242553.74 rows=45952 width=4)
(actual time=104.085..183.994 rows=64 loops=3)

Run

ANALYZE events;


And retry. Your original query can use a plain B-tree index. It's just not as efficient as the suggested SP-GiST index.

And then maybe tune your
autovacuum` and statistics settings to avoid bad statistics in the future. See:

  • Aurora PostgreSQL database using a slower query plan than a normal PostgreSQL for an identical query?



  • Keep PostgreSQL from sometimes choosing a bad query plan

Code Snippets

CREATE INDEX events_right_idx ON events USING spgist (tsrange(time_start, time_end, '[]'));
SELECT user_id
FROM   events
WHERE  tsrange(time_start, time_end, '[]') @> timestamp '2021-08-24 15:30:00'
    OR tsrange(time_start, time_end, '[]') @> timestamp '2021-08-24 15:59:00'
GROUP  BY user_id;
ALTER TABLE event ADD COLUMN ts_range tsrange GENERATED ALWAYE AS (tsrange(time_start, time_end, '[]')) STORED;
CREATE INDEX events_right_idx ON events USING spgist (ts_range);

SELECT user_id
FROM   events
WHERE  ts_range @> timestamp '2021-08-24T15:30:00'
    OR ts_range @> timestamp '2021-08-24T15:59:00'
GROUP  BY user_id;
SELECT user_id
FROM   events
WHERE  '2021-08-24T15:30:00' BETWEEN time_start AND time_end
UNION
SELECT user_id
FROM   events
WHERE  '2021-08-24T15:59:00' BETWEEN time_start AND time_end

Context

StackExchange Database Administrators Q#298625, answer score: 6

Revisions (0)

No revisions yet.