snippetsqlMinor
Filter on array text[] and sort on timestamp
Viewed 0 times
arraytextfiltertimestampandsort
Problem
Description
PostgreSQL 9.6 on Linux, size of
I need to retrieve data with filter on
If not, could you suggest other ideas?
Query example
SQL code:
Explain result:
`Limit (cost=233469.63..233469.65 rows=5 width=116) (actual time=801.482..801.483 rows=5 loops=1)
Output: id, tags, maker_date, value
-> Sort (cost=233469.63..234714.22 rows=497833 width=116) (actual time=801.481..801.481 rows=5 loops=1)
Output: id, tags, maker_date, value
Sort Key: tags_tmp.maker_date DESC
Sort Method: top-N heapsort Memory: 25kB
-> Bitmap Heap Scan on public.tags_tmp (cost=6486.58..225200.81 rows=497833 width=116) (actual time=212.982..696.650 rows=366392 loops=1)
Output: id, tags, maker_date, value
Recheck Cond: (tags_tmp.tags && '{funny,inspiration}'::text[])
Heap Blocks: exact=120034
-> Bitmap Index Scan on idx1 (cost=0.00..6362.12 rows=497882 width=0) (actual time=171.742..171.742 rows=722612 loops=1)
I
PostgreSQL 9.6 on Linux, size of
tags_tmp table ~ 30 GB (10 million rows), tags is a text[] and has only 6 values. tags_tmp(id int, tags text[], maker_date timestamp, value text)id tags maker_date value
1 {a,b,c} 2016-11-09 This is test
2 {a} 2016-11-08 This is test
3 {b,c} 2016-11-07 This is test
4 {c} 2016-11-06 This is test
5 {d} 2016-11-05 This is test
I need to retrieve data with filter on
tags as well as order by on maker_date desc. Can I create an index on both tags & maker_date desc columns? If not, could you suggest other ideas?
Query example
select id, tags, maker_date, value
from tags_tmp
where tags && array['a','b']
order by maker_date desc
limit 5 offset 0SQL code:
create index idx1 on tags_tmp using gin (tags);
create index idx2 on tags_tmp using btree(maker_date desc);
explain (analyse on, costs on, verbose)
select id, tags, maker_date, value
from tags_tmp
where tags && array['funny','inspiration']
order by maker_date desc
limit 5 offset 0 ;Explain result:
`Limit (cost=233469.63..233469.65 rows=5 width=116) (actual time=801.482..801.483 rows=5 loops=1)
Output: id, tags, maker_date, value
-> Sort (cost=233469.63..234714.22 rows=497833 width=116) (actual time=801.481..801.481 rows=5 loops=1)
Output: id, tags, maker_date, value
Sort Key: tags_tmp.maker_date DESC
Sort Method: top-N heapsort Memory: 25kB
-> Bitmap Heap Scan on public.tags_tmp (cost=6486.58..225200.81 rows=497833 width=116) (actual time=212.982..696.650 rows=366392 loops=1)
Output: id, tags, maker_date, value
Recheck Cond: (tags_tmp.tags && '{funny,inspiration}'::text[])
Heap Blocks: exact=120034
-> Bitmap Index Scan on idx1 (cost=0.00..6362.12 rows=497882 width=0) (actual time=171.742..171.742 rows=722612 loops=1)
I
Solution
General considerations
Index optimization always depends on the complete picture. Table size, row size, cardinalities, value frequencies, selectivity of typical queries, Postgres version, typical access patterns, etc.
Your case is particularly difficult for two reasons:
-
Different columns used in
-
Filter on array is most efficient with GIN or GiST index, but neither index type produces sorted output. The manual:
Of the index types currently supported by PostgreSQL, only B-tree can
produce sorted output — the other index types return matching rows in
an unspecified, implementation-dependent order.
You can create a multicolumn GIN index on
And it's not going to help for the
One more clarification:
Solution for added specifications
You clarified: only 6 distinct tags possible. That's crucial.
Your table is big and your table definition leaves room for improvements. Size matters for big tables. Your numbers (30 GB, 10 million rows) also suggest a big avg. row size of ~ 3 KB. Either you have more columns than you show or table bloat and need a
The order of columns is relevant because of alignment padding. Details:
More importantly, this:
Arrays have a sizeable overhead of 24 byte (similar to a row), plus actual items.
So a
And 6 distinct values can be represented by only 6 bits of information (assuming that sort order of the array is irrelevant).
We could compress even more, but I chose the convenient
Your tags map to bits in a bitmap,
So the tag array
To encapsulate the logic I suggest two auxiliary functions, easily expandable:
tags --> integer:
integer --> tags:
Basics here:
For convenience, you can add a
Now your basic query can be:
Using the binary
The above query should be faster already, for the smaller size and cheaper operators alone, but nothing revolutionary, yet. To make it fast we need indices, and the above cannot use an index, yet.
One partial index for every tag:
```
CREATE INDEX foo_tag_a_idx ON tags_tmp(maker_date DESC
Index optimization always depends on the complete picture. Table size, row size, cardinalities, value frequencies, selectivity of typical queries, Postgres version, typical access patterns, etc.
Your case is particularly difficult for two reasons:
-
Different columns used in
WHERE and ORDER BY.-
Filter on array is most efficient with GIN or GiST index, but neither index type produces sorted output. The manual:
Of the index types currently supported by PostgreSQL, only B-tree can
produce sorted output — the other index types return matching rows in
an unspecified, implementation-dependent order.
You can create a multicolumn GIN index on
(tags, maker_date) or even more columns (the order of index columns is irrelevant for GIN indexes). But you need to have the additional module btree_gin installed. Instructions:- Inner join using an array column
And it's not going to help for the
ORDER BY component of your problem.One more clarification:
OFFSET m LIMIT n is typically almost as expensive as LIMIT m+n.Solution for added specifications
You clarified: only 6 distinct tags possible. That's crucial.
Your table is big and your table definition leaves room for improvements. Size matters for big tables. Your numbers (30 GB, 10 million rows) also suggest a big avg. row size of ~ 3 KB. Either you have more columns than you show or table bloat and need a
VACUUM FULL run (or similar) or your value column is big and TOASTed, which would make my improvements even more effective since the main relation is cut down to half its size or less with this:CREATE TABLE tags_tmp (
id int PRIMARY KEY -- assuming PK
, tags int NOT NULL -- also assuming NOT NULL
, value text
, maker_date timestamp NOT NULL -- NOT NULL!
);The order of columns is relevant because of alignment padding. Details:
- Configuring PostgreSQL for read performance
More importantly, this:
tags int. Why?Arrays have a sizeable overhead of 24 byte (similar to a row), plus actual items.
- Calculating and saving space in PostgreSQL
So a
text[] with 1-6 items like you demonstrate ('funny', 'inspiration', ...) occupies ~ 56 bytes on avg.And 6 distinct values can be represented by only 6 bits of information (assuming that sort order of the array is irrelevant).
We could compress even more, but I chose the convenient
integer type (occupies 4 bytes), which provides space for up to 31 distinct tags. That leaves room for later additions without changes to the table schema. Detailed rationale:- Should I use the PostgreSQL bit string?
Your tags map to bits in a bitmap,
'a' being the least significant bit (to the right):tag: a | b | c | d | e | f
position: 0 | 1 | 2 | 3 | 4 | 5
int value: 1 | 2 | 4 | 8 | 16 | 32
So the tag array
'{a,d,f}' maps to 41. You can use any arbitrary, distinct strings instead of 'a'-'f', doesn't matter.To encapsulate the logic I suggest two auxiliary functions, easily expandable:
tags --> integer:
CREATE OR REPLACE FUNCTION f_tags2int(text[])
RETURNS int
LANGUAGE sql IMMUTABLE PARALLEL SAFE AS
$func$
SELECT bit_or(CASE x
WHEN 'a' THEN 1
WHEN 'b' THEN 2
WHEN 'c' THEN 4
WHEN 'd' THEN 8
WHEN 'e' THEN 16
WHEN 'f' THEN 32
-- more?
END)
FROM unnest ($1) x
$func$;integer --> tags:
CREATE OR REPLACE FUNCTION f_int2tags(int)
RETURNS text[]
LANGUAGE SQL IMMUTABLE PARALLEL SAFE AS
$func$
SELECT array_remove(ARRAY [CASE WHEN $1 & 1 > 0 THEN 'a' END
, CASE WHEN $1 & 2 > 0 THEN 'b' END
, CASE WHEN $1 & 4 > 0 THEN 'c' END
, CASE WHEN $1 & 8 > 0 THEN 'd' END
, CASE WHEN $1 & 16 > 0 THEN 'e' END
, CASE WHEN $1 & 32 > 0 THEN 'f' END], NULL)
-- more?
$func$;PARALLEL SAFE in Postgres 9.6 or later.Basics here:
- Can I convert a bunch of boolean columns to a single bitmap in PostgreSQL?
For convenience, you can add a
VIEW to display tags as text array like you had it:CREATE VIEW tags_tmp_pretty AS
SELECT id, tags
, f_int2tags(tags) AS tags_pretty
, maker_date, value
FROM tags_tmp;Now your basic query can be:
SELECT id, tags, maker_date, value
FROM tags_tmp
WHERE tags & f_tags2int('{a,b}') > 0 -- any of the tags matched
ORDER by maker_date DESC
LIMIT 5;Using the binary
AND operator &. There are more operators to manipulate the column. get_bit() and set_bit() from the binary string operators are also convenient.The above query should be faster already, for the smaller size and cheaper operators alone, but nothing revolutionary, yet. To make it fast we need indices, and the above cannot use an index, yet.
One partial index for every tag:
```
CREATE INDEX foo_tag_a_idx ON tags_tmp(maker_date DESC
Code Snippets
CREATE TABLE tags_tmp (
id int PRIMARY KEY -- assuming PK
, tags int NOT NULL -- also assuming NOT NULL
, value text
, maker_date timestamp NOT NULL -- NOT NULL!
);CREATE OR REPLACE FUNCTION f_tags2int(text[])
RETURNS int
LANGUAGE sql IMMUTABLE PARALLEL SAFE AS
$func$
SELECT bit_or(CASE x
WHEN 'a' THEN 1
WHEN 'b' THEN 2
WHEN 'c' THEN 4
WHEN 'd' THEN 8
WHEN 'e' THEN 16
WHEN 'f' THEN 32
-- more?
END)
FROM unnest ($1) x
$func$;CREATE OR REPLACE FUNCTION f_int2tags(int)
RETURNS text[]
LANGUAGE SQL IMMUTABLE PARALLEL SAFE AS
$func$
SELECT array_remove(ARRAY [CASE WHEN $1 & 1 > 0 THEN 'a' END
, CASE WHEN $1 & 2 > 0 THEN 'b' END
, CASE WHEN $1 & 4 > 0 THEN 'c' END
, CASE WHEN $1 & 8 > 0 THEN 'd' END
, CASE WHEN $1 & 16 > 0 THEN 'e' END
, CASE WHEN $1 & 32 > 0 THEN 'f' END], NULL)
-- more?
$func$;CREATE VIEW tags_tmp_pretty AS
SELECT id, tags
, f_int2tags(tags) AS tags_pretty
, maker_date, value
FROM tags_tmp;SELECT id, tags, maker_date, value
FROM tags_tmp
WHERE tags & f_tags2int('{a,b}') > 0 -- any of the tags matched
ORDER by maker_date DESC
LIMIT 5;Context
StackExchange Database Administrators Q#154681, answer score: 7
Revisions (0)
No revisions yet.