patternsqlMinor
Is there a faster way to count JSONB tags?
Viewed 0 times
waytagsfastercounttherejsonb
Problem
I'm trying to squeeze some more performance out of this query in Postgres 9.5. I'm running it over 400,000 rows.
In playing around with it, I've noticed that the
Output of
Running on:
PostgreSQL 9.5.5 on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 4.8.2-19ubuntu1) 4.8.2, 64-bit
The inner query and outer query are generated by separate portions of the application. Is it possible to optimise without restructuring?
In playing around with it, I've noticed that the
CASE statements are adding quite a bit to the query cost - if I replace them with simply summing some existing column it halves the execution time. Is there a more efficient way to calculate these sums?SELECT sum("tag1"), sum("tag2"), sum("total_tags")
FROM (
SELECT people.data->'recruiter_id' AS recruiter_id,
(CASE WHEN people.data->'tags' ? 'tag1' THEN 1 END) AS "tag1",
(CASE WHEN people.data->'tags' ? 'tag2' THEN 1 END) AS "tag2",
((CASE WHEN people.data->'tags' ? 'tag1' THEN 1 ELSE 0 END) +
(CASE WHEN people.data->'tags' ? 'tag2' THEN 1 ELSE 0 END)) AS total_tags
FROM people WHERE people.data->'tags' ?| ARRAY['tag1','tag2'] ) AS target
GROUP BY recruiter_idOutput of
EXPLAIN ANALYSE:HashAggregate (cost=1076.30..1078.22 rows=550 width=202) (actual time=7043.115..7043.208 rows=449 loops=1)
Group Key: (people.data -> 'recruiter_id'::text)
-> Bitmap Heap Scan on people (cost=12.85..1072.72 rows=550 width=202) (actual time=13.908..2619.878 rows=48492 loops=1)
Recheck Cond: ((data -> 'tags'::text) ?| '{tag1,tag2}'::text[])
Heap Blocks: exact=26114
-> Bitmap Index Scan on index_people_on_data_tags (cost=0.00..12.82 rows=550 width=0) (actual time=9.219..9.219 rows=48493 loops=1)
Index Cond: ((data -> 'tags'::text) ?| '{tag1,tag2}'::text[])
Planning time: 0.139 ms
Execution time: 7043.291 msRunning on:
PostgreSQL 9.5.5 on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 4.8.2-19ubuntu1) 4.8.2, 64-bit
The inner query and outer query are generated by separate portions of the application. Is it possible to optimise without restructuring?
Solution
This should be faster and simpler overall:
-
Assuming that
-
Only count once in the subquery, then add counts for the total sum in the outer
-
Use the aggregate
-
If you want shorter syntax, this gets you the same result and performance:
Index and missing statistics for
Typically, indexes are the deciding factor for performance with big tables. But since your query retrieves 48.493 of 400.000 rows, i.e. > 12 %, the index is not helping this query at all.
Why the poor decision? The query planner has no statistics for values inside a
The index may still be instrumental for less frequent tags (if you have those), an expression index on
However, for this and other reasons, while working with common tags, a plain Postgres array or a fully normalized schema will beat the performance of your
This discussion in the postgresql-performance list is about your problem exactly:
SELECT *, tag1 + tag2 AS total_tags
FROM (
SELECT (data->>'recruiter_id')::int AS recruiter_id -- cheaper to group by int
, count(*) FILTER (WHERE data->'tags' ? 'tag1') AS tag1
, count(*) FILTER (WHERE data->'tags' ? 'tag2') AS tag2
FROM people
WHERE data->'tags' ?| ARRAY['tag1','tag2']
GROUP BY 1
) target;-
Assuming that
recruiter_id is an integer entity, it's cheaper to group by the integer value than the jsonb object containing an integer value. I also assume, you'd rather get the integer value in the result anyway.-
Only count once in the subquery, then add counts for the total sum in the outer
SELECT.-
Use the aggregate
FILTER clause for conditional counts:- Return counts for multiple ranges in a single SELECT statement
-
If you want shorter syntax, this gets you the same result and performance:
count(data->'tags' ? 'tag1' OR NULL) AS tag1Index and missing statistics for
jsonbTypically, indexes are the deciding factor for performance with big tables. But since your query retrieves 48.493 of 400.000 rows, i.e. > 12 %, the index is not helping this query at all.
Why the poor decision? The query planner has no statistics for values inside a
json / jsonb object and has to chose the best query plans based on generic selectivity estimates. It expects to find rows = 550 while the query actually finds ~ 90 x as many (rows=48493). The query plan using bitmap index scans is a poor decision. A sequential scan would be faster (not using the index at all).The index may still be instrumental for less frequent tags (if you have those), an expression index on
data->'tags' should serve best. Maybe even a jsonb_path_ops index, combined with an adapted query. More:- Index for finding an element in a JSON array
However, for this and other reasons, while working with common tags, a plain Postgres array or a fully normalized schema will beat the performance of your
jsonb object by a long shot.This discussion in the postgresql-performance list is about your problem exactly:
- https://www.postgresql.org/message-id/54C738E1.8080405@agliodbs.com
Code Snippets
SELECT *, tag1 + tag2 AS total_tags
FROM (
SELECT (data->>'recruiter_id')::int AS recruiter_id -- cheaper to group by int
, count(*) FILTER (WHERE data->'tags' ? 'tag1') AS tag1
, count(*) FILTER (WHERE data->'tags' ? 'tag2') AS tag2
FROM people
WHERE data->'tags' ?| ARRAY['tag1','tag2']
GROUP BY 1
) target;count(data->'tags' ? 'tag1' OR NULL) AS tag1Context
StackExchange Database Administrators Q#159968, answer score: 5
Revisions (0)
No revisions yet.