patternsqlMinor
Index on JSONB not improving query speed
Viewed 0 times
queryspeedindexnotjsonbimproving
Problem
table structure
Column | Type | Collation | Nullable | Default
------------+--------------------------------+-----------+----------+------------------------------------
id | integer | | not null | nextval('events_id_seq'::regclass)
event_name | character varying(255) | | not null |
data | jsonb | | not null |
Indexes:
"events_pkey" PRIMARY KEY, btree (id)
"events_expr_idx" btree ((data -> 'program_id'::text))
"idxginp" gin (data)
what 'data' might look like
{
"label":"Take Survey Button",
"sent_at":"2018-07-25",
"user_id":123456,
"order_id":7654321,
"event_time":"2018-07-25 00:09:41",
"program_id":4,
"user_agent":"Mozilla/5.0 ...",
"destination":"http://www.google.com",
"communication_kind":"email",
"communication_name":"welcome"
}
Query that I need to run:
select label, count(*) as clicks
from (select distinct "data" ->> 'label' as "label", "data" ->> 'user_id' as "user_id"
from "events"
where "event_name" = 'communication_link_clicked'
and data @> '{"communication_kind": "email"}'
and data @> '{"communication_name": "program_welcome"}'
and CAST ((data ->> 'sent_at') as date ) between '2018-07-25' and '2019-05-18'
and CAST ((data ->> 'program_id') as int) = 4) as sub
group by "label";
This query on my small VM takes around 1 minute.
Records on this table: ~4 million.
Records where data ->> program_id = 4 (~1 million.)
I've tried creating individual indexes for the columns included in the where condition but I din't see major improvements.
Should I convert the json attributes used in the WHERE clause to recular columns? or Am I just missing more indexes and tune my query?
EXPLAIN:
EXPLAIN (ANALYZE, BUFFERS) select label, count(*) as clicks
from (select distinct "data" ->> 'label' as "label", "data" ->> 'user_id' as "user_id
Column | Type | Collation | Nullable | Default
------------+--------------------------------+-----------+----------+------------------------------------
id | integer | | not null | nextval('events_id_seq'::regclass)
event_name | character varying(255) | | not null |
data | jsonb | | not null |
Indexes:
"events_pkey" PRIMARY KEY, btree (id)
"events_expr_idx" btree ((data -> 'program_id'::text))
"idxginp" gin (data)
what 'data' might look like
{
"label":"Take Survey Button",
"sent_at":"2018-07-25",
"user_id":123456,
"order_id":7654321,
"event_time":"2018-07-25 00:09:41",
"program_id":4,
"user_agent":"Mozilla/5.0 ...",
"destination":"http://www.google.com",
"communication_kind":"email",
"communication_name":"welcome"
}
Query that I need to run:
select label, count(*) as clicks
from (select distinct "data" ->> 'label' as "label", "data" ->> 'user_id' as "user_id"
from "events"
where "event_name" = 'communication_link_clicked'
and data @> '{"communication_kind": "email"}'
and data @> '{"communication_name": "program_welcome"}'
and CAST ((data ->> 'sent_at') as date ) between '2018-07-25' and '2019-05-18'
and CAST ((data ->> 'program_id') as int) = 4) as sub
group by "label";
This query on my small VM takes around 1 minute.
Records on this table: ~4 million.
Records where data ->> program_id = 4 (~1 million.)
I've tried creating individual indexes for the columns included in the where condition but I din't see major improvements.
Should I convert the json attributes used in the WHERE clause to recular columns? or Am I just missing more indexes and tune my query?
EXPLAIN:
EXPLAIN (ANALYZE, BUFFERS) select label, count(*) as clicks
from (select distinct "data" ->> 'label' as "label", "data" ->> 'user_id' as "user_id
Solution
So it is using the jsonb index. But your work_mem is not large enough to hold the full bitmap, so it goes "lossy" resulting in extra work, see:
Rows Removed by Index Recheck: 1434250
Heap Blocks: exact=42839 lossy=99688
Increasing the size of your work_mem setting should improve this.
Your functional index cannot be used, because of a mismatch between how you define it and how you try to use it.
((data -> 'program_id'::text))
CAST ((data ->> 'program_id') as int) = 4
One uses "->>", one uses "->". And one casts its output to int, and the other leaves it as jsonb. You should define the index over the same expression as you use in the query, if you want it to be usable. But since 1/4 of the table meets that criterion, the index is probably not going to be helpful in the first place.
To maximize the efficiency here, you might want a compound index over event_name and program_id:
That way you get the 1/4 selectivity for free if you are already using the index on event_name. It is hard to say if that is worthwhile or not, as we don't know how selective the
But the best solution is probably not to use JSONB in the first place. Your JSON structure appears to be flat (based on one example), couldn't you just map those fields into real PostgreSQL columns? You kind of suggest that, but what is the cost in doing so in terms of application complexity? Then the planner would have better statistics to work with. Top level JSON attributes do not get the same kind of statistics on that as real columns do.
Rows Removed by Index Recheck: 1434250
Heap Blocks: exact=42839 lossy=99688
Increasing the size of your work_mem setting should improve this.
Your functional index cannot be used, because of a mismatch between how you define it and how you try to use it.
((data -> 'program_id'::text))
CAST ((data ->> 'program_id') as int) = 4
One uses "->>", one uses "->". And one casts its output to int, and the other leaves it as jsonb. You should define the index over the same expression as you use in the query, if you want it to be usable. But since 1/4 of the table meets that criterion, the index is probably not going to be helpful in the first place.
To maximize the efficiency here, you might want a compound index over event_name and program_id:
create index ON events (event_name, CAST ((data ->> 'program_id') as int));That way you get the 1/4 selectivity for free if you are already using the index on event_name. It is hard to say if that is worthwhile or not, as we don't know how selective the
event_name = 'communication_link_clicked' is. Your current query plan is not using that index, but maybe improving the selectivity by 4 fold would make it worthwhile.But the best solution is probably not to use JSONB in the first place. Your JSON structure appears to be flat (based on one example), couldn't you just map those fields into real PostgreSQL columns? You kind of suggest that, but what is the cost in doing so in terms of application complexity? Then the planner would have better statistics to work with. Top level JSON attributes do not get the same kind of statistics on that as real columns do.
Code Snippets
create index ON events (event_name, CAST ((data ->> 'program_id') as int));Context
StackExchange Database Administrators Q#238806, answer score: 4
Revisions (0)
No revisions yet.