HiveBrain v1.2.0
Get Started
← Back to all entries
patternsqlMinor

Abysmal postgres query performance when joining by jsonb_exists (question mark operator)

Submitted by: @import:stackexchange-dba··
0
Viewed 0 times
postgresjsonb_existsoperatorqueryperformancewhenabysmalquestionjoiningmark

Problem

Our application recently changed how some data was modeled, changing the schema to add a jsonb array column, and populating it from data in another column.

We're finding that the new queries joining on this column are unacceptably slow against our full data set, and are in a tight spot as we're forced to still use the old queries until this is resolved.

The main table we're querying on is defined like so:

CREATE TABLE public.visit (
  visit_key CHARACTER VARYING PRIMARY KEY NOT NULL,
  store_key CHARACTER VARYING NOT NULL,
  launch_key CHARACTER VARYING NOT NULL,
  when_ TIMESTAMP WITHOUT TIME ZONE NOT NULL,
  page CHARACTER VARYING NOT NULL,
  ip CHARACTER VARYING NOT NULL,
  useragent CHARACTER VARYING,
  referrer CHARACTER VARYING,
  visitor CHARACTER VARYING NOT NULL,
  country CHARACTER VARYING,
  region CHARACTER VARYING,
  city CHARACTER VARYING,
  vars JSONB,
  stops JSONB,
  FOREIGN KEY (launch_key) REFERENCES public.launch (launch_key)
  MATCH SIMPLE ON UPDATE NO ACTION ON DELETE NO ACTION,
  FOREIGN KEY (store_key) REFERENCES public.store (store_key)
  MATCH SIMPLE ON UPDATE NO ACTION ON DELETE NO ACTION
);
CREATE INDEX visit_store_key_idx ON visit USING BTREE (store_key);
CREATE INDEX visit_launch_key_idx ON visit USING BTREE (launch_key);
CREATE INDEX visit_visitor_idx ON visit USING BTREE (visitor);
CREATE INDEX visit_stops_idx ON visit USING GIN (stops);
CREATE INDEX visit_when__idx ON visit USING BTREE (when_);


Our old "fast" query and corresponding plan looks like this:

```
EXPLAIN (ANALYZE, VERBOSE)
SELECT
date_trunc('day', "public"."visit"."when_" :: TIMESTAMP WITH TIME ZONE AT TIME ZONE 'America/Los_Angeles'),
count("public"."visit"."visitor"),
count(DISTINCT "public"."visit"."visitor")
FROM "public"."visit" join "public"."launch" ON ("public"."visit"."launch_key" = "public"."launch"."launch_key")
WHERE (
"public"."visit"."store_key" = 'ahBzfmdlYXJsYXVuY2gtaHVicg8LEgVTdG9yZRi53Ku8Cgw':: VARCHAR AND
"public"."visi

Solution

Data was migrated such that if visit.launch_key contained 'abc', visit.stops now contains ['abc'].

This seems like a really bad idea. Why do you want to do that? Is it too late to change your mind?

PostgreSQL can get a pretty good estimate of how selective visit.store_key = parameter is. It has much less accurate information on how selective visit.stops ? parameter.

When the ? is only used as a filter, this probably isn't going to be so bad. But when it is used as a join operator, this leads the planner to make very poor choices.

This is an acknowledged shortcoming and there has been some discussion of trying to improve this situation for a future version of PostgreSQL, but so far I don't think it has moved beyond discussion so I wouldn't be terrible optimistic. But even if it were fixed, I would still expect a scalar which is a scalar is going to work better than a scalar dressed up as an array.

By the way, you asked why it slows down your sort. It doesn't. The time reported for the sort operation includes the time needed to generate the data that is to be sorted. The sort itself is very quick.

Context

StackExchange Database Administrators Q#154519, answer score: 2

Revisions (0)

No revisions yet.