HiveBrain v1.2.0
Get Started
← Back to all entries
patternsqlMinor

Optimize query matching first N items of an array

Submitted by: @import:stackexchange-dba··
0
Viewed 0 times
arrayqueryfirstitemsoptimizematching

Problem

I'm working with a Postgres DB full of chess game data right now, where each game is a row in a 'records' table. The players' moves and the (optional) computer evaluations of those moves each have their own column and are stored as arrays.

I've written a query to retrieve all evaluations of a specified opening sequence of moves. (You'd think the computer evaluation would be consistent - but it's not.) The length of the opening sequence is arbitrary - it could be one move, it could be thirty.

Here's an example query, which finds all games starting with the same ten-move opening sequence and then, for each game with an evaluation, returns the computer's evaluation of that point in the game -

SELECT evaluation[10]
FROM records
WHERE moves[1:10]::text[] = ARRAY['b4', 'e5', 'Bb2', 'd6', 'Nf3', 'Nf6', 'g3', 'Bg4', 'Bg2', 'h5']::text[]
AND evaluation IS NOT NULL;


I'm not certain it's relevant, but the move data is always an alphanumeric string from 2-6 characters, and the computer evaluations are largely decimals (both positive and negative) but do include the occasional special character (forced checkmates have an octothorpe for a prefix).

Here's the relevant snippet of the table description -

Column      |              Type              | 
-----------------+--------------------------------+-
 id              | bigint                         | 
 moves           | character varying(255)[]       |  
 evaluation      | character varying(255)[]       | 
    "records_pkey" PRIMARY KEY, btree (id)
Access method: heap


And here's the query plan from EXPLAIN ANALYZE:

```
Gather (cost=1000.00..736354.70 rows=905 width=516) (actual time=28251.267..28253.139 rows=0 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Parallel Seq Scan on records (cost=0.00..735264.20 rows=377 width=516) (actual time=28243.233..28243.234 rows=0 loops=3)
Filter: ((evaluation IS NOT NULL) AND ((moves[1:10])::text[] = '{b4,e5,Bb2,d6,Nf3,Nf6,g3,Bg4,Bg2,h5}'::text

Solution

You need index support to be fast. Tricky without re-designing your table. The following solution should perform excellently (microseconds instead of seconds), but requires some skill. Buckle up.
Expression index on IMMUTABLE function

Just take a few leading array elements, say 8. That should be very selective already. More would just make the index bigger, for little extra filtering.

Convert to a string. No separator. That allows false positives, but unlikely enough to matter. Eventually, we filter for exact results anyway.

Only IMMUTABLE functions are allowed in index expressions. But array_to_string() is only STABLE, not IMMUTABLE, because it takes anyarray and some element types don't have an immutable text representation. We only deal with text (well, varchar(255) for no good reason, but all the same), and that is, in fact, immutable. But array_to_string() doesn't know that.

So we could "fake" an immutable wrapper function. That's possible for any user who can create functions:

CREATE OR REPLACE FUNCTION public.f_8moves(text[])
  RETURNS text
  LANGUAGE sql IMMUTABLE PARALLEL SAFE STRICT AS
$func$
SELECT array_to_string($1[1:8], '')
$func$;


Or, better yet, define an honestly IMMUTABLE variant of array_to_string() for only text[] input, directly based on the underlying C function. Faster and cleaner. Let's call it array_to_string_immutable() to be clear. That requires superuser privileges:

-- SET ROLE postgres;  -- you must be superuser

CREATE OR REPLACE FUNCTION public.array_to_string_immutable(text[], text)
 RETURNS text
 LANGUAGE internal IMMUTABLE PARALLEL SAFE STRICT AS 'array_to_text';


The rest works without superuser privileges.

CREATE OR REPLACE FUNCTION public.f_8moves(text[])
  RETURNS text
  LANGUAGE sql IMMUTABLE PARALLEL SAFE STRICT AS
$func$
SELECT public.array_to_string_immutable($1[1:8], '')
$func$;


Related:

  • Does PostgreSQL support "accent insensitive" collations?



Either way, we now have a function public.f_8moves(text[]) to be used in the following index:

CREATE INDEX records_8moves_idx ON records (public.f_8moves(moves) COLLATE "C");


COLLATE "C" is exactly what we need to allow left-anchored (your expressed requirement) LIKE expressions. See:

  • Is there a difference between text_pattern_ops and COLLATE "C"?



  • Is unique index better than unique constraint when an index with an operator class is required



If a major percentage of rows has evaluation IS NOT NULL, add that filter to the index, to make it a partial index on top:

CREATE INDEX records_8moves_idx ON records (public.f_8moves(moves) COLLATE "C")
WHERE evaluation IS NOT NULL;


Query

For queries with 10 array elements or more:

SELECT evaluation[10]
FROM   records
WHERE  public.f_8moves(moves) = public.f_8moves('{b4,e5,Bb2,d6,Nf3,Nf6,g3,Bg4,Bg2,h5}') COLLATE "C"
AND    moves[1:10] = '{b4,e5,Bb2,d6,Nf3,Nf6,g3,Bg4,Bg2,h5}'
AND    evaluation IS NOT NULL;


COLLATE "C" is required to match the index.

Use the expression public.f_8moves(moves) to match the functional index. The right side of the expression can be given as string directly. But I use the same function for convenience.

Then add original exact filters to narrow the results down to exact matches:

AND    moves[1:10] = '{b4,e5,Bb2,d6,Nf3,Nf6,g3,Bg4,Bg2,h5}'


Looks redundant, is logically redundant, but allows to use the index to great effect.

For arrays with less than 8 elements (our indexed lead), or generally, use LIKE with a left-anchored pattern:

SELECT evaluation[10]
FROM   records
WHERE  public.f_8moves(moves) LIKE (public.f_8moves('{b4,e5}') || '%') COLLATE "C"  -- COLLATE "C" is optional for LIKE
AND    moves[1:2] = '{b4,e5}'
AND    evaluation IS NOT NULL;


db<>fiddle here

Similar, with more explanation:

  • Index max row size error



  • Full-text search in Postgres or CouchDB?



Cluster table rows

It will generally help performance to physically sort your table rows, so that each query can read results from one or few data pages, instead of fetching from all over the place.

Use CLUSTER or the non-blocking alternatives pg_repack or pg_squeeze. More in the link above.

Code Snippets

CREATE OR REPLACE FUNCTION public.f_8moves(text[])
  RETURNS text
  LANGUAGE sql IMMUTABLE PARALLEL SAFE STRICT AS
$func$
SELECT array_to_string($1[1:8], '')
$func$;
-- SET ROLE postgres;  -- you must be superuser

CREATE OR REPLACE FUNCTION public.array_to_string_immutable(text[], text)
 RETURNS text
 LANGUAGE internal IMMUTABLE PARALLEL SAFE STRICT AS 'array_to_text';
CREATE OR REPLACE FUNCTION public.f_8moves(text[])
  RETURNS text
  LANGUAGE sql IMMUTABLE PARALLEL SAFE STRICT AS
$func$
SELECT public.array_to_string_immutable($1[1:8], '')
$func$;
CREATE INDEX records_8moves_idx ON records (public.f_8moves(moves) COLLATE "C");
CREATE INDEX records_8moves_idx ON records (public.f_8moves(moves) COLLATE "C")
WHERE evaluation IS NOT NULL;

Context

StackExchange Database Administrators Q#299039, answer score: 7

Revisions (0)

No revisions yet.