HiveBrain v1.2.0
Get Started
← Back to all entries
patternsqlMinor

PostgreSQL storing vectors and computing dot product

Submitted by: @import:stackexchange-dba··
0
Viewed 0 times
postgresqlvectorsdotproductandstoringcomputing

Problem

I am working on a small NLP project in Python that involves performing text similarity on documents(news articles, blog posts etc) and I am thinking of storing each document features vector in Postgres, for instance say I have a document with the content Breakthrough on cancer research..., assuming after all the Python preprocessing it would become something like [(1, 0.225), (2, 0.1), (5, 0.11),...] where the first item in the tuple represents the index of the "feature" (in this case a word), and the second item represents its "score/magnitude".

Now if I have a second document with a similar data structure, say query_vector = [(1, 0.525), (6, 0.9), (7, 0.1221),...] I need to run a query like SELECT FROM vectors_table WHERE dot_product(vectors_table.features_vector, query_vector) > 0.5. What dot_product will do is multiple the "score/magnitude" of both vectors, which in the example above its gonna be SUM(0.5250.225, 0.90, 0.12210, ...) (0 because the other vector don't have any score for the feature with index 6 or 7).

I have a working solution but it is way too slow. My most straightforward solution is to store the vectors as jsonb so it'll become something like {1: 0.225, 2: 0.1, 5:0.11, ...} and I also have just the keys on the jsonb in another column (for faster array intersection computation later on) then I have an UDF as such:

CREATE OR REPLACE FUNCTION dot_product_overlap(query jsonb, query_keys anyarray, target jsonb, target_keys anyarray)
RETURNS float AS $
DECLARE 
    row integer;
    result float := 0;
BEGIN
       FOR row IN SELECT * FROM array_intersect(query_keys, target_keys)
       LOOP
          result:=result+((query->>row)::float*(target->>row)::float);
       END LOOP;
    RETURN result;
END;
$  LANGUAGE plpgsql;


WHERE array_intersect is another UDF to compute the intersection between 2 array (something like & in intarray). What this UDF did was first compute the intersection between two array of keys and

Solution

You should write the function in C if you want it to run as fast as possible. Then it might also be a good idea to write your own data type that is stored with a hash table.

If that is too hard, and you don't need that last little bit of performance, you could try to rewrite your function in Perl or Python — both can be run inside the database.

Context

StackExchange Database Administrators Q#248701, answer score: 3

Revisions (0)

No revisions yet.