patternsqlMinor
PostgreSQL storing vectors and computing dot product
Viewed 0 times
postgresqlvectorsdotproductandstoringcomputing
Problem
I am working on a small NLP project in Python that involves performing text similarity on documents(news articles, blog posts etc) and I am thinking of storing each document features vector in Postgres, for instance say I have a document with the content
Now if I have a second document with a similar data structure, say
I have a working solution but it is way too slow. My most straightforward solution is to store the vectors as jsonb so it'll become something like
WHERE
Breakthrough on cancer research..., assuming after all the Python preprocessing it would become something like [(1, 0.225), (2, 0.1), (5, 0.11),...] where the first item in the tuple represents the index of the "feature" (in this case a word), and the second item represents its "score/magnitude". Now if I have a second document with a similar data structure, say
query_vector = [(1, 0.525), (6, 0.9), (7, 0.1221),...] I need to run a query like SELECT FROM vectors_table WHERE dot_product(vectors_table.features_vector, query_vector) > 0.5. What dot_product will do is multiple the "score/magnitude" of both vectors, which in the example above its gonna be SUM(0.5250.225, 0.90, 0.12210, ...) (0 because the other vector don't have any score for the feature with index 6 or 7).I have a working solution but it is way too slow. My most straightforward solution is to store the vectors as jsonb so it'll become something like
{1: 0.225, 2: 0.1, 5:0.11, ...} and I also have just the keys on the jsonb in another column (for faster array intersection computation later on) then I have an UDF as such:CREATE OR REPLACE FUNCTION dot_product_overlap(query jsonb, query_keys anyarray, target jsonb, target_keys anyarray)
RETURNS float AS $
DECLARE
row integer;
result float := 0;
BEGIN
FOR row IN SELECT * FROM array_intersect(query_keys, target_keys)
LOOP
result:=result+((query->>row)::float*(target->>row)::float);
END LOOP;
RETURN result;
END;
$ LANGUAGE plpgsql;WHERE
array_intersect is another UDF to compute the intersection between 2 array (something like & in intarray). What this UDF did was first compute the intersection between two array of keys andSolution
You should write the function in C if you want it to run as fast as possible. Then it might also be a good idea to write your own data type that is stored with a hash table.
If that is too hard, and you don't need that last little bit of performance, you could try to rewrite your function in Perl or Python — both can be run inside the database.
If that is too hard, and you don't need that last little bit of performance, you could try to rewrite your function in Perl or Python — both can be run inside the database.
Context
StackExchange Database Administrators Q#248701, answer score: 3
Revisions (0)
No revisions yet.