HiveBrain v1.2.0
Get Started
← Back to all entries
patternMajor

Quick nearest neighbor search in the 150-dimensional space

Submitted by: @import:stackexchange-dba··
0
Viewed 0 times
neighborthespacesearchnearestquick150dimensional

Problem

I want to create a database using any of the possible RDBMS. It will have a table with approximately 150 columns. The objective is to perform nearest neighbor search of some other objects. So it's a NNS in the 150-dimensional space.

I already tried to use some obvious methods like L1 or L2 distances but of course it takes a lot of time for tables with many rows. Also I tried to look at the KD-tree (note I did not test it) and PG-Strom but they are not a good solution for data with a many dimensions.

Can I somehow improve the speed of the described search using math methods (like KD-tree) or tech methods (like PG-Strom)?

I will try to use any RDBMS which allow to improve speed of the NNS. But MySQL and PostgreSQL are the most appropriate DBMS for me.

Solution

PostgreSQL 9.6 using cube

First install the cube extension

CREATE EXTENSION cube;


Now we will create some n-dimensional space with 100,000 points in 50 dimensions. In addition we'll add a GIST index.

CREATE TEMP TABLE space_nd
AS
  SELECT i, cube(array_agg(random()::float)) AS c
  FROM generate_series(1,1e5) AS i
  CROSS JOIN LATERAL generate_series(1,50)
    AS x
  GROUP BY i;

CREATE INDEX ON space_nd USING gist ( c );
ANALYZE space_nd;


Now we will generate a single point and use the ` operater to find the nearest point using Eucledian distance.

WITH points AS (
  SELECT cube(array_agg(random()::float)) AS c
  FROM generate_series(1,50)
    AS x
)
SELECT i,
  pg_typeof(space_nd.c),
  pg_typeof(points.c),
  cube_distance(space_nd.c, points.c)
FROM space_nd
CROSS JOIN points
ORDER BY space_nd.c  points.c
LIMIT 5;


PostgreSQL 9.6+ supports other distance operators over
cube`. All of which can use the GIST index we created. Namely,

a  b float8  Euclidean distance between a and b.
a  b float8  Taxicab (L-1 metric) distance between a and b.
a  b float8  Chebyshev (L-inf metric) distance between a and b.


That said there is one caveat,


To make it harder for people to break things, there is a limit of 100 on the number of dimensions of cubes. This is set in cubedata.h if you need something bigger.

You ask for 150 dimensions. That may present a minor complication.

Code Snippets

CREATE EXTENSION cube;
CREATE TEMP TABLE space_nd
AS
  SELECT i, cube(array_agg(random()::float)) AS c
  FROM generate_series(1,1e5) AS i
  CROSS JOIN LATERAL generate_series(1,50)
    AS x
  GROUP BY i;

CREATE INDEX ON space_nd USING gist ( c );
ANALYZE space_nd;
WITH points AS (
  SELECT cube(array_agg(random()::float)) AS c
  FROM generate_series(1,50)
    AS x
)
SELECT i,
  pg_typeof(space_nd.c),
  pg_typeof(points.c),
  cube_distance(space_nd.c, points.c)
FROM space_nd
CROSS JOIN points
ORDER BY space_nd.c <-> points.c
LIMIT 5;
a <-> b float8  Euclidean distance between a and b.
a <#> b float8  Taxicab (L-1 metric) distance between a and b.
a <=> b float8  Chebyshev (L-inf metric) distance between a and b.

Context

StackExchange Database Administrators Q#163207, answer score: 21

Revisions (0)

No revisions yet.