HiveBrain v1.2.0
Get Started
← Back to all entries
patternMinor

What is the best known data structure for online dk-NNG?

Submitted by: @import:stackexchange-cs··
0
Viewed 0 times
thewhatonlinenngknownstructurefordatabest

Problem

I want to build the distance constrained k-Nearest Neighbours Graph, i.e. for every point $p$ in the data cloud $P$ there are at most $k$ undirected edges to points that are closest among all possible vertices using $L_2$ metric with additional constraint that are no further than $d$.

There are only insertions (in batches of $10^6$ points) and neighbourhood retrieval queries used. The results have to be exact - I cannot use approximate nearest neighbours or reduce dimensions.

The setting is following:

There are 50 dimensions, not correlated. The $k=9$ is number of neighbours, $d$ is given distance. The data is sparse in every dimension, mean and median are meaningful, data is random and unstructured in any way (but for sure not from uniform or Gaussian distribution).

So far I have tried to build parallel kd-tree with partial rebalancing, the ball trees, m-trees, r-trees, Sparse ktree (octree with more dimensions) partitioning and two-projection hashes. But I am facing problem, that either insertion time is far too long or k-nng queries are too long. I think this happens because the structures are either static or not really prepared to maintain full neighbourhood graph.

I have tried linear scan, it performs good for very small data, worse than z-curve localised version or sparse ktree. Up to some $n$ it outperforms any treelike structure. I made one modification to naïve algorithm: used one AVL (right threaded) per dimension keeping nodes connected among trees and checked only points within distance $d$ at each dimension - this one performs well, but due to its memory consumption and lack of multidimensional structure scales poorly.

What data structure would be the best in practice for dk-NNG?

If this is not enough, I would like to optimize the number of neighbours checked, for localised data the number of bins traversed.

Solution

I'm not familiar with dk-NNG, but have you tried a PH-Tree? It has excellent insertion times and very good kNN search.
I tested several scenarios with up to 40 dimension and $10^6$ and $10^7$ points. It usually competes with R*Tree (R-Star-Tree) for kNN-search, but has much better update performance.

You could also try an R*tree, they are almost always faster than a standard R-Tree. An implementation in Java can be found here.

Context

StackExchange Computer Science Q#77520, answer score: 2

Revisions (0)

No revisions yet.