HiveBrain v1.2.0
Get Started
← Back to all entries
patternMinor

Efficient point grouping algorithm

Submitted by: @import:stackexchange-cs··
0
Viewed 0 times
groupingalgorithmefficientpoint

Problem

I'm trying to find a good solution to the following problem:
Given a set of 3d real number coordinates inside a cube, I want to group them into containers. The containers should contain all elements that are in predefined distance to at least one of the other elements. Should two elements be closer than a predefined distance, they are to end in the same container.
Note: Two elements that are further apart may end in the same container e.g. if they are both close to a different element

The most straight forward solution I can think of is to go through all points, and compare them with all other points that haven't been visited before to find out if they are in proximity, but it seems rather inefficient.

I tried to find a solution online, but (maybe because I'm missing the correct keywords) can't find any. I stumbled upon k-d-trees and similar concepts but as far as I understood they don't help much here. Are there good algorithms out there that can take care of this or does anybody have a suggestion on what I could do?

Below is a (badly hand drawn) illustration of a possible 2d case with reference distance d marked in blue and the two groups I want to end up with in red. Thanks in advance for any help.

Edit: as pointed out in the comments, this problem can be thought of as finding connected components of a graph. The graph nodes are the points from the data set and two nodes are adjacent if their distance is smaller than the reference value. The problem I see is that establishing the edges seems to be O(n^2), because I don't see a way that doesn't require me to do pairwise checks on all points of the set, is that correct?

Solution

My Solutions

I will try to elaborate on what solutions I found, their implementations and those implementation's running time.
As I don't have any real computer science background, please point out any mistakes/unclarities you find.

First: The problem, in general terms, is to partition a set of points from
$A \subset \mathbb{R}^d$ into groups such that every point in them is $\epsilon$-reachable
from every other point in the group.
Reachable,for $x_1, x_n \in A$, means there is a sequence of points $x_1,\dots,x_n \in A$ s.t. $d(x_i,x_{i+1})

  • $x \in A$ is directly reachable from $y \in A$ if both are core points and $d(x,y)



  • $x \in A$ is reachable from $y \in A$ if there exists a sequence $x_1 =x, \dots, x_n =y \in A$ s.t. $x_{i+1}$ is reachable from $x_i$



  • non reachable points are outliers



Now, as you might see, for $N=2$ (if you include every point itself when counting) and $d(\cdot, \cdot)$ the usual
euclidean metric
this defines exactly the problem I stated above. DBSCAN clusters together core-points deterministically,
but also adds non-core points to clusters. So if $x \in A$ is reachable from core point $y$, it might
end up in the same cluster without being a core point itself. I say might, because non-core points
may be reachable from different clusters. The assignation is then determined by the order in which
the algorithm runs through the clusters. This ambiguity however resolves with $N=2$, as then every point
that is reachable from another is automatically a core point, that is reachability as stated
above is then a symmetric relation. Clusters are then unambiguous.
In case of an indexing structure that does neighborhood querys in $O(logn)$, the runtime complexity is
apparently $O(n \; logn)$, but don't take my word for it.

I used scikit-learn's implementation of DBSCAN (http://scikit-learn.org/stable/modules/generated/sklearn.cluster.DBSCAN.html).
I failed to find out what runtime complexity this implementation has, but the docs state memory complexity
of $O(n.d)$.

Intersection graph:
As it has been proposed in a different answer: One can also think of the problem as a graph-theoretical one.
The graphs vertices are represented by the points in $A$, with two vertices being adjacent iff their
points are in $\epsilon$-distance of each other. On this graph the problem translates to finding the
connected components.
The paper of
Bentley, Jon L.; Stanat, Donald F.; Williams, E. Hollins, Jr. (1977), "The complexity of finding fixed-radius near neighbors", Information Processing Letters, 6 (6): 209–212
provides an algorithm for constructing this graph that runs in linear time and is essentially identical to the one proposed
by @D.W.. Finding the connected components can also be done in linear time, e.g. by breadth- or depth-first algortihms.

Python Code:
For the Code, I mainly used the DBSCAN implementation of scikit-learn, networkx for graphs and finding
connected components (it is open source but I couldnt find any reference on the runtime complexity)
and scipy.spatial's cKDTree implementation (which I do not understand in detail).

from sklearn.cluster import DBSCAN
from sklearn.datasets.samples_generator import make_blobs
import networkx as nx
import scipy.spatial as sp

def cluster(data, epsilon,N): #DBSCAN, euclidean distance
    db     = DBSCAN(eps=epsilon, min_samples=N).fit(data)
    labels = db.labels_ #labels of the found clusters
    n_clusters = len(set(labels)) - (1 if -1 in labels else 0) #number of clusters
    clusters   = [data[labels == i] for i in range(n_clusters)] #list of clusters
    return clusters, n_clusters


Is my minimal implementation of the previously described goal. The method returns clusters and their size.
For the graph method, I wrote a little class. The class describes a graph whose vertices are the indices
of the corresponding points in $A$ as list. Using the cKDTree's data structure, finding pairs of
reachable points is lightning fast. You can then slice out the elements via the indices.

class IGraph:
    def __init__(self, nodelst=[], radius = 1):
        self.igraph = nx.Graph()
        self.radii  = radius
        self.nodelst = nodelst #nodelst is array of coordinate tuples, graph contains indices as nodes
        self.__make_edges__()

    def __make_edges__(self):
        self.igraph.add_edges_from( sp.cKDTree(self.nodelst).query_pairs(r=self.radii) )

    def get_conn_comp(self):
        ind = [list(x) for x in nx.connected_components(self.igraph) if len(x)>1]
        return [self.nodelst[indlist] for indlist in ind]

def graph_cluster(data, epsilon):
    graph = IGraph(nodelst = data, radius = epsilon)
    clusters = graph.get_conn_comp() 
    return clusters, len(clusters)


I also did a little benchmark, but it is in no way representative. I create a cluster with:

centers = [[1, 1,1], [-1, -1,1], [1, -1,1]]
X,_ = make_blobs(n_samples=N, centers=centers, cluster_std=0.4,
                            random_state=0)

Code Snippets

from sklearn.cluster import DBSCAN
from sklearn.datasets.samples_generator import make_blobs
import networkx as nx
import scipy.spatial as sp

def cluster(data, epsilon,N): #DBSCAN, euclidean distance
    db     = DBSCAN(eps=epsilon, min_samples=N).fit(data)
    labels = db.labels_ #labels of the found clusters
    n_clusters = len(set(labels)) - (1 if -1 in labels else 0) #number of clusters
    clusters   = [data[labels == i] for i in range(n_clusters)] #list of clusters
    return clusters, n_clusters
class IGraph:
    def __init__(self, nodelst=[], radius = 1):
        self.igraph = nx.Graph()
        self.radii  = radius
        self.nodelst = nodelst #nodelst is array of coordinate tuples, graph contains indices as nodes
        self.__make_edges__()

    def __make_edges__(self):
        self.igraph.add_edges_from( sp.cKDTree(self.nodelst).query_pairs(r=self.radii) )

    def get_conn_comp(self):
        ind = [list(x) for x in nx.connected_components(self.igraph) if len(x)>1]
        return [self.nodelst[indlist] for indlist in ind]

def graph_cluster(data, epsilon):
    graph = IGraph(nodelst = data, radius = epsilon)
    clusters = graph.get_conn_comp() 
    return clusters, len(clusters)
centers = [[1, 1,1], [-1, -1,1], [1, -1,1]]
X,_ = make_blobs(n_samples=N, centers=centers, cluster_std=0.4,
                            random_state=0)

Context

StackExchange Computer Science Q#85929, answer score: 4

Revisions (0)

No revisions yet.