HiveBrain v1.2.0
Get Started
← Back to all entries
patternsqlMinor

Finding closest points without duplicates and given distance threshold

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
withoutpointsdistancefindingthresholdclosestandgivenduplicates

Problem

Given a bunch of latitudes and longitudes stored as Points, I would like to determine the points that are closest to each point given some maximum threshold in meters. This is my first stab at this:

DECLARE @ThresholdInMeters FLOAT = 1000000.0;

IF OBJECT_ID('tempdb..#Points1') IS NOT NULL
    DROP TABLE #Points1
CREATE TABLE #Points1
    (
      PointId INT IDENTITY(1, 1) ,
      Point GEOGRAPHY NULL
    ) 

DECLARE @LowerLatitude FLOAT;
DECLARE @UpperLatitude FLOAT;
DECLARE @RandomLatitude FLOAT;
DECLARE @LowerLongitude FLOAT;
DECLARE @UpperLongitude FLOAT;
DECLARE @RandomLongitude FLOAT;
DECLARE @Counter INT;

SET @LowerLatitude = -90 
SET @UpperLatitude = 90 
SET @LowerLongitude = -180
SET @UpperLongitude = 180

SET @Counter = 100;
WHILE ( @Counter > 0 )
    BEGIN
        SELECT  @RandomLatitude = ( ( @UpperLatitude - @LowerLatitude - 1 )
                                    * RAND() + @LowerLatitude );
        SELECT  @RandomLongitude = ( ( @UpperLongitude - @LowerLongitude - 1 )
                                     * RAND() + @LowerLongitude );

        INSERT  INTO #Points1
                SELECT  GEOGRAPHY::Point(@RandomLatitude, @RandomLongitude,
                                         4326)

        SET @Counter = @Counter - 1;
    END 

SELECT  
    Points1.PointId AS StartPointId,
    Points2.PointId AS EndPointId,
    Points1.Point.STDistance(Points2.Point) AS Distance
FROM #Points1 AS Points1
INNER JOIN #Points1 AS Points2 ON Points1.PointId <= Points2.PointId
WHERE Points1.Point.STDistance(Points2.Point) < @ThresholdInMeters


Any ideas on how to potentially improve this, especially in performance? All indices are already set.

Solution

Style

Remove unneccessary parentheses

You don't need parentheses around the condition for the WHILE
loop. You can just write WHILE @counter > 0.

Also, the extra sets of parentheses around the random numbers are
superfluous. If you leave them off, the line will be a bit less
daunting.

Don't break the line in the middle of a concept

That is, in the SELECT @RandomLatitude line, the statement breaks
just before the * operator. As it has a higher precedence than
the +, this can be a bit misleading. Either don't break, use
parentheses (if really unavoidable) or break before the +:

SELECT @RandomLatitude = ( @UpperLatitude - @LowerLatitude - 1 ) * RAND()
                         + @LowerLatitude;
SELECT @RandomLongitude = ( @UpperLongitude - @LowerLongitude - 1 ) * RAND()
                          + @LowerLongitude;


Merge variable declaration and initialisation

Where possible, declare a variable just before it is used, and
initialise it there as well:

DECLARE @LowerLatitude FLOAT = -90;
DECLARE @UpperLatitude FLOAT = 90;
DECLARE @LowerLongitude FLOAT = -180;
DECLARE @UpperLongitude FLOAT = 180;

DECLARE @Counter INT = 1000;
WHILE @Counter > 0
  BEGIN
    DECLARE @RandomLatitude FLOAT = ( @UpperLatitude - @LowerLatitude - 1 ) * RAND()
                                    + @LowerLatitude
    DECLARE @RandomLongitude FLOAT = ( @UpperLongitude - @LowerLongitude - 1 ) * RAND()
                                     + @LowerLongitude
    -- ...


Performance

You mention that the indices are already set. I assume that means that
it is not possible to add a Spatial Index, or a primary key.
Although I would strongly suggest adding a primary key constraint to
the temp table in this example:

CREATE TABLE #Points1
    ( PointId INT IDENTITY(1, 1) PRIMARY KEY
    , Point GEOGRAPHY NULL
    );


Unfortunately, without a Spatial Index or changes to the result
requirements (you'd need a TOP n clause and an ORDER BY on the
STDistance column), there is no choice but to check each point against
all others, which will give a large join of which I don't know how to
improve the performance.

Small point: you may gain a smaller result (only in number of rows) if
you leave out the distance from each point to itself:

INNER JOIN Points1 AS Points2 ON Points1.PointId < Points2.PointId


but that might not fit your requirements.

Code Snippets

SELECT @RandomLatitude = ( @UpperLatitude - @LowerLatitude - 1 ) * RAND()
                         + @LowerLatitude;
SELECT @RandomLongitude = ( @UpperLongitude - @LowerLongitude - 1 ) * RAND()
                          + @LowerLongitude;
DECLARE @LowerLatitude FLOAT = -90;
DECLARE @UpperLatitude FLOAT = 90;
DECLARE @LowerLongitude FLOAT = -180;
DECLARE @UpperLongitude FLOAT = 180;

DECLARE @Counter INT = 1000;
WHILE @Counter > 0
  BEGIN
    DECLARE @RandomLatitude FLOAT = ( @UpperLatitude - @LowerLatitude - 1 ) * RAND()
                                    + @LowerLatitude
    DECLARE @RandomLongitude FLOAT = ( @UpperLongitude - @LowerLongitude - 1 ) * RAND()
                                     + @LowerLongitude
    -- ...
CREATE TABLE #Points1
    ( PointId INT IDENTITY(1, 1) PRIMARY KEY
    , Point GEOGRAPHY NULL
    );
INNER JOIN Points1 AS Points2 ON Points1.PointId < Points2.PointId

Context

StackExchange Code Review Q#102166, answer score: 2

Revisions (0)

No revisions yet.