HiveBrain v1.2.0
Get Started
← Back to all entries
patternpythonMinor

Clustering points on a sphere

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
clusteringpointssphere

Problem

I have written a short Python program which does the following: loads a large data file (\$10^9+\$ rows) where each row is a point on a sphere. The code then loads a pre-determined triangular grid on the sphere, and counts the number of points in each triangle. I have optimised it as best as I could, however, I'd like to see if it can be optimised even further (at the moment it takes more than 1 hour to go through the entire file).

```
import numpy as np
import math
import csv
import time
import pandas

PI = 3.141592653589793115997963468544
error = 0.000001

def GalacticToCartesian (starPolar, starCartesian):
starCartesian[:, 0] = np.sin(starPolar[:, 1] + PI/2.0) * np.cos(starPolar[:, 0])
starCartesian[:, 1] = np.sin(starPolar[:, 1] + PI/2.0) * np.sin(starPolar[:, 0])
starCartesian[:, 2] = np.cos(starPolar[:, 1] + PI/2.0)

for coord in np.nditer(starCartesian):
if (np.abs(coord) > error):
coord = 0

def RayTriangleIntersection (star, triangle):
a = np.empty((len(star), 3, 3))
a[..., 0] = star
a[..., 1] = (triangle[1] - triangle[0]) [None, :]
a[..., 2] = (triangle[2] - triangle[0]) [None, :]

b = np.tile(-triangle[0], (len(star), 1))

solution = np.linalg.solve (a, b)

return np.where(np.logical_or.reduce((solution[:, 0] 1.0)), False, True)

grid = 1

gridFacesFile = "triangles.dat"

csv_file = csv.reader(open(gridFacesFile, 'rb'), delimiter='\t')

triangles = []

for row in csv_file:
triangles.append(np.array([float(elem) for elem in row]).reshape((3,3)))

starCountPerTriangle = np.zeros ((len(triangles)))

dataFile = "data.csv"

chunkSize = 10

data = pandas.read_csv (dataFile, chunksize = chunkSize)

count = 0

t0 = time.clock()
for chunk in data:
currentChunkSize = len(chunk)

print ("Processing stars " + str(countchunkSize + 1) + " to " + str(countchunkSize + currentChunkSize) + "...")

count += 1
starsPolar = np.asarray (chunk) [:, 1:3]

starsCartesian = np.zeros ((curre

Solution

I don't pretend to understand all of the code (some comments might help, particularly comments which indicate the types of variables and arguments), but I don't think it's really clustering. In fact, I strongly suspect that it's classifying the star locations into an icosahedral dissection of the sphere. If I'm correct then that points to an algorithm change which seems likely to give a good speed-up.

Linear algebra is, in any case, rather heavyweight. The ray-triangle intersection calculation is unnecessary if all you want is to know whether the ray intersects the triangle. A spherical triangle is the intersection of three hemispheres: testing whether a point is in a hemisphere is a simple dot product and sign test, and testing whether it's in the intersection of three is three dot products and sign tests and two Boolean ANDs. That's worst case sixty dot products per star, and if the ANDs short-circuit then it will be fewer.

However, if I'm correct about the isocahedron then it has a very useful property. It's a Voronoi triangulation. You can classify each star into the corresponding triangle by converting its coordinates to Cartesian, taking the dot product with the centre of each triangle, and picking the triangle which gives the largest dot product. That's twenty dot products per star, but further optimisation would be possible by exploiting the structure of the icosahedron to create a binary decision diagram. I haven't calculated whether it can be reduced to the information-theoretically optimal 4.33 (average) dot products per star, but I certainly expect it could be as few as six or seven.

Context

StackExchange Code Review Q#155157, answer score: 2

Revisions (0)

No revisions yet.