patternpythonMinor
Clustering points on a sphere
Viewed 0 times
clusteringpointssphere
Problem
I have written a short Python program which does the following: loads a large data file (\$10^9+\$ rows) where each row is a point on a sphere. The code then loads a pre-determined triangular grid on the sphere, and counts the number of points in each triangle. I have optimised it as best as I could, however, I'd like to see if it can be optimised even further (at the moment it takes more than 1 hour to go through the entire file).
```
import numpy as np
import math
import csv
import time
import pandas
PI = 3.141592653589793115997963468544
error = 0.000001
def GalacticToCartesian (starPolar, starCartesian):
starCartesian[:, 0] = np.sin(starPolar[:, 1] + PI/2.0) * np.cos(starPolar[:, 0])
starCartesian[:, 1] = np.sin(starPolar[:, 1] + PI/2.0) * np.sin(starPolar[:, 0])
starCartesian[:, 2] = np.cos(starPolar[:, 1] + PI/2.0)
for coord in np.nditer(starCartesian):
if (np.abs(coord) > error):
coord = 0
def RayTriangleIntersection (star, triangle):
a = np.empty((len(star), 3, 3))
a[..., 0] = star
a[..., 1] = (triangle[1] - triangle[0]) [None, :]
a[..., 2] = (triangle[2] - triangle[0]) [None, :]
b = np.tile(-triangle[0], (len(star), 1))
solution = np.linalg.solve (a, b)
return np.where(np.logical_or.reduce((solution[:, 0] 1.0)), False, True)
grid = 1
gridFacesFile = "triangles.dat"
csv_file = csv.reader(open(gridFacesFile, 'rb'), delimiter='\t')
triangles = []
for row in csv_file:
triangles.append(np.array([float(elem) for elem in row]).reshape((3,3)))
starCountPerTriangle = np.zeros ((len(triangles)))
dataFile = "data.csv"
chunkSize = 10
data = pandas.read_csv (dataFile, chunksize = chunkSize)
count = 0
t0 = time.clock()
for chunk in data:
currentChunkSize = len(chunk)
print ("Processing stars " + str(countchunkSize + 1) + " to " + str(countchunkSize + currentChunkSize) + "...")
count += 1
starsPolar = np.asarray (chunk) [:, 1:3]
starsCartesian = np.zeros ((curre
```
import numpy as np
import math
import csv
import time
import pandas
PI = 3.141592653589793115997963468544
error = 0.000001
def GalacticToCartesian (starPolar, starCartesian):
starCartesian[:, 0] = np.sin(starPolar[:, 1] + PI/2.0) * np.cos(starPolar[:, 0])
starCartesian[:, 1] = np.sin(starPolar[:, 1] + PI/2.0) * np.sin(starPolar[:, 0])
starCartesian[:, 2] = np.cos(starPolar[:, 1] + PI/2.0)
for coord in np.nditer(starCartesian):
if (np.abs(coord) > error):
coord = 0
def RayTriangleIntersection (star, triangle):
a = np.empty((len(star), 3, 3))
a[..., 0] = star
a[..., 1] = (triangle[1] - triangle[0]) [None, :]
a[..., 2] = (triangle[2] - triangle[0]) [None, :]
b = np.tile(-triangle[0], (len(star), 1))
solution = np.linalg.solve (a, b)
return np.where(np.logical_or.reduce((solution[:, 0] 1.0)), False, True)
grid = 1
gridFacesFile = "triangles.dat"
csv_file = csv.reader(open(gridFacesFile, 'rb'), delimiter='\t')
triangles = []
for row in csv_file:
triangles.append(np.array([float(elem) for elem in row]).reshape((3,3)))
starCountPerTriangle = np.zeros ((len(triangles)))
dataFile = "data.csv"
chunkSize = 10
data = pandas.read_csv (dataFile, chunksize = chunkSize)
count = 0
t0 = time.clock()
for chunk in data:
currentChunkSize = len(chunk)
print ("Processing stars " + str(countchunkSize + 1) + " to " + str(countchunkSize + currentChunkSize) + "...")
count += 1
starsPolar = np.asarray (chunk) [:, 1:3]
starsCartesian = np.zeros ((curre
Solution
I don't pretend to understand all of the code (some comments might help, particularly comments which indicate the types of variables and arguments), but I don't think it's really clustering. In fact, I strongly suspect that it's classifying the star locations into an icosahedral dissection of the sphere. If I'm correct then that points to an algorithm change which seems likely to give a good speed-up.
Linear algebra is, in any case, rather heavyweight. The ray-triangle intersection calculation is unnecessary if all you want is to know whether the ray intersects the triangle. A spherical triangle is the intersection of three hemispheres: testing whether a point is in a hemisphere is a simple dot product and sign test, and testing whether it's in the intersection of three is three dot products and sign tests and two Boolean ANDs. That's worst case sixty dot products per star, and if the ANDs short-circuit then it will be fewer.
However, if I'm correct about the isocahedron then it has a very useful property. It's a Voronoi triangulation. You can classify each star into the corresponding triangle by converting its coordinates to Cartesian, taking the dot product with the centre of each triangle, and picking the triangle which gives the largest dot product. That's twenty dot products per star, but further optimisation would be possible by exploiting the structure of the icosahedron to create a binary decision diagram. I haven't calculated whether it can be reduced to the information-theoretically optimal 4.33 (average) dot products per star, but I certainly expect it could be as few as six or seven.
Linear algebra is, in any case, rather heavyweight. The ray-triangle intersection calculation is unnecessary if all you want is to know whether the ray intersects the triangle. A spherical triangle is the intersection of three hemispheres: testing whether a point is in a hemisphere is a simple dot product and sign test, and testing whether it's in the intersection of three is three dot products and sign tests and two Boolean ANDs. That's worst case sixty dot products per star, and if the ANDs short-circuit then it will be fewer.
However, if I'm correct about the isocahedron then it has a very useful property. It's a Voronoi triangulation. You can classify each star into the corresponding triangle by converting its coordinates to Cartesian, taking the dot product with the centre of each triangle, and picking the triangle which gives the largest dot product. That's twenty dot products per star, but further optimisation would be possible by exploiting the structure of the icosahedron to create a binary decision diagram. I haven't calculated whether it can be reduced to the information-theoretically optimal 4.33 (average) dot products per star, but I certainly expect it could be as few as six or seven.
Context
StackExchange Code Review Q#155157, answer score: 2
Revisions (0)
No revisions yet.