patternpythonMinor
K-means algorithm very slow
Viewed 0 times
algorithmslowverymeans
Problem
I'm a beginner in Python but I have tried to implement K-means algorithm in python and it's working... but it's too slow... Instead of few seconds I can spend hours to finish it and I don't know why... something it's wrong... Could anyone of you give me some advice please?
Code:
```
import math
from random import randint
from copy import deepcopy
from chart import chart_centroids, save_image
centroid = dict(x=0, y=0, points_x=[], points_y=[])
list_centroids = []
x = []
y = []
k = 0
def distance(x1, y1, x2, y2):
x = (int(x1) - int(x2)) ** 2
y = (int(y1) - int(y2)) ** 2
sum = x + y
sqr = math.sqrt(sum)
return sqr
def generate_centroids(range_x, range_y):
list_centroids.clear()
k = 3
for i in range(0, k):
centroid["x"] = randint(1, range_x)
centroid["y"] = randint(1, range_y)
list_centroids.append(deepcopy(centroid))
def choose_points_for_centroids(x, y):
for i in range(len(list_centroids)):
list_centroids[i]["points_x"].clear()
list_centroids[i]["points_y"].clear()
distances = []
for j in range(len(x)):
for i in range(len(list_centroids)):
dist = distance(x[j], y[j], list_centroids[i]["x"], list_centroids[i]["y"])
distances.append(dist)
minim = min(float(s) for s in distances)
index = distances.index(minim)
list_centroids[index]["points_x"].append(x[j])
list_centroids[index]["points_y"].append(y[j])
distances.clear()
def move_centroids():
sum_x = 0
sum_y = 0
for cent in list_centroids:
for j in range(len(cent["points_x"])):
sum_x += cent["points_x"][j]
sum_y += cent["points_y"][j]
if len(cent["points_x"]) > 0 and len(cent["points_y"]) > 0:
avg_x = sum_x / len(cent["points_x"])
avg_y = sum_y / len(cent["points_y"])
cent["x"] = avg_x
cent["y"] = avg_y
def run():
generate_centroids(300, 300)
read_f
Code:
```
import math
from random import randint
from copy import deepcopy
from chart import chart_centroids, save_image
centroid = dict(x=0, y=0, points_x=[], points_y=[])
list_centroids = []
x = []
y = []
k = 0
def distance(x1, y1, x2, y2):
x = (int(x1) - int(x2)) ** 2
y = (int(y1) - int(y2)) ** 2
sum = x + y
sqr = math.sqrt(sum)
return sqr
def generate_centroids(range_x, range_y):
list_centroids.clear()
k = 3
for i in range(0, k):
centroid["x"] = randint(1, range_x)
centroid["y"] = randint(1, range_y)
list_centroids.append(deepcopy(centroid))
def choose_points_for_centroids(x, y):
for i in range(len(list_centroids)):
list_centroids[i]["points_x"].clear()
list_centroids[i]["points_y"].clear()
distances = []
for j in range(len(x)):
for i in range(len(list_centroids)):
dist = distance(x[j], y[j], list_centroids[i]["x"], list_centroids[i]["y"])
distances.append(dist)
minim = min(float(s) for s in distances)
index = distances.index(minim)
list_centroids[index]["points_x"].append(x[j])
list_centroids[index]["points_y"].append(y[j])
distances.clear()
def move_centroids():
sum_x = 0
sum_y = 0
for cent in list_centroids:
for j in range(len(cent["points_x"])):
sum_x += cent["points_x"][j]
sum_y += cent["points_y"][j]
if len(cent["points_x"]) > 0 and len(cent["points_y"]) > 0:
avg_x = sum_x / len(cent["points_x"])
avg_y = sum_y / len(cent["points_y"])
cent["x"] = avg_x
cent["y"] = avg_y
def run():
generate_centroids(300, 300)
read_f
Solution
I see a lot of little things slowing you down, but I don't know what your
Let's look at one of your two frequently-called functions:
In the first paragraph, you "clear" a bunch of data. But I'm not sure why you have your centroids structured this way. Every time you access something, there's an index, a key lookup, and maybe another index. That's way too much work for getting at something you'll be addressing frequently!
In fact, the whole idea of accessing
On the other hand, if you were to combine your
In the next section, you iterate over all your points (here you go again, segregating ordinates from abscissas!) computing a distance metric. You store the distances in a list.
After creating the distances list, you then find the min value.
After finding the min value, you then try to map back to the index of that value.
After finding the index, you use that to figure out what centroid was closest to the point, and tie the point to the centroid.
You overlook the
In your case, you can replace all that code by judicious use of a lambda-expression:
And if you recode your distance function to take tuples, you don't need to do even that much work:
This does three things for you. First, it eliminates a lot of bytecode. And that means it eliminates a lot of things that the computer was doing, which should save you time.
Second, it converts some bytecode into builtins. Using the builtins as much as possible means that your code might be running in C, instead of bytecode. This makes for better performance.
Third, it eliminates extra data structures. Which eliminates allocation, deallocation, garbage collection, data structure maintenance, etc. All that storage translates into performance, either directly (thrashing) or indirectly (code).
Now, speaking of your
Try something like this, again using the points-as-tuples approach:
(Putting the lookup of
chart_centroids and save_image functions do, so I have no idea if they are part of the problem or not. Let's look at one of your two frequently-called functions:
def choose_points_for_centroids(x, y):
for i in range(len(list_centroids)):
list_centroids[i]["points_x"].clear()
list_centroids[i]["points_y"].clear()
distances = []
for j in range(len(x)):
for i in range(len(list_centroids)):
dist = distance(x[j], y[j], list_centroids[i]["x"], list_centroids[i]["y"])
distances.append(dist)
minim = min(float(s) for s in distances)
index = distances.index(minim)
list_centroids[index]["points_x"].append(x[j])
list_centroids[index]["points_y"].append(y[j])
distances.clear()In the first paragraph, you "clear" a bunch of data. But I'm not sure why you have your centroids structured this way. Every time you access something, there's an index, a key lookup, and maybe another index. That's way too much work for getting at something you'll be addressing frequently!
In fact, the whole idea of accessing
list_centroids[i]["x"] and list_centroids[i]["y"] is kind of silly. I don't see any value to separating the x and y coordinates, here. On the other hand, if you were to combine your
x and y coordinates into a tuple, you would have a constant object that can be hashed. And hashed items can be stored in a dictionary. Centroid = { ... }
for c in Centroid:
Centroid[c] = [] # Reset list of points to emptyIn the next section, you iterate over all your points (here you go again, segregating ordinates from abscissas!) computing a distance metric. You store the distances in a list.
After creating the distances list, you then find the min value.
After finding the min value, you then try to map back to the index of that value.
After finding the index, you use that to figure out what centroid was closest to the point, and tie the point to the centroid.
You overlook the
min function's key= argument. The key is a function (or lambda-expression) that returns a value. Given the input, the min function determines what to compare by calling the key function. If the key function is not provided, then a simple identity function ( f(x) = x ) is used.In your case, you can replace all that code by judicious use of a lambda-expression:
# This should be your global Point store, not x[] and y[]
Points = [ (_x, _y) for _x, _y in zip(x,y) ]
for p in Points:
x,y = p
nearoid = min(Centroid, key=lambda c: distance(x,c[0],y,c[1]))
Centroid[nearoid].append(p)And if you recode your distance function to take tuples, you don't need to do even that much work:
for p in Points:
nearoid = min(Centroid, key=lambda c: distance(p, c))
Centroid[nearoid].append(p)This does three things for you. First, it eliminates a lot of bytecode. And that means it eliminates a lot of things that the computer was doing, which should save you time.
Second, it converts some bytecode into builtins. Using the builtins as much as possible means that your code might be running in C, instead of bytecode. This makes for better performance.
Third, it eliminates extra data structures. Which eliminates allocation, deallocation, garbage collection, data structure maintenance, etc. All that storage translates into performance, either directly (thrashing) or indirectly (code).
Now, speaking of your
distance function, I see you are calling int a bunch of times. But the inputs are, if I understand correctly, already integers. So those are a bunch of name lookups, and function calls, that are entirely redundant.Try something like this, again using the points-as-tuples approach:
def distance(a, b, sqrt=math.sqrt):
"""Return the distance between (x,y) tuples a and b"""
return sqrt( (a[0] - b[0])**2 + (a[1] - b[1])**2)(Putting the lookup of
math.sqrt into the constants table is a bit of a cheat. But anything for speed, eh?)Code Snippets
def choose_points_for_centroids(x, y):
for i in range(len(list_centroids)):
list_centroids[i]["points_x"].clear()
list_centroids[i]["points_y"].clear()
distances = []
for j in range(len(x)):
for i in range(len(list_centroids)):
dist = distance(x[j], y[j], list_centroids[i]["x"], list_centroids[i]["y"])
distances.append(dist)
minim = min(float(s) for s in distances)
index = distances.index(minim)
list_centroids[index]["points_x"].append(x[j])
list_centroids[index]["points_y"].append(y[j])
distances.clear()Centroid = { ... }
for c in Centroid:
Centroid[c] = [] # Reset list of points to empty# This should be your global Point store, not x[] and y[]
Points = [ (_x, _y) for _x, _y in zip(x,y) ]
for p in Points:
x,y = p
nearoid = min(Centroid, key=lambda c: distance(x,c[0],y,c[1]))
Centroid[nearoid].append(p)for p in Points:
nearoid = min(Centroid, key=lambda c: distance(p, c))
Centroid[nearoid].append(p)def distance(a, b, sqrt=math.sqrt):
"""Return the distance between (x,y) tuples a and b"""
return sqrt( (a[0] - b[0])**2 + (a[1] - b[1])**2)Context
StackExchange Code Review Q#158661, answer score: 3
Revisions (0)
No revisions yet.