patternpythonMinor

K-means algorithm very slow

Submitted by: @import:stackexchange-codereview·Mar 10, 2026·

Viewed 0 times

algorithmslowverymeans

Problem

I'm a beginner in Python but I have tried to implement K-means algorithm in python and it's working... but it's too slow... Instead of few seconds I can spend hours to finish it and I don't know why... something it's wrong... Could anyone of you give me some advice please?

Code:

```
import math
from random import randint
from copy import deepcopy

from chart import chart_centroids, save_image

centroid = dict(x=0, y=0, points_x=[], points_y=[])

list_centroids = []
x = []
y = []
k = 0

def distance(x1, y1, x2, y2):
x = (int(x1) - int(x2)) ** 2
y = (int(y1) - int(y2)) ** 2
sum = x + y
sqr = math.sqrt(sum)
return sqr

def generate_centroids(range_x, range_y):
list_centroids.clear()
k = 3
for i in range(0, k):
centroid["x"] = randint(1, range_x)
centroid["y"] = randint(1, range_y)
list_centroids.append(deepcopy(centroid))

def choose_points_for_centroids(x, y):
for i in range(len(list_centroids)):
list_centroids[i]["points_x"].clear()
list_centroids[i]["points_y"].clear()

distances = []
for j in range(len(x)):
for i in range(len(list_centroids)):
dist = distance(x[j], y[j], list_centroids[i]["x"], list_centroids[i]["y"])
distances.append(dist)

minim = min(float(s) for s in distances)
index = distances.index(minim)
list_centroids[index]["points_x"].append(x[j])
list_centroids[index]["points_y"].append(y[j])
distances.clear()

def move_centroids():
sum_x = 0
sum_y = 0

for cent in list_centroids:
for j in range(len(cent["points_x"])):
sum_x += cent["points_x"][j]
sum_y += cent["points_y"][j]

if len(cent["points_x"]) > 0 and len(cent["points_y"]) > 0:
avg_x = sum_x / len(cent["points_x"])
avg_y = sum_y / len(cent["points_y"])
cent["x"] = avg_x
cent["y"] = avg_y

def run():
generate_centroids(300, 300)
read_f

Solution

I see a lot of little things slowing you down, but I don't know what your chart_centroids and save_image functions do, so I have no idea if they are part of the problem or not.

Let's look at one of your two frequently-called functions:

def choose_points_for_centroids(x, y):
    for i in range(len(list_centroids)):
        list_centroids[i]["points_x"].clear()
        list_centroids[i]["points_y"].clear()

    distances = []
    for j in range(len(x)):
        for i in range(len(list_centroids)):
            dist = distance(x[j], y[j], list_centroids[i]["x"], list_centroids[i]["y"])
            distances.append(dist)

        minim = min(float(s) for s in distances)
        index = distances.index(minim)
        list_centroids[index]["points_x"].append(x[j])
        list_centroids[index]["points_y"].append(y[j])
        distances.clear()

In the first paragraph, you "clear" a bunch of data. But I'm not sure why you have your centroids structured this way. Every time you access something, there's an index, a key lookup, and maybe another index. That's way too much work for getting at something you'll be addressing frequently!

In fact, the whole idea of accessing list_centroids[i]["x"] and list_centroids[i]["y"] is kind of silly. I don't see any value to separating the x and y coordinates, here.

On the other hand, if you were to combine your x and y coordinates into a tuple, you would have a constant object that can be hashed. And hashed items can be stored in a dictionary.

Centroid = { ... }

for c in Centroid:
    Centroid[c] = []    # Reset list of points to empty

In the next section, you iterate over all your points (here you go again, segregating ordinates from abscissas!) computing a distance metric. You store the distances in a list.

After creating the distances list, you then find the min value.

After finding the min value, you then try to map back to the index of that value.

After finding the index, you use that to figure out what centroid was closest to the point, and tie the point to the centroid.

You overlook the min function's key= argument. The key is a function (or lambda-expression) that returns a value. Given the input, the min function determines what to compare by calling the key function. If the key function is not provided, then a simple identity function ( f(x) = x ) is used.

In your case, you can replace all that code by judicious use of a lambda-expression:

# This should be your global Point store, not x[] and y[]
Points = [ (_x, _y) for _x, _y in zip(x,y) ]

for p in Points:
    x,y = p
    nearoid = min(Centroid, key=lambda c: distance(x,c[0],y,c[1]))
    Centroid[nearoid].append(p)

And if you recode your distance function to take tuples, you don't need to do even that much work:

for p in Points:
    nearoid = min(Centroid, key=lambda c: distance(p, c))
    Centroid[nearoid].append(p)

This does three things for you. First, it eliminates a lot of bytecode. And that means it eliminates a lot of things that the computer was doing, which should save you time.

Second, it converts some bytecode into builtins. Using the builtins as much as possible means that your code might be running in C, instead of bytecode. This makes for better performance.

Third, it eliminates extra data structures. Which eliminates allocation, deallocation, garbage collection, data structure maintenance, etc. All that storage translates into performance, either directly (thrashing) or indirectly (code).

Now, speaking of your distance function, I see you are calling int a bunch of times. But the inputs are, if I understand correctly, already integers. So those are a bunch of name lookups, and function calls, that are entirely redundant.

Try something like this, again using the points-as-tuples approach:

def distance(a, b, sqrt=math.sqrt):
    """Return the distance between (x,y) tuples a and b"""
    return sqrt( (a[0] - b[0])**2 + (a[1] - b[1])**2)

(Putting the lookup of math.sqrt into the constants table is a bit of a cheat. But anything for speed, eh?)

Code Snippets

def choose_points_for_centroids(x, y):
    for i in range(len(list_centroids)):
        list_centroids[i]["points_x"].clear()
        list_centroids[i]["points_y"].clear()

    distances = []
    for j in range(len(x)):
        for i in range(len(list_centroids)):
            dist = distance(x[j], y[j], list_centroids[i]["x"], list_centroids[i]["y"])
            distances.append(dist)

        minim = min(float(s) for s in distances)
        index = distances.index(minim)
        list_centroids[index]["points_x"].append(x[j])
        list_centroids[index]["points_y"].append(y[j])
        distances.clear()

Centroid = { ... }

for c in Centroid:
    Centroid[c] = []    # Reset list of points to empty

# This should be your global Point store, not x[] and y[]
Points = [ (_x, _y) for _x, _y in zip(x,y) ]

for p in Points:
    x,y = p
    nearoid = min(Centroid, key=lambda c: distance(x,c[0],y,c[1]))
    Centroid[nearoid].append(p)

for p in Points:
    nearoid = min(Centroid, key=lambda c: distance(p, c))
    Centroid[nearoid].append(p)

def distance(a, b, sqrt=math.sqrt):
    """Return the distance between (x,y) tuples a and b"""
    return sqrt( (a[0] - b[0])**2 + (a[1] - b[1])**2)

Context

StackExchange Code Review Q#158661, answer score: 3

Revisions (0)

No revisions yet.