patternpythonMinorCanonical

K-Means clustering in Python2

Submitted by: @import:stackexchange-codereview·Mar 10, 2026·

Viewed 0 times

clusteringpython2means

Problem

I've implemented the K-Means clustering algorithm in Python2, and I wanted to know what remarks you guys could make regarding my code. I've included a small test set with 2D-vectors and 2 classes, but it works with higher dimensions and more classes. Here, it should sort all the elements starting with the same letters in the same classes (except ak, with is quite in between).

What I'm interested in:

Did I do something unpythonic?

Are there unclear or unnecessary parts?

What changes could drastically improve the code's performance?

```
from __future__ import print_function
from operator import add, sub
from _collections import defaultdict
import random as rd
import math

class Element:
def __init__(self, idf, data, cls='ukn'):
self.idf = idf
self.data = data
self.cls = cls

def __str__(self):
return '%s: %s -> %s' % (str(self.idf), str(self.data), str(self.cls))

def eucl_dist_to(self, other):
diff = map(sub, self.data, other.data)
norm2 = math.sqrt(sum([math.pow(coord, 2) for coord in diff]))
return norm2

def main():
test_data_tuple = [('aa', 1, 7), ('ab', 1, 8), ('ac', 1, 9), ('ba', 2, 1), ('bb', 2, 3), ('ad', 2, 8), ('ae', 2, 9),
('bc', 3, 2), ('bd', 3, 4), ('af', 3, 7), ('be', 4, 2), ('bf', 4, 3), ('ag', 4, 8),
('bg', 5, 1), ('bh', 5, 3), ('ah', 5, 6), ('bi', 6, 2), ('bj', 6, 4), ('ai', 6, 6), ('aj', 6, 7),
('bk', 7, 0), ('bl', 7, 2), ('bm', 7, 3), ('ak', 7, 5)]
test_data_elt = list()
for t in test_data_tuple:
test_data_elt.append(Element(t[0], tuple(t[1:])))
# test_data_elt.sort(key=lambda e: e.idf)
results = k_means(test_data_elt, 2, 5)

def k_means(elts, nb_classes, nb_steps):
k = nb_classes
#init
centroids = list()
cls_content = defaultdict(list)
for cls_nb in range(k):
cls_name = 'class_'+str(cls_nb)
centroids.append(Element(cls_name, rd.choice(elts).data, cls=

Solution

something unpythonic?

Consider using plural identifiers: test_data_tuples, test_data_elts. Later, for example, you use a perfectly lovely for elt in elts: idiom.

In Element __init__, I wouldn't mind seeing cls='unknown' spelled out, but I am really missing a docstring that explains what the heck idf stands for. Also, as near as I can tell, data really is a coord, e.g. (x, y).

Kudos on k = nb_classes. I feel you did a good job of making the public API very clear, and then within the body k is conventional.

I can't say I'm a fan of import random as rd. It needlessly obscured the fairly obvious meaning of rd.choice(elts).data.

The for itr ... loop exhibits admirable clarity.

Testing if len(elts_in) == 0:, plus the map, seems like it's maybe a bit clunky. I'm just not understanding why empty is a special case, perhaps we could use sum and lambda, or a helper function? It seems like we could always find something sensible to assign to c.data, even in the empty case. Or maybe what I'm really looking for is an if that rapidly moves on to a continue.

The print('\n'+str(c)) is not very pythonic. I'd prefer print('\n%s' % c). Ok, that's the last of the tiny nits I will pick.

Overall, it looks like fairly solid code.

changes could drastically improve the code's performance?

Sorry, don't see anything obvious that would help.

Context

StackExchange Code Review Q#151882, answer score: 2

Revisions (0)

No revisions yet.