patternpythonMinorCanonical
K-Means clustering in Python2
Viewed 0 times
clusteringpython2means
Problem
I've implemented the K-Means clustering algorithm in Python2, and I wanted to know what remarks you guys could make regarding my code. I've included a small test set with 2D-vectors and 2 classes, but it works with higher dimensions and more classes. Here, it should sort all the elements starting with the same letters in the same classes (except ak, with is quite in between).
What I'm interested in:
```
from __future__ import print_function
from operator import add, sub
from _collections import defaultdict
import random as rd
import math
class Element:
def __init__(self, idf, data, cls='ukn'):
self.idf = idf
self.data = data
self.cls = cls
def __str__(self):
return '%s: %s -> %s' % (str(self.idf), str(self.data), str(self.cls))
def eucl_dist_to(self, other):
diff = map(sub, self.data, other.data)
norm2 = math.sqrt(sum([math.pow(coord, 2) for coord in diff]))
return norm2
def main():
test_data_tuple = [('aa', 1, 7), ('ab', 1, 8), ('ac', 1, 9), ('ba', 2, 1), ('bb', 2, 3), ('ad', 2, 8), ('ae', 2, 9),
('bc', 3, 2), ('bd', 3, 4), ('af', 3, 7), ('be', 4, 2), ('bf', 4, 3), ('ag', 4, 8),
('bg', 5, 1), ('bh', 5, 3), ('ah', 5, 6), ('bi', 6, 2), ('bj', 6, 4), ('ai', 6, 6), ('aj', 6, 7),
('bk', 7, 0), ('bl', 7, 2), ('bm', 7, 3), ('ak', 7, 5)]
test_data_elt = list()
for t in test_data_tuple:
test_data_elt.append(Element(t[0], tuple(t[1:])))
# test_data_elt.sort(key=lambda e: e.idf)
results = k_means(test_data_elt, 2, 5)
def k_means(elts, nb_classes, nb_steps):
k = nb_classes
#init
centroids = list()
cls_content = defaultdict(list)
for cls_nb in range(k):
cls_name = 'class_'+str(cls_nb)
centroids.append(Element(cls_name, rd.choice(elts).data, cls=
What I'm interested in:
- Did I do something unpythonic?
- Are there unclear or unnecessary parts?
- What changes could drastically improve the code's performance?
```
from __future__ import print_function
from operator import add, sub
from _collections import defaultdict
import random as rd
import math
class Element:
def __init__(self, idf, data, cls='ukn'):
self.idf = idf
self.data = data
self.cls = cls
def __str__(self):
return '%s: %s -> %s' % (str(self.idf), str(self.data), str(self.cls))
def eucl_dist_to(self, other):
diff = map(sub, self.data, other.data)
norm2 = math.sqrt(sum([math.pow(coord, 2) for coord in diff]))
return norm2
def main():
test_data_tuple = [('aa', 1, 7), ('ab', 1, 8), ('ac', 1, 9), ('ba', 2, 1), ('bb', 2, 3), ('ad', 2, 8), ('ae', 2, 9),
('bc', 3, 2), ('bd', 3, 4), ('af', 3, 7), ('be', 4, 2), ('bf', 4, 3), ('ag', 4, 8),
('bg', 5, 1), ('bh', 5, 3), ('ah', 5, 6), ('bi', 6, 2), ('bj', 6, 4), ('ai', 6, 6), ('aj', 6, 7),
('bk', 7, 0), ('bl', 7, 2), ('bm', 7, 3), ('ak', 7, 5)]
test_data_elt = list()
for t in test_data_tuple:
test_data_elt.append(Element(t[0], tuple(t[1:])))
# test_data_elt.sort(key=lambda e: e.idf)
results = k_means(test_data_elt, 2, 5)
def k_means(elts, nb_classes, nb_steps):
k = nb_classes
#init
centroids = list()
cls_content = defaultdict(list)
for cls_nb in range(k):
cls_name = 'class_'+str(cls_nb)
centroids.append(Element(cls_name, rd.choice(elts).data, cls=
Solution
something unpythonic?
Consider using plural identifiers:
In Element
Kudos on
I can't say I'm a fan of
The
Testing
The
Overall, it looks like fairly solid code.
changes could drastically improve the code's performance?
Sorry, don't see anything obvious that would help.
Consider using plural identifiers:
test_data_tuples, test_data_elts. Later, for example, you use a perfectly lovely for elt in elts: idiom.In Element
__init__, I wouldn't mind seeing cls='unknown' spelled out, but I am really missing a docstring that explains what the heck idf stands for. Also, as near as I can tell, data really is a coord, e.g. (x, y).Kudos on
k = nb_classes. I feel you did a good job of making the public API very clear, and then within the body k is conventional.I can't say I'm a fan of
import random as rd. It needlessly obscured the fairly obvious meaning of rd.choice(elts).data.The
for itr ... loop exhibits admirable clarity.Testing
if len(elts_in) == 0:, plus the map, seems like it's maybe a bit clunky. I'm just not understanding why empty is a special case, perhaps we could use sum and lambda, or a helper function? It seems like we could always find something sensible to assign to c.data, even in the empty case. Or maybe what I'm really looking for is an if that rapidly moves on to a continue.The
print('\n'+str(c)) is not very pythonic. I'd prefer print('\n%s' % c). Ok, that's the last of the tiny nits I will pick.Overall, it looks like fairly solid code.
changes could drastically improve the code's performance?
Sorry, don't see anything obvious that would help.
Context
StackExchange Code Review Q#151882, answer score: 2
Revisions (0)
No revisions yet.