patternpythonMinor
K-means clustering in Python
Viewed 0 times
clusteringpythonmeans
Problem
The following code uses scikit-learn to carry out K-means clustering where \$K = 4\$, on an example related to wine marketing from the book DataSmart. That book uses excel but I wanted to learn Python (including numPy and sciPy) so I implemented this example in that language (of course the K-means clustering is done by the scikit-learn package, I'm first interested in just getting the data in to my program and getting the answer out).
I'm new to Python so any advice on style or ways to write my code in a more idiomatic way would be appreciated.
The csv files needed (in the same directory as the program code) can be produced from downloading "Chapter 2" from the book link above and saving the first and second sheets of the resulting excel file as csv.
```
# -- coding: utf-8 --
"""
A program to carry out Kmeans clustering where K=4
on data relating to wine marketing from book
"Data Smart: Using Data Science to Transform Information into Insight"
Requires csv input file OfferInfo.csv with headings
'Campaign', 'Varietal', 'Minimum Qty (kg)', 'Discount (%)', 'Origin', 'Past Peak'
and input file Transactions.csv with headings
'Customer Last Name', 'Offer #'
"""
#make more similar to Python 3
from __future__ import print_function, division, absolute_import, unicode_literals
#other stuff we need to import
import csv
import numpy as np
from sklearn.cluster import KMeans
#beginning of main program
#read in OfferInfo.csv
csvf = open('OfferInfo.csv','rU')
rows = csv.reader(csvf)
offer_sheet = [row for row in rows]
csvf.close()
#read in Transactions.csv
csvf = open('Transactions.csv','rU')
rows = csv.reader(csvf)
transaction_sheet = [row for row in rows]
csvf.close()
#first row of each spreadsheet is column headings, so we remove them
offer_sheet_data = offer_sheet[1:]
transaction_sheet_data = transaction_sheet[1:]
K=4 #four clusters
num_deals = len(offer_sheet_data) #assume listed offers are distinct
#find the sorted list of customer last names
customer_names =
I'm new to Python so any advice on style or ways to write my code in a more idiomatic way would be appreciated.
The csv files needed (in the same directory as the program code) can be produced from downloading "Chapter 2" from the book link above and saving the first and second sheets of the resulting excel file as csv.
```
# -- coding: utf-8 --
"""
A program to carry out Kmeans clustering where K=4
on data relating to wine marketing from book
"Data Smart: Using Data Science to Transform Information into Insight"
Requires csv input file OfferInfo.csv with headings
'Campaign', 'Varietal', 'Minimum Qty (kg)', 'Discount (%)', 'Origin', 'Past Peak'
and input file Transactions.csv with headings
'Customer Last Name', 'Offer #'
"""
#make more similar to Python 3
from __future__ import print_function, division, absolute_import, unicode_literals
#other stuff we need to import
import csv
import numpy as np
from sklearn.cluster import KMeans
#beginning of main program
#read in OfferInfo.csv
csvf = open('OfferInfo.csv','rU')
rows = csv.reader(csvf)
offer_sheet = [row for row in rows]
csvf.close()
#read in Transactions.csv
csvf = open('Transactions.csv','rU')
rows = csv.reader(csvf)
transaction_sheet = [row for row in rows]
csvf.close()
#first row of each spreadsheet is column headings, so we remove them
offer_sheet_data = offer_sheet[1:]
transaction_sheet_data = transaction_sheet[1:]
K=4 #four clusters
num_deals = len(offer_sheet_data) #assume listed offers are distinct
#find the sorted list of customer last names
customer_names =
Solution
One obvious improvement would be to break the code up a bit more - identify standalone pieces of functionality and put them into functions, e.g.:
This reduces duplication and, therefore, possibilities for errors. It allows easier development, as you can create and test each function separately before connecting it all together. It also makes it easier to improve the functionality, in this case by adopting the
You make that change in only one place and everywhere that calls it benefits.
I would also have as little code as possible at the top level. Instead, move it inside an enclosing function, and only call that function if we're running the file directly:
This makes it easier to
def read_data(filename):
csvf = open(filename,'rU')
rows = csv.reader(csvf)
data = [row for row in rows]
csvf.close()
return data
offer_sheet = read_data('OfferInfo.csv')
transaction_sheet = read_data('Transactions.csv')This reduces duplication and, therefore, possibilities for errors. It allows easier development, as you can create and test each function separately before connecting it all together. It also makes it easier to improve the functionality, in this case by adopting the
with context manager:def read_data(filename):
with open(filename, 'rU') as csvf:
return [row for row in csv.reader(csvf)]You make that change in only one place and everywhere that calls it benefits.
I would also have as little code as possible at the top level. Instead, move it inside an enclosing function, and only call that function if we're running the file directly:
def analyse(offer_file, transaction_file):
offer_sheet = read_data(offer_file)
transaction_sheet = read_data(transaction_file)
...
if __name__ == "__main__":
analyse('OfferInfo.csv', 'Transactions.csv')This makes it easier to
import the code you develop elsewhere without running the test/demo code.Code Snippets
def read_data(filename):
csvf = open(filename,'rU')
rows = csv.reader(csvf)
data = [row for row in rows]
csvf.close()
return data
offer_sheet = read_data('OfferInfo.csv')
transaction_sheet = read_data('Transactions.csv')def read_data(filename):
with open(filename, 'rU') as csvf:
return [row for row in csv.reader(csvf)]def analyse(offer_file, transaction_file):
offer_sheet = read_data(offer_file)
transaction_sheet = read_data(transaction_file)
...
if __name__ == "__main__":
analyse('OfferInfo.csv', 'Transactions.csv')Context
StackExchange Code Review Q#52029, answer score: 7
Revisions (0)
No revisions yet.