HiveBrain v1.2.0
Get Started
← Back to all entries
patternpythonMinor

Decision tree for binary classification

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
decisionbinaryforclassificationtree

Problem

I want to become a good Python programmer and so I'd like to know what in my code practices I can improve. Overall I feel like a pretty solid programmer but writing this code felt very "Java" so I am probably still following poor practices in terms of following Python code practices.

```
__author__ = "arthur"

import pandas as pd

pd.set_option('display.max_rows', 1000)

def indices_next(ls):
indices = []
for i, element in enumerate(ls):
if i != 0:
if element != ls[i-1]:
indices.append(i)
return indices

def summed_list(ls):
for i, elt in enumerate(ls):
if i != 0:
ls[i] += ls[i-1]
return ls

class TreeNode(object):
class_counter = 0
def __init__(self):
self.name = TreeNode.class_counter
TreeNode.class_counter += 1
self.split_gini = -1000
self.data = pd.DataFrame()
self.node_type = "Node"
self.node_gini = 1.0
self.split_value = -1000
self.split_attribute = ""
self.parent = None
self.left_child_node = None
self.left_child_complete = False
self.split_dict = dict()

self.right_child_node = None
self.right_child_complete = False
self.level = 0

def compute_gini_new_node(self):
split_dict = self.data["OK"].value_counts().to_dict()
self.split_dict = split_dict
if len(split_dict) == 2:
# print "Size of split_dict is 2"
zero_count = float(split_dict[0])
one_count = float(split_dict[1])
gini = 1 - (zero_count/(zero_count+one_count))2 - (one_count/(zero_count+one_count))2
self.node_gini = gini

class DecisionTree(object):
def __init__(self):
self.root = TreeNode()
self.attributes = []
self.used_attributes = set()

def is_leaf_node(self, node):
result = False
data_ct = len(node.data)
if len(node.split_dict) == 1:
result

Solution

Disclaimer I'm less concerned about how Pythonic the code is than how understandable it is.

A Few Observations

-
There are almost 400 lines of code. It is not easy to see how it all fits together.

-
There are comments. None of them paint the big picture. Many comments are just dead code. These impede clarity.

-
Tests are mixed in with production code. This adds to bulk and makes the main logic more difficult to understand not less.

A Few Suggestions

-
Make the code more modular by removing tests from the production logic.

-
Delete dead code. Consider using version control to maintain historical investigations instead.

-
Consider writing an overall description of the program, so that anyone reading it, including yourself a week from now, can more quickly understand what the code does and how it hangs together to do it.

-
Consider better names.

  • What does tuple_dict represent, i.e. what are the these tuples of?



  • zero might be better labelled as zero_coefficient unless it is 0.



  • WhyTreeNode instead of Node? The rest of the program uses node to refer to nodes.



-
Why does TreeNode contain so many magic numbers? Perhaps this should be passed as a parameter from a file?

Context

StackExchange Code Review Q#82957, answer score: 3

Revisions (0)

No revisions yet.