patternpythonMinor
Random Forest Code Optimization
Viewed 0 times
randomcodeforestoptimization
Problem
I am new to Python. I have built a model with randomforest in python. But I think my code is not optimized. Please look into my code and suggest if I have deviated from best practices.
Overview about the data I have:
The data has response columns and predictor columns. Also there is a column 'TestOrTrainingDataRandom' which specifies the test and training data.(There are also columns like index, Timestamp,etc which have to be removed) The predictor columns start with '3000' and ends at '3680' with a step increase of 5 (i.e in total there are 137 predictor columns) But there are some predictor columns missing. So the missing predictor columns has been interpolated.
Here is my code. Please give the feedback.
```
from sklearn.ensemble import RandomForestClassifier #Package for random forest classification
import pandas as pd
from sklearn.metrics import classification_report,confusion_matrix
data = pd.read_csv("combined_spectra_and_gas_params_3.csv") #Reading the file
#creating response column
data["response"] = None
s = pd.Series(["verylow","low","medium","high","veryhigh"], dtype="category")
n=0
for row in data["H2S"]:
if row =120 and row =500 and row =1000 and row <1300:
data['response'].iloc[n] = s[3] #Assign 'high' if H2S concentration is between 1000 and 1300 and also do indexing
n=n+1
else:
data['response'].iloc[n] = s[4] #Assign 'veryhigh' if H2S concentration is greater than 1300 and do indexing
n=n+1
#create the training & test sets
b=len(data)
a=data.TestOrTrainingDataRandom [data.TestOrTrainingDataRandom == 1].count() #Count the number of training data
new_data=data.drop(data.columns[[0,1,2,3,4,5,6,7,8,121,120,119,118]], axis=1)
colnames=list(new_data)
len_column = len(new_data.columns)
len_iteration=len_column-1
j=3000;i=0;k=0
new_col=pd.DataFrame(index=range(0,b),columns=['temp'])
# To insert missing columns
while i < len_iteration:
if int(colnames[i])== j:
i=i+1
j=j+5;
el
Overview about the data I have:
The data has response columns and predictor columns. Also there is a column 'TestOrTrainingDataRandom' which specifies the test and training data.(There are also columns like index, Timestamp,etc which have to be removed) The predictor columns start with '3000' and ends at '3680' with a step increase of 5 (i.e in total there are 137 predictor columns) But there are some predictor columns missing. So the missing predictor columns has been interpolated.
Here is my code. Please give the feedback.
```
from sklearn.ensemble import RandomForestClassifier #Package for random forest classification
import pandas as pd
from sklearn.metrics import classification_report,confusion_matrix
data = pd.read_csv("combined_spectra_and_gas_params_3.csv") #Reading the file
#creating response column
data["response"] = None
s = pd.Series(["verylow","low","medium","high","veryhigh"], dtype="category")
n=0
for row in data["H2S"]:
if row =120 and row =500 and row =1000 and row <1300:
data['response'].iloc[n] = s[3] #Assign 'high' if H2S concentration is between 1000 and 1300 and also do indexing
n=n+1
else:
data['response'].iloc[n] = s[4] #Assign 'veryhigh' if H2S concentration is greater than 1300 and do indexing
n=n+1
#create the training & test sets
b=len(data)
a=data.TestOrTrainingDataRandom [data.TestOrTrainingDataRandom == 1].count() #Count the number of training data
new_data=data.drop(data.columns[[0,1,2,3,4,5,6,7,8,121,120,119,118]], axis=1)
colnames=list(new_data)
len_column = len(new_data.columns)
len_iteration=len_column-1
j=3000;i=0;k=0
new_col=pd.DataFrame(index=range(0,b),columns=['temp'])
# To insert missing columns
while i < len_iteration:
if int(colnames[i])== j:
i=i+1
j=j+5;
el
Solution
I have notes on style and redundancies, but you should read the Python Style Guide: PEP0008. It has a lot of good info on how to format your code to be clear and readable for yourself and others. I'll miss pointing out some of it's recommendations so do read it too.
You have a lot of unnecessary inline comments.
But actually you could just use
No need for the
You never use
You shouldn't use the semicolon line separator, it's rarely a good idea and often just looks unPythonic. Anyway, Python lets you assign multiple values at once by comma separating them:
Having them out of alphabetical order is a bit confusing though. You'd be better giving these meaningful names since their usage isn't simple.
In you're
That block could really do with some comments though. Especially this line:
I have no idea what that's doing.
You also use
You have a lot of unnecessary inline comments.
from package import Class makes it pretty clear that you're importing Class from package. Python is designed to be readable so that you don't need to explicitly tell people these things. I'd remove most of them. In particular, with your long if and elif chains it's be better to cut down on repetition. Both by having only one comment at the top, and by putting n=n+1 at the end. Though, you can rewrite that as n+=1, and you should put spaces around operators, makes them easier to read:for row in data["H2S"]:
#Assign response based on H2S concentration and do indexing
if row = 120 and row = 500 and row = 1000 and row < 1300:
data['response'].iloc[n] = s[3]
else:
data['response'].iloc[n] = s[4]
n += 1But actually you could just use
enumerate for your for loop instead. It allows you to do the same as what you're using n for, it contains the number iteration you're on. So you can save a line and just do this:for n, row in enumerate(data["H2S"]):
#Assign response based on H2S concentration and do indexing
if row < 120:
data['response'].iloc[n] = s[0]
...
else:
data['response'].iloc[n] = s[4]No need for the
n += 1 any more.You never use
len_column except to assign to len_iteration, so just assign directly to len_iteration:len_iteration = len(new_data.columns) - 1You shouldn't use the semicolon line separator, it's rarely a good idea and often just looks unPythonic. Anyway, Python lets you assign multiple values at once by comma separating them:
j, i, k = 3000, 0, 0Having them out of alphabetical order is a bit confusing though. You'd be better giving these meaningful names since their usage isn't simple.
In you're
while loop you increment i and j in both conditions, so you should just do that at the end and just run the middle block if your condition is False:while i < len_iteration:
if int(colnames[i]) != j:
for k in range(0, b):
new_col.iloc[k] = (new_data.iloc[k,i-1]+new_data.iloc[k,i+1])/2
new_data.insert(i,str(j),new_col)
colnames = list(new_data)
len_iteration += 1;
j += 5
i += 1That block could really do with some comments though. Especially this line:
new_col.iloc[k] = (new_data.iloc[k, i - 1] + new_data.iloc[k, i + 1]) / 2I have no idea what that's doing.
You also use
m, n and p for indexing. If you're not going to give them more meaningful names, you can at least re-use i, j and k which are more commonly used for looping over indices.Code Snippets
for row in data["H2S"]:
#Assign response based on H2S concentration and do indexing
if row < 120:
data['response'].iloc[n] = s[0]
elif row >= 120 and row < 500:
data['response'].iloc[n] = s[1]
elif row >= 500 and row < 1000:
data['response'].iloc[n] = s[2]
elif row >= 1000 and row < 1300:
data['response'].iloc[n] = s[3]
else:
data['response'].iloc[n] = s[4]
n += 1for n, row in enumerate(data["H2S"]):
#Assign response based on H2S concentration and do indexing
if row < 120:
data['response'].iloc[n] = s[0]
...
else:
data['response'].iloc[n] = s[4]len_iteration = len(new_data.columns) - 1j, i, k = 3000, 0, 0while i < len_iteration:
if int(colnames[i]) != j:
for k in range(0, b):
new_col.iloc[k] = (new_data.iloc[k,i-1]+new_data.iloc[k,i+1])/2
new_data.insert(i,str(j),new_col)
colnames = list(new_data)
len_iteration += 1;
j += 5
i += 1Context
StackExchange Code Review Q#106231, answer score: 3
Revisions (0)
No revisions yet.