HiveBrain v1.2.0
Get Started
← Back to all entries
patternpythonMinor

Random forest and machine learning

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
randomforestlearningmachineand

Problem

I am quite new to using python for machine learning. I come from a background of programming in Fortran, so as you may imagine, python is quite a leap. I work in chemistry and have become involved in chemiformatics (applying data science techniques to chemistry). As such, the application of pythons extensive machine learning libraries is important. I also need my codes to be efficent. I have written a code which runs and seems to work OK. What I would like to know is:

-
How best to improve it/make it more efficient.

-
Any suggestions on alternative formulations to those I have used and if possible a reason why another route maybe superior?

I tend to work with continuous data and regression models.

Edit:

Thank you for all of your comments so far. Apologies for the indentation error this was a copy mistake.

To give a few more details I am aiming to use the code to make prediction of chemical properties such as toxicity, melting points, solubility etc. These sorts of properties are the focus of research efforts in academia and industry to provide pre-screening of target molecules for certain properties.

The data I am providing as input is a csv file. The first column, is a label (molecule name). The last column, is the target value from experiment or quantum chemical calculation. The in between columns are descriptors calculated based on some molecular structure format (2D SMILES, 3D crystal structure etc). An example of a subset of the data is below:

Typically there would be 100 - 150 descriptors which provide information and between 100 and several thousand examples. These examples need to be split into training and test sets.

End Edit

```
import scipy
import math
import numpy as np
import pandas as pd
import plotly.plotly as py
import os.path
import sys

from time import time
from sklearn import preprocessing, metrics, cross_validation
from sklearn.cross_validation import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklear

Solution

Disclaimer: I'm not familiar with the work you're doing, so I'll limit my comments to the structure of your code. Generally, I'd make the following changes:

  • Split data and code



  • Write more functions



  • Don't repeat yourself



Split data and code

You have a lot of lines that contain both hardcoded information to display to the user and calculations that are being performed. At the very least, I could imagine splitting this into two classes: a calculation class and a reporter. Put all the logic for extracting information in to the calculation class, pass it to the reporter to display to the user. The calculation class can init all of your empty arrays and contain methods for operating on data.

Perhaps later you won't want to print everything, you'll want to save it to a logfile, put on a website, etc. Perhaps you'll want to report everything the same way, but you want to compare two different algorithms. Separating the logic and how you report the results will make this much simpler.

Much of the information on what to do with what data is tied up in lines like this:

print round(RFpreds[i],2),'\t', round(RFpreds[i+1],2),'\t', round(RFpreds[i+2],2),'\t', round(RFpreds[i+3],2),'\t', round(RFpreds[i+4],2)


Write more functions

An excellent piece of advice I've heard is that if you write a block of code and put a comment at the top, you've probably just written something that should be a function. Something like the collection of if checks related to ytestdim seems like it could be abstracted out into its own function.

This allows you to review what your code is doing at each level of abstraction. Ultimately this makes debugging a lot easier, and it makes it especially easy to compare a mathematical technique you're familiar with in the real world with your code implementation of it.

Don't repeat yourself

I also see plenty of opportunities to combine collections of statements. For example:

print("n_estimators = %d " % RfGridSearch.best_params_['n_estimators'])
ne = RfGridSearch.best_params_['n_estimators']
print("max_features = %s " % RfGridSearch.best_params_['max_features'])
mf = RfGridSearch.best_params_['max_features']
print("max_depth = %d " % RfGridSearch.best_params_['max_depth'])
md = RfGridSearch.best_params_['max_depth']


This is really a single operation performed 3 times. In situations like this, make a list ["n_estimators", "max_features", "max_depth"], and iterate over the list.

Making changes like this prevent you from making simple spelling errors from one line to the next, and make it easier to pull behavior into classes and methods.

Miscellaneous additional items:

If you're using the with...open() as notation, you don't need to worry about close().

You can print multiple things on the same line with this notation:

print a,
print b,
print c


It's easier to read dictionaries that are defined the following way:

rfparamgrid = {
    "n_estimators": [10],
    "max_features": ["auto", "sqrt", "log2"],
    "max_depth": [5,7],
}

Code Snippets

print round(RFpreds[i],2),'\t', round(RFpreds[i+1],2),'\t', round(RFpreds[i+2],2),'\t', round(RFpreds[i+3],2),'\t', round(RFpreds[i+4],2)
print("n_estimators = %d " % RfGridSearch.best_params_['n_estimators'])
ne = RfGridSearch.best_params_['n_estimators']
print("max_features = %s " % RfGridSearch.best_params_['max_features'])
mf = RfGridSearch.best_params_['max_features']
print("max_depth = %d " % RfGridSearch.best_params_['max_depth'])
md = RfGridSearch.best_params_['max_depth']
print a,
print b,
print c
rfparamgrid = {
    "n_estimators": [10],
    "max_features": ["auto", "sqrt", "log2"],
    "max_depth": [5,7],
}

Context

StackExchange Code Review Q#133965, answer score: 7

Revisions (0)

No revisions yet.