HiveBrain v1.2.0
Get Started
← Back to all entries
patternpythonMinor

MNIST Deep Neural Network in TensorFlow

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
mnisttensorflowneuraldeepnetwork

Problem

I have been working on this code for a while and it gave me a lot of headaches before I got it to work. It basically tries to use the MNIST dataset to classify handwritten digits. I am not using the prepackaged MNIST in TensorFlow because I want to learn preprocessing the data myself and for deeper understanding of TensorFlow.

It's finally working, but I would love it if someone with expertise could take a look at it and tell me what they think, and if the results it's producing are actually real stats or if it's overfitting.

It's giving me accuracy between 80% and 91%. The dataset I'm using is from here.

```
import numpy as np
import tensorflow as tf
sess = tf.Session()
from sklearn import preprocessing
import matplotlib.pyplot as plt
with tf.Session() as sess:
# lets load the file
train_file = 'mnist_train.csv'
test_file = 'mnist_test.csv'
#train_file = 'mnist_train_small.csv'
#test_file = 'mnist_test_small.csv'

train = np.loadtxt(train_file, delimiter=',')
test = np.loadtxt(test_file, delimiter=',')

x_train = train[:,1:785]
y_train = train[:,:1]

x_test = test[:,1:785]
y_test = test[:,:1]
print(x_test.shape)

# lets normalize the data
def normalize(input_data):
minimum = input_data.min(axis=0)
maximum = input_data.max(axis=0)
#normalized = (input_data - minimum) / ( maximum - minimum )
normalized = preprocessing.normalize(input_data, norm='l2')
return normalized

# convert to a onehot array
def one_hot(input_data):
one_hot = []
for item in input_data:
if item == 0.:
one_h = [1.,0.,0.,0.,0.,0.,0.,0.,0.,0.]
elif item == 1.:
one_h = [0.,1.,0.,0.,0.,0.,0.,0.,0.,0.]
elif item == 2.:
one_h = [0.,0.,1.,0.,0.,0.,0.,0.,0.,0.]
elif item == 3.:
one_h = [0.,0.,0.,1.,0.,0.,0.,0.,0.,0.]
elif item == 4.:
one_h = [0.

Solution

Whereas unlikely to have high impact, I have found a potential source of overfitting in your code:

# lets normalize the data
def normalize(input_data):
    minimum = input_data.min(axis=0)
    maximum = input_data.max(axis=0)
    #normalized = (input_data - minimum) / ( maximum - minimum )
    normalized = preprocessing.normalize(input_data, norm='l2')
    return normalized


When training a model, you should always consider the complete pipeline. Everywhere, where dataset properties are used to adapt the pipeline, only training data should be used.

The preprocessing step - the normalization - needs to be trained as well. Therefore you would have to fit it with training data and the transform on test (without using the min and max of test data).

Data Leakage as in using test data properties in your model can result in overfitting.

See Medium and datascience.stackexchange for details such as:

Most practitioners — including myself — typically drop their full
dataset into the same collection and normalize it all at once before
splitting the data into test and evaluation. While the code for this
approach will be cleaner, this breaks fundamental assumptions about
data leakage. Most importantly, we are using information from data
that will appear in both the test and training data. This is because
our mean and standard deviation will be based on the full dataset, not
just the training data.

Some practitioners will normalize the two
datasets separately, using different means and standard deviations.
This is also incorrect since it breaks the assumption that the data is
drawn from the same distribution.

Mr. Guts tells us that in order to
remedy this, we must first separate our data into training and test
sets. Then, once we normalize the training set, we apply the mean and
standard deviation to the normalization of the test set. This is a
very subtle source of data leakage that most are apt to miss, but
important to creating the best machine learning model possible.

Code Snippets

# lets normalize the data
def normalize(input_data):
    minimum = input_data.min(axis=0)
    maximum = input_data.max(axis=0)
    #normalized = (input_data - minimum) / ( maximum - minimum )
    normalized = preprocessing.normalize(input_data, norm='l2')
    return normalized

Context

StackExchange Code Review Q#159660, answer score: 2

Revisions (0)

No revisions yet.