HiveBrain v1.2.0
Get Started
← Back to all entries
patternpythonMinor

Imputing values with non-negative matrix factorization

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
nonwithimputingvaluesmatrixnegativefactorization

Problem

X is a DataFrame w/ about 90% missing values and around 10% actual values. My goal is to use nmf in a successive imputation loop to predict the actual values I have hidden. The mask, msk, selects a random 80% of the actual values (or 80% of the 10% actual values). I initialize all but these 80% to 0 and begin to impute them. Line 2 looks odd because I couldn't find a way to get a random 80% (train set) of the values who weren't np.nan so if I add an np.nan to a number, the value stays np.nan. Then if I subtract that X.values back off the only values that are effected are the non-null values of the array X_imputed. This allows me to get a random 80% of the non-null values.

import pandas as pd
from pandas import DataFrame
import numpy as np

from sklearn.decomposition import ProjectedGradientNMF

# toy example data, actual data is ~500 by ~ 250
customers = range(20)
features = range(15)

toy_vals = np.random.random(20*15).reshape((20,15))
toy_mask = toy_vals  10:
   nmf_model.fit_transform(X_imputed.values)
   W = nmf_model.fit_transform(X_imputed.values)
   H = nmf_model.components_
   X_imputed.values[~msk] = W.dot(H)[~msk]


I'm pretty sure this can be written in fewer lines but I'm not sure how to do it.

Solution

-
In the while loop, the first call you make to nmf_model.fit_transform() is superfluous and can be removed. You aren't even using the results of the transformation calculation. The next line, where you have W = nmf_model.fit_transform(X_imputed.values) is doing all the work. Removing this line halves the number of model fits and speeds things up by ~twofold.

-
You don't need to assign H outside/before the while loop.

-
If minimization of code lines is the goal, you can avoid assigning to temporary variables and just put the expression that you would have used to define the variable in the code line that uses it. I did that for H in the while loop. It is more compact but probably harder to understand.

-
You don't seem to need to full pandas module, so that import can be removed.

-
I didn't change anything in my code below, but why are you squaring nmf_model.reconstruction_err_? According to the docs this error is the Frobenius norm of the difference matrix (X - WH), so it will always be positive even without squaring.

This is a tad more compact and significantly faster (because of item 1):

from pandas import DataFrame
import numpy as np
from sklearn.decomposition import ProjectedGradientNMF

# Example data matrix X
nrows, ncols = 200, 150
toy_vals = np.random.random(nrows*ncols).reshape((nrows, ncols))
toy_vals[toy_vals  10:
    W = nmf_model.fit_transform(X_imputed.values)
    X_imputed.values[~msk] = W.dot(nmf_model.components_)[~msk]
    print nmf_model.reconstruction_err_

Code Snippets

from pandas import DataFrame
import numpy as np
from sklearn.decomposition import ProjectedGradientNMF

# Example data matrix X
nrows, ncols = 200, 150
toy_vals = np.random.random(nrows*ncols).reshape((nrows, ncols))
toy_vals[toy_vals < 0.9] = np.nan
X = DataFrame(toy_vals, index=range(nrows), columns=range(ncols))

# Hiding values to test imputation
X_imputed = X.copy()
msk = (X.values + np.random.randn(*X.shape) - X.values) < 0.8
X_imputed.values[~msk] = 0

# Initializing model
nmf_model = ProjectedGradientNMF(n_components=5)
nmf_model.fit(X_imputed.values)

# iterate model
while nmf_model.reconstruction_err_**2 > 10:
    W = nmf_model.fit_transform(X_imputed.values)
    X_imputed.values[~msk] = W.dot(nmf_model.components_)[~msk]
    print nmf_model.reconstruction_err_

Context

StackExchange Code Review Q#96725, answer score: 5

Revisions (0)

No revisions yet.