HiveBrain v1.2.0
Get Started
← Back to all entries
patternpythonMinor

Conditional removal of columns in sparse matrix

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
removalcolumnsconditionalmatrixsparse

Problem

I have a large dataset (78k instances x 490k features) that is loaded as a scipy.sparse.csr_matrix format. From this dataset I want to filter certain features (i.e. columns) for which all values fall below a certain threshold.

Loading the dataset as a dense matrix is not an option, nor did I find sparse matrix operation that do the job (please correct me if I am wrong on the latter). So I took a column-iteration approach for each feature group using multiprocessing:

  • Divide the total column indices in n = n_cores roughly equal groups.



  • For every index group spawn a process that iterates over each column and use buildin .all() to check the comparison condition. Collect all indices that should be deleted in list (order does not matter).



  • Drop the columns in the full dataset matrix X based on the indices list.



On a 48-core@2.50GHz machine this takes 42 minutes on my dataset. I feel that especially the .all() conditional check in .get_filtered_cols should be optimized. Any other recommendations are certainly welcome.

Code with smaller simulated dataset:

```
import numpy as np
from scipy.sparse import csr_matrix
import multiprocessing

# Initiate simulated random sparse csr matrix as dataset X. Actual use case is 78k x 490k.
N = 780; M = 4900
X = np.random.choice([0, 1, 2, 3, 4], size=(N,M), p=[0.99, 0.005, 0.0025, 0.0015, 0.001]) # this is a rough
# simulation of the type of data in the use case (of course upperbound of some features is much higher)
X = csr_matrix(X, dtype=np.float32) # the real-use svmlight dataset can only be loaded as sparse.csr_matrix

# The settings of the feature groups to be filtered. Contains the range of the feature group in the dataset and the
# threshold value.
ngram_fg_dict = {"featuregroup_01": {"threshold": 3, "start_idx": 0, "end_idx": 2450},
"featuregroup_02": {"threshold": 4, "start_idx": 2451, "end_idx": 4900}}
n_cores = 3

def list_split(lst, n):
'''Split a list into roughly equal n groups'''

Solution

Assuming that the threshold is positive, then you can use the >= operator to construct a sparse Boolean array indicating which points are above or equal to the threshold:

# m is your dataset in sparse matrix representation
above_threshold = m >= v["threshold"]


and then you can use the max method to get the maximum entry in each column:

cols = above_threshold.max(axis=0)


This will be 1 for columns that have any value greater than or equal to the threshold, and 0 for columns where all values are below the threshold. So cols is a mask for the columns you want to keep. (If you need a Boolean array, then use cols == 1.)

(Updated after discussion in comments. I had some more complicated suggestions, but simpler is better.)

Code Snippets

# m is your dataset in sparse matrix representation
above_threshold = m >= v["threshold"]
cols = above_threshold.max(axis=0)

Context

StackExchange Code Review Q#138842, answer score: 4

Revisions (0)

No revisions yet.