patternpythonMinor

Conditional removal of columns in sparse matrix

Submitted by: @import:stackexchange-codereview·Mar 10, 2026·

Viewed 0 times

removalcolumnsconditionalmatrixsparse

Problem

I have a large dataset (78k instances x 490k features) that is loaded as a scipy.sparse.csr_matrix format. From this dataset I want to filter certain features (i.e. columns) for which all values fall below a certain threshold.

Loading the dataset as a dense matrix is not an option, nor did I find sparse matrix operation that do the job (please correct me if I am wrong on the latter). So I took a column-iteration approach for each feature group using multiprocessing:

Divide the total column indices in n = n_cores roughly equal groups.

For every index group spawn a process that iterates over each column and use buildin .all() to check the comparison condition. Collect all indices that should be deleted in list (order does not matter).

Drop the columns in the full dataset matrix X based on the indices list.

On a 48-core@2.50GHz machine this takes 42 minutes on my dataset. I feel that especially the .all() conditional check in .get_filtered_cols should be optimized. Any other recommendations are certainly welcome.

Code with smaller simulated dataset:

```
import numpy as np
from scipy.sparse import csr_matrix
import multiprocessing

# Initiate simulated random sparse csr matrix as dataset X. Actual use case is 78k x 490k.
N = 780; M = 4900
X = np.random.choice([0, 1, 2, 3, 4], size=(N,M), p=[0.99, 0.005, 0.0025, 0.0015, 0.001]) # this is a rough
# simulation of the type of data in the use case (of course upperbound of some features is much higher)
X = csr_matrix(X, dtype=np.float32) # the real-use svmlight dataset can only be loaded as sparse.csr_matrix

# The settings of the feature groups to be filtered. Contains the range of the feature group in the dataset and the
# threshold value.
ngram_fg_dict = {"featuregroup_01": {"threshold": 3, "start_idx": 0, "end_idx": 2450},
"featuregroup_02": {"threshold": 4, "start_idx": 2451, "end_idx": 4900}}
n_cores = 3

def list_split(lst, n):
'''Split a list into roughly equal n groups'''

Solution

Assuming that the threshold is positive, then you can use the >= operator to construct a sparse Boolean array indicating which points are above or equal to the threshold:

# m is your dataset in sparse matrix representation
above_threshold = m >= v["threshold"]

and then you can use the max method to get the maximum entry in each column:

cols = above_threshold.max(axis=0)

This will be 1 for columns that have any value greater than or equal to the threshold, and 0 for columns where all values are below the threshold. So cols is a mask for the columns you want to keep. (If you need a Boolean array, then use cols == 1.)

(Updated after discussion in comments. I had some more complicated suggestions, but simpler is better.)

Code Snippets

# m is your dataset in sparse matrix representation
above_threshold = m >= v["threshold"]

cols = above_threshold.max(axis=0)

Context

StackExchange Code Review Q#138842, answer score: 4

Revisions (0)

No revisions yet.