patternpythonMinor
Conditional removal of columns in sparse matrix
Viewed 0 times
removalcolumnsconditionalmatrixsparse
Problem
I have a large dataset (78k instances x 490k features) that is loaded as a
Loading the dataset as a dense matrix is not an option, nor did I find sparse matrix operation that do the job (please correct me if I am wrong on the latter). So I took a column-iteration approach for each feature group using
On a 48-core@2.50GHz machine this takes 42 minutes on my dataset. I feel that especially the .all() conditional check in
Code with smaller simulated dataset:
```
import numpy as np
from scipy.sparse import csr_matrix
import multiprocessing
# Initiate simulated random sparse csr matrix as dataset X. Actual use case is 78k x 490k.
N = 780; M = 4900
X = np.random.choice([0, 1, 2, 3, 4], size=(N,M), p=[0.99, 0.005, 0.0025, 0.0015, 0.001]) # this is a rough
# simulation of the type of data in the use case (of course upperbound of some features is much higher)
X = csr_matrix(X, dtype=np.float32) # the real-use svmlight dataset can only be loaded as sparse.csr_matrix
# The settings of the feature groups to be filtered. Contains the range of the feature group in the dataset and the
# threshold value.
ngram_fg_dict = {"featuregroup_01": {"threshold": 3, "start_idx": 0, "end_idx": 2450},
"featuregroup_02": {"threshold": 4, "start_idx": 2451, "end_idx": 4900}}
n_cores = 3
def list_split(lst, n):
'''Split a list into roughly equal n groups'''
scipy.sparse.csr_matrix format. From this dataset I want to filter certain features (i.e. columns) for which all values fall below a certain threshold.Loading the dataset as a dense matrix is not an option, nor did I find sparse matrix operation that do the job (please correct me if I am wrong on the latter). So I took a column-iteration approach for each feature group using
multiprocessing:- Divide the total column indices in
n = n_coresroughly equal groups.
- For every index group spawn a process that iterates over each column and use buildin
.all()to check the comparison condition. Collect all indices that should be deleted in list (order does not matter).
- Drop the columns in the full dataset matrix
Xbased on the indices list.
On a 48-core@2.50GHz machine this takes 42 minutes on my dataset. I feel that especially the .all() conditional check in
.get_filtered_cols should be optimized. Any other recommendations are certainly welcome.Code with smaller simulated dataset:
```
import numpy as np
from scipy.sparse import csr_matrix
import multiprocessing
# Initiate simulated random sparse csr matrix as dataset X. Actual use case is 78k x 490k.
N = 780; M = 4900
X = np.random.choice([0, 1, 2, 3, 4], size=(N,M), p=[0.99, 0.005, 0.0025, 0.0015, 0.001]) # this is a rough
# simulation of the type of data in the use case (of course upperbound of some features is much higher)
X = csr_matrix(X, dtype=np.float32) # the real-use svmlight dataset can only be loaded as sparse.csr_matrix
# The settings of the feature groups to be filtered. Contains the range of the feature group in the dataset and the
# threshold value.
ngram_fg_dict = {"featuregroup_01": {"threshold": 3, "start_idx": 0, "end_idx": 2450},
"featuregroup_02": {"threshold": 4, "start_idx": 2451, "end_idx": 4900}}
n_cores = 3
def list_split(lst, n):
'''Split a list into roughly equal n groups'''
Solution
Assuming that the threshold is positive, then you can use the
and then you can use the
This will be 1 for columns that have any value greater than or equal to the threshold, and 0 for columns where all values are below the threshold. So
(Updated after discussion in comments. I had some more complicated suggestions, but simpler is better.)
>= operator to construct a sparse Boolean array indicating which points are above or equal to the threshold:# m is your dataset in sparse matrix representation
above_threshold = m >= v["threshold"]and then you can use the
max method to get the maximum entry in each column:cols = above_threshold.max(axis=0)This will be 1 for columns that have any value greater than or equal to the threshold, and 0 for columns where all values are below the threshold. So
cols is a mask for the columns you want to keep. (If you need a Boolean array, then use cols == 1.)(Updated after discussion in comments. I had some more complicated suggestions, but simpler is better.)
Code Snippets
# m is your dataset in sparse matrix representation
above_threshold = m >= v["threshold"]cols = above_threshold.max(axis=0)Context
StackExchange Code Review Q#138842, answer score: 4
Revisions (0)
No revisions yet.