patternMinor

Classification algorithm for high dimensional data which is uniquely definable in a very small sub-space

Submitted by: @import:stackexchange-cs·Mar 10, 2026·

Viewed 0 times

machine-learning classification cs stackoverflow

spacewhichhighsubdefinablealgorithmsmallforuniquelyvery

Problem

I am new to machine learning, so forgive me if I am doing something absolutely absurd.

I have a classification task (~100 classes) and have about 2 million training data points in a 2000-dimensional space. Coordinates of data points are integers (discrete). All points have non-zero coordinates only for < 10 dimensions. That is, each point can be uniquely defined in < 10 dimensional sub-space.

If I use a Gaussian Mixture Model (GMM) for each class, I will end up with ~100 GMMs in a 2000-dimensional space. I feel that given the fact that each point is uniquely definable in less than 10 dimensional space, there can possibly be a better way of doing it.

What am I missing here?

Solution

Since your data are extremely sparse, using GMMs or a traditional SVM will result in an over-fit model. By employing methods that exploit the sparsity of the structure, you should get much better results. Regression methods typically add some penalty function as a measure of the amount of non-zero values. This is usually referred to as "regularization". Doing this exactly (under the $L_0$ norm) is difficult, so relaxations are used (lasso: $L_1$, ridge: $L_2$).

If you are comfortable with Python, SciKit Learn has implementations for several approaches that utilize either lasso or ridge penalties. Here is an example of Lasso penalty for an SVM model. I would try that and see how well it works. It may require a parameter to adjust the regularization amount. You can do cross-validation to find what parameter works best.

Context

StackExchange Computer Science Q#22217, answer score: 3

Revisions (0)

No revisions yet.