HiveBrain v1.2.0
Get Started
← Back to all entries
patternMinor

Is it a problem that "successful" machine learning algorithms have large VC dimension?

Submitted by: @import:stackexchange-cs··
0
Viewed 0 times
problemsuccessfuldimensionalgorithmslearninglargethatmachinehave

Problem

In my limited exposure, it appears that "successful" machine learning algorithms tend to have very large VC dimension. For example, XGBoost is famous for being used to win the Higgs Boson Kaggle competition, and Deep Learning has made many headlines. Both algorithm paradigms are based on models that can be scaled to shatter any dataset, and can incorporate boosting which increases VC dimension.

According to VC dimension analysis, a large dimension is supposedly a bad thing, it allows models to overfit or memorize the data instead of generalize. For example, if my model shatters every dataset, e.g. a rectangle around every point, then it cannot be extrapolated outside the dataset. My grid of rectangles tells me nothing about points outside the grid. The larger the VC dimension, the more likely the model will shatter a dataset instead of generalizing, and thus once exposed to new data outside the training dataset, it will perform badly.

Back to the original point, many of the most "successful" machine learning algorithms have this common trend of having a large VC dimension. Yet, according to machine learning theory this is a bad thing.

So, I am left confused with this significant discrepancy between theory and practice. I know the saying "In theory there is no difference between theory and practice, in practice there is" and practitioners tend to brush aside such discrepancies if they get the results they want. A similar question was asked regarding Deep Learning, and the consensus was that it did have large VC dimension, but that doesn't matter because it scores very well on benchmark datasets.

But, it is also said "there is nothing more practical than good theory." Which suggests such a large discrepancy has a significance for practical application.

My question then, is it true that the only thing that really matters is low error scores on test datasets, even when the theoretical analysis of the algorithm says it will generalize poorly? Is overfitt

Solution

When left with a discrepancy between theory and data, data is king. Theory is intended to be predictive -- to make predictions about the world -- but when it fails to predict what we actually observe and experience, when its predictions disagree with our experience, then there is obviously something lacking in the theory.

In this case, VC theory just isn't adequate to understand modern practice in machine learning.

Unfortunately, VC theory ignores methods like regularization. Regularization is widely used in practice, so that's a pretty important gap in VC theory. VC theory counts the number (size, dimension) of possible models, and treats all of them as "equally valid/likely".

When we train a model with regularization, we depart from that paradigm. Regularization implicitly encodes the assumption that "all else being equal, simpler models (explanations) are more likely to be correct". In other words, regularization is essentially an application of Occam's Razor. In effect, regularization encodes some kind of prior on the distribution of likely models: not all models are equally likely; simpler models are more likely to be right. Classical VC theory doesn't take that into account, and thus can't make useful predictions about the behavior of machine learning methods that use regularization.

Practitioners aren't "brushing aside" the theory. Rather, VC dimension simply doesn't seem to be super-relevant to practice. It is too limited.

It's still an open question to understand why techniques like deep learning work so well. VC dimension was an early attempt to understand machine learning -- a powerful, beautiful, valiant attempt, one that may still be of some interest -- but ultimately one that doesn't seem to give us the whole picture, perhaps partly because it does not account for things like regularization and our priors on the model.

Context

StackExchange Computer Science Q#75397, answer score: 5

Revisions (0)

No revisions yet.