patternCritical

Why is deep learning hyped despite bad VC dimension?

Submitted by: @import:stackexchange-cs·Mar 10, 2026·

Viewed 0 times

machine-learning vc-dimension cs stackoverflow

whydespitebaddimensionhypedlearningdeep

Problem

The Vapnik–Chervonenkis (VC)-dimension formula for neural networks ranges from $O(E)$ to $O(E^2)$, with $O(E^2V^2)$ in the worst case, where $E$ is the number of edges and $V$ is the number of nodes. The number of training samples needed to have a strong guarantee of generalization is linear with the VC-dimension.

This means that for a network with billions of edges, as in the case of successful deep learning models, the training dataset needs billions of training samples in the best case, to quadrillions in the worst case. The largest training sets currently have about a hundred billion samples. Since there is not enough training data, it is unlikely deep learning models are generalizing. Instead, they are overfitting the training data. This means the models will not perform well on data that is dissimilar to the training data, which is an undesirable property for machine learning.

Given the inability of deep learning to generalize, according to VC dimensional analysis, why are deep learning results so hyped? Merely having a high accuracy on some dataset does not mean much in itself. Is there something special about deep learning architectures that reduces the VC-dimension significantly?

If you do not think the VC-dimension analysis is relevant, please provide evidence/explanation that deep learning is generalizing and is not overfitting. I.e. does it have good recall AND precision, or just good recall? 100% recall is trivial to achieve, as is 100% precision. Getting both close to 100% is very difficult.

As a contrary example, here is evidence that deep learning is overfitting. An overfit model is easy to fool since it has incorporated deterministic/stochastic noise. See the following image for an example of overfitting.

Also, see lower ranked answers to this question to understand the problems with an overfit model despite good accuracy on test data.

Some have responded that regularization solves the problem of a large VC dimension. See this question for

Solution

"If the map and the terrain disagree, trust the terrain."

It's not really understood why deep learning works as well as it does, but certainly old concepts from learning theory such as VC dimensions appear not to be very helpful.

The matter is hotly debated, see e.g.:

H. W. Lin, M. Tegmark, D. Rolnick, Why does deep and cheap learning work so well?

C. Zhang, S. Bengio, M. Hardt, B. Recht, O. Vinyals, Understanding Deep Learning Requires Rethinking Generalization.

D. Krueger, B. Ballas, S. Jastrzebski, D. Arpit, M. S. Kanwal, T. Maharaj, E. Bengio, A. Fischer, A. Courville, Deep Nets Dont Learn Via Memorization.

Regarding the issue of adversarial examples, the problem was discovered in:

C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, A. Rabinovich, Going deeper with convolutions.

It is further developed in:

I. Goodfellow, J. Shlens, C. Szegedy, Explaining And Harnessing Adversarial Examples.

There is a lot of follow-on work.

Update March 2020. A new hypothesis that appears to explain some of the mismatch between clear over-parameterisation of modern (feed-forward) NNs and good recognition performance is Frankle and Carbin's Lottery Ticket Hypothesis from 2018:

J. Frankle, M. Carbin, The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks.

The claim is that a "randomly-initialised, dense [feed-forward] neural
network contains a subnetwork that is initialised such that when
trained in isolation it can match the test accuracy of the original
network after training for at most the same number of iterations." Regarding the original question, the Lottery Ticket Hypothesis might be understood as saying that:

-
Training by stochastic gradient descent searches for small subnetworks that work well and deemphasises the rest of the overparameterised network's learning capacity.

-
The bigger the original network, the more likely it is to contain a small subnetwork with good performance on the task at hand.

This has found empirical support, e.g. in

H. Zhou, J. Lan, R. Liu, J. Yosinski, Deconstructing Lottery Tickets: Zeros, Signs, and the Supermask.

and theoretical support in:

E. Malach, G. Yehudai, S. Shalev-Shwartz, O. Shamir, Proving the Lottery Ticket Hypothesis: Pruning is All You Need.

As far as I'm aware, it has not yet been possible to generalise the Lottery Ticket Hypothesis to recurrent NNs.

Update March 2021. The original 2016 paper Understanding Deep Learning Requires Rethinking Generalization has been updated. Here is the new version:

C. Zhang, S. Bengio, M. Hardt, B. Recht, O. Vinyals, Understanding Deep Learning (Still) Requires Rethinking Generalization.

Context

StackExchange Computer Science Q#75327, answer score: 96

Revisions (0)

No revisions yet.