patternMinor
How much does regularization prevent overfitting?
Viewed 0 times
preventoverfittingmuchregularizationdoeshow
Problem
I've been curious why machine learning algorithms with high VC dimension, such as XGBoost and deep learning, do well in practice. The answer appears to be regularization significantly restricting the parameter space, but the only justifications I've seen are references to Occam's razor.
Is there quantitative/theoretical understanding of how much regularization can prevent overfitting in a model?
Background:
I've taken a course in machine learning and covered the theory behind regularization and a couple techniques, such as lasso and ridge regression. I understand the principle that regularization can reduce the VC dimension by minimizing or zeroing weights in the model.
But, that principle does not clarify in my mind whether regularization is adequate to counteract the high VC dimension in models used by XGBoost and deep learning.
I am asking for some sort of quantitative theory that provides justification that even with high VC dimension regularization can reduce the dimensionality enough to provide a decent guarantee of generalization.
Alternatively, providing a method I can use to figure this out myself is also acceptable.
Is there quantitative/theoretical understanding of how much regularization can prevent overfitting in a model?
Background:
I've taken a course in machine learning and covered the theory behind regularization and a couple techniques, such as lasso and ridge regression. I understand the principle that regularization can reduce the VC dimension by minimizing or zeroing weights in the model.
But, that principle does not clarify in my mind whether regularization is adequate to counteract the high VC dimension in models used by XGBoost and deep learning.
I am asking for some sort of quantitative theory that provides justification that even with high VC dimension regularization can reduce the dimensionality enough to provide a decent guarantee of generalization.
Alternatively, providing a method I can use to figure this out myself is also acceptable.
Solution
Moritz Hardt's excellent blog outlines extensive research into the idea that stability in machine learning methods is related to the idea of generalization. Interestingly, we find that regularization implies stability.
Formally, given observations $S = (z_1, \dotsc, z_n)$ and algorithm $A$, stability is defined as the expected difference between the risk and empirical risk $$\mathbb{E}[R - R_e] = \Delta,$$ where $R_e = \ell(z, S) = \frac{1}{n} \sum_i \ell(z_i, A(S))$ is the empirical risk and $R = \mathbb{E}_{z \sim D}[\ell(z, A(S))]$ is the risk.
The post details how to obtain bounds on $\Delta$ and its implications much better than I ever can. I highly recommend reading that post along with all of his other posts.
A much more practical argument is that regularization constrains the problem space to a very specific subset of models; namely, regularization methods can be recast as constrained optimization where $\|\beta\|_p \leq c$ for some value of $p$. This is equivalent to using prior knowledge about the problem and forcing the solution to be within some constrained space.
Formally, given observations $S = (z_1, \dotsc, z_n)$ and algorithm $A$, stability is defined as the expected difference between the risk and empirical risk $$\mathbb{E}[R - R_e] = \Delta,$$ where $R_e = \ell(z, S) = \frac{1}{n} \sum_i \ell(z_i, A(S))$ is the empirical risk and $R = \mathbb{E}_{z \sim D}[\ell(z, A(S))]$ is the risk.
The post details how to obtain bounds on $\Delta$ and its implications much better than I ever can. I highly recommend reading that post along with all of his other posts.
A much more practical argument is that regularization constrains the problem space to a very specific subset of models; namely, regularization methods can be recast as constrained optimization where $\|\beta\|_p \leq c$ for some value of $p$. This is equivalent to using prior knowledge about the problem and forcing the solution to be within some constrained space.
Context
StackExchange Computer Science Q#75457, answer score: 4
Revisions (0)
No revisions yet.