In the linear regression context, subsetting means choosing a subset from available variables to include in the model, thus reducing its dimensionality.
Shrinkage, on the other hand, means reducing the size of the coefficient estimates. Consequently, such a case can also be seen as a kind of subsetting.
Shrinkage and selection aim at improving upon the simple linear regression. It may not be immediately obvious why such a constraint should improve the fit, but it turns out that shrinking the coefficient estimates can significantly reduce their variance.
There are two techniques in Shrinkage:
Ridge regression
It is very similar to least squares, except that the coefficients are estimated by minimizing a slightly different quantity.
whereλ ≥ 0 is atuning parameter, to be determined separately.
As with least squares, ridge regression seeks coefficient estimates that fit the data well, by making the RSS small.
The tuning parameterλserves to control the relative impact of these two terms on the regression coefficient estimates. Whenλ = 0, the penalty term has no effect, and ridge regression will produce the least squares estimates. However, asλ → ∞, the impact of the shrinkage penalty grows, and the ridge regression coefficient estimates will approach zero
Ridge regression’s advantage over least squares is rooted in thebias-variance trade-off. Asλincreases, the flexibility of the ridge regression fit decreases, leading to decreased variance but increased bias.At the least squares coefficient estimates, which correspond to ridge regression withλ = 0, the variance is high but there is no bias. But asλincreases, the shrinkage of the ridge coefficient estimates leads to a substantial reduction in the variance of the predictions, at the expense of a slight increase in bias.
Lasso regression
The LASSO is a regression method that involves penalizing the absolute size of the regression coefficients.
By penalizing \you end up in a situation where some of the parameter estimates may be exactly zero. The larger the penalty applied, the further estimates are shrunk towards zero.
This is convenient when we want some automatic feature/variable selection, or when dealing with highly correlated predictors, where standard regression will usually have regression coefficients that are 'too large'.
As with ridge regression, the lasso shrinks the coefficient estimates towards zero. However, in the case of the lasso, theℓ1penalty has the effect of forcing some of the coefficient estimates to be exactly equal to zero when the tuning parameterλis sufficiently large. Hence, much like best subset selection, the lasso performsvariable selection. As a result, models generated from the lasso are generally much easier to interpret than those produced by ridge regression.