Chapter 15: Regularisation

Regularization is a vital tool in machine learning to prevent overfitting and foster generalization ability. This chapter introduces the concept of regularization and discusses common regularization techniques in more depth.

§15.01: Introduction

Regularisation are methods used to add inductive bias to model usually, some “low complexity” priors (shrinkage & sparsity for eg) to reduce overfitting and get better bias-variance tradeoff. Explicit (L1/L2), Implicit (Early stopping, dropout), Structural Knowledge (group Lasso) are some broad categories (examples) of regularisation.
Increasing the dataset size can be useful but often is not feasible in practice. Another way is reduce model complexity. For example just start with constant model and add one feature at a time in Linear Regression for example. Or we could “optimise less” i.e. early stopping.
Regularised Empirical Risk Minimisation has contradictory goals (maximising the fit and at the same time minimising model complexity). We can add a penalty term which is a function of model complexity to the Empirical Risk Minimisation objective.

§15.02: Ridge Regression

In linear regression if $p$ is large enough and $n$ is small enough, LR can also overfit. Moreover OLS Estimator requires a full rank design matrix, and for highly correlated features, OLS becomes sensitive to random errors resulting in a large variance.
${\hat{θ}}_{R i d g e} = \underset{θ}{argmin} Σ_{i = 1}^{n} (y^{(i)} - θ^{T} x^{(i)})^{2} + λ Σ_{j = 1}^{p} θ_{j}^{2} = | | y - θ^{T} x | |_{2}^{2} + λ | | θ | |_{2}^{2}$ ${\hat{θ}}_{R i d g e} = (X^{T} X + λ I)^{- 1} X^{T} y$

§15.03: LASSO Regression

Another shrinkage method is the so-called LASSO (least absolute shrinkage and selection operator) which uses the L1 penalty on $θ$ :
$\begin{aligned} {\hat{θ}}_{lasso} & = \underset{θ}{argmin} \sum_{i = 1}^{n} (y^{(i)} - θ^{T} x^{(i)})^{2} + λ \sum_{j = 1}^{p} | θ_{j} | \\ = \underset{θ}{argmin} ‖ y - X θ ‖_{2}^{2} + λ ‖ θ ‖_{1} \end{aligned}$
Optimisation becomes much harder because the function is no longer differentiable.
For $p > n$ , LASSO selects atmost $n$ features.

§15.04: LASSO vs Ridge Regression

Usually the intercept is not regularised so that the “infinitely” regularised model is the constant model.
Unregularised Linear Regression has rescaling equivariance. If you rescale certain variables then the coefficients change and the risk does not change. For example if a variable was encoded in cm, but we convert it to mm, then the coefficient would be a simple multiple by a 100 for the mm variable. However for the regularised version, the risk for rescaled variables are often different and hence don’t have rescaling equivariance $\Rightarrow$ standardize features in regularised linear regression!
Suppose two variables are highly correlated. Then Ridge regression has tends to have “grouping effect” i.e. similar effect on both variables where LASSO arbitrarily decides to select one of them. If $x_{i} = x_{j}$ , then $θ_{i, R i d g e} = θ_{j, R i d g e}$ because the $L 2$ penalty is strictly convex. Since $L 1$ is not strictly convex, sum of coefficents can be arbitrarily allocated across both features i.e.
${\hat{θ}}_{i} = s . ({\hat{θ}}_{i} + θ_{j}) and {\hat{θ}}_{j} = (1 - s) ({\hat{θ}}_{i} + θ_{j}) for s \in [0, 1]$

§15.05: Elastic Net and Regularization for GLMs

It is a “compromise” between L1 and L2 penalties:
$\begin{aligned} {\hat{θ}}_{elnet} & = \underset{θ}{argmin} \sum_{i = 1}^{n} (y^{(i)} - θ^{T} x^{(i)})^{2} + λ_{1} ‖ θ ‖_{2}^{2} + λ_{2} ‖ θ ‖_{1} \\ = \underset{θ}{argmin} ‖ y - X θ ‖_{2}^{2} + λ α ‖ θ ‖_{2}^{2} + λ (1 - α) ‖ θ ‖_{1} \end{aligned}$
Correlated features tend to be selected or zeroed out together and selection more than $n$ features possible if $p > n$ .
Unlike pure Ridge, Elasticnet does perform some feature selection and unlike Lasso which sometimes ignores relevant features, Elasticnet keeps them in.

§15.06: Other Types of Regularisation

Although Ridge and Lasso have many desirable properties, they are biased estimators.
Besides $L 1 / L 2$ , we can also consider $L_{q}$ Norm: $∣∣ θ ∣ ∣_{q}^{q}$ . For $q < 1$ , penalty becomes non-convex, and for $q > 1$ sparsity is not achieved. Non-convex $L_{q}$ has the so-called Oracle Property - consitent and asymptotically unbiased estimator and feature selection. However the non-convexity makes the problem quite a bit harder.
The $L 0$ norm simply counts the number of non-zero parameters. Induces sparsity more aggressively than $L 1$ but does not shrink parameters. AIC and BIC are special cases of $L 0$ Norm and the problem is NP-hard.

§15.07: Non-Linear Models and Structural Risk Minimisation

We can summarise regularised risk minimisation using this equation:
$min_{θ} R_{r e g} (θ) = min_{θ} (Σ_{i = 1}^{n} L (y^{(i)}, f (x^{(i)} | θ)) + λ J (θ)$
- hypothesis space of f controls how features influence predictions
- loss function L measures how errors are treated
- regulariser $J (θ)$ encodes inductive bias
Structural Risk Minimisation (SRM) assumes that our hypothesis space can be decomposed into a sequence of increasingly complex hypothesis for eg degree of polynomials or size of hidden layer : $H = ⋃ H_{k}$ .
SRM chooses the smallest $k$ such that the optimal model from $H_{k}$ is not significantly outperformed by a model from $H_{m}$ with $m > k$ . In simple terms, if a model from the higher complexity hypothesis space does not significantly outperform the current complexity level, then choose the optimal model from the current level.
We can also interpret Regularised Risk Minimisation from the lens of SRM by considering each level of $λ$ as a complexity level.

§15.08: Bayesian Priors

We have previously created an equivalence between MLE and ERM. We can do the same for MAP and RRM. Assume we have a parametrised distribution $p (y ∣ θ, x)$ for our data and a prior distribution $q (θ)$ over our parameter space, then, as per Bayes rule:
$p (θ | x, y) = \frac{p (y | θ, x) q (θ)}{p (y | x)} \propto p (y | θ, x) q (θ)$
The Maximum Priori Estimator is then given by:
$\begin{aligned} {\hat{θ}}_{MAP} & = \underset{θ}{argmax} p (y ∣ θ, x) q (θ) \\ = \underset{θ}{argmax} \log p (y ∣ θ, x) + \log q (θ) \\ = \underset{θ}{argmin} - \log p (y ∣ θ, x) - \log q (θ) \end{aligned}$
In this formulation we can identify the $\log p (y ∣ θ, x)$ as our Loss function and and the $\log q (θ) \propto J (θ)$ : our regulariser.
$L 2$ regularisation is equivalent to an MAP estimator with $θ_{j} \sim N (0, σ^{2})$ prior and $L 1$ regularisation is equivalent to $θ_{j} \sim L a p l a c e (0, b)$ prior. For the $L 2$ regularisation we can show that larger the variance of our prior, lower the regularisation i.e. the shrinkange is inversely proportional to the variance of our prior.

§15.09: Weight Decay and L2

Consider $L 2$ regularisation:
$min_{θ} R_{r e g} (θ) = min_{θ} L (y, f (x | θ)) + \frac{λ}{2} | | θ | |_{2}^{2} = min_{θ} R_{e m p} (θ) + \frac{λ}{2} | | θ | |_{2}^{2}$
If we now update $θ$ using gradient descent, then the gradient is:
$\nabla_{θ} R_{r e g} (θ) = \nabla_{θ} R_{e m p} (θ) + λ θ$
Thus we get the following update:
$θ^{n e w} = θ^{o l d} - α (\nabla_{θ} R_{e m p} (θ^{o l d}) + λ θ^{o l d}) θ^{o l d} (1 - α λ) - α \nabla_{θ} R_{e m p} (θ^{o l d})$
Usually both $α λ « 1$ . Therefore essentially what is happening is the old parameter is decayed in magnitude before performing an update. Thus this well known technique of weight decay is simply $L 2$ regularisation in disguise. Note that this equivalence only holds for (stochastic) gradient descent and not for Adam, etc.

§15.10: Geometry of L2 Regularisation

§15.11: Geometry of L1 Regularisation

§15.12: Early Stopping

Early stopping is effective and simple to implement. It involves simply stopping optimisation when the validation error stops decreasing.
For a simple linear regression with squared loss and gradient descent optimisation with initial parameters set to 0 has exact correspondence with $L 2$ regularisation. The optimal early stopping iteration $T_{s t o p} \approx \frac{1}{α λ} \Rightarrow$ small $λ$ (low regularisation) implies a larger $T_{s t o p}$ (more complexity of the model).