Linear Regression with Ridge

As we discussed how OLS works, we turn to another popular regression method often used in Machine Learning algorithms – Ridge. In the article, we are going to discuss what Ridge regression is, how it is different from OLS, why we need this difference, and how we can calculate the Ridge estimator.

Intro

In OLS, we aim to minimize the sum of the squared differences between the predicted values and the actual values of the dependent variable, as shown by the following equation:

    \begin{equation*} \min ||\mathbf{y} - \mathbf{X}\boldsymbol{\beta}||^2 \end{equation*}

where \mathbf{y} is the vector of observed values, \mathbf{X} is the matrix of predictors, and \boldsymbol{\beta} is the vector of coefficients that we want to estimate. However, when the number of predictors is large, OLS can lead to overfitting, which means that the model is too complex and doesn’t generalize well to new data.

To address this issue, Ridge Regression adds a penalty term to the OLS loss function, which aims to shrink the coefficients towards zero. This penalty term is controlled by a hyperparameter \lambda, which can be tuned to find the optimal balance between model complexity and performance. The Ridge Regression objective function can be written as:

    \begin{equation*} \min ||\mathbf{y} - \mathbf{X}\boldsymbol{\beta}||^2 + \lambda ||\boldsymbol{\beta}||^2 \end{equation*}

where ||\cdot||^2 denotes the squared Euclidean norm. By adding the penalty term, Ridge Regression is able to reduce the size of the coefficients and avoid overfitting, while still maintaining good performance on the training data.

Ridge Estimators

To calculate the Ridge estimators, we need to solve the following optimization problem:

    \[\hat{\beta}_{Ridge} = arg \min_{\beta} \{ (\mathbf{y} - \mathbf{X}\boldsymbol{\beta})^\top (\mathbf{y} - \mathbf{X}\boldsymbol{\beta}) + \lambda \sum_{j=1}^{p} \beta_j^2 \}\]

where \mathbf{X} is the n \times p matrix of predictors, \mathbf{y} is the n \times 1 vector of response variables, \boldsymbol{\beta} is the p \times 1 vector of coefficients to be estimated, and \lambda is the penalty parameter that controls the amount of shrinkage applied to the coefficients.

Using the matrix notation, we can write the objective function as:

    \begin{equation*} (\mathbf{y} - \mathbf{X}\boldsymbol{\beta})^\top (\mathbf{y} - \mathbf{X}\boldsymbol{\beta}) + \lambda \boldsymbol{\beta}^\top \boldsymbol{\beta} \end{equation*}

To find the Ridge estimators, we need to take the derivative of the objective function with respect to the coefficient vector \boldsymbol{\beta} and set it equal to zero:

    \begin{equation*} \frac{\partial}{\partial \boldsymbol{\beta}} \left[ (\mathbf{y} - \mathbf{X}\boldsymbol{\beta})^\top (\mathbf{y} - \mathbf{X}\boldsymbol{\beta}) + \lambda \boldsymbol{\beta}^\top \boldsymbol{\beta} \right] = -2\mathbf{X}^\top (\mathbf{y} - \mathbf{X}\boldsymbol{\beta}) + 2\lambda \boldsymbol{\beta} = \mathbf{0} \end{equation*}

Solving for \boldsymbol{\beta}, we get:

    \begin{equation*} \boldsymbol{\beta}_{\text{Ridge}} = (\mathbf{X}^\top \mathbf{X} + \lambda \mathbf{I})^{-1} \mathbf{X}^\top \mathbf{y} \end{equation*}

where \mathbf{I} is the p \times p identity matrix. This expression shows that the Ridge coefficients are obtained by multiplying the inverse of the sum of the squared predictors and the penalty parameter by the matrix product of the transpose of the predictors and the response variables.

It is important to note that the matrix (\mathbf{X}^\top \mathbf{X} + \lambda \mathbf{I})^{-1} is always invertible, which ensures that the Ridge estimators exist and are unique. Moreover, the Ridge estimators can be computed efficiently using standard matrix inversion algorithms, such as the QR decomposition or the Singular Value Decomposition (SVD).

Reduce Sizes and Avoid Overfitting

To see why this shrinks the size of estimators and avoids overfitting, let’s focus on the expression of the Ridge estimator:

    \begin{equation*} \boldsymbol{\beta}_{\text{Ridge}} = (\mathbf{X}^\top \mathbf{X} + \lambda \mathbf{I})^{-1} \mathbf{X}^\top \mathbf{y} \end{equation*}

The expression (\mathbf{X}^\top \mathbf{X} + \lambda \mathbf{I})^{-1} is known as the Ridge Regression shrinkage operator. It can be shown that the coefficients estimated by Ridge Regression are proportional to the coefficients estimated by OLS, but scaled down by a factor that depends on the value of \lambda. Specifically, the Ridge coefficients are given by:

    \[\hat{\beta}_{Ridge, j} = \frac{\hat{\beta}_{OLS, j}}{1 + \lambda}\]

where \hat{\beta}_{\text{OLS}, j} is the OLS estimate for the j-th coefficient, and \alpha is a scaling factor that depends on the data. As we can see from this expression, the Ridge coefficients are always smaller than the OLS coefficients, since the denominator is always greater than 1.

To see how this helps to avoid overfitting, consider the case where the number of predictors is large relative to the number of observations in the training data. In this scenario, OLS may produce a model with coefficients that have large absolute values, resulting in a complex model that is susceptible to overfitting. Ridge Regression, on the other hand, will produce a model with smaller coefficient values, because of the penalty term. This reduction in the size of the coefficients reduces the complexity of the model and makes it less likely to overfit.

    \begin{equation*} \boldsymbol{\beta}_{\text{Ridge}} = (\mathbf{X}^\top \mathbf{X} + \lambda \mathbf{I})^{-1} \mathbf{X}^\top \mathbf{y} \end{equation*}

This expression shows that the coefficients in Ridge Regression are a function of both the original OLS coefficients, (\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top\mathbf{y}, and the penalty term, \lambda \mathbf{I}. The penalty term shrinks the coefficients towards zero, and the amount of shrinkage is controlled by the value of \lambda. As \lambda increases, the size of the coefficients decreases, and the model becomes simpler.

Intuitively, the Ridge penalty term shrinks the coefficients towards zero by imposing a trade-off between the goodness of fit and the complexity of the model and helps to avoid overfitting by adding a penalty term to the OLS loss function, which encourages the coefficients to have smaller absolute values. When the penalty parameter \lambda is large, the Ridge coefficients will be close to zero, which means that the model is simpler and less prone to overfitting. On the other hand, when \lambda is small, the Ridge coefficients will be closer to the OLS estimates, which means that the model is more complex and can fit the data more closely, but may be more prone to overfitting.

Biasedness and Consistency

The Ridge estimator is biased.

This is because the addition of the penalty term in Ridge Regression results in a bias in the estimated coefficients, compared to the OLS estimates.

To see this, consider the expected value of the Ridge estimator:

    \begin{equation*} \begin{aligned} \mathbb{E}(\boldsymbol{\beta}^{\text{Ridge}}) &= \mathbb{E}\left[(\mathbf{X}^\top\mathbf{X} + \lambda \mathbf{I})^{-1}\mathbf{X}^\top\mathbf{y}\right]\ \\ &= \mathbb{E}\left[(\mathbf{X}^\top\mathbf{X} + \lambda \mathbf{I})^{-1}\mathbf{X}^\top(\mathbf{X}\boldsymbol{\beta}+\boldsymbol{\epsilon})\right]\ \\ &= \mathbb{E}\left[(\mathbf{X}^\top\mathbf{X} + \lambda \mathbf{I})^{-1}\mathbf{X}^\top\mathbf{X}\boldsymbol{\beta}\right] + \mathbb{E}\left[(\mathbf{X}^\top\mathbf{X} + \lambda \mathbf{I})^{-1}\mathbf{X}^\top\boldsymbol{\epsilon}\right] \end{aligned} \end{equation*}

where \boldsymbol{\epsilon} is the error term, assumed to have mean zero and constant variance.

The first term on the right-hand side of the equation is the bias term, which can be shown to be:

    \begin{equation*} \mathbb{E}\left[(\mathbf{X}^\top\mathbf{X} + \lambda \mathbf{I})^{-1}\mathbf{X}^\top\mathbf{X}\boldsymbol{\beta}\right] = (\mathbf{X}^\top\mathbf{X} + \lambda \mathbf{I})^{-1}\mathbf{X}^\top\mathbf{X}\boldsymbol{\beta} \end{equation*}

The second term on the right-hand side of the equation is the variance term, which has mean zero and can be shown to be:

    \begin{equation*} \begin{aligned} \mathbb{E}\left[(\mathbf{X}^\top\mathbf{X} + \lambda \mathbf{I})^{-1}\mathbf{X}^\top\boldsymbol{\epsilon}\right] &= (\mathbf{X}^\top\mathbf{X} + \lambda \mathbf{I})^{-1}\mathbf{X}^\top\mathbb{E}(\boldsymbol{\epsilon})\ \\ &= (\mathbf{X}^\top\mathbf{X} + \lambda \mathbf{I})^{-1}\mathbf{X}^\top\mathbf{0}\ \\ &= \mathbf{0} \end{aligned} \end{equation*}

Therefore, the expected value of the Ridge estimator is:

    \begin{equation*} \mathbb{E}(\boldsymbol{\beta}^{\text{Ridge}}) = (\mathbf{X}^\top\mathbf{X} + \lambda \mathbf{I})^{-1}\mathbf{X}^\top\mathbf{X}\boldsymbol{\beta} \end{equation*}

which is a biased estimator of the true regression coefficients \boldsymbol{\beta}. The bias is a function of the tuning parameter \lambda and the true regression coefficients \boldsymbol{\beta}.

However, even though the Ridge estimator is biased, it can still lead to better prediction performance by reducing the variance of the estimates, as explained earlier. The amount of bias introduced by Ridge Regression is typically small compared to the reduction in variance, especially when the number of predictors is large relative to the sample size.

The Ridge estimator is consistent.

In statistics, consistency means that as the sample size n approaches infinity, the estimator converges in probability to the true value of the parameter being estimated.

For Ridge Regression, the estimator is consistent because the penalty term added to the least squares objective function ensures that the estimated coefficients do not grow unbounded even when the number of predictors is larger than the sample size.

The consistency of the Ridge estimator can be shown mathematically by demonstrating that as the sample size n approaches infinity, the estimated coefficients converge in probability to the true regression coefficients. Specifically, under some regularity conditions, it can be shown that:

    \begin{equation*} \lim_{n \rightarrow \infty} \mathbb{P}(|\boldsymbol{\beta}^{\text{Ridge}} - \boldsymbol{\beta}|_2 > \epsilon) = 0 \end{equation*}

for any positive value of \epsilon, where \boldsymbol{\beta}^{\text{Ridge}} is the Ridge estimator and \boldsymbol{\beta} is the true regression coefficients.

Showing the consistency of Ridge Regression involves demonstrating that as the sample size n approaches infinity, the Ridge estimator \boldsymbol{\beta}^{\text{Ridge}} converges in probability to the true regression coefficients \boldsymbol{\beta}.

Specifically, we need to show that:

    \begin{equation*} \lim_{n \rightarrow \infty} \mathbb{P}(|\boldsymbol{\beta}^{\text{Ridge}} - \boldsymbol{\beta}|_2 > \epsilon) = 0 \end{equation*}

for any positive value of \epsilon.

To prove this, we first define the Ridge estimator as:

    \begin{equation*} \boldsymbol{\beta}^{\text{Ridge}} = (\mathbf{X}^T\mathbf{X} + \lambda\mathbf{I})^{-1}\mathbf{X}^T\mathbf{y} \end{equation*}

where \mathbf{X} is the n \times p design matrix, \mathbf{y} is the n \times 1 response vector, \lambda is the tuning parameter, and \mathbf{I} is the p \times p identity matrix.

Next, we express the difference between the Ridge estimator and the true coefficients as:

    \begin{align*} |\boldsymbol{\beta}^{\text{Ridge}} - \boldsymbol{\beta}|_2 &= |(\mathbf{X}^T\mathbf{X} + \lambda\mathbf{I})^{-1}\mathbf{X}^T\mathbf{y} - \boldsymbol{\beta}|_2 \ \\ &= |(\mathbf{X}^T\mathbf{X} + \lambda\mathbf{I})^{-1}\mathbf{X}^T(\mathbf{X}\boldsymbol{\beta} + \boldsymbol{\epsilon}) - \boldsymbol{\beta}|_2 \ \\ &= |(\mathbf{X}^T\mathbf{X} + \lambda\mathbf{I})^{-1}\mathbf{X}^T\boldsymbol{\epsilon}|_2 \end{align*}

where \boldsymbol{\epsilon} is the n \times 1 vector of errors.

Using the triangle inequality, we have:

    \begin{align*} |\boldsymbol{\beta}^{\text{Ridge}} - \boldsymbol{\beta}|_2 &\leq |(\mathbf{X}^T\mathbf{X} + \lambda\mathbf{I})^{-1}|_2 \cdot |\mathbf{X}^T\boldsymbol{\epsilon}|_2 \ \\ &\leq |(\mathbf{X}^T\mathbf{X} + \lambda\mathbf{I})^{-1}|_2 \cdot |\mathbf{X}^T|_2 \cdot |\boldsymbol{\epsilon}|_2 \end{align*}

where we have used the sub-multiplicative property of matrix norms and the fact that |\mathbf{X}^T\boldsymbol{\epsilon}|_2 = |\boldsymbol{\epsilon}|_2.

Now, to show that the Ridge estimator is consistent, we need to show that both terms on the right-hand side of the above inequality converge to zero as n \rightarrow \infty.

First, note that by the strong law of large numbers, we have \frac{1}{n}|\mathbf{X}^T|_2^2 \rightarrow \mathrm{tr}(\boldsymbol{\Sigma}) as n \rightarrow \infty, where \boldsymbol{\Sigma} is the p \times p covariance matrix of the predictor variables. Since \mathrm{tr}(\boldsymbol{\Sigma}) < \infty, it follows that |\mathbf{X}^T|_2 < \infty almost surely.

Next, note that by the Cauchy-Schwarz inequality, we have:

    \begin{equation*} |\boldsymbol{\epsilon}|_2 = \left|\mathbf{y} - \mathbf{X}\boldsymbol{\beta}\right|_2 \leq \left|\mathbf{y}\right|_2 + \left|\mathbf{X}\boldsymbol{\beta}\right|_2 \end{equation*}

Since \left|\mathbf{y}\right|_2 < \infty almost surely and \left|\mathbf{X}\boldsymbol{\beta}\right|_2 \leq |\mathbf{X}|_2 \cdot |\boldsymbol{\beta}|_2, it follows that |\boldsymbol{\epsilon}|_2 < \infty almost surely.

Finally, we note that |(\mathbf{X}^T\mathbf{X} + \lambda\mathbf{I})^{-1}|_2 is a continuous function of \lambda, and hence is bounded over any finite interval of \lambda. This means that there exists some constant M > 0 such that |(\mathbf{X}^T\mathbf{X} + \lambda\mathbf{I})^{-1}|_2 \leq M for all n and all \lambda > 0.

Putting everything together, we have:

    \begin{align*} \lim_{n \rightarrow \infty} \mathbb{P}(|\boldsymbol{\beta}^{Ridge} - \boldsymbol{\beta}|_2 > \epsilon) &\leq \lim_{n \rightarrow \infty} \mathbb{P}(|(\mathbf{X}^T\mathbf{X} + \lambda \mathbf{I}^{-1}|_2 \cdot |\mathbf{X}^T|_2 \cdot |\boldsymbol{\epsilon}|_2 > \epsilon) \\ &= \lim_{n \rightarrow \infty} \mathbb{P} (|\boldsymbol{\epsilon}|_2 > \frac{\epsilon}{M |\mathbf{X}^T|_2 )} \\ &= 0 \end{align*}

where the last step follows from the fact that |\boldsymbol{\epsilon}|_2 < \infty almost surely.

Therefore, we have shown that the Ridge estimator is consistent, which means that as the sample size n gets larger and larger, the estimator will converge to the true coefficients with high probability.


Posted