Biased Estimators - Ridge and Lasso regressions

true
library(knitr)
library(captioner)
fig_nums <- captioner()
cod_nums <- captioner( prefix = "Code" )

Some days ago I wrote some notes concerning unbiased estimators, that are the estimators that we commonly use with linear models, and statistical analyses such as ANOVA, linear regression, etc…

Unbiased estimators try to estimate the best unknown but not casual, \(\beta\) coefficients of the independent variables (i.v.) to explain the observed dependent variable \(y\) (d.v.), minimising the error between the variables that can be sampled from the linear model (\(\hat{y}\)) and the observed data (\(y\)).

In other terms, they try to reduce the error between \(y\) and \(\hat{y}\) with different methodologies. We can also say that they look for the values of the coefficients that better “explain the variance”.

Problems with the Unbiased Estimators

That’s great! However, having models that predict so brilliantly the behaviour of a specific sample of data coming from a statistical population, can lead to the problem that the model is not actually so useful in predicting new data and, therefore, it has not really captured the values of the coefficients at a population-level. This is called overfitting.

Overfitting: the production of an analysis that corresponds too closely or exactly to a particular set of data, and may therefore fail to fit additional data or predict future observations reliably

This may also be caused by the use of an excessive number of i.v. in the analysis.

In order to prevent this problem, biased estimators have been proposed.

Biased estimators: algorithms to estimate the values of the coefficients with penalisation coefficients, in order to have coefficients that do not allow the perfect correspondence between \(\hat{y}\) and \(y\).

Concepts that are often used to understand better when a statistical model is a good model are the bias and the variance. Please, note that in this context bias has no relation with cognitive bias, and variance in this context is not the dispersion statistics that we all know.

The bias is the error of the model in fitting the observed data \(y\). A large bias means that the model underfits the sample of data.

x = 1:10
y = x^3/50 + rnorm(10, sd = 1)

plot(x,y)
abline(lm(y~x), col="red")
points(x , x^3/50, type="l", col="green")
legend("top", col = c("red" , "green"), lty = 1,
       legend = c("Underfit" , "Correct fit"))

Figure 1: Graphical representation of an hypothetical distribution fitted by two models, one with correct fit (y ~ x^3), and one that underfits the data (y ~ x)

The variance is the error of the model in predicting new data. A large variance means that the model overfits the original sample of data \(y\).

x = seq(from = 1, to = 5, by = 0.1)
y = x^3/100 + cos(x*10)^2

x1 = seq(from = 1, to = 5, by = 0.001)
y1 = x1^3/100 + cos(x1*10)^2

plot(x,y)
points(x1, y1, type="l",  col =" red")
points(x, x^3/100 + mean(cos(x*10)^2), type = "l", col = "green")
legend("topleft", col = c("red" , "green"), lty = 1,
       legend = c("Overfit" , "Correct fit"))

Figure 2: Graphical representation of an hypothetical distribution fitted by two models, one with correct fit (y ~ x^3), and one that overfits the data (y ~ x^3 + cos(x * 10)^2)

(to be honest, the overfitted curve is obtained by using exactly y ~ x^3 + cos(x * 10)^2, but for simplicity let’s pretend that the cosine part is just random variation)

A good model should have low bias and low variance, finding the trade-off between them.

x = 1:100
y = x^5 / 1e10

plot( x = 1:100 , y , type = "l" , col = "red")
points( x = 100:1 , y = y , type = "l", col = "blue")
points( x = 1:100 , y = y + y[100:1] + 0.1 , type = "l" , col = "green")
legend("top", col = c("red" , "blue" , "green"), lty = 1, legend = c("Bias" , "Variance" , "Error trade-Off"))

Figure 3: Graphical representation of the Bias and Variance errors of the model, and hypothetical trade-off curve. Please note that the Trade-off curve has been displaced for a better visualization

The most famous among the biased estimators are the:

  • Ridge regression (a.k.a. L2 regularization)
  • Lasso regression (a.k.a. L1 regularization)

Ridge regression

Ridge regression tries to find the best coefficients (\(\beta\)s) that describe our observed sample of data (\(y\)), penalising the computation of the coefficients with a coefficient \(\lambda\).

Therefore, the idea is to find the \(\beta\)s of the model that best fit the data, with an additional penalysing term \(\lambda\).

\(\beta = \underset{\beta}{\operatorname{arg min}}~S(\beta, \lambda)\)

where

\(S(\beta, \lambda) = \sum_{i=1}^{n} \frac{1}{2} ( y_i - \sum_{j=1}^p x_{i,j} \beta_{j})^2+\lambda \sum_{j=1}^p \beta_j^2\)

\(\lambda\) is the complexity parameter, and penalises the \(\beta\)s, limiting overfitting and shrinking large \(\beta\)s.

In order to better understand how ridge regression works, we can use an iterative, inefficient, and so simplified to be simply wrong, function just to have an idea concerning how ridge regression works.

In this function we create some random data gaussianally distributed, coming from a linear model with mean 1 (\(\beta_0\)) and regressor \(\beta_1 = 2\).

The same data used for unbiased estimators.

We set our \(\lambda\) to 0.1.

## creating the data
set.seed( 1 )
# independent variable
x <- 1:10
#   beta0 + i.v. * beta1 + epsilon
y <- 1 + x * 2 + rnorm( n = 10 )

find_ridge_least_squares <- function( y , x , beta0 , beta1 , lambda ) {
  
  # variable for the beta1 that best fits the data
  best_beta <- c(NULL, NULL)
  # variable for memorizing the squared errors
  best_err  <- Inf
  for(int in beta0){
    for( bet in beta1 ){
      yhat <- int + x * bet
      err  <- ( sum( ( y - yhat)^2 ) / 2 ) + ( lambda * ( int^2 + bet^2 ) )
      best_beta[1] <- ifelse( best_err > err,
                           int,
                           best_beta[1])
      best_beta[2] <- ifelse( best_err > err,
                           bet,
                           best_beta[2])
      best_err  <- ifelse( best_err > err,
                           err,
                           best_err)
    }
  }
  
  return( best_beta )
}

print(
  find_ridge_least_squares( y ,
                            x,
                            beta0 = seq(from = 0, to = 3, by = 0.01),
                            beta1 = seq(from = 1, to = 3, by = 0.01),
                            lambda = 0.01 )
)
## [1] 0.80 2.06

Code 1: Exemplification code for a ridge regression

The estimations are not so far from the real values. However, let’s see how the real function works, in library glmnet:

library("glmnet")
## Carico il pacchetto richiesto: Matrix
## Loaded glmnet 4.0-2
fit <- glmnet(y = y, x = cbind(1,x),
              lambda = 0.01, alpha = 0)

print( coef( fit ) )
## 3 x 1 sparse Matrix of class "dgCMatrix"
##                    s0
## (Intercept) 0.8501507
##             .        
## x           2.0512822

Code 2: Use of glmnet for ridge regression

The minimisation problem has a unique solution:

\(\beta = (X^TX + \lambda I)^{-1}X^Ty\)

X = cbind( 1 , x )

solve( t(X) %*% X + 0.01 * diag(2) ) %*% t(X) %*% y
##        [,1]
##   0.8286793
## x 2.0550354

Code 3: Matrix computation for Ridge regression

In this case we can see that the algorithms used in the glmnet function and in the matrix computation are different.

glmnet is a complex function that allows to compute Ridge regression (alpha = 0), Lasso regression (alpha = 1) and Elasticnet regression (alpha \(\in\) [0, 1]), and probably it is using algorithms that are more advanced than the classical ones that I am using here.

Lasso regression

The Least Absolute Shrinkage and Selection Operator (Lasso) regression is very similar to the Ridge regression.

In fact, it uses the \(\lambda\) coefficient of penalisation already seen in the Ridge regression.

Therefore, the minimisation problem is always:

\(\beta = \underset{\beta}{\operatorname{arg min}}~S(\beta, \lambda)\)

but the \(S\) function changes:

\(S(\beta, \lambda) = \frac{1}{n}\sum_{i=1}^{n} ( y_i - \sum_{j=1}^p x_{i,j} \beta_{j})^2+\frac{\lambda}{2} \sum_{j=1}^p |\beta_j|\)

find_lasso_least_squares <- function( y , x , beta0 , beta1 , lambda ) {
  
  # variable for the beta1 that best fits the data
  best_beta <- c(NULL, NULL)
  # variable for memorizing the squared errors
  best_err  <- Inf
  for(int in beta0){
    for( bet in beta1 ){
      yhat <- int + x * bet
      err  <- ( sum( ( y - yhat)^2 ) / ( length(yhat) ) ) +
        ( lambda / 2 * abs( int + bet ) )
      best_beta[1] <- ifelse( best_err > err,
                           int,
                           best_beta[1])
      best_beta[2] <- ifelse( best_err > err,
                           bet,
                           best_beta[2])
      best_err  <- ifelse( best_err > err,
                           err,
                           best_err)
    }
  }
  
  return( best_beta )
}

print(
  find_lasso_least_squares( y ,
                            x,
                            beta0 = seq(from = 0, to = 3, by = 0.01),
                            beta1 = seq(from = 1, to = 3, by = 0.01),
                            lambda = 0.01 )
)
## [1] 0.80 2.06

Code 4: Exemplification code for a Lasso regression

fit <- glmnet(y = y, x = cbind(1,x),
              lambda = 0.01)

print( coef( fit ) )
## 3 x 1 sparse Matrix of class "dgCMatrix"
##                   s0
## (Intercept) 0.850325
##             .       
## x           2.051251

Code 5: Use of glmnet for Lasso regression

The minimisation problem can be divided in two steps:

  1. finding the \(\beta\)s with OLS (\(\beta = (X^TX)^{-1}X^Ty\))
  2. applying the Lasso penalisation (\((1+N \frac{\lambda}{2})^{-1}\))

\(\beta = (1+N \frac{\lambda}{2})^{-1}[(X^TX)^{-1}X^Ty]\)

X = cbind( 1 , x )

(1 + nrow(X) * 0.01 / 2)^(-1) *
solve( t(X) %*% X ) %*% t(X) %*% y
##        [,1]
##   0.7915966
## x 1.9568877

Also in this case the results are different from the glmnet function. I suspect that in the glmnet function they are using more advanced versions of the Lasso regression.