Linear regression is not difficult, basically we are trying to estimate a line that efficiently interpolates our data.
However, how it works?
Independent and Dependent variables
Linear regression try to estimate the parameters \(\beta\) (independent variables) that allow to better “explain the variance” of observations (dependent variable).
The dependent variable is the observation that we postulate being shaped / modulated / caused by our independent variables.
This means that we are not looking for correlations between the dependent and independent variables, because we assume that the dependent depends (sorry…) from the independent variables. Converserly, correlations do not assume that a variable causes the modification of another variable. In fact, in the correlation case, it is an error speaking of dependent and independent variables.
The linear regression formula
Therefore, we try to estimate the unknown vector of parameters \(\beta\) in this formula \(y = X \beta + \epsilon\) trying to minimize \(\epsilon\), that is our error term.
\(\beta\) could also be named coefficients, fixed effects, regressors, etc…
In linear models the error term \(\epsilon\) should be distributed along a normal distribution with mean 0 and standard deviation 1 \(\epsilon \sim N(0, 1)\).
The \(X\) matrix represents the contrast matrix where all our independent data are stored: covariates and factors.
Estimation of the unknown vector of \(\beta\)s
Usually the estimation of the \(\beta\)s can be computed with one ot hese three algorythms:
- Ordinary Least Square estimation (OLS)
- Maximum Likelihood estimation (ML)
- Restricted (or Residual) Maximum Likelihood estimation (REML)
Ordinary Least Square estimation (OLS)
The idea is to find the coefficients that describe the line whose difference between the fitted / predicted data (i.e. the estimation of the observed data that are obtained by using the fitted model) and the observed / dependent data is minimal.
Namely, \(\beta = \underset{\beta}{\operatorname{arg min}}~S(\beta)\), where \(S(\beta) = \sum_{i=1}^{n} | y_i - \sum_{j=1}^p x_{i,j} \beta_{j}|^2\) , that basically is the square of the error of the prediction between the observed (\(y\)) and the predicted (\(\hat{y}\)) data.
Therefore, among all the following lines:
## creating the data
set.seed( 1 )
# independent variable
x <- 1:10
# beta0 + i.v. * beta1 + epsilon
y <- 1 + x * 2 + rnorm( n = 10 )
## plotting the data
plot( x , y , ylim = c(0, 30) , xlim = c (0, 30))
abline(a = 1, b = 2 , col = "red")
abline(a = 1, b = 1.5, col = "green")
abline(a = 1, b = 2.5, col = "blue")
abline(a = 1, b = 1, col = "purple")
abline(a = 1, b = 3, col = "yellow")
OLS will choose the one with the smallest square of errors \(S(\beta)\).
## creating the data
set.seed( 1 )
# independent variable
x <- 1:10
# beta0 + i.v. * beta1 + epsilon
y <- 1 + x * 2 + rnorm( n = 10 )
## plotting the data
plot( x , y , ylim = c(0, 30) , xlim = c (0, 30))
abline(a = 1, b = 2 , col = "red")
abline(a = 1, b = 1.5, col = "green")
abline(a = 1, b = 2.5, col = "blue")
abline(a = 1, b = 1, col = "purple")
abline(a = 1, b = 3, col = "yellow")
d1 <- abs( y - (1 + x * 2) )
d2 <- abs( y - (1 + x * 1.5) )
d3 <- abs( y - (1 + x * 2.5) )
d4 <- abs( y - (1 + x * 3) )
d5 <- abs( y - (1 + x) )
segments(x0 = x, y0 = y, x1 = x, y1 = y + d1, col = "red")
segments(x0 = x, y0 = y, x1 = x + d1, y1 = y, col = "red")
segments(x0 = x + d1, y0 = y, x1 = x + d1, y1 = y + d1, col = "red")
segments(x0 = x, y0 = y + d1, x1 = x + d1, y1 = y + d1, col = "red")
segments(x0 = x, y0 = y, x1 = x, y1 = y + d2, col = "green")
segments(x0 = x, y0 = y, x1 = x + d2, y1 = y, col = "green")
segments(x0 = x + d2, y0 = y, x1 = x + d2, y1 = y + d2, col = "green")
segments(x0 = x, y0 = y + d2, x1 = x + d2, y1 = y + d2, col = "green")
segments(x0 = x, y0 = y, x1 = x, y1 = y + d3, col = "blue")
segments(x0 = x, y0 = y, x1 = x + d3, y1 = y, col = "blue")
segments(x0 = x + d3, y0 = y, x1 = x + d3, y1 = y + d3, col = "blue")
segments(x0 = x, y0 = y + d3, x1 = x + d3, y1 = y + d3, col = "blue")
segments(x0 = x, y0 = y, x1 = x, y1 = y + d4, col = "yellow")
segments(x0 = x, y0 = y, x1 = x + d4, y1 = y, col = "yellow")
segments(x0 = x + d4, y0 = y, x1 = x + d4, y1 = y + d4, col = "yellow")
segments(x0 = x, y0 = y + d4, x1 = x + d4, y1 = y + d4, col = "yellow")
segments(x0 = x, y0 = y, x1 = x, y1 = y + d5, col = "purple")
segments(x0 = x, y0 = y, x1 = x + d5, y1 = y, col = "purple")
segments(x0 = x + d5, y0 = y, x1 = x + d5, y1 = y + d5, col = "purple")
segments(x0 = x, y0 = y + d5, x1 = x + d5, y1 = y + d5, col = "purple")
By observing the graph, we can say that is the red one, that has as coefficients \(\beta_0 = 1\) and \(\beta_1 = 2\).
In order to see if analytically is as we observed, we can start with an iterative inefficient function:
find_least_squares <- function( y , x , beta0 , beta1 ) {
# variable for the beta1 that best fits the data
best_beta <- c(NULL, NULL)
# variable for memorizing the squared errors
best_err <- Inf
for(int in beta0){
for( bet in beta1 ){
yhat <- int + x * bet
err <- sum( abs( y - yhat)^2 )
best_beta[1] <- ifelse( best_err > err,
int,
best_beta[1])
best_beta[2] <- ifelse( best_err > err,
bet,
best_beta[2])
best_err <- ifelse( best_err > err,
err,
best_err)
}
}
return( best_beta )
}
print( find_least_squares( y , x, beta0 = seq(from = 0, to = 3, by = 0.1),
beta1 = seq(from = 1, to = 3, by = 0.1) ) )
## [1] 0.6 2.1
Let’s try to estimate these values with the standard function of R lm
(that does not uses OLS to estimate the coefficients
[thanks Marco Tullio Liuzza for the correction]):
mdl <- lm(y ~ x)
print( mdl )
##
## Call:
## lm(formula = y ~ x)
##
## Coefficients:
## (Intercept) x
## 0.8312 2.0547
As you can see, very similar same results.
plot( x , y )
abline(mdl , col= "red")
Maximum Likelihood estimation (ML)
In this case we want to maximize the probability of a Likelihood function in describing the data. Therefore, we want to find the values of \(\beta\) that maximize the probability of describing the observed dependent variable \(y\), namely the values of \(\beta\) that, if use din our model, maximize the probability of obtaining the observed data.
Namely, \(\beta = \underset{\beta}{\operatorname{arg max}}~\hat{L}_n(\beta;y)\), where $ L_n(;y)$ is the Likelihood function, that is also an index of goodness of fit of the model (the log-likelihood is often used for model selection).
To better understand how it works, here another interactive, inefficient function (remember that we postulate that our coefficients are distributed along normal distributions):
my_gauss <- function(true_value, sigma, estimate) 1/(sigma*sqrt(2*pi)) *
exp(- (true_value - estimate)^2/(2*sigma^2))
Thanks to the above function, I will get the likelihood of each parameter,
given its true value
and the proposed estimate
value.
We know that our data were crated with a angular coefficient of 2, therefore
this is our true value
. We will try different estimate
values to see
which has the higher likelihood.
find_ML <- function( beta0 , beta1 ) {
# variable for the beta1 that best fits the data
best_betas <- c(NULL , NULL)
# variable for memorizing the squared errors
best_lik <- -Inf
for(int in beta0){
for( bet in beta1 ){
current_lik <- my_gauss(1 , sigma = 1 , int) *
my_gauss(2 , sigma = 1 , bet)
best_betas[1] <- ifelse( current_lik > best_lik,
int,
best_betas[1])
best_betas[2] <- ifelse( current_lik > best_lik,
bet,
best_betas[2])
best_lik <- ifelse( current_lik > best_lik,
current_lik,
best_lik)
}
}
return( best_betas )
}
print( find_ML( beta0 = seq(from = 0, to = 3, by = 0.1),
beta1 = seq(from = 1, to = 3, by = 0.1) ) )
## [1] 1 2
As we can see, also in this case the results are 1 for the intercept and 2 for the angular coefficient.
This methodology can be used in both linear and multilevel linear models.
Restricted (or Residual) Maximum Likelihood (REML)
This method is used in particular in linear models that take into consideration both population- and group-level effects.
As in ML, we want to maximize the Likelihood of the model by finding the best \(\beta\)s parameters. Remember that in linear models all \(\beta\)s are distributed along a normal distribution characterized by a \(\mu\) and \(\sigma\) parameter:
\(\beta \sim N(\mu,\sigma)\)
With this methodology, the computation of the Likelihood of the model is composed by two parts, maximised separately:
- The Likelihood of \(\mu\) and \(\sigma\) parameters for all group- and population-level effects. \(\leftarrow\) to maximize the likelihood of the \(\mu\) parameter.
- The Residual Likelihood of only the \(\sigma\) parameters. \(\leftarrow\) to maximize the likelihood of the \(\sigma\) parameter.