Basics in Linear Models (Part 1)

Linear Regression Model
Ordinary Least Squares (OLS)
- Gauss-Markov Theorem
- Geometric Interpretation
Sum of Squares: SST, SSR, SSE
Maximum Likelihood Estimation (MLE)

Linear Regression Model

Given a dataset with $n$ observations and $d$ features, $\mathcal{D} = \{(\mathbf{x}_i, y_i)\}_{i=1}^n$, where $\mathbf{x}_i \in \mathbb{R}^d$ and $y_i \in \mathbb{R}$, a linear regression model can be expressed as: \begin{equation} y_i = \mathbf{x}_i^{\top} \boldsymbol{\beta} + \epsilon_i, \nonumber \end{equation} where $\boldsymbol{\beta} \in \mathbb{R}^d$ is the vector of regression coefficients and $\epsilon_i$ is the noise term. We may also express the model in matrix form: \begin{equation} \label{eq:linear_model} \mathbf{y} = X \boldsymbol{\beta} + \boldsymbol{\epsilon}, \end{equation} where $\mathbf{y} \in \mathbb{R}^n$, $X \in \mathbb{R}^{n \times d}$, and $\boldsymbol{\epsilon} \in \mathbb{R}^n$. Here, $X$ is the design matrix, where each row corresponds to an observation and each column corresponds to a feature, i.e., $X = [\mathbf{x}_1, \mathbf{x}_2, \ldots, \mathbf{x}_n]^{\top}$. In this blog post, we will focus on the case where $n > d$, which means that we have more observations than features, or equivalently, the linear system is overdetermined.

The standard assumptions in linear models are usually called Gauss-Markov assumptions:

Linearity. The model is linear in terms of $\boldsymbol{\beta}$.
No multicollinearity. The design matrix $X$ has full column rank, meaning that the columns of $X$ are linearly independent.
Independence. $\epsilon_i$ are independent and have mean zero, i.e., $\mathbb{E}[\epsilon_i] = 0$ for all $i$.
Homoscedasticity. The error terms have constant variance, i.e., $\text{Var}(\epsilon_i) = \sigma^2$ for all $i$.
Normality. The error terms are normally distributed, i.e., $\epsilon_i \sim \mathcal{N}(0, \sigma^2)$ for all $i$.

Please note that the normality assumption is not necessary for Gauss-Markov theorem to hold, but it is required for statistical inference, such as hypothesis testing and confidence intervals.

Ordinary Least Squares (OLS)

The most common method to estimate the regression coefficients $\boldsymbol{\beta}$ is the Ordinary Least Squares (OLS) method, which minimizes the sum of squared residuals: \begin{equation} \hat{\boldsymbol{\beta}} = \arg\min_{\boldsymbol{\beta}} \sum_{i=1}^n (y_i - \mathbf{x}_i^{\top} \boldsymbol{\beta})^2. \nonumber \end{equation}

In matrix form, the above equation can be expressed as: \begin{equation} \label{eq:ols_problem} \hat{\boldsymbol{\beta}} = \arg\min_{\boldsymbol{\beta}} \lVert\mathbf{y} - X \boldsymbol{\beta}\rVert_2^2. \end{equation}

We can verify that problem \eqref{eq:ols_problem} is a convex optimization problem since its Hessian matrix $X^{\top} X$ is positive definite. Therefore, the unique solution can be obtained by setting the gradient to zero: \begin{equation} \nabla_{\boldsymbol{\beta}} \lVert\mathbf{y} - X \boldsymbol{\beta}\rVert_2^2 = -2X^{\top}(\mathbf{y} - X \boldsymbol{\beta}) = 0. \nonumber \end{equation} Solving for $\boldsymbol{\beta}$ gives us the OLS estimator: \begin{equation} \label{eq:ols_estimator} \hat{\boldsymbol{\beta}} = (X^{\top} X)^{-1} X^{\top} \mathbf{y}. \end{equation}

Gauss-Markov Theorem

Theorem (Gauss-Markov Theorem). Under the Gauss-Markov assumptions (1-4), the OLS estimator $\hat{\boldsymbol{\beta}}$ is the Best Linear Unbiased Estimator (BLUE) of the true regression coefficients $\boldsymbol{\beta}$. That is,

The OLS estimator is unbiased, i.e., $\mathbb{E}[\hat{\boldsymbol{\beta}}] = \boldsymbol{\beta}$.
Among all linear unbiased estimators, the OLS estimator has the minimum variance, i.e., for any other linear unbiased estimator $\tilde{\boldsymbol{\beta}}$, we have: $\text{Var}(\hat{\boldsymbol{\beta}}) \preceq \text{Var}(\tilde{\boldsymbol{\beta}}).$

Proof. Based on the OLS estimator \eqref{eq:ols_estimator}, we can express it as: $\hat{\boldsymbol{\beta}} = (X^{\top} X)^{-1} X^{\top} (X \boldsymbol{\beta} + \boldsymbol{\epsilon}) = \boldsymbol{\beta} + (X^{\top} X)^{-1} X^{\top} \boldsymbol{\epsilon}.$

At first, we can show that the OLS estimator is unbiased by taking the expectation

\[\begin{aligned} \mathbb{E}[\hat{\boldsymbol{\beta}}] & = \mathbb{E}\left[\boldsymbol{\beta} + (X^{\top} X)^{-1} X^{\top} \boldsymbol{\epsilon}\right] \\ & = \boldsymbol{\beta} + (X^{\top} X)^{-1} X^{\top} \mathbb{E}[\boldsymbol{\epsilon}] \\ & = \boldsymbol{\beta}. \end{aligned}\]

The last equality holds because $\mathbb{E}[\boldsymbol{\epsilon}] = \mathbf{0}$.

Next, we can show that the OLS estimator has the minimum variance among all linear unbiased estimators. Let $\tilde{\boldsymbol{\beta}} = C \mathbf{y}$ be any linear unbiased estimator, where $C \in \mathbb{R}^{d \times n}$ is a matrix. Since $\tilde{\boldsymbol{\beta}}$ is unbiased, we have

\[\mathbb{E}[\tilde{\boldsymbol{\beta}}] = C \mathbb{E}[\mathbf{y}] = C X \boldsymbol{\beta} = \boldsymbol{\beta}.\]

The above equation should hold for any $\boldsymbol{\beta}$, so we can conclude that $C X = \mathbf{I}_d.$

We can decompose the matrix $C$ as

\[C = (X^{\top} X)^{-1} X^{\top} + D,\]

where $D \in \mathbb{R}^{d \times n}$. Since $C X = \mathbf{I}_d$, we have $D X = 0$, implying that $D$ lives in the left null space of $X$.

On the other hand, we can express the variance of $\tilde{\boldsymbol{\beta}}$ as follows:

\[\text{Var}(\tilde{\boldsymbol{\beta}}) = C \text{Var}(\mathbf{y}) C^{\top} = C \sigma^2 \mathbf{I}_n C^{\top} = \sigma^2 C C^{\top}.\]

Substituting the expression of $C$, we have

\[\begin{aligned} &\quad \text{Var}(\tilde{\boldsymbol{\beta}}) \\ & = \sigma^2 \left((X^{\top} X)^{-1} X^{\top} + D\right) \left((X^{\top} X)^{-1} X^{\top} + D\right)^{\top} \\ & = \sigma^2 \left((X^{\top} X)^{-1} X^{\top} X (X^{\top} X)^{-1} + (X^{\top} X)^{-1} X^{\top} D^{\top} + D X (X^{\top} X)^{-1} + D D^{\top}\right) \\ & = \sigma^2 \left((X^{\top} X)^{-1} + D D^{\top}\right) \\ & \succeq \sigma^2 (X^{\top} X)^{-1} \\ & = \text{Var}(\hat{\boldsymbol{\beta}}). \end{aligned}\]

The third equality holds because $D X = 0$, and the last inequality holds because $D D^{\top} \succeq 0$. This completes the proof.

Geometric Interpretation

Let’s recall some basic linear algebra concepts before we proceed with the geometric interpretation of OLS.

Column Space and Null Space

Given a matrix $A \in \mathbb{R}^{m \times n}$, the column space of $A$, also known as the range or image of $A$, denoted as $\text{Col}(A)$, is the subspace spanned by the columns of $A$. More formally, it is defined as:

\[\text{Col}(A) = \{ \mathbf{y} \in \mathbb{R}^m : \mathbf{y} = A \mathbf{x} \text{ for some } \mathbf{x} \in \mathbb{R}^n \}.\]

The null space of $A$, also known as the kernel of $A$, denoted as $\text{Ker}(A)$, is the subspace of all vectors $\mathbf{x} \in \mathbb{R}^n$ such that $A \mathbf{x} = \mathbf{0}$. More formally, it is defined as:

\[\text{Ker}(A) = \{ \mathbf{x} \in \mathbb{R}^n : A \mathbf{x} = \mathbf{0} \}.\]

It is clear that the intersection of the column space and the null space is trivial, i.e., $\text{Col}(A) \cap \text{Ker}(A) = \{\mathbf{0}\}$. Therefore, the vector space $\mathbb{R}^n$ can be decomposed into the direct sum of the column space and the null space of $A$, i.e.,

\[\mathbb{R}^n = \text{Col}(A) \oplus \text{Ker}(A).\]

According to the Rank-Nullity Theorem, we have:

\[\text{rank}(A) + \text{nullity}(A) = n,\]

where $\text{rank}(A)$ is the dimension of the column space of $A$, and $\text{nullity}(A)$ is the dimension of the null space of $A$. This can be understood from the orthogonal decomposition of the vector space $\mathbb{R}^n$, where the column space is the orthogonal complement of the null space.

Projection

Definition (Projection Operator) A projection on a vector space $V$ is a linear operator $P: V \to V$ such that $P^2 = P$. If $P^2 = P = P^{\top}$, then $P$ is called an orthogonal projection.

In other words, applying the projection operator twice yields the same result as applying it once. This fundamental property is known as idempotence.

Proposition If $P: V \to V$ is a projection, then the eigenvalues of $P$ must be 0 or 1.

Proof. Suppose $(\lambda, \mathbf{v})$ is a pair of eigenvalue and eigenvector of $P$. Then we have $P\mathbf{v} = \lambda \mathbf{v}$. Apply $P$ to both sides, we get

\[P^2 \mathbf{v} = \lambda P \mathbf{v} = \lambda^2 \mathbf{v}.\]

Since $P^2 = P$, the above equation will become

\[(\lambda^2 - \lambda) \mathbf{v} = \mathbf{0}.\]

Now we can conclude that $\lambda = 0$ or $\lambda = 1$. This completes the proof.

Moore–Penrose Inverse

Definition (Moore–Penrose Inverse) For a matrix $A \in \mathbb{R}^{m \times n}$, the Moore–Penrose inverse, also known as the pseudo-inverse, is a matrix $A^{\dagger} \in \mathbb{R}^{n \times m}$ that satisfies the following four properties: (1) $AA^{\dagger}A = A$ (2) $A^{\dagger}AA^{\dagger} = A^{\dagger}$ (3) $(AA^{\dagger})^{\top} = AA^{\dagger}$ (4) $(A^{\dagger}A)^{\top} = A^{\dagger}A$.

Proposition (Unique Existence of Moore-Penrose Inverse) For any matrix $A$, its Moore-Penrose inverse must exist and is unique.

We omit the proof here, but you can find it in many linear algebra textbooks.

For a matrix $A$ with full column rank, the Moore-Penrose inverse can be computed as:

\[A^{\dagger} = (A^{\top} A)^{-1} A^{\top}.\]

For a matrix $A$ with full row rank, the Moore-Penrose inverse can be computed as:

\[A^{\dagger} = A^{\top} (A A^{\top})^{-1}.\]

One can verify that the above two formulas satisfy the four properties of the Moore-Penrose inverse.

Projection onto Column Space

Given a matrix $A \in \mathbb{R}^{m \times n}$, the orthogonal projection onto the column space of $A$ can be expressed as:

\begin{equation} \label{eq:col_projection} P_A = A A^{\dagger}, \nonumber \end{equation}

To see that this is indeed an orthogonal projection, we can verify that $P_A$ is idempotent and symmetric. Symmetry of $P_A$ follows directly from the property (3) of the Moore-Penrose inverse. To see that $P_A$ is idempotent, we use the property (1) of the Moore-Penrose inverse:

\[P_A^2 = (A A^{\dagger})(A A^{\dagger}) = (A A^{\dagger} A) A^{\dagger} = A A^{\dagger} = P_A.\]

Therefore, $P_A$ is an orthogonal projection onto $\text{Col}(A)$. Meanwhile, the projection onto $\text{Ker}(A)$ can be expressed as: \begin{equation} \label{eq:null_projection} P_{A^{\perp}} = \mathbf{I}_n - P_A = \mathbf{I}_n - A A^{\dagger}. \nonumber \end{equation}

One can verify that $P_{A^{\perp}}$ is also an orthogonal projection.

Geometric Interpretation of OLS

Now we go back to the geometric interpretation of OLS. Based on the OLS estimator \eqref{eq:ols_estimator}, we can express the fitted values $\hat{\mathbf{y}}$ as:

\begin{equation} \label{eq:fitted_values} \hat{\mathbf{y}} = X \hat{\boldsymbol{\beta}} = X (X^{\top} X)^{-1} X^{\top} \mathbf{y} = H \mathbf{y}, \end{equation}

where $H = X (X^{\top} X)^{-1} X^{\top} = X X^{\dagger}$ is known as the hat matrix.

From equation \eqref{eq:fitted_values}, we can see that the fitted values $\hat{\mathbf{y}}$ are obtained by projecting the response vector $\mathbf{y}$ onto $\text{Col}(X)$. Therefore, the residuals $\mathbf{r} = \mathbf{y} - \hat{\mathbf{y}} = (\mathbf{I}_n - H)\mathbf{y}$ lives in $\text{Ker}(X)$. Therefore, the residuals are orthogonal to any vector in $\text{Col}(X)$, i.e.,

\[X^{\top} \mathbf{r} = 0.\]

This orthogonality condition is a key property of the OLS estimator and appears in the normal equations.

Here are some useful properties of the hat matrix $H$:

$H$ and $\mathbf{I}_n - H$ are both orthgonal projections, i.e., they are both idempotent and symmetric.
The eigenvalues of $H$ and $\mathbf{I}_n - H$ are either 0 or 1.
$\text{tr}(H) = \text{rank}(X) = \text{dim}\left(\text{Col}(X)\right) = d$ and $\text{tr}(\mathbf{I}_n - H) = \text{dim}\left(\text{Ker}(X)\right) = n - d$.

Sum of Squares: SST, SSR, SSE

In regression analysis, the total variability in the response variable $\mathbf{y}$ can be decomposed into two components: the variability explained by the regression model and the unexplained variability (residuals). This decomposition is often expressed in terms of sum of squares. In this section, we assume $X$ includes a column of ones, which allows us to estimate the intercept term in the linear regression model.

Total sum of squares (SST) is defined as the follows:

\begin{equation} \label{eq:sst} \text{SST} = \sum_{i=1}^n (y_i - \bar{y})^2 = \lVert \mathbf{y} - \frac{1}{n} \mathbf{1}_n \mathbf{1}_n^{\top} \mathbf{y} \rVert_2^2 = \mathbf{y}^{\top} \left(\mathbf{I}_n - \frac{1}{n} \mathbf{1}_n \mathbf{1}_n^{\top}\right) \mathbf{y}, \end{equation}

where $\bar{y} = \frac{1}{n} \sum_{i=1}^n y_i$ is the mean of the response variable, and $\mathbf{1}_n$ is an $n$-dimensional vector of ones. One should also note that $\mathbf{I}_n - \frac{1}{n} \mathbf{1}_n \mathbf{1}_n^{\top}$ is also an orthogonal projection matrix.

Sum of squared estimate of errors (SSE) is defined as the follows:

\begin{equation} \label{eq:sse} \text{SSE} = \sum_{i=1}^n (y_i - \hat{y}_i)^2 = \lVert \mathbf{y} - \hat{\mathbf{y}} \rVert_2^2 = \mathbf{y}^{\top} (\mathbf{I}_n - H) \mathbf{y}. \end{equation}

Sum of squares due to regression (SSR) is defined as the follows:

\begin{equation} \label{eq:ssr} \text{SSR} = \sum_{i=1}^n (\hat{y}_i - \bar{\hat{y}})^2 = \lVert \hat{\mathbf{y}} - \frac{1}{n} \mathbf{1}_n \mathbf{1}_n^{\top} \hat{\mathbf{y}} \rVert_2^2 = \hat{\mathbf{y}}^{\top} \left(\mathbf{I}_n - \frac{1}{n} \mathbf{1}_n \mathbf{1}_n^{\top}\right) \hat{\mathbf{y}} = \mathbf{y}^{\top} \left(H - \frac{1}{n} \mathbf{1}_n \mathbf{1}_n^{\top}\right) \mathbf{y}, \end{equation}

where $\bar{\hat{y}} = \frac{1}{n} \sum_{i=1}^n \hat{y}_i$ is the mean of the fitted values. The last equality holds because $\hat{\mathbf{y}} = H \mathbf{y}$ and $H \mathbf{1}_n = \mathbf{1}_n$ since $\mathbf{1}_n$ is in the column space of $X$.

From equations \eqref{eq:sst}, \eqref{eq:sse}, and \eqref{eq:ssr}, we can derive the fundamental decomposition of total variability:

\[\text{SST} = \text{SSR} + \text{SSE}.\]

This decomposition states that the total variation in the response variable equals the sum of the variation explained by the regression model plus the unexplained variation (residuals). Please note that the above decomposition is valid only when the design matrix $X$ includes the intercept; otherwise, the above decomposition will not necessarily hold.

Coefficient of Determination (R²) is defined as the proportion of the variance in the response variable that is predictable from the features. It is calculated as: \begin{equation} \label{eq:r_squared} R^2 = \frac{\text{SSR}}{\text{SST}} = 1 - \frac{\text{SSE}}{\text{SST}}. \nonumber \end{equation}

Another useful property of SSE is that its expectation can be expressed as:

\[\begin{aligned} \mathbb{E}[\text{SSE}] & = \mathbb{E}[\mathbf{y}^{\top} (\mathbf{I}_n - H) \mathbf{y}] \\ & = \mathbb{E}[\text{tr}((\mathbf{I}_n - H) \mathbf{y} \mathbf{y}^{\top})] \\ & = \text{tr}((\mathbf{I}_n - H) \mathbb{E}[\mathbf{y} \mathbf{y}^{\top}]) \\ & = \text{tr}((\mathbf{I}_n - H) (\sigma^2 \mathbf{I}_n + X \boldsymbol{\beta} \boldsymbol{\beta}^{\top} X^{\top})) \\ & = \sigma^2 \text{tr}(\mathbf{I}_n - H) + \text{tr}((\mathbf{I}_n - H) X \boldsymbol{\beta} \boldsymbol{\beta}^{\top} X^{\top}) \\ & = \sigma^2 (n - d) + 0 \\ & = (n - d) \sigma^2. \end{aligned}\]

Therefore, an unbiased estimator of the noise variance $\sigma^2$ can be obtained as:

\begin{equation} \label{eq:variance_estimator} \hat{\sigma^2} = \frac{\text{SSE}}{n - d} = \frac{1}{n - d} \lVert \mathbf{y} - X \hat{\boldsymbol{\beta}} \rVert_2^2. \end{equation}

Maximum Likelihood Estimation (MLE)

Under the normality assumption of the error terms, i.e., $\epsilon_i \sim \mathcal{N}(0, \sigma^2)$ for all $i$, we can derive the Maximum Likelihood Estimation (MLE) of the regression coefficients $\boldsymbol{\beta}$ and the noise variance $\sigma^2$.

The likelihood function of the observed data $\mathcal{D} = \{(\mathbf{x}_i, y_i)\}_{i=1}^n$ is given by:

\begin{equation} L(\boldsymbol{\beta}, \sigma^2) = \prod_{i=1}^n \frac{1}{\sqrt{2 \pi \sigma^2}} \exp\left(-\frac{(y_i - \mathbf{x}_i^{\top} \boldsymbol{\beta})^2}{2 \sigma^2}\right). \nonumber \end{equation}

We usually work with the log-likelihood function, which is given by:

\begin{equation} \ell(\boldsymbol{\beta}, \sigma^2) = -\frac{n}{2} \log(2 \pi \sigma^2) - \frac{1}{2 \sigma^2} \lVert \mathbf{y} - X \boldsymbol{\beta} \rVert_2^2. \nonumber \end{equation}

To find the MLE of $\boldsymbol{\beta}$ and $\sigma^2$, we can maximize the log-likelihood function with respect to both parameters. We can do this by taking the partial derivatives and setting them to zero.

Taking the partial derivative with respect to $\boldsymbol{\beta}$ and $\sigma^2$, we have:

\[\begin{aligned} \frac{\partial \ell(\boldsymbol{\beta}, \sigma^2)}{\partial \boldsymbol{\beta}} &= \frac{1}{\sigma^2} X^{\top} (\mathbf{y} - X \boldsymbol{\beta}) \\ \frac{\partial \ell(\boldsymbol{\beta}, \sigma^2)}{\partial \boldsymbol{\sigma}^2} &= -\frac{n}{2 \sigma^2} + \frac{1}{2 (\sigma^2)^2} \lVert \mathbf{y} - X \boldsymbol{\beta} \rVert_2^2. \end{aligned}\]

Setting the above partial derivatives to zero, we can solve for the MLE of $\boldsymbol{\beta}$ and $\sigma^2$:

\[\begin{aligned} \hat{\boldsymbol{\beta}} &= (X^{\top} X)^{-1} X^{\top} \mathbf{y} \\ \hat{\sigma^2} &= \frac{1}{n} \lVert \mathbf{y} - X \hat{\boldsymbol{\beta}} \rVert_2^2. \end{aligned}\]

The MLE of $\boldsymbol{\beta}$ is the same as the OLS estimator, while the MLE of $\sigma^2$ is slightly different from the unbiased estimator in equation \eqref{eq:variance_estimator}. The MLE of $\sigma^2$ is biased, but it is consistent as $n \to \infty$.

References

Seber, G. A. F., & Lee, A. J. (2003). Linear regression analysis (2nd ed.). John Wiley & Sons. https://doi.org/10.1002/9780471722199
Faraway, J. J. (2025). Linear Models with R (3rd ed.). Chapman and Hall/CRC. ISBN 9781032583983