Basics in Linear Models (Part 2)

Inference in Linear Models
- Test for a Single Coefficient (t-test)
- Test for Multiple Coefficients (F-test)

Inference in Linear Models

From the previous post, we have shown that the linear models can be fitted using the ordinary least squares (OLS) method. Recall that a linear model can be written as

\[\mathbf{y} = X \boldsymbol{\beta} + \boldsymbol{\epsilon},\]

where \(X \in \mathbb{R}^{n \times d}\) is a full column-rank design matrix, \(\mathbf{y} \in \mathbb{R}^n\) is the response vector, and \(\boldsymbol{\epsilon}\) is the noise vector. The OLS estimate is given by

\[\begin{equation} \hat{\boldsymbol{\beta}} = (X^{\top} X)^{-1} X^{\top} \mathbf{y}. \end{equation}\]

Test for a Single Coefficient (t-test)

The OLS estimate is unbiased even without assuming the normality of the noise, i.e., \(\mathbb{E}[\hat{\boldsymbol{\beta}}] = \boldsymbol{\beta}\). If we further assume that the noise \(\epsilon\) is normally distributed, i.e., \(\boldsymbol{\epsilon} \sim \mathcal{N}(0, \sigma^2 \mathbf{I}_d)\), then we can derive the distribution of the OLS estimates.

The OLS estimates \(\hat{\boldsymbol{\beta}}\) are normally distributed with mean \(\boldsymbol{\beta}\) and variance \(\sigma^2 (X^{\top} X)^{-1}\), i.e.,

\[\hat{\boldsymbol{\beta}} \sim \mathcal{N} \left(\boldsymbol{\beta}, \sigma^2 (X^{\top} X)^{-1}\right).\]

To see this, we can rewrite the OLS estimates \eqref{eq:ols} as

\[\begin{aligned} \hat{\boldsymbol{\beta}} &= (X^{\top} X)^{-1} X^{\top} \mathbf{y} \\ &= (X^{\top} X)^{-1} X^{\top} (X \boldsymbol{\beta} + \boldsymbol{\epsilon}) \\ &= (X^{\top} X)^{-1} X^{\top} X \boldsymbol{\beta} + (X^{\top} X)^{-1} X^{\top} \boldsymbol{\epsilon} \\ &= \boldsymbol{\beta} + (X^{\top} X)^{-1} X^{\top} \boldsymbol{\epsilon}. \end{aligned}\]

Since \(\boldsymbol{\epsilon} \sim \mathcal{N}(0, \sigma^2 \mathbf{I}_n)\), we have

\[\begin{aligned} (X^{\top} X)^{-1} X^{\top} \boldsymbol{\epsilon} &\sim \mathcal{N} \left(0, \sigma^2 (X^{\top} X)^{-1} X^{\top} \mathbf{I}_n X (X^{\top} X)^{-1}\right) \\ &= \mathcal{N} \left(0, \sigma^2 (X^{\top} X)^{-1}\right). \end{aligned}\]

It is now clear that the OLS estimates \(\hat{\boldsymbol{\beta}}\) are normally distributed with mean \(\boldsymbol{\beta}\) and variance \(\sigma^2 (X^{\top} X)^{-1}\).

Or equivalently, we have

\[\frac{1}{\sigma} \left(\hat{\boldsymbol{\beta}} - \boldsymbol{\beta}\right) \sim \mathcal{N} \left(0, (X^{\top} X)^{-1}\right).\]

Recall the unbiased estimator of the variance \(\sigma^2\) is given by

\begin{equation} \label{eq:sigma2} \hat{\sigma^2} = \frac{1}{n - d} \sum_{i=1}^n (y_i - \hat{y}_i)^2 = \frac{1}{n - d} \lVert \mathbf{y} - X \hat{\boldsymbol{\beta}} \rVert_2^2. \end{equation}

We also want to derive the distribution os \(\hat{\sigma^2}\). We can rewrite the residuals as

\[\begin{aligned} \mathbf{y} - X \hat{\boldsymbol{\beta}} &= \mathbf{y} - H \mathbf{y} \\ &= (\mathbf{I}_n - H) \mathbf{y} \\ &= (\mathbf{I}_n - H) (X \boldsymbol{\beta} + \boldsymbol{\epsilon}) \\ &= (\mathbf{I}_n - H) \boldsymbol{\epsilon}, \end{aligned}\]

Proposition Let \(\mathbf{x} \sim \mathcal{N}(\mathbf{0}, \mathbf{I}_p)\) and let \(A\) be an orthogonal projection matrix with \(\text{rank}(A) = r \leq p\). Then the quadratic form \(Y = \mathbf{x}^{\top} A \mathbf{x}\) follows a chi-squared distribution with \(r\) degrees of freedom, that is, \(Y \sim \chi^2_r\).

Proof. Since \(A\) is an orthogonal projection matrix, \(A\) is symmetric and idempotent, i.e., \(A = A^{\top}\) and \(A^2 = A\). Due to the symmetry of \(A\), we can diagonalize \(A\) as

\[A = Q \Lambda Q^{\top},\]

where \(Q\) is an orthogonal matrix and \(\Lambda\) is a diagonal matrix with eigenvalues of \(A\). Since \(A\) is idempotent, the eigenvalues of \(A\) are either 0 or 1. Let \(\lambda_1, \ldots, \lambda_r = 1\) be the non-zero eigenvalues of \(A\) and \(\lambda_{r+1}, \ldots, \lambda_p = 0\) be the zero eigenvalues of \(A\).

Define \(\mathbf{z} = Q^{\top} \mathbf{x}\), then we have \(\mathbf{z} \sim \mathcal{N}(0, Q^{\top} Q) = \mathcal{N}(0, \mathbf{I}_p)\), known as the unitary invariance of the normal distribution.

Thus, we can rewrite the quadratic form as

\[Y = \mathbf{x}^{\top} A \mathbf{x} = \mathbf{z}^{\top} \Lambda \mathbf{z} = \sum_{i=1}^r z_i^2,\]

where \(z_i\) are the components of the vector \(\mathbf{z}\). Since \(z_i \sim \mathcal{N}(0, 1)\), we have \(z_i^2 \sim \chi^2_1\). Therefore,

\[Y = \sum_{i=1}^r z_i^2 \sim \sum_{i=1}^r \chi^2_1 = \chi^2_r.\]

This completes the proof.

Now we can apply the proposition to derive the distribution of the residuals. Since \(\mathbf{I}_n - H\) is an orthogonal projection matrix with rank \(n - d\), we have

\[\begin{aligned} \frac{(n-d)\hat{\sigma^2}}{\sigma^2} &= \frac{1}{\sigma^2} \lVert \mathbf{y} - X \hat{\boldsymbol{\beta}} \rVert_2^2 \\ &= \frac{1}{\sigma^2} \lVert (\mathbf{I}_n - H) \boldsymbol{\epsilon} \rVert_2^2 \\ &= \frac{1}{\sigma^2} \boldsymbol{\epsilon}^{\top} (\mathbf{I}_n - H) \boldsymbol{\epsilon} \\ &\sim \chi^2_{n - d}. \end{aligned}\]

We also want to show that the OLS estimate \(\frac{1}{\sigma} \left(\hat{\boldsymbol{\beta}} - \boldsymbol{\beta}\right)\) is independent of \(\frac{(n-d)\hat{\sigma^2}}{\sigma^2}\). To show this, we only need to show that \(y-X\hat{\boldsymbol{\beta}}\) is independent of \(\hat{\boldsymbol{\beta}}-\boldsymbol{\beta}\). We have \(y-X\hat{\boldsymbol{\beta}} = (\mathbf{I}_n - H) \boldsymbol{\epsilon}\) and \(\hat{\boldsymbol{\beta}}-\boldsymbol{\beta} = (X^{\top} X)^{-1} X^{\top} \boldsymbol{\epsilon}\).

Then we can compute the covariance between \(\hat{\boldsymbol{\beta}}-\boldsymbol{\beta}\) and \(y-X\hat{\boldsymbol{\beta}}\) as follows:

\[\begin{aligned} \quad & \mathbb{E} \left[ \left( \hat{\boldsymbol{\beta}}-\boldsymbol{\beta} \right) \left( y-X\hat{\boldsymbol{\beta}} \right)^{\top} \right] \\ &= \mathbb{E} \left[ (X^{\top} X)^{-1} X^{\top} \boldsymbol{\epsilon} \boldsymbol{\epsilon}^{\top} (\mathbf{I}_n - H) \right] \\ &= \mathbb{E} \left[ (X^{\top} X)^{-1} X^{\top} (\mathbf{I}_n - H) \sigma^2 \right] \\ &= 0, \end{aligned}\]

where the last equality holds because \(X^{\top} (\mathbf{I}_n - H) = 0\). Since both \(\hat{\boldsymbol{\beta}}-\boldsymbol{\beta}\) and \(y-X\hat{\boldsymbol{\beta}}\) are normally distributed, we can conclude that they are independent. Therefore, \(\hat{\boldsymbol{\beta}}-\boldsymbol{\beta}\) and \(\frac{(n-d)\hat{\sigma^2}}{\sigma^2}\) are independent.

Definition (Student’s t-distribution) The Student’s t-distribution with \(\nu\) degrees of freedom is defined as

\[T = \frac{Z}{\sqrt{V/\nu}} \sim t_{\nu},\]

where \(Z \sim \mathcal{N}(0, 1)\) and \(V \sim \chi^2_{\nu}\) are independent.

Let \(Z = \frac{1}{\sigma \cdot {(X^{\top} X)^{-1}}_{k,k}} \left(\hat{\beta}_k - \beta_k\right)\), where \(\hat{\beta}_k\) is the \(k\)-th component of the OLS estimate \(\hat{\boldsymbol{\beta}}\) and \({(X^{\top} X)^{-1}}_{k,k}\) is the \(k\)-th diagonal element of the matrix \((X^{\top} X)^{-1}\). We can see that \(Z \sim \mathcal{N}(0, 1)\). Let \(V = \frac{(n-d)\hat{\sigma^2}}{\sigma^2}\). We have shown that \(Z \sim \mathcal{N}(0, 1)\) and \(V \sim \chi^2_{n-d}\) are independent. Therefore, we can conclude that

\[T = \frac{Z}{\sqrt{V/(n-d)}} = \frac{\hat{\beta}_k - \beta_k}{\sqrt{\hat{\sigma^2} {(X^{\top} X)^{-1}}_{k,k}}} \sim t_{n-d}.\]

We usually call the denominator \(\sqrt{\hat{\sigma^2} {(X^{\top} X)^{-1}}_{k,k}}\) the standard error of the OLS estimate \(\hat{\beta}_k\), denoted as \(\text{se}(\hat{\beta}_k)\).

Now we can perform a hypothesis test for the \(k\)-th coefficient \(\beta_k\). The null hypothesis and the alternative hypothesis are given by

\[\begin{aligned} H_0: \beta_k &= c, \\ H_a: \beta_k &\neq c. \end{aligned}\]

The test statistic is given by

\[T = \frac{\hat{\beta}_k - c}{\text{se}(\hat{\beta}_k)} \sim t_{n-d}.\]

We can reject the null hypothesis if the absolute value of the test statistic is larger than the critical value \(t_{n-d, \alpha/2}\), where \(\alpha\) is the significance level. The p-value can be computed as \(p = 2 \mathbb{P}(|T| > t_{n-d, \alpha/2}).\)

Test for Multiple Coefficients (F-test)

We provide a different approach to test any linear combination of the coefficients, i.e.,

\[\begin{aligned} H_0: C\boldsymbol{\beta} &= \boldsymbol{\gamma}, \\ H_a: C\boldsymbol{\beta} &\neq \boldsymbol{\gamma}, \end{aligned}\]

where \(C \in \mathbb{R}^{m \times d}\) is a matrix and \(\boldsymbol{\gamma} \in \mathbb{R}^m\) is a vector.

This is a general form of the hypothesis test for multiple coefficients. Here are some special cases:

If \(C = \mathbf{I}_d\) and \(\boldsymbol{\gamma} = 0\), then it is equivalent to test if all coefficients are zero.
If \(C = [1, 0, \ldots, 0]\) and \(\boldsymbol{\gamma} = 0\), then it is equivalent to test if the first coefficient is zero.
If \(C = [1, -1, 0, \ldots, 0]\) and \(\boldsymbol{\gamma} = 0\), then it is equivalent to test if the first coefficient is equal to the second coefficient.

We consider the following constrained optimization problem:

\[\begin{aligned} \min_{\boldsymbol{\beta}} &\quad \lVert \mathbf{y} - X \boldsymbol{\beta} \rVert_2^2 \\ \text{s.t.} &\quad C \boldsymbol{\beta} = \boldsymbol{\gamma}. \end{aligned}\]

To solve the above optimization problem, we can use the method of Lagrange multipliers. The Lagrangian is given by

\[\begin{equation} \label{eq:lagrangian} \mathcal{L}(\boldsymbol{\beta}, \boldsymbol{\lambda}) = \lVert \mathbf{y} - X \boldsymbol{\beta} \rVert_2^2 + 2\boldsymbol{\lambda}^{\top} (C \boldsymbol{\beta} - \boldsymbol{\gamma}). \end{equation}\]

Taking the derivative of the Lagrangian with respect to \(\boldsymbol{\beta}\) and setting it to zero, we have

\[\begin{aligned} \frac{\partial \mathcal{L}}{\partial \boldsymbol{\beta}} &= -2 X^{\top} (\mathbf{y} - X \boldsymbol{\beta}) + 2 C^{\top} \boldsymbol{\lambda} = 0 \\ &\Rightarrow X^{\top} X \boldsymbol{\beta} - X^{\top} \mathbf{y} = C^{\top} \boldsymbol{\lambda} \\ &\Rightarrow \boldsymbol{\beta} = (X^{\top} X)^{-1} X^{\top} \mathbf{y} + (X^{\top} X)^{-1} C^{\top} \boldsymbol{\lambda} \\ \end{aligned}\]

Since \(C \boldsymbol{\beta} = \boldsymbol{\gamma}\), we have

\[C (X^{\top} X)^{-1} X^{\top} \mathbf{y} + C (X^{\top} X)^{-1} C^{\top} \boldsymbol{\lambda} = \boldsymbol{\gamma}.\]

Solving for \(\boldsymbol{\lambda}\), we have

\[\begin{aligned} \boldsymbol{\lambda} &= \left(C (X^{\top} X)^{-1} C^{\top}\right)^{-1} \left(\boldsymbol{\gamma} - C (X^{\top} X)^{-1} X^{\top} \mathbf{y}\right) \\ &= \left(C (X^{\top} X)^{-1} C^{\top}\right)^{-1} \left(\boldsymbol{\gamma} - C \hat{\boldsymbol{\beta}}\right). \end{aligned}\]

Combining the above results, we can write the constrained OLS estimate as

\[\begin{equation} \label{eq:constrained_ols} \hat{\boldsymbol{\beta}}_C = \hat{\boldsymbol{\beta}} + (X^{\top} X)^{-1} C^{\top} \left(C (X^{\top} X)^{-1} C^{\top}\right)^{-1} \left(\boldsymbol{\gamma} - C \hat{\boldsymbol{\beta}}\right). \end{equation}\]

To keep the notation simple, let \(M = C (X^{\top} X)^{-1} C^{\top}\). The above equation can be rewritten as

\[\begin{aligned} \hat{\boldsymbol{\beta}}_C &= \hat{\boldsymbol{\beta}} + (X^{\top} X)^{-1} C^{\top} M^{-1} \left(\boldsymbol{\gamma} - C \hat{\boldsymbol{\beta}}\right) \\ &= \left(\mathbf{I}_d - (X^{\top} X)^{-1} C^{\top} M^{-1} C\right) \hat{\boldsymbol{\beta}} + (X^{\top} X)^{-1} C^{\top} M^{-1} \boldsymbol{\gamma}. \end{aligned}\]

Recall the definition of SSE (sum of squared errors), we have

\[\text{SSE} = \lVert \mathbf{y} - X \hat{\boldsymbol{\beta}} \rVert_2^2.\]

Similarly, we can define the SSE under the constrained OLS estimate as

\[\text{SSE}_C = \lVert \mathbf{y} - X \hat{\boldsymbol{\beta}}_C \rVert_2^2.\]

Proposition: \(\text{SSE}\) and \(\text{SSE}_C - \text{SSE}\) are independent.

Proof. We can rewrite the \(\text{SSE}_C\) as follows:

\[\begin{aligned} \text{SSE}_C &= \lVert \mathbf{y} - X \hat{\boldsymbol{\beta}}_C \rVert_2^2 \\ &= \lVert \mathbf{y} - X \hat{\boldsymbol{\beta}} + X (\hat{\boldsymbol{\beta}} - \hat{\boldsymbol{\beta}}_C) \rVert_2^2 \\ \end{aligned}\]

References

Seber, G. A. F., & Lee, A. J. (2003). Linear regression analysis (2nd ed.). John Wiley & Sons. https://doi.org/10.1002/9780471722199
Faraway, J. J. (2025). Linear Models with R (3rd ed.). Chapman and Hall/CRC. ISBN 9781032583983