Linear Regression in Vector-Matrix Form

A Concise and General Approach

Vladislav Morozov

Introduction

Learning Outcomes

This lecture is about a vector-matrix approach to linear regression


By the end, you should be able to

  • Represent linear regression in scalar, vector, and matrix forms
  • Derive the explicit form of the OLS estimator

Textbook References

  • Linear algebra refresher:
    • Appendix D in Wooldridge (2020)
    • Easier treatment with more examples: chapter 1-2 in Strang (2016)
    • Quicker discussion: Kaye and Wilson (1998)
  • Vector treatment of linear regression: E1 (except E-1a) in Wooldridge (2020)

Motivation

Reminder: Multivariate Regression

Reminder: Linear Regression Model

\[ Y_i = \beta_1 + \beta_2 X_{i2} + \beta_3 X_{i3} + \dots + \beta_k X_{ik} + U_i, \] where

  • \(Y_i\) — outcome or dependent variable
  • \(X_{i2}, \dots, X_{ik}\) — regressors, covariates, or independent variables
  • \(U_i\) — error, shock, or innovation
  • \(\beta_2\) — intercept term
  • \(\beta_2, \dots, \beta_k\) — slope parameters

Reminder: OLS Estimator

Ordinary Least Squares (OLS): \[ \begin{aligned} (\hat{\beta}_1, \dots, \hat{\beta}_k) = \argmin_{b_1, b_2, \dots, b_k} \sum_{i=1}^N \left( Y_i - b_1 - b_2 X_{i2} - \dots - b_k X_{ik}\right)^2 \end{aligned} \]

In words: find coefficients which minimize sum of squared differences between \(Y_i\) and linear combinations of \(1, X_{i2}, \dots, X_{ik}\):

Reminder: When is the OLS Estimator Defined?

OLS estimator well-defined if

  1. No strict multicollinearity
  2. Every \(X_{ij}\) varies in sample for \(j>1\)

Remember: can always define OLS estimator even if \(Y_i\) does not depend linearly on covariates. OLS is just a method

Scalar Representation and Its Issues

Scalar Representation

Definition 1 Linear regression model \[ Y_i = \beta_1 + \beta_2 X_{i2} + \beta_3 X_{i3} + \dots + \beta_k X_{ik} + U_i, \] is said to be in scalar form.

Scalar” means

  • No matrices
  • Every covariate is written out individually

Pros of Scalar Representation

Scalar representation used only when we care about the individual regressors

Usual case: when presenting estimated equations.

Example: regression of wages on education and experience: \[ \widehat{\log(wage)} = \underset{(0.094)}{0.284} + \underset{(0.032)}{0.092}\times Education + \dots \] Standard errors in parentheses below the estimates

Cons of Scalar Representation

Scalar representation has downsides:

  1. Unnecessarily long: if you just have “anonymous” \((X_{i2}, \dots, X_{ik})\), why bother writing them out?
  2. No explicit formula for OLS estimator
  3. Inconvenient to program

Is There a Solution?

Yes! The vector \[ y_i = \bbeta_i'\bx_i + U_{i} \] and matrix forms \[ \bY = \bX\bbeta + \bu \]

This lecture is about working with these forms and using them to derive the OLS estimator.

Vector and Matrix Forms of Regression

Vector Form

Model

Our model in this lecture: \[ \begin{aligned} Y_{i} & = \beta_1 X_{i1} + \dots + \beta_k X_{ik} + U_{i}\\ & = \sum_{j=1}^k \beta_k X_{ik} + U_{i} \end{aligned} \]

Here \(X_{i1}\) may be \(X_{i1} = 1\) if you want to include an intercept

Vector of Covariates and Coefficients

Our first problem: writing out \(X_{i1}, \dots, X_{ik}\) every time


Why not combine the covariates into a single vector \(\bX_i\) and coefficients into a vector \(\bbeta\)? Combine them into column \(k\)-vectors (\(k\times 1\) matrices):

\[ \bX_i = \begin{pmatrix} X_{i1} \\ X_{i2} \\ \vdots \\ X_{ik} \end{pmatrix}, \quad \bbeta = \begin{pmatrix} \beta_1 \\ \beta_2 \\ \vdots \\ \beta_k \end{pmatrix} \]

Reminder: Transposition

Recall the transposition operator: \[ \bX_i = \begin{pmatrix} X_{i1} \\ X_{i2} \\ \vdots \\ X_{ik} \end{pmatrix}, \quad \bX'_i = \begin{pmatrix} X_{i1}, X_{i2}, \dots, X_{ik} \end{pmatrix} \] \(\bX'_i\) is read “\(\bX_i\) transpose”. Sometimes also labeled as \(\bX_i^T\).

Combining \(\bX_i\) and \(\bbeta\)

Need to combine \(\bX_i\) and \(\bbeta\) to obtain \(\sum_{j=1}^k \beta_k X_{ik}\)

Using rules of matrix multiplication, obtain exactly \[ \bX_i'\bbeta = \sum_{j=1}^k \beta_k X_{ik} \] Note that \(\bX_i'\bbeta = \bbeta'\bX_i\) (why?)

Vectors are column vectors by default; we will transpose when necessary — standard approach. Careful with Wooldridge (2020): he mixes rows and column vectors to avoid transposes.

Vector Form of Linear Regression

Can now combine \(\bX_i'\bbeta\) with \(U_i\) to get \(Y_i\)

Definition 2 Linear regression model \[ Y_i = \bX_i'\bbeta + U_i \] is said to be in vector form.

Matrix Form

Stacking Observations

Vector form — convenient representation of model for a single observation \(i\): \[ Y_i = \bX_i'\bbeta + U_i \]

What if we want to represent the model for the whole sample?

Tabular Data Form

Why and how? Think of how we store data, e.g., in pandas:

import pandas as pd
import statsmodels.api as sm

credit_card_df = sm.datasets.ccard.load_pandas().data
credit_card_df.head(2)
AVGEXP AGE INCOME INCOMESQ OWNRENT
0 124.98 38.0 4.52 20.4304 1.0
1 9.85 33.0 2.42 5.8564 0.0

Each row is an observation \(i\), variables stored in different columns — data as a table

Stacking Observations I

(Flat) tabular form

  • Usually convenient in practice
  • Also convenient in theory!

To replicate, put outcomes in a vector \(\bY\): \[ \bY = \begin{pmatrix} Y_1\\ \vdots\\ Y_N \end{pmatrix} \]

Stacking Observations II

Recall vector form: \(Y_i = \bX_i'\bbeta + U_i\). Stacking left hand sides also stacks right hand sides: \[ \begin{pmatrix} Y_1\\ \vdots\\ Y_k \end{pmatrix} = \begin{pmatrix} \bX_1'\bbeta + U_1\\ \vdots \\ \bX_N'\bbeta + U_N \end{pmatrix} \tag{1}\] Here \(N\) is the number of observations (rows)

Matrix of Covariates I

Next, define the \(N\times k\) matrix \(\bX\) (note: no \(i\) index) \[ \bX = \begin{pmatrix} \bX_1'\\ \bX_2'\\ \vdots\\ \bX_N' \end{pmatrix} = \begin{pmatrix} X_{11} & X_{12} & \cdots & X_{1k} \\ X_{21} & X_{22} & \cdots & X_{2k}\\ \vdots & \vdots & \ddots & \vdots\\ X_{N1} & X_{N2} & \cdots & X_{Nk} \end{pmatrix} \]

Matrix of Covariates II

\(\bX\) — precisely a table of covariates:

  1. A row is an individual observation: the \(i\)th row — values of covariates for the \(i\) observation (corresponds to \(\bX_i\)!)
  2. A column is all observations of a given variable: \(j\)th column is \((X_{1j}, X_{2j}, \dots X_{Nj})\)

statsmodels often uses the term exog for \(\bX\). scikit-learn simply uses X

Vector of Residuals

Define the column vector \(\bU\) of residuals through its transpose \[ \bU' = \begin{pmatrix} U_1 & U_2 & \dots & U_n \end{pmatrix} \]

Books and articles often define column vectors using their transposes — this saves vertical space

Splitting The Right-Hand Side

Can go back to Equation 1. Using matrix multiplication rules and \(\bU\), can write \[ \begin{pmatrix} \bX_1'\bbeta + U_1\\ \vdots \\ \bX_N'\bbeta + U_N \end{pmatrix} = \begin{pmatrix} \bX_1'\bbeta \\ \vdots \\ \bX_N'\bbeta \end{pmatrix} + \bU = \bX\bbeta + \bU \]

Check this! Write out each part of the above equalities in detail

Matrix Form of Linear Regression

Adding back the left-hand side gives us:

Definition 3 Linear regression model \[ \bY = \bX\bbeta + \bU \] is said to be in matrix form. Here \(\bX\) is a \(N\times k\) matrix, while \(\bY, \bbeta, \bU\) are \(k\)-vectors (\(k\times 1\) matrices)

Matrix form — combine all sample points into one equation. Solves our third problem: convenient form for computers

The OLS Estimator

Derivation

Estimation Problem

Our last remaining goal: convenient expression for the OLS estimator \(\hat{\bbeta}\): \[ \hat{\bbeta} = (\hat{\beta}_1, \hat{\beta}_2, \dots, \hat{\beta}_k)' \]

Recall that \(\hat{\bbeta}\) is defined through optimization \[ \hat{\bbeta} = \argmin_{b_1, b_2, \dots, b_k}\sum_{i=1}^N \left( Y_i - \sum_{j=1}^k b_j X_{ij}\right)^2 \]

Approach to Optimization

Ojective function is differentiable in \((b_1, \dots, b_k)\)

\(\Rightarrow\) our strategy:

  1. Differentiate with respect to the arguments. Set the derivative equal to 0 (first order conditions)
  2. Hope that
    • There is a unique solution to the first order conditions
    • The solution actually minimizes the function (instead of maximizing or being a saddle point)

Derivative With Respect to \(b_j\)

Derivative with respect to \(b_j\) is given by \[ \begin{aligned} & -2\sum_{i=1}^N X_{ij} \left(Y_i - \sum_{l=1}^k b_l X_{il}\right)\\ & = -2\sum_{i=1}^N X_{ij} \left(Y_i - \bX_i'\bb\right) \end{aligned} \]

First Order Conditions (FOCs)

Any optimizer \(\hat{\bbeta} = (\hat{\beta}_1, \dots, \hat{\beta}_k)\) must satisfy the FOC system \[ \begin{cases} -2\sum_{i=1}^N X_{i1} \left(Y_i - \bX_i'\hat{\bbeta} \right) = 0\\ \vdots \\ -2\sum_{i=1}^N X_{ik} \left(Y_i - \bX_i'\hat{\bbeta} \right) = 0 \end{cases} \]

In words, all partial derivatives of the objective function must be zero at \(\hat{\bbeta}\)

Normal Equations

Notice that (check this!) that FOCs can be compactly written as \[ \bX'(\bY - \bX\hat{\bbeta}) = 0 \tag{2}\]

Definition 4 Equation 2 is called the normal equations.

Normal Equations as a Linear System

Can write Equation 2 as \[ (\bX'\bX)\hat{\bbeta} = \bX'\by \]

A standard system linear system of equations of the form \(\bA\hat{\bbeta} = \bc\) for \(\bA = (\bX'\bX)\) and \(\bc= \bX'\by\)

  • \(\bX'\bX\) is a \(k\times k\) matrix (what are its elements?)
  • System has \(k\) unknowns

Solving the Normal Equations

\(k\) equations in \(k\) unknowns — systems has 1 or infinitely many solutions, depending on rank of \(\bX'\bX\)

  • Maximum rank = unique solution
  • Less than maximum rank = infinitely many solutions

Recall: if \(\bA\) is a square matrix with maximal rank, \(\bA^{-1}\) exists

Objective Function and \(\bX'\bX\)

Vector Formula for OLS Estimator

Proposition 1 (OLS Estimator Formula) Let \(\bX'\bX\) have rank \(k\). Then \((\bX'\bX)^{-1}\) exists and the OLS estimator is given by \[ \hat{\bbeta} = (\bX'\bX)^{-1}\bX\bY \]

Exercise: check that the above formula agrees with the simple formula involving sample covariance and variance when \(k=2\) and \(X_{i1}=1\)

Extra Material

Vector and Matrix Forms of Objective Function

Can do the derivation more efficiently: \[ \begin{aligned} & \sum_{i=1}^N \left( Y_i - \sum_{j=1}^k b_j X_{ij}\right)^2 = \sum_{i=1}^N \left( Y_i -\bX_i'\bb\right)^2\\ & = (\bY-\bX\bbeta)'(\bY-\bX\bb) = \bY'\bY - 2\bY'\bX\bb + \bb'\bX'\bX\bb \end{aligned} \] where \(\bb=(b_1, \dots, b_k)'\)

Matrix First Order Conditions

You can differentiate with respect to the whole vector \(\bbeta\), see Wikipedia

Vector form of first order condition \[ \dfrac{\partial (\bY'\bY - 2\bY'\bX\bb + \bb'\bX'\bX\bb)}{\partial \bb} = -2\bX'\bY + 2\bX'\bX\bb \]

Setting the derivative equal to zero and rearranging again gives \[ \bX'\bX\hat{\bbeta} = \bX'\by \]

Recap and Conclusions

Recap

In this lecture we

  1. Introduced vector and matrix forms of linear regression
  2. Derived the OLS estimator in vector and matrix forms

Next Questions

  1. When is \(\bX'\bX\) invertible (= full rank?)
  2. What are the statistical properties of \(\hat{\bbeta}\)?
  3. How to extend the vector approach to other familiar estimators — such as IV with several covariates and instruments.

References

Kaye, Richard, and Robert Wilson. 1998. Linear Algebra. Oxford Science Publications. Oxford: Oxford Univ. Press.
Strang, Gilbert. 2016. Introduction to Linear Algebra. 5th edition. Wellesley: Cambridge press.
Wooldridge, Jeffrey M. 2020. Introductory Econometrics: A Modern Approach. Seventh edition. Boston, MA: Cengage.