import pandas as pd
import statsmodels.api as sm
credit_card_df = sm.datasets.ccard.load_pandas().data
credit_card_df.head(2)
AVGEXP | AGE | INCOME | INCOMESQ | OWNRENT | |
---|---|---|---|---|---|
0 | 124.98 | 38.0 | 4.52 | 20.4304 | 1.0 |
1 | 9.85 | 33.0 | 2.42 | 5.8564 | 0.0 |
A Concise and General Approach
This lecture is about a vector-matrix approach to linear regression
By the end, you should be able to
\[ Y_i = \beta_1 + \beta_2 X_{i2} + \beta_3 X_{i3} + \dots + \beta_k X_{ik} + U_i, \] where
Ordinary Least Squares (OLS): \[ \begin{aligned} (\hat{\beta}_1, \dots, \hat{\beta}_k) = \argmin_{b_1, b_2, \dots, b_k} \sum_{i=1}^N \left( Y_i - b_1 - b_2 X_{i2} - \dots - b_k X_{ik}\right)^2 \end{aligned} \]
In words: find coefficients which minimize sum of squared differences between \(Y_i\) and linear combinations of \(1, X_{i2}, \dots, X_{ik}\):
OLS estimator well-defined if
Remember: can always define OLS estimator even if \(Y_i\) does not depend linearly on covariates. OLS is just a method
Definition 1 Linear regression model \[ Y_i = \beta_1 + \beta_2 X_{i2} + \beta_3 X_{i3} + \dots + \beta_k X_{ik} + U_i, \] is said to be in scalar form.
“Scalar” means
Scalar representation used only when we care about the individual regressors
Usual case: when presenting estimated equations.
Example: regression of wages on education and experience: \[ \widehat{\log(wage)} = \underset{(0.094)}{0.284} + \underset{(0.032)}{0.092}\times Education + \dots \] Standard errors in parentheses below the estimates
Scalar representation has downsides:
Yes! The vector \[ y_i = \bbeta_i'\bx_i + U_{i} \] and matrix forms \[ \bY = \bX\bbeta + \bu \]
This lecture is about working with these forms and using them to derive the OLS estimator.
Our model in this lecture: \[ \begin{aligned} Y_{i} & = \beta_1 X_{i1} + \dots + \beta_k X_{ik} + U_{i}\\ & = \sum_{j=1}^k \beta_k X_{ik} + U_{i} \end{aligned} \]
Here \(X_{i1}\) may be \(X_{i1} = 1\) if you want to include an intercept
Our first problem: writing out \(X_{i1}, \dots, X_{ik}\) every time
Why not combine the covariates into a single vector \(\bX_i\) and coefficients into a vector \(\bbeta\)? Combine them into column \(k\)-vectors (\(k\times 1\) matrices):
\[ \bX_i = \begin{pmatrix} X_{i1} \\ X_{i2} \\ \vdots \\ X_{ik} \end{pmatrix}, \quad \bbeta = \begin{pmatrix} \beta_1 \\ \beta_2 \\ \vdots \\ \beta_k \end{pmatrix} \]
Recall the transposition operator: \[ \bX_i = \begin{pmatrix} X_{i1} \\ X_{i2} \\ \vdots \\ X_{ik} \end{pmatrix}, \quad \bX'_i = \begin{pmatrix} X_{i1}, X_{i2}, \dots, X_{ik} \end{pmatrix} \] \(\bX'_i\) is read “\(\bX_i\) transpose”. Sometimes also labeled as \(\bX_i^T\).
Need to combine \(\bX_i\) and \(\bbeta\) to obtain \(\sum_{j=1}^k \beta_k X_{ik}\)
Using rules of matrix multiplication, obtain exactly \[ \bX_i'\bbeta = \sum_{j=1}^k \beta_k X_{ik} \] Note that \(\bX_i'\bbeta = \bbeta'\bX_i\) (why?)
Vectors are column vectors by default; we will transpose when necessary — standard approach. Careful with Wooldridge (2020): he mixes rows and column vectors to avoid transposes.
Definition 2 Linear regression model \[ Y_i = \bX_i'\bbeta + U_i \] is said to be in vector form.
Vector form — convenient representation of model for a single observation \(i\): \[ Y_i = \bX_i'\bbeta + U_i \]
What if we want to represent the model for the whole sample?
Why and how? Think of how we store data, e.g., in pandas
:
import pandas as pd
import statsmodels.api as sm
credit_card_df = sm.datasets.ccard.load_pandas().data
credit_card_df.head(2)
AVGEXP | AGE | INCOME | INCOMESQ | OWNRENT | |
---|---|---|---|---|---|
0 | 124.98 | 38.0 | 4.52 | 20.4304 | 1.0 |
1 | 9.85 | 33.0 | 2.42 | 5.8564 | 0.0 |
Each row is an observation \(i\), variables stored in different columns — data as a table
Link to the docs on the ccard
dataset
(Flat) tabular form
To replicate, put outcomes in a vector \(\bY\): \[ \bY = \begin{pmatrix} Y_1\\ \vdots\\ Y_N \end{pmatrix} \]
Recall vector form: \(Y_i = \bX_i'\bbeta + U_i\). Stacking left hand sides also stacks right hand sides: \[ \begin{pmatrix} Y_1\\ \vdots\\ Y_k \end{pmatrix} = \begin{pmatrix} \bX_1'\bbeta + U_1\\ \vdots \\ \bX_N'\bbeta + U_N \end{pmatrix} \tag{1}\] Here \(N\) is the number of observations (rows)
Next, define the \(N\times k\) matrix \(\bX\) (note: no \(i\) index) \[ \bX = \begin{pmatrix} \bX_1'\\ \bX_2'\\ \vdots\\ \bX_N' \end{pmatrix} = \begin{pmatrix} X_{11} & X_{12} & \cdots & X_{1k} \\ X_{21} & X_{22} & \cdots & X_{2k}\\ \vdots & \vdots & \ddots & \vdots\\ X_{N1} & X_{N2} & \cdots & X_{Nk} \end{pmatrix} \]
\(\bX\) — precisely a table of covariates:
statsmodels
often uses the term exog
for \(\bX\). scikit-learn
simply uses X
Define the column vector \(\bU\) of residuals through its transpose \[ \bU' = \begin{pmatrix} U_1 & U_2 & \dots & U_n \end{pmatrix} \]
Books and articles often define column vectors using their transposes — this saves vertical space
Can go back to Equation 1. Using matrix multiplication rules and \(\bU\), can write \[ \begin{pmatrix} \bX_1'\bbeta + U_1\\ \vdots \\ \bX_N'\bbeta + U_N \end{pmatrix} = \begin{pmatrix} \bX_1'\bbeta \\ \vdots \\ \bX_N'\bbeta \end{pmatrix} + \bU = \bX\bbeta + \bU \]
Check this! Write out each part of the above equalities in detail
Definition 3 Linear regression model \[ \bY = \bX\bbeta + \bU \] is said to be in matrix form. Here \(\bX\) is a \(N\times k\) matrix, while \(\bY, \bbeta, \bU\) are \(k\)-vectors (\(k\times 1\) matrices)
Matrix form — combine all sample points into one equation. Solves our third problem: convenient form for computers
Our last remaining goal: convenient expression for the OLS estimator \(\hat{\bbeta}\): \[ \hat{\bbeta} = (\hat{\beta}_1, \hat{\beta}_2, \dots, \hat{\beta}_k)' \]
Recall that \(\hat{\bbeta}\) is defined through optimization \[ \hat{\bbeta} = \argmin_{b_1, b_2, \dots, b_k}\sum_{i=1}^N \left( Y_i - \sum_{j=1}^k b_j X_{ij}\right)^2 \]
Ojective function is differentiable in \((b_1, \dots, b_k)\)
\(\Rightarrow\) our strategy:
Derivative with respect to \(b_j\) is given by \[ \begin{aligned} & -2\sum_{i=1}^N X_{ij} \left(Y_i - \sum_{l=1}^k b_l X_{il}\right)\\ & = -2\sum_{i=1}^N X_{ij} \left(Y_i - \bX_i'\bb\right) \end{aligned} \]
Any optimizer \(\hat{\bbeta} = (\hat{\beta}_1, \dots, \hat{\beta}_k)\) must satisfy the FOC system \[ \begin{cases} -2\sum_{i=1}^N X_{i1} \left(Y_i - \bX_i'\hat{\bbeta} \right) = 0\\ \vdots \\ -2\sum_{i=1}^N X_{ik} \left(Y_i - \bX_i'\hat{\bbeta} \right) = 0 \end{cases} \]
In words, all partial derivatives of the objective function must be zero at \(\hat{\bbeta}\)
Notice that (check this!) that FOCs can be compactly written as \[ \bX'(\bY - \bX\hat{\bbeta}) = 0 \tag{2}\]
Definition 4 Equation 2 is called the normal equations.
Can write Equation 2 as \[ (\bX'\bX)\hat{\bbeta} = \bX'\by \]
A standard system linear system of equations of the form \(\bA\hat{\bbeta} = \bc\) for \(\bA = (\bX'\bX)\) and \(\bc= \bX'\by\)
\(k\) equations in \(k\) unknowns — systems has 1 or infinitely many solutions, depending on rank of \(\bX'\bX\)
Recall: if \(\bA\) is a square matrix with maximal rank, \(\bA^{-1}\) exists
Proposition 1 (OLS Estimator Formula) Let \(\bX'\bX\) have rank \(k\). Then \((\bX'\bX)^{-1}\) exists and the OLS estimator is given by \[ \hat{\bbeta} = (\bX'\bX)^{-1}\bX\bY \]
Exercise: check that the above formula agrees with the simple formula involving sample covariance and variance when \(k=2\) and \(X_{i1}=1\)
Can do the derivation more efficiently: \[ \begin{aligned} & \sum_{i=1}^N \left( Y_i - \sum_{j=1}^k b_j X_{ij}\right)^2 = \sum_{i=1}^N \left( Y_i -\bX_i'\bb\right)^2\\ & = (\bY-\bX\bbeta)'(\bY-\bX\bb) = \bY'\bY - 2\bY'\bX\bb + \bb'\bX'\bX\bb \end{aligned} \] where \(\bb=(b_1, \dots, b_k)'\)
You can differentiate with respect to the whole vector \(\bbeta\), see Wikipedia
Vector form of first order condition \[ \dfrac{\partial (\bY'\bY - 2\bY'\bX\bb + \bb'\bX'\bX\bb)}{\partial \bb} = -2\bX'\bY + 2\bX'\bX\bb \]
Setting the derivative equal to zero and rearranging again gives \[ \bX'\bX\hat{\bbeta} = \bX'\by \]
In this lecture we
A Deeper Look at Linear Regression: Vector Approach