Limit Distribution of the OLS Estimator

Key Step Towards Inference

Vladislav Morozov

Introduction

Lecture Info

Learning Outcomes

This lecture is about the asymptotic distribution of the OLS estimator

By the end, you should be able to

Discuss why we need the asymptotic distribution of the OLS estimator
Provide the definition of convergence in distribution
State the central limit theorem and Slutsky’s theorem
Derive the asymptotic distribution of the OLS estimator

Textbook References

Refresher on probability:
- Your favorite probability textbook (e.g. chapter 5 in Bertsekas and Tsitsiklis (2008))
- Sections B-C in Wooldridge (2020)
Asymptotic theory for the OLS estimator
- 5.2 and E4 in Wooldridge (2020)
- Or 7.3 in Hansen (2022)

Motivation

Motivating Empirical Example

Setting: Linear Causal Model

We’ll continue to work in the linear causal model with potential outcomes: \[ Y_i^\bx = \bx'\bbeta + U_i \tag{1}\]

Motivating Empirical Example: Variables

\(Y_i\) — hourly log wage
\(\bx\) — education and job experience in years
\(U_i\) — unobserved characteristics (skill, health, etc.), assumed to satisfy \(\E[U_i|\bX_i]=0\)
Sample: some suitably homogeneous group (e.g. married white women)

Motivating Empirical Example: Potential Outcomes

\[ \begin{aligned}[] & [\ln(\text{wage}_i)]^{\text{(education, experience)}} \\ & = \beta_1 + \beta_2 \times \text{education} \\ & \quad + \beta_3 \times \text{experience} + \beta_4 \times \dfrac{\text{experience}^2}{100} + U_i \end{aligned} \]

Can write model in terms of realized variables, but above emphasizes causal assumption
We divide experience\(^2\) by 100 for numerical reasons

Motivating Empirical Example: Parameters of Interest

Our parameters of interest:

\(100\beta_2\) — (more or less) average effect of additional year of education in percent
\(100\beta_3 + 20 \beta_4\) — average effect of increasing education for individuals with 10 years of experience
\(-50\beta_3/\beta_4\) — experience level which maximizes expected log wage

Motivating Empirical Example: Data

cps09mar — a selection from the March 2009 US Current Population Survey:
Can be obtained from the website for Hansen (2022)
Sample: married white women with present spouses

Expand for full data preparation code

import numpy as np
import pandas as pd
import statsmodels.api as sm

from statsmodels.regression.linear_model import OLS

# Read in the data
data_path = ("https://github.com/pegeorge/Econ521_Datasets/"
             "raw/refs/heads/main/cps09mar.csv")
cps_data = pd.read_csv(data_path)

# Generate variables
cps_data["experience"] = cps_data["age"] - cps_data["education"] - 6
cps_data["experience_sq_div"] = cps_data["experience"]**2/100
cps_data["wage"] = cps_data["earnings"]/(cps_data["week"]*cps_data["hours"] )
cps_data["log_wage"] = np.log(cps_data['wage'])

# Retain only married women white with present spouses
select_data = cps_data.loc[
    (cps_data["marital"] <= 2) & (cps_data["race"] == 1) & (cps_data["female"] == 1), :
]

# Construct X and y for regression 
exog = select_data.loc[:, ['education', 'experience', 'experience_sq_div']]
exog = sm.add_constant(exog)
endog = select_data.loc[:, "log_wage"]

Motivating Empirical Example: Estimation Results

results = OLS(endog, exog).fit(cov_type='HC0') # Robust covariance matrix estimator
print(results.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:               log_wage   R-squared:                       0.226
Model:                            OLS   Adj. R-squared:                  0.226
Method:                 Least Squares   F-statistic:                     862.5
Date:                Mon, 19 May 2025   Prob (F-statistic):               0.00
Time:                        17:37:49   Log-Likelihood:                -8152.9
No. Observations:               10402   AIC:                         1.631e+04
Df Residuals:                   10398   BIC:                         1.634e+04
Df Model:                           3                                         
Covariance Type:                  HC0                                         
=====================================================================================
                        coef    std err          z      P>|z|      [0.025      0.975]
-------------------------------------------------------------------------------------
const                 0.9799      0.040     24.675      0.000       0.902       1.058
education             0.1114      0.002     50.185      0.000       0.107       0.116
experience            0.0229      0.002     12.257      0.000       0.019       0.027
experience_sq_div    -0.0347      0.004     -8.965      0.000      -0.042      -0.027
==============================================================================
Omnibus:                     4380.404   Durbin-Watson:                   1.833
Prob(Omnibus):                  0.000   Jarque-Bera (JB):           134722.859
Skew:                          -1.401   Prob(JB):                         0.00
Kurtosis:                      20.406   Cond. No.                         219.
==============================================================================

Notes:
[1] Standard Errors are heteroscedasticity robust (HC0)

Empirical Questions

How certain are we of our estimates of target parameters?
Does education matter at all? (up to our statistical confidence)
Is the best amount of experience to have equal to 15 years? (up to our statistical confidence)

Translating to Theory

Goal: Inference

Recall:

Inference is about answering questions about the population based on the finite sample

All of our questions — examples of inference

Challenge: Randomness

Key challenge:

We only see a random sample instead of the full population

In other words: our estimated values are also random and do not perfectly reflect the underlying population values

How to quantify whether we are close or far to the true parameters? (\(\Rightarrow\) confidence intervals)
Are the values we obtained compatible with our hypotheses? (\(\Rightarrow\) hypothesis testing)

Necessary Object: Distribution of Estimator

To answer these questions, we need the distribution of the estimator for given sample size

How do you get this distribution if you only have one sample?

Possible Approach: Distributional Assumptions

It is possible to impose exact distributional assumptions on the data (i.e., \((\bX_i, \bU_i))\)
Example: assuming \(U_i|\bX_i\sim N(\cdot, \cdot)\) (e.g. chapter 4 in Wooldridge (2020))

Such approaches usually problematic:

How do you justify such assumptions?
If these distributions have unknown parameters, how do you estimate them and quantify uncertainty about those parameters?

Other Approaches

Nonasymptotic/finite-sample analysis based on “high-probability bounds”:
- Usually require making assumptions about tails of the data (e.g. that \(U_i\) has bounded support),
- See examples in Mohri, Rostamizadeh, and Talwalkar (2018) and Wainwright (2019) for some examples
Large-sample approximations using tools like the central limit theorem(s) — topic of this lecture

Probability Background

Definitions

Convergence in Distribution

Definition 1 Let \(\bX_1, \bX_2, \dots\) and \(\bX\) be random vectors in \(\R^q\). Let \(\bX_N\) have CDF \(F_N(\cdot)\) and \(\bX\) have CDF \(F(\cdot)\). \(\curl{\bX_N}\) converges in distribution (converges weakly) to \(\bX\) if

\[ \lim_{N\to\infty} F_N(\bx) = F(\bx) \] for every \(x\in\R^q\) such that \(F(\cdot)\) is continuous at \(\bx\)

Convergence in distribution is labeled \(\bX_N\Rightarrow \bX\) or \(\bX_N\xrightarrow{d} \bX\)

Convergence in Distribution vs. in Probability

Proposition 1

If \(\bX_N\xrightarrow{p} \bX\), then \(\bX_N\xrightarrow{d} \bX\)
If \(\bX_N\xrightarrow{d} \bX\) and \(\bX\) is not random (=constant), then \(\bX_N\xrightarrow{p} \bX\)

Convergence in probability is generally stronger than convergence in distribution

Tools for Working with Convergence in Distribution

Working with Vectors: Mean and Variance

In this class we work with random vectors (such as \(\bX_iU_i\))

Some notation for vectors:

Mean of random vector \(\bZ\) is the vector \(\bmu = \E[\bZ]\): coordinates of \(\bmu\) — means of coordinates of \(\E[\bZ]\)
Variance-covariance matrix of \(\bZ\) is the matrix
\[ \var(\bZ) = \E[(\bZ-\bmu)(\bZ-\bmu)'] \]

Key Tool: (Multivariate) CLT

Proposition 2 Let \(\bZ_1, \bZ_2, \dots\) be a sequence of random vectors with \(\bmu=\E[\bZ_i]\)

\(\bZ_i\) are independently and identically distributed (IID)
\(\E[\norm{\bZ_i}^2]<\infty\)

Then \[ \small \sqrt{N}\left( \dfrac{1}{N}\sum_{i=1}^N \bZ_i - \bmu \right) \xrightarrow{d} N(0, \var(\bZ_i) ) \]

Visualizing The CLT

Practical Interpetation: Scalar Case

The CLT (Proposition 2) for scalar data states that for sufficiently large \(N\) \[ \small P\left(\dfrac{N^{-1}\sum_{i=1}^N Z_i - \E[Z_i] }{\sqrt{ \var(Z_i)/N }}\leq x \right) \approx \Phi(x) \] where \(\Phi(\cdot)\) is the standard normal CDF. In other words, the mean is approximately distrubuted as (“\(\overset{a}{\sim}\)”) \[\small \bar{Z} \overset{a}{\sim} N\left( \E[Z_i], \dfrac{\var(Z_i)}{N} \right) \tag{2}\]

Tool: Continuous Mapping Theorem

Proposition 3 Let \(\bZ_N\xrightarrow{d}\bZ\), and let \(f(\cdot)\) be continuous is some neighborhood of all the possible values of \(\bZ\).

Then
\[ f(\bZ_N) \xrightarrow{d} f(\bZ) \]

In words: convergence in distribution is also preserved under continuous transformations

CMT: Scalar Example

Consider scalar case with \(\sqrt{N}(\bar{X}-\mu)\Rightarrow Z\) for \(Z\sim N(0, \var(X_i))\). Then \[ N(\bar{X}-\mu)^2 \xrightarrow{d} Z^2 \]

Note: the above is not the same as looking at \[ \sqrt{N}([\bar{X}]^2 - \mu^2) \tag{3}\] The function \(f\) is applied to the elements of the convergent sequence

CMT: Vector Example

In the vector case, label

\(\bZ_N= \sqrt{N}\left( \dfrac{1}{N}\sum_{i=1}^N \bX_i - \E[\bX_i] \right)\) with \(q\) coordinates
\(\bZ\sim N(0, \var(\bX) )\)
\(\bA\) — some \(q\times q\) matrix

Then

\[ \bZ_N'\bA\bZ_N\xrightarrow{d} \bZ'\bA\bZ \]

Tool: Slutsky’s Theorem

Tool for combining sequences that converge in probability and in distribution

Proposition 4 Let \(\bZ_N\xrightarrow{d}\bZ\), \(\bV_N\xrightarrow{p} \bv\), \(\bA_N\xrightarrow{p}\bA\). Then (provided invertibility and compatible sizes)

\(\bZ_N + \bV_N \xrightarrow{d} \bZ + \bv\)
\(\bA_N\bZ_N \xrightarrow{d} \bA\bZ\)
\(\bZ_N'\bA_N^{-1}\bZ_N\xrightarrow{d} \bZ'\bA^{-1}\bZ\)

Warning: \(\bv\) and \(\bA\) Must Be Constant

Important

Slutsky’s theorem no longer holds if \(\bV_N\) or \(\bA_N\) converges to a random limit!

Asymptotic Normality of the OLS Estimator

Estimator and Model

Let’s return to the OLS estimator for \(\bbeta\) in model (1) \[ \hat{\bbeta} = \left( \dfrac{1}{N}\sum_{i=1}^N \bX_i\bX_i' \right)^{-1}\left( \dfrac{1}{N}\sum_{i=1}^N \bX_i Y_i \right) \]

We retain these assumptions used for consistency:

IID sample
\(\E[\bX_i\bX_i']\) invertible
\(\E[\bX_iU_i] =0\)

Towards Normality

Write the OLS estimator in sample error form: \[ \hat{\bbeta} = \bbeta + \left( \dfrac{1}{N}\sum_{i=1}^N \bX_i\bX_i' \right)^{-1}\left( \dfrac{1}{N}\sum_{i=1}^N \bX_i U_i \right) \]

Notice

\(N^{-1}\sum_{i=1}^N \bX_i U_i\) — sample average of IID random variable with \(\E[\bX_i U_i]=0\).
Potential application of the CLT!

The Four Steps of Showing Normality

Show asymptotic normality of \(N^{-1}\sum_{i=1}^N \bX_i U_i\)
Handle \(\left(N^{-1}\sum_{i=1}^N \bX_i\bX_i' \right)^{-1}\)
Combine first two steps together
Figure out the variance of the limit

Step 1: Asymptotic Normality of \(N^{-1}\sum_{i=1}^N \bX_i U_i\)

By the CLT (Proposition 2) (applied to \(\bX_iU_i\)) and the assumption \(\E[\bX_iU_i]=0\)

\[ \dfrac{1}{\sqrt{N}}\sum_{i=1}^N \bX_i U_i \xrightarrow{d} N\left( 0, \E[U_i^2\bX_i\bX_i'] \right) \]

We need to assume that \(\E[\norm{\bX_i U_i}^2] <\infty\)

Step 2: Handling \(\left(N^{-1}\sum_{i=1}^N \bX_i\bX_i' \right)^{-1}\)

Exactly as for consistency:

Law of large numbers: \(N^{-1}\sum_{i=1}^N \bX_i\bX_i'\xrightarrow{p} \E[\bX_i\bX_i']\)
CMT + assumption that \(\left(\E[\bX_i\bX_i']\right)^{-1}\) exists \(\Rightarrow\) \(N^{-1}\sum_{i=1}^N \bX_i\bX_i'\) is invertible with probability approaching 1
CMT: \(\left(\frac{1}{N} \sum_{i=1}^N \bX_i\bX_i' \right)^{-1} \xrightarrow{p} \left( \E[\bX_i\bX_i']\right)^{-1}\)

Step 3: OLS as Product of Two Sequences

OLS estimator multiplies two sequences:

Vectors \(\frac{1}{\sqrt{N}}\sum_{i=1}^N \bX_i U_i\) converge in distribution
Matrices \(\left(N^{-1}\sum_{i=1}^N \bX_i\bX_i' \right)^{-1}\) converge in probability to constant limit

Can apply Slutsky’s theorem (Proposition 4)!

Step 3: Applying Slutsky’s Theorem

Slutsky’s theorem gives \[ \begin{aligned} & \sqrt{N}(\hat{\bbeta}-\bbeta)\\ & = \left(N^{-1}\sum_{i=1}^N \bX_i\bX_i' \right)^{-1} \dfrac{1}{\sqrt{N}}\sum_{i=1}^N \bX_i U_i \\ & \xrightarrow{d} \left( \E[\bX_i\bX_i']\right)^{-1} \bZ, \end{aligned} \tag{4}\] for \(\bZ\sim N\left( 0, \E[U_i^2\bX_i\bX_i'] \right)\)

Step 4: Properties of Variances of Vectors

Recall: if \(\var(X)= \sigma^2\), then \(\var(aX) = a^2\sigma^2\)

Similar story for vectors:

Proposition 5 Let \(\bZ\) be a \(k\)-vector with variance-covariance matrix \(\var(\bZ)\). Let \(\bA\) be a \(q\times k\) matrix. Then \[ \var(\bA\bZ) = \bA\var(\bZ)\bA' \]

Step 4: Variance

Now want to apply Proposition 5 to Equation 4 to get \[ \begin{aligned} & \var\left( \left( \E[\bX_i\bX_i']\right)^{-1} \bZ \right) \\ & = \left( \E[\bX_i\bX_i']\right)^{-1} \E[U_i^2\bX_i\bX_i']\left( \E[\bX_i\bX_i']\right)^{-1} \end{aligned} \]

We have used (where?) that

\(\E[\bX_i\bX_i']\) is symmetric (why?)
Inverses of symmetric matrices are symmetric

Combined Result

Proposition 6 Let

\((\bX_i, Y_i)\) be IID
\(\E[\norm{\bX_i\bX_i'}]<\infty\), \(\E[\norm{\bX_iU_i}^2]<\infty\)
\(\E[\bX_i\bX_i']\) be invertible

Then \[\scriptsize \sqrt{N}(\hat{\bbeta}-\bbeta) \xrightarrow{d} N\left(0, \underbrace{\left( \E[\bX_i\bX_i']\right)^{-1} \E[U_i^2\bX_i\bX_i']\left( \E[\bX_i\bX_i']\right)^{-1} }_{\tiny \text{Called the asymptotic variance: } \avar(\hat{\bbeta}) }\right) \]

Discussion of Asymptotic Variance \(\avar(\hat{\bbeta})\)

Asymptotic variance expression in Proposition 6 is sometimes called the “sandwich” form
Expression compatible with heteroskedasticity
If \(\E[U_i^2|\bX_i]=\sigma^2\) (conditional homoskedasticity), expression simplifies (corresponds to nonrobust estimates of standard errors). In practice not very useful

Practical Usefulness: Foundation for Inference

Similarly to (2), Proposition 6 can be interpreted as saying that

\[ \hat{\bbeta} \overset{a}{\sim} N\left(\bbeta, \dfrac{1}{N}\avar(\hat{\bbeta}) \right) \]

Gives us an idea of how much variation/precision we can expect from \(\hat{\bbeta}\)
If a hypothesis \(H_0\) tells us something about \(\bbeta\), we can check how likely \(H_0\) is based on seeing \(\hat{\bbeta}\)

Recap and Conclusions

Recap

In this lecture we

Formulated our first practical questions in inference
Reviewed convergence in distribution
Derived the asymptotic distribution of the OLS estimator

Next Questions

How to use Proposition 6 for inference?

Estimating \(\avar(\hat{\bbeta})\) to get standard errors
Building tests and confidence intervals
Answering the questions of our empirical example

References

Bertsekas, Dimitri P., and John N. Tsitsiklis. 2008. Introduction to Probability. 2nd ed. Optimization and Computation Series. Belmont: Athena scientific.

Hansen, Bruce. 2022. Econometrics. Princeton_University_Press.

Mohri, Mehryar, Afshin Rostamizadeh, and Ameet Talwalkar. 2018. Foundations of Machine Learning. The MIT Press. https://doi.org/10.5555/3360093.

Wainwright, Martin J. 2019. High-Dimensional Statistics: A Non-Asymptotic Viewpoint. Cambridge University Press. https://doi.org/10.1017/9781108627771.

Wooldridge, Jeffrey M. 2020. Introductory Econometrics: A Modern Approach. Seventh edition. Boston, MA: Cengage.