Consistency of the OLS Estimator

Convergence as Sample Size Grows

Vladislav Morozov

Introduction

Lecture Info

Learning Outcomes

This lecture is about a consistency in general and consistency of the OLS estimator

By the end, you should be able to

Provide definitions of convergence in probability and consistency
Handle the question of invertibility of \(\bX'\bX\)
Derive consistency results for the OLS estimator

Textbook References

Refresher on probability:
- Your favorite probability textbook (e.g. chapter 5 in Bertsekas and Tsitsiklis (2008))
- Sections B-C in Wooldridge (2020)
Consistency of the OLS estimator
- 7.1-7.2 in Hansen (2022)
- Or 5.1 and E4 in Wooldridge (2020)

Consistency as Basic Requirement

Want estimators with good properties

Consistency is a minimal required property for a “sensible” estimator

Informally:

An estimation procedure is consistent if it get the target parameter right as sample size grows infinite large

Probability Background

Definitions

Reminder: Convergence of a Deterministic Sequence

Recall:

Definition 1 Let \(\bx_1, \bx_2, \dots\) be a sequence of vectors in \(\R^p\). Then \(\bx_n\) converges to some \(\bx\in \R^p\) if for any \(\varepsilon>0\) there exists an \(N_0\) such that for all \(N\geq N_0\) \[ \norm{ \bx_N - \bx } < \varepsilon \]

Here \(\norm{\cdot}\) is the Euclidean norm on \(\R^p\): if \(\by = (y_1, \dots, y_p)\), then \(\norm{\by} = \sqrt{ \sum_{i=1}^p y_i^2 }\)

Towards Formalizing Convergence

Let \(\theta\in \R^p\)
Sample of size \(N\) is \((X_1, \dots, X_N)\)
Recall: estimators \(\curl{\hat{\theta}_k}_{k=\min N}^{\infty}\) are a sequence of functions. \(N\)th estimator maps \((X_1, \dots, X_N)\) to \(\R^p\) to produce estimates

Sample is random \(\Rightarrow\) each \(\hat{\theta}_N(X_1, \dots, X_N)\) is random

\(\Rightarrow\) How to formalize convergence?

Convergence in Probability: Definition

Definition 2 Let \(\bX_1, \bX_2, \dots\) be a sequence of random matrices in \(\R^{k\times p}\). Then \(\bX_N\) converges to some \(\bX\in \R^{k\times p}\) in probability if for any \(\varepsilon>0\) it holds that \[ \lim_{N\to\infty} P(\norm{\bX_N - \bX}>\varepsilon) = 0 \]

Convergence in Probability: Discussion

The limit \(\bX\) can be random or deterministic
Convergence in probability written \(\bX_n\xrightarrow{p}\bX\)
\(\bX_N\xrightarrow{p} \bX\) is the same as \(\bX_N-\bX\xrightarrow{p} 0\)

Two Important Characterizations

Proposition 1 Let \(\bA_N, \bA\) be a \(m \times n\) matrices with \((i, j)\)th element \(a_{ij, N}\) and \(a_{ij}\). Then \[ \bA_N\xrightarrow{p}\bA \Leftrightarrow a_{ij, N} \xrightarrow a_{ij} \]

Proposition 2 \(\bX_N\xrightarrow{p}\bX\) if and only if \(P(\bX_N\in U)\to 1\) for any open set \(U\) that contains \(\bX\)

Definition of Consistency

Definition 3 The estimator (sequence) \(\hat{\theta}_N\) is consistent for \(\theta\) if as \(N\to\infty\) \[ \hat{\theta}_N(X_1, \dots, X_N) \xrightarrow{p} \theta \]

Note: we usually use the word “estimator” to refer to the whole sequence \(\curl{\hat{\theta}_N}\)

Tools for Showing Consistency

Two Approaches To Showing Consistency

Qualitative: just that convergence happens

Relies on laws of large numbers (LLNs) and related results
Approach in this course

Quantitive: shows that convergence happens and answers how fast

Usually based on concentration inequalities
Check out chapter 2 in Wainwright (2019)

Tool: (Weak) Law of Large Numbers

Proposition 3 Let \(\bX_1, \bX_2, \dots\) be a sequence of random vectors such that

\(\bX_i\) are independently and identically distributed (IID)
\(\E[\norm{\bX_i}]<\infty\)

Then \[ \small \dfrac{1}{N} \sum_{i=1}^N \bX_i \xrightarrow{p} \E[\bX_i] \]

Tool: Continuous Mapping Theorem

Proposition 4 Let \(\bX_N\xrightarrow{p}\bX\), and let \(f(\cdot)\) be continuous is some neighborhood of all the possible values of \(\bX\).

Then
\[ f(\bX_N) \xrightarrow{p} f(\bX) \]

In words: convergence in probability is preserved under continuous transformations

CMT Examples

Simple examples

If \(X_N\) is scalar and \(X_n\xrightarrow{p}X\), then
- \(X_N^2 \xrightarrow{p} X^2\)
- \(\max\curl{0, X_N}\xrightarrow \max\curl{0, X}\)
If \(\bX_N\to \bX\), \(\bX_N\in \R^p\) and \(\bA_N\xrightarrow{p}\bA\), \(\bA_N\in\R^{k\times p}\), then \[ \bA_N\bX_N\xrightarrow{p} \bA\bX \]

Visualizing the Law of Large Numbers

Consistency of the OLS Estimator

Returning to the OLS Estimator

Let’s go back to the OLS estimator based on IID sample \((\bX_1, Y_1), \dots, (\bX_N, Y_N)\)

Is the OLS estimator consistent?

Consistent for what?

Convergence Without a Causal Model

Invertibility of \(\bX'\bX\)

First an unpleasant technical issue

How to handle non-invertible \(\bX'\bX\)?

Some fall-back known value \(\bc\) \[ \hat{\bbeta} = \begin{cases} (\bX'\bX)^{-1}\bX'\bY, & \bX'\bX \text{ invertible}\\ \bc, & \bX'\bX \text{ not invertible} \end{cases} \] Now \(\hat{\bbeta}\) is always defined, but does \(\bc\) matter?

Representation in Terms of Averages

First lecture shows that \[ \bX'\bX = \sum_{i=1}^N \bX_i\bX_i', \quad \bX'\bY = \sum_{i=1}^N \bX_i Y_i \]

Then we can write (under invertibility) \[ (\bX'\bX)\bX'\bY = \left(\textcolor{teal}{\dfrac{1}{N}} \sum_{i=1}^N \bX_i\bX_i'\right)^{-1} \left( \textcolor{teal}{\dfrac{1}{N}} \sum_{i=1}^N \bX_i Y_i\right) \]

Limits of Averages

Can handle averages. If

\(\E[\norm{\bX_i\bX_i'}]<\infty\)
\(\E[\norm{\bX_iY_i}]<\infty\)

then by the WLLN (Proposition 3) \[ \begin{aligned} \dfrac{1}{N} \sum_{i=1}^N \bX_i\bX_i' \xrightarrow{p} \E[\bX_i\bX_i'], \quad \dfrac{1}{N} \sum_{i=1}^N \bX_iY_i \xrightarrow{p} \E[\bX_i Y_i] \end{aligned} \]

Handling the Inverse of \(\bX'\bX\)

Two facts:

The inverse function \(\bA\to \bA^{-1}\) is continuous on the space of invertible matrices
The set of invertible matrices is open

So if \(\E[\bX_i\bX_i']\) is invertible, then by Proposition 2 and the CMT (Proposition 4)

\(P(\frac{1}{N} \sum_{i=1}^N \bX_i\bX_i' \text{ is invertible})\to 1\)
\((\frac{1}{N}\sum_{i=1}^N \bX_i\bX_i')^{-1} \xrightarrow{p} \left(\E[\bX_i\bX_i']\right)^{-1}\)

\(\bc\) Does Not Matter

Since \(\frac{1}{N} \sum_{i=1}^N \bX_i\bX_i'\) is invertible with probability approaching 1 (w.p.a. 1), then w.p.a.1 it holds \[ \hat{\bbeta} = (\bX'\bX)^{-1}\bX'\bY = \left( \dfrac{1}{N} \sum_{i=1}^N \bX_i\bX_i'\right)^{-1} \left( \dfrac{1}{N} \sum_{i=1}^N \bX_i Y_i\right) \]

It follows that if \(\bc\neq \E[\bX_i\bX_i']^{-1}\E[\bX_iY_i]\), then \[ P(\hat{\bbeta}= \bc) \to 0 \]

Combining Together

Proposition 5 Let

\((\bX_i, Y_i)\) be IID
\(\E[\norm{\bX_i\bX_i'}]<\infty\), \(\E[\norm{\bX_iY_i}]<\infty\)
\(\E[\bX_i\bX_i']\) be invertible

Then \[ \hat{\bbeta} \xrightarrow{p} \left( \E[\bX_i\bX_i'] \right)^{-1} \E[\bX_iY_i] \]

Quicker Way of Writing

It is common and acceptable (also on the exam) to directly write \[ \hat{\bbeta} = (\bX'\bX)^{-1}\bX'\bY \] provided that you

Are talking about asymptotic properties (consistency, asymptotic distributions, asymptotic confidence intervals)
Make the assumption that \(\E[\bX_i\bX_i']\) is invertible and say that \(\bX'\bX\) is invertible w.p.a.1

Convergence of Estimator and Objective Function

Discussion

Proposition 5: no causal framework
OLS just measures covariances in general
Limit \(\left( \E[\bX_i\bX_i'] \right)^{-1} \E[\bX_iY_i]\) is called “population projection of \(Y_i\) on \(\bX_i\)”

Convergence under Causal Model with Exogeneity and Homogeneous Effects

Potential Outcomes Framework

Let’s go back to our causal framework to add a causal part to Proposition 5

Treatment \(\bX_i\) with possible values \(\bx\)
Potential outcome under \(\bX_i=\bx\): \[ Y_i^{\bx} = \bx'\bbeta + U_i \tag{1}\]

Observed outcomes \(Y_i\) satisfy \[ Y_i = \bX_i'\bbeta + U_i \]

Sampling Error Representation

Can now substitute the equation for \(Y_i\) (+invertibility assumptions) to get
\[ \begin{aligned} \hat{\bbeta} & = \left( \dfrac{1}{N}\sum_{i=1}^N \bX_i\bX_i' \right)^{-1} \dfrac{1}{N}\sum_{i=1}^N \bX_iY_i\\ & = \bbeta + \left( \dfrac{1}{N}\sum_{i=1}^N \bX_i\bX_i' \right)^{-1} \dfrac{1}{N}\sum_{i=1}^N \bX_i U_i \end{aligned} \] Last line — sampling error form

Consistency of the OLS Estimator

Proposition 6 Let

\((\bX_i, U_i)\) be IID and model (1) hold
\(\E[\norm{\bX_i\bX_i'}]<\infty\), \(\E[\norm{\bX_iU_i}]<\infty\), \(\E[\bX_iU_i]=0\)
\(\E[\bX_i\bX_i']\) be invertible

Then \[ \hat{\bbeta} \xrightarrow{p} \bbeta \]

Discussion of Assumptions

Proposition 5: assumptions on \((\bX_i, Y_i)\)
Proposition 6: assumptions on \((\bX_i, U_i)\)

Need assumptions for causal interpretation:

Assumptions on the assignment mechanism (IID \(\bX_i\) and exogeneity/orthogonality)
Same assumptions as for identification of \(\bbeta\) (why can you use Proposition 6 to prove identification?)

Orthogonality vs Strict Exogeneity

Recall:

If \(\E[U_i|\bX_i]=0\), then \(\E[\bX_iU_i]=0\)

\(\E[U_i|\bX_i]=0\) is stronger than \(\E[\bX_iU_i]=0\)
If \(\E[U_i|\bX_i]=0\) holds, then \(\E[\hat{\bbeta}] = \bbeta\) (unbiased)
If \(\E[U_i|\bX_i]\neq 0\) but \(\E[\bX_i U_i]=0\), then \(\hat{\bbeta}\) may be biased in finite samples, but still converges to the correct \(\bbeta\)
What if both fail?

Convergence under Causal Model with Heterogeneous Effects

Allowing Heterogeneous Causal Effects

Before concluding, let’s go back to model (1)

Causal effects are homogeneous (same for everyone)
Might not be realistic

More general potential outcome equation \[ Y_i^\bx = \bx_i'\bbeta_{\textcolor{teal}{i}} + U_i \tag{2}\] Causal effect of shift from \(\bx_1\) to \(\bx_2\) for unit \(i\): \[ (\bx_2-\bx_1)'\bbeta_i \]

Parameters of Interest In Model (2)

Average \(\E[\bbeta_i]\) — enough to compute all average treatment effects
Variance, other moments of \(\bbeta_i\)
Distribution of \(\bbeta_i\)
Other objects?

OLS Under Model (2)

We can still apply OLS under (2), but the representation in terms of \(U_i\) and \(\bbeta_i\) is different: \[ \small \begin{aligned} \hat{\bbeta} & = \left( \dfrac{1}{N} \sum_{i=1}^N \bX_i\bX_i' \right)\dfrac{1}{N} \sum_{i=1}^N \bX_i Y_i \\ & = \left( \dfrac{1}{N} \sum_{i=1}^N \bX_i\bX_i' \right)\dfrac{1}{N} \sum_{i=1}^N \bX_i\bX_i'\bbeta_i \\ & \quad + \left( \dfrac{1}{N} \sum_{i=1}^N \bX_i\bX_i' \right)\dfrac{1}{N} \sum_{i=1}^N \bX_i U_i \end{aligned} \]

Limit of the OLS Estimator Under Model (2)

Proposition 7 Let

\(X_i\) be scalar, \((X_i, \bbeta_i, U_i)\) be IID and model (2) hold as \[ \small Y_i^x = \beta_i x + U_i \]
\(\E[X_i^2] \in (0,\infty)\), \(\E[X_iU_i]=0\)

Then \[ \small \hat{\bbeta} \xrightarrow{p} \E[W(X_i)\beta_i], \quad W(X_i) = X_i^2/\E[X_i^2] \]

Proposition 7: Discussion

Proposition 7: OLS is estimating a weighted average of individual \(\beta_i\)
Weights \(W(X_i)\) are non-negative and \(\E[W(X_i)]=1\) (population version of summing to 1)
Still a “causal” parameter, but maybe hard to interpret
Not equal to \(\E[\bbeta_i]\) without further assumptions

Can You Identify \(\E[\bbeta_i]\)?

Only under restrictions:

Independence of \(\bbeta_i\) and \(\bX_i\) (experimental settings). Show this!
With instruments

Recap and Conclusions

Recap

In this lecture we

Reviewed convergence in probability and consistency
Discussed consistency of the OLS estimator
- Without a causal framework (covariances)
- In a causal model with homogeneous effects
- In a causal model with heterogeneous effects

Next Questions

How was fast does \(\hat{\bbeta}\) converge?
Distributional properties of \(\hat{\bbeta}\)?
Inference on \(\bbeta\)

References

Bertsekas, Dimitri P., and John N. Tsitsiklis. 2008. Introduction to Probability. 2nd ed. Optimization and Computation Series. Belmont: Athena scientific.

Hansen, Bruce. 2022. Econometrics. Princeton_University_Press.

Wainwright, Martin J. 2019. High-Dimensional Statistics: A Non-Asymptotic Viewpoint. Cambridge University Press. https://doi.org/10.1017/9781108627771.

Wooldridge, Jeffrey M. 2020. Introductory Econometrics: A Modern Approach. Seventh edition. Boston, MA: Cengage.