Consistency of the OLS Estimator

Convergence as Sample Size Grows

Vladislav Morozov

Introduction

Lecture Info

Learning Outcomes

This lecture is about a consistency in general and consistency of the OLS estimator


By the end, you should be able to

  • Provide definitions of convergence in probability and consistency
  • Handle the question of invertibility of \(\bX'\bX\)
  • Derive consistency results for the OLS estimator

Textbook References

  • Refresher on probability:
    • Your favorite probability textbook (e.g. chapter 5 in Bertsekas and Tsitsiklis (2008))
    • Sections B-C in Wooldridge (2020)
  • Consistency of the OLS estimator
    • 7.1-7.2 in Hansen (2022)
    • Or 5.1 and E4 in Wooldridge (2020)

Consistency as Basic Requirement

Want estimators with good properties


Consistency is a minimal required property for a “sensible” estimator


Informally:

An estimation procedure is consistent if it get the target parameter right as sample size grows infinite large

Probability Background

Definitions

Reminder: Convergence of a Deterministic Sequence

Recall:

Definition 1 Let \(\bx_1, \bx_2, \dots\) be a sequence of vectors in \(\R^p\). Then \(\bx_n\) converges to some \(\bx\in \R^p\) if for any \(\varepsilon>0\) there exists an \(N_0\) such that for all \(N\geq N_0\) \[ \norm{ \bx_N - \bx } < \varepsilon \]

Here \(\norm{\cdot}\) is the Euclidean norm on \(\R^p\): if \(\by = (y_1, \dots, y_p)\), then \(\norm{\by} = \sqrt{ \sum_{i=1}^p y_i^2 }\)

Towards Formalizing Convergence

  • Let \(\theta\in \R^p\)
  • Sample of size \(N\) is \((X_1, \dots, X_N)\)
  • Recall: estimators \(\curl{\hat{\theta}_k}_{k=\min N}^{\infty}\) are a sequence of functions. \(N\)th estimator maps \((X_1, \dots, X_N)\) to \(\R^p\) to produce estimates


Sample is random \(\Rightarrow\) each \(\hat{\theta}_N(X_1, \dots, X_N)\) is random

\(\Rightarrow\) How to formalize convergence?

Convergence in Probability: Definition


Definition 2 Let \(\bX_1, \bX_2, \dots\) be a sequence of random matrices in \(\R^{k\times p}\). Then \(\bX_N\) converges to some \(\bX\in \R^{k\times p}\) in probability if for any \(\varepsilon>0\) it holds that \[ \lim_{N\to\infty} P(\norm{\bX_N - \bX}>\varepsilon) = 0 \]

Convergence in Probability: Discussion


  • The limit \(\bX\) can be random or deterministic
  • Convergence in probability written \(\bX_n\xrightarrow{p}\bX\)
  • \(\bX_N\xrightarrow{p} \bX\) is the same as \(\bX_N-\bX\xrightarrow{p} 0\)

Two Important Characterizations

Proposition 1 Let \(\bA_N, \bA\) be a \(m \times n\) matrices with \((i, j)\)th element \(a_{ij, N}\) and \(a_{ij}\). Then \[ \bA_N\xrightarrow{p}\bA \Leftrightarrow a_{ij, N} \xrightarrow a_{ij} \]


Proposition 2 \(\bX_N\xrightarrow{p}\bX\) if and only if \(P(\bX_N\in U)\to 1\) for any open set \(U\) that contains \(\bX\)

Definition of Consistency

Definition 3 The estimator (sequence) \(\hat{\theta}_N\) is consistent for \(\theta\) if as \(N\to\infty\) \[ \hat{\theta}_N(X_1, \dots, X_N) \xrightarrow{p} \theta \]

Note: we usually use the word “estimator” to refer to the whole sequence \(\curl{\hat{\theta}_N}\)

Tools for Showing Consistency

Two Approaches To Showing Consistency

  1. Qualitative: just that convergence happens
  • Relies on laws of large numbers (LLNs) and related results
  • Approach in this course
  1. Quantitive: shows that convergence happens and answers how fast
  • Usually based on concentration inequalities
  • Check out chapter 2 in Wainwright (2019)

Tool: (Weak) Law of Large Numbers

Proposition 3 Let \(\bX_1, \bX_2, \dots\) be a sequence of random vectors such that

  1. \(\bX_i\) are independently and identically distributed (IID)
  2. \(\E[\norm{\bX_i}]<\infty\)

Then \[ \small \dfrac{1}{N} \sum_{i=1}^N \bX_i \xrightarrow{p} \E[\bX_i] \]

Tool: Continuous Mapping Theorem

Proposition 4 Let \(\bX_N\xrightarrow{p}\bX\), and let \(f(\cdot)\) be continuous is some neighborhood of all the possible values of \(\bX\).

Then
\[ f(\bX_N) \xrightarrow{p} f(\bX) \]

In words: convergence in probability is preserved under continuous transformations

CMT Examples

Simple examples

  1. If \(X_N\) is scalar and \(X_n\xrightarrow{p}X\), then
    • \(X_N^2 \xrightarrow{p} X^2\)
    • \(\max\curl{0, X_N}\xrightarrow \max\curl{0, X}\)
  2. If \(\bX_N\to \bX\), \(\bX_N\in \R^p\) and \(\bA_N\xrightarrow{p}\bA\), \(\bA_N\in\R^{k\times p}\), then \[ \bA_N\bX_N\xrightarrow{p} \bA\bX \]

Visualizing the Law of Large Numbers

Consistency of the OLS Estimator

Returning to the OLS Estimator

Let’s go back to the OLS estimator based on IID sample \((\bX_1, Y_1), \dots, (\bX_N, Y_N)\)


Is the OLS estimator consistent?


Consistent for what?

Convergence Without a Causal Model

Invertibility of \(\bX'\bX\)

First an unpleasant technical issue

How to handle non-invertible \(\bX'\bX\)?


Some fall-back known value \(\bc\) \[ \hat{\bbeta} = \begin{cases} (\bX'\bX)^{-1}\bX'\bY, & \bX'\bX \text{ invertible}\\ \bc, & \bX'\bX \text{ not invertible} \end{cases} \] Now \(\hat{\bbeta}\) is always defined, but does \(\bc\) matter?

Representation in Terms of Averages

First lecture shows that \[ \bX'\bX = \sum_{i=1}^N \bX_i\bX_i', \quad \bX'\bY = \sum_{i=1}^N \bX_i Y_i \]

Then we can write (under invertibility) \[ (\bX'\bX)\bX'\bY = \left(\textcolor{teal}{\dfrac{1}{N}} \sum_{i=1}^N \bX_i\bX_i'\right)^{-1} \left( \textcolor{teal}{\dfrac{1}{N}} \sum_{i=1}^N \bX_i Y_i\right) \]

Limits of Averages

Can handle averages. If

  • \(\E[\norm{\bX_i\bX_i'}]<\infty\)
  • \(\E[\norm{\bX_iY_i}]<\infty\)

then by the WLLN (Proposition 3) \[ \begin{aligned} \dfrac{1}{N} \sum_{i=1}^N \bX_i\bX_i' \xrightarrow{p} \E[\bX_i\bX_i'], \quad \dfrac{1}{N} \sum_{i=1}^N \bX_iY_i \xrightarrow{p} \E[\bX_i Y_i] \end{aligned} \]

Handling the Inverse of \(\bX'\bX\)

Two facts:

  • The inverse function \(\bA\to \bA^{-1}\) is continuous on the space of invertible matrices
  • The set of invertible matrices is open

So if \(\E[\bX_i\bX_i']\) is invertible, then by Proposition 2 and the CMT (Proposition 4)

  • \(P(\frac{1}{N} \sum_{i=1}^N \bX_i\bX_i' \text{ is invertible})\to 1\)
  • \((\frac{1}{N}\sum_{i=1}^N \bX_i\bX_i')^{-1} \xrightarrow{p} \left(\E[\bX_i\bX_i']\right)^{-1}\)

\(\bc\) Does Not Matter

Since \(\frac{1}{N} \sum_{i=1}^N \bX_i\bX_i'\) is invertible with probability approaching 1 (w.p.a. 1), then w.p.a.1 it holds \[ \hat{\bbeta} = (\bX'\bX)^{-1}\bX'\bY = \left( \dfrac{1}{N} \sum_{i=1}^N \bX_i\bX_i'\right)^{-1} \left( \dfrac{1}{N} \sum_{i=1}^N \bX_i Y_i\right) \]

It follows that if \(\bc\neq \E[\bX_i\bX_i']^{-1}\E[\bX_iY_i]\), then \[ P(\hat{\bbeta}= \bc) \to 0 \]

Combining Together

Proposition 5 Let

  1. \((\bX_i, Y_i)\) be IID
  2. \(\E[\norm{\bX_i\bX_i'}]<\infty\), \(\E[\norm{\bX_iY_i}]<\infty\)
  3. \(\E[\bX_i\bX_i']\) be invertible

Then \[ \hat{\bbeta} \xrightarrow{p} \left( \E[\bX_i\bX_i'] \right)^{-1} \E[\bX_iY_i] \]

Quicker Way of Writing

It is common and acceptable (also on the exam) to directly write \[ \hat{\bbeta} = (\bX'\bX)^{-1}\bX'\bY \] provided that you

  1. Are talking about asymptotic properties (consistency, asymptotic distributions, asymptotic confidence intervals)
  2. Make the assumption that \(\E[\bX_i\bX_i']\) is invertible and say that \(\bX'\bX\) is invertible w.p.a.1

Convergence of Estimator and Objective Function

Discussion


  • Proposition 5: no causal framework
  • OLS just measures covariances in general
  • Limit \(\left( \E[\bX_i\bX_i'] \right)^{-1} \E[\bX_iY_i]\) is called “population projection of \(Y_i\) on \(\bX_i\)

Convergence under Causal Model with Exogeneity and Homogeneous Effects

Potential Outcomes Framework

Let’s go back to our causal framework to add a causal part to Proposition 5

  • Treatment \(\bX_i\) with possible values \(\bx\)
  • Potential outcome under \(\bX_i=\bx\): \[ Y_i^{\bx} = \bx'\bbeta + U_i \tag{1}\]

Observed outcomes \(Y_i\) satisfy \[ Y_i = \bX_i'\bbeta + U_i \]

Sampling Error Representation

Can now substitute the equation for \(Y_i\) (+invertibility assumptions) to get
\[ \begin{aligned} \hat{\bbeta} & = \left( \dfrac{1}{N}\sum_{i=1}^N \bX_i\bX_i' \right)^{-1} \dfrac{1}{N}\sum_{i=1}^N \bX_iY_i\\ & = \bbeta + \left( \dfrac{1}{N}\sum_{i=1}^N \bX_i\bX_i' \right)^{-1} \dfrac{1}{N}\sum_{i=1}^N \bX_i U_i \end{aligned} \] Last line — sampling error form

Consistency of the OLS Estimator

Proposition 6 Let

  1. \((\bX_i, U_i)\) be IID and model (1) hold
  2. \(\E[\norm{\bX_i\bX_i'}]<\infty\), \(\E[\norm{\bX_iU_i}]<\infty\), \(\E[\bX_iU_i]=0\)
  3. \(\E[\bX_i\bX_i']\) be invertible

Then \[ \hat{\bbeta} \xrightarrow{p} \bbeta \]

Discussion of Assumptions


Need assumptions for causal interpretation:

  • Assumptions on the assignment mechanism (IID \(\bX_i\) and exogeneity/orthogonality)
  • Same assumptions as for identification of \(\bbeta\) (why can you use Proposition 6 to prove identification?)

Orthogonality vs Strict Exogeneity

Recall:

If \(\E[U_i|\bX_i]=0\), then \(\E[\bX_iU_i]=0\)

  • \(\E[U_i|\bX_i]=0\) is stronger than \(\E[\bX_iU_i]=0\)
  • If \(\E[U_i|\bX_i]=0\) holds, then \(\E[\hat{\bbeta}] = \bbeta\) (unbiased)
  • If \(\E[U_i|\bX_i]\neq 0\) but \(\E[\bX_i U_i]=0\), then \(\hat{\bbeta}\) may be biased in finite samples, but still converges to the correct \(\bbeta\)
  • What if both fail?

Convergence under Causal Model with Heterogeneous Effects

Allowing Heterogeneous Causal Effects

Before concluding, let’s go back to model (1)

  • Causal effects are homogeneous (same for everyone)
  • Might not be realistic

More general potential outcome equation \[ Y_i^\bx = \bx_i'\bbeta_{\textcolor{teal}{i}} + U_i \tag{2}\] Causal effect of shift from \(\bx_1\) to \(\bx_2\) for unit \(i\): \[ (\bx_2-\bx_1)'\bbeta_i \]

Parameters of Interest In Model (2)

  1. Average \(\E[\bbeta_i]\) — enough to compute all average treatment effects
  2. Variance, other moments of \(\bbeta_i\)
  3. Distribution of \(\bbeta_i\)
  4. Other objects?

OLS Under Model (2)

We can still apply OLS under (2), but the representation in terms of \(U_i\) and \(\bbeta_i\) is different: \[ \small \begin{aligned} \hat{\bbeta} & = \left( \dfrac{1}{N} \sum_{i=1}^N \bX_i\bX_i' \right)\dfrac{1}{N} \sum_{i=1}^N \bX_i Y_i \\ & = \left( \dfrac{1}{N} \sum_{i=1}^N \bX_i\bX_i' \right)\dfrac{1}{N} \sum_{i=1}^N \bX_i\bX_i'\bbeta_i \\ & \quad + \left( \dfrac{1}{N} \sum_{i=1}^N \bX_i\bX_i' \right)\dfrac{1}{N} \sum_{i=1}^N \bX_i U_i \end{aligned} \]

Limit of the OLS Estimator Under Model (2)

Proposition 7 Let

  1. \(X_i\) be scalar, \((X_i, \bbeta_i, U_i)\) be IID and model (2) hold as \[ \small Y_i^x = \beta_i x + U_i \]
  2. \(\E[X_i^2] \in (0,\infty)\), \(\E[X_iU_i]=0\)

Then \[ \small \hat{\bbeta} \xrightarrow{p} \E[W(X_i)\beta_i], \quad W(X_i) = X_i^2/\E[X_i^2] \]

Proposition 7: Discussion


  • Proposition 7: OLS is estimating a weighted average of individual \(\beta_i\)
  • Weights \(W(X_i)\) are non-negative and \(\E[W(X_i)]=1\) (population version of summing to 1)
  • Still a “causal” parameter, but maybe hard to interpret
  • Not equal to \(\E[\bbeta_i]\) without further assumptions

Can You Identify \(\E[\bbeta_i]\)?


Only under restrictions:

  • Independence of \(\bbeta_i\) and \(\bX_i\) (experimental settings). Show this!
  • With instruments

Recap and Conclusions

Recap

In this lecture we

  1. Reviewed convergence in probability and consistency
  2. Discussed consistency of the OLS estimator
    • Without a causal framework (covariances)
    • In a causal model with homogeneous effects
    • In a causal model with heterogeneous effects

Next Questions


  • How was fast does \(\hat{\bbeta}\) converge?
  • Distributional properties of \(\hat{\bbeta}\)?
  • Inference on \(\bbeta\)

References

Bertsekas, Dimitri P., and John N. Tsitsiklis. 2008. Introduction to Probability. 2nd ed. Optimization and Computation Series. Belmont: Athena scientific.
Hansen, Bruce. 2022. Econometrics. Princeton_University_Press.
Wainwright, Martin J. 2019. High-Dimensional Statistics: A Non-Asymptotic Viewpoint. Cambridge University Press. https://doi.org/10.1017/9781108627771.
Wooldridge, Jeffrey M. 2020. Introductory Econometrics: A Modern Approach. Seventh edition. Boston, MA: Cengage.