Convergence as Sample Size Grows
This lecture is about a consistency in general and consistency of the OLS estimator
By the end, you should be able to
Want estimators with good properties
Consistency is a minimal required property for a “sensible” estimator
An estimation procedure is consistent if it get the target parameter right as sample size grows infinite large
Recall:
Definition 1 Let \(\bx_1, \bx_2, \dots\) be a sequence of vectors in \(\R^p\). Then \(\bx_n\) converges to some \(\bx\in \R^p\) if for any \(\varepsilon>0\) there exists an \(N_0\) such that for all \(N\geq N_0\) \[ \norm{ \bx_N - \bx } < \varepsilon \]
Here \(\norm{\cdot}\) is the Euclidean norm on \(\R^p\): if \(\by = (y_1, \dots, y_p)\), then \(\norm{\by} = \sqrt{ \sum_{i=1}^p y_i^2 }\)
Sample is random \(\Rightarrow\) each \(\hat{\theta}_N(X_1, \dots, X_N)\) is random
\(\Rightarrow\) How to formalize convergence?
See Wikipedia
Definition 2 Let \(\bX_1, \bX_2, \dots\) be a sequence of random matrices in \(\R^{k\times p}\). Then \(\bX_N\) converges to some \(\bX\in \R^{k\times p}\) in probability if for any \(\varepsilon>0\) it holds that \[ \lim_{N\to\infty} P(\norm{\bX_N - \bX}>\varepsilon) = 0 \]
Proposition 1 Let \(\bA_N, \bA\) be a \(m \times n\) matrices with \((i, j)\)th element \(a_{ij, N}\) and \(a_{ij}\). Then \[ \bA_N\xrightarrow{p}\bA \Leftrightarrow a_{ij, N} \xrightarrow a_{ij} \]
Proposition 2 \(\bX_N\xrightarrow{p}\bX\) if and only if \(P(\bX_N\in U)\to 1\) for any open set \(U\) that contains \(\bX\)
Definition 3 The estimator (sequence) \(\hat{\theta}_N\) is consistent for \(\theta\) if as \(N\to\infty\) \[ \hat{\theta}_N(X_1, \dots, X_N) \xrightarrow{p} \theta \]
Note: we usually use the word “estimator” to refer to the whole sequence \(\curl{\hat{\theta}_N}\)
Wikipedia gives a list of some concentration inequalities
Proposition 3 Let \(\bX_1, \bX_2, \dots\) be a sequence of random vectors such that
Then \[ \small \dfrac{1}{N} \sum_{i=1}^N \bX_i \xrightarrow{p} \E[\bX_i] \]
Proposition 4 Let \(\bX_N\xrightarrow{p}\bX\), and let \(f(\cdot)\) be continuous is some neighborhood of all the possible values of \(\bX\).
Then
\[
f(\bX_N) \xrightarrow{p} f(\bX)
\]
In words: convergence in probability is preserved under continuous transformations
Simple examples
Let’s go back to the OLS estimator based on IID sample \((\bX_1, Y_1), \dots, (\bX_N, Y_N)\)
Is the OLS estimator consistent?
Consistent for what?
First an unpleasant technical issue
How to handle non-invertible \(\bX'\bX\)?
Some fall-back known value \(\bc\) \[ \hat{\bbeta} = \begin{cases} (\bX'\bX)^{-1}\bX'\bY, & \bX'\bX \text{ invertible}\\ \bc, & \bX'\bX \text{ not invertible} \end{cases} \] Now \(\hat{\bbeta}\) is always defined, but does \(\bc\) matter?
First lecture shows that \[ \bX'\bX = \sum_{i=1}^N \bX_i\bX_i', \quad \bX'\bY = \sum_{i=1}^N \bX_i Y_i \]
Then we can write (under invertibility) \[ (\bX'\bX)\bX'\bY = \left(\textcolor{teal}{\dfrac{1}{N}} \sum_{i=1}^N \bX_i\bX_i'\right)^{-1} \left( \textcolor{teal}{\dfrac{1}{N}} \sum_{i=1}^N \bX_i Y_i\right) \]
Can handle averages. If
then by the WLLN (Proposition 3) \[ \begin{aligned} \dfrac{1}{N} \sum_{i=1}^N \bX_i\bX_i' \xrightarrow{p} \E[\bX_i\bX_i'], \quad \dfrac{1}{N} \sum_{i=1}^N \bX_iY_i \xrightarrow{p} \E[\bX_i Y_i] \end{aligned} \]
Two facts:
So if \(\E[\bX_i\bX_i']\) is invertible, then by Proposition 2 and the CMT (Proposition 4)
See this StackExchange discussion for more details
Since \(\frac{1}{N} \sum_{i=1}^N \bX_i\bX_i'\) is invertible with probability approaching 1 (w.p.a. 1), then w.p.a.1 it holds \[ \hat{\bbeta} = (\bX'\bX)^{-1}\bX'\bY = \left( \dfrac{1}{N} \sum_{i=1}^N \bX_i\bX_i'\right)^{-1} \left( \dfrac{1}{N} \sum_{i=1}^N \bX_i Y_i\right) \]
It follows that if \(\bc\neq \E[\bX_i\bX_i']^{-1}\E[\bX_iY_i]\), then \[ P(\hat{\bbeta}= \bc) \to 0 \]
Proposition 5 Let
Then \[ \hat{\bbeta} \xrightarrow{p} \left( \E[\bX_i\bX_i'] \right)^{-1} \E[\bX_iY_i] \]
It is common and acceptable (also on the exam) to directly write \[ \hat{\bbeta} = (\bX'\bX)^{-1}\bX'\bY \] provided that you
See Wikipedia on inner products of random variables
Let’s go back to our causal framework to add a causal part to Proposition 5
Observed outcomes \(Y_i\) satisfy \[ Y_i = \bX_i'\bbeta + U_i \]
Can now substitute the equation for \(Y_i\) (+invertibility assumptions) to get
\[
\begin{aligned}
\hat{\bbeta} & = \left( \dfrac{1}{N}\sum_{i=1}^N \bX_i\bX_i' \right)^{-1} \dfrac{1}{N}\sum_{i=1}^N \bX_iY_i\\
& = \bbeta + \left( \dfrac{1}{N}\sum_{i=1}^N \bX_i\bX_i' \right)^{-1} \dfrac{1}{N}\sum_{i=1}^N \bX_i U_i
\end{aligned}
\] Last line — sampling error form
Proposition 6 Let
Then \[ \hat{\bbeta} \xrightarrow{p} \bbeta \]
Need assumptions for causal interpretation:
Recall:
If \(\E[U_i|\bX_i]=0\), then \(\E[\bX_iU_i]=0\)
Before concluding, let’s go back to model (1)
More general potential outcome equation \[ Y_i^\bx = \bx_i'\bbeta_{\textcolor{teal}{i}} + U_i \tag{2}\] Causal effect of shift from \(\bx_1\) to \(\bx_2\) for unit \(i\): \[ (\bx_2-\bx_1)'\bbeta_i \]
We can still apply OLS under (2), but the representation in terms of \(U_i\) and \(\bbeta_i\) is different: \[ \small \begin{aligned} \hat{\bbeta} & = \left( \dfrac{1}{N} \sum_{i=1}^N \bX_i\bX_i' \right)\dfrac{1}{N} \sum_{i=1}^N \bX_i Y_i \\ & = \left( \dfrac{1}{N} \sum_{i=1}^N \bX_i\bX_i' \right)\dfrac{1}{N} \sum_{i=1}^N \bX_i\bX_i'\bbeta_i \\ & \quad + \left( \dfrac{1}{N} \sum_{i=1}^N \bX_i\bX_i' \right)\dfrac{1}{N} \sum_{i=1}^N \bX_i U_i \end{aligned} \]
Proposition 7 Let
Then \[ \small \hat{\bbeta} \xrightarrow{p} \E[W(X_i)\beta_i], \quad W(X_i) = X_i^2/\E[X_i^2] \]
Only under restrictions:
In this lecture we
A Deeper Look at Linear Regression: Consistency