7 Variance of Heterogeneous Coefficients

Summary and Learning Outcomes

This section discusses how to identify the variance of heterogeneous coefficients without restricting the dependence between coefficients and covariates.

By the end of this section, you should be able to:

Explain why average coefficients provide an incomplete picture.
Identify the variance of heterogeneous coefficients using a temporal homoskedasticity assumption on the residuals.
Construct a consistent estimator for the variance of coefficients.

7.1 Beyond Average Coefficients

7.1.1 Potential Objects of Interest

Previously, we focused on estimating the average coefficient vector \(\E[\bbeta_i]\) in our model (2.7): \[ y_{it} = \bx_{it}'\bbeta_i + u_{it}. \tag{7.1}\] In particular, section 6 has shown that the mean group estimator can consistently estimate \(\E[\bbeta_i]\) even without imposing restrictions on the dependence between \(\bbeta_i\) and \(\bx_{it}\).

Average effects are informative but limited. They do not capture the full variation in responses across individuals (Heckman, Smith, and Clements (1997)). Other parameters of interest include:

The moments of \(\bbeta\):
- Variance: overall dispersion in effects.
- Skewness: asymmetry in responses.
- Higher moments: shape of the distribution.
The full distribution and quantiles of \(\bbeta_i\). For example, the distribution may be used to compute what proportion of people benefit vs. how many people are hurt by changes in \(x_{it}\).

7.1.2 Model

As it turns out, it is possible to identify such distributional features in the static version of model (7.1) without restricting the dependence structure between \(\bbeta_i\) and \(\bx_{it}\) (Arellano and Bonhomme 2012). In this section we discuss a streamlined version of their results for variance, while Chapter 9 shows how to identify the maximal object of interest — the full distribution.

Specifically, we consider model (7.1) under a strict exogeneity condition of the form: \[ \E[u_{it}|\bbeta_i, \bX_i] =0 \tag{7.2}\] The number \(T\) of unit-level observations is assumed to exceed the number \(p\) of covariates. We will treat \(T\) as fixed, and consider large-\(N\) identification and estimation arguments.

As before, the model can be written in unit matrix form as \[ \by_i = \bX_i\bbeta_i + \bu_i. \] We will again assume that \(\det(\bX_i'\bX_i)>0\) for all \(i\). If this condition does not hold, our results will be about the subpopulation of units with positive determinant, as in section 6.

7.2 Identification

Our object of interest in this section is the variance-covariance matrix \(\var(\bbeta_i)\) of the coefficients \(\bbeta_i\): \[ \var(\bbeta_i) = \E[\bbeta_i\bbeta_i'] - \E[\bbeta_i]\E[\bbeta_i]'. \] Its diagonal terms are the variances of individual coefficients, which show how dispersed the effects are overall. The off-diagonal covariance terms capture whether the effects of different covariates tend to go in the same or contrary directions.

7.2.1 Variance Decomposition

To identify \(\var(\bbeta_i)\), we will first look at the second moments of \(\by_i\). Specifically, we will consider the conditional second moment of \(\by_i\) given \(\bX_i=\bX\), where \(\bX\) is some potential value of \(\bX_i\) such that \(\det(\bX'\bX)>0\) and \(\bX\) lies in the support of \(\bX_i\).

Using model (7.1) and condition (7.2), the conditional second moment of \(\by_i\) can be represented as \[ \begin{aligned} &\E[\by_i\by_i'|\bX_i=\bX]\\ & = \E\left[ (\bX_i\bbeta_i+ \bu_i)(\bX_i\bbeta_i+\bu_i)'|\bX_i=\bX \right]\\ & =\bX \E\left[\bbeta_i\bbeta_i'|\bX_i=\bX \right]\bX' + \E\left[\bu_i\bu_i'|\bX_i=\bX \right]. \end{aligned} \tag{7.3}\]

The above expression decomposes the conditional second moment of \(\by_i\) into two components — one corresponding to \(\bbeta_i\) and the other one to \(\bu_i\).

In general, we cannot separate the two components of the second moment of \(\by\). To see why, consider Equation 7.3 as a system of linear equations. The unknowns are the components of the symmetric matrices

\(\E\left[\bbeta_i\bbeta_i'|\bX_i=\bX \right]\) — a total of \(p(p+1)/2\) unknowns.
\(\E\left[\bu_i\bu_i'|\bX_i=\bX \right]\) — a total of \(T(T+1)/2\) unknowns.

The system is underdetermined, as there is a total of only \(T(T+1)/2\) distinct equations.

At heart, the underdeterminacy stems from the fact that the model allows any dynamic structure for \(\curl{u_{it}}_{t=1}^T\). As a consequence, \(\E[\bu_i\bu_i'|\bX_i=\bX]\) has too many free elements relative to the number of equations.

7.2.2 Imposing Structure on the Error Term

This issue can be resolved by imposing assumptions on the time series dependence of \(u_{it}\). The magnitudes of \(T\) and \(p\) determine how many restrictions are necessary. After taking out the \(p(p+1)/2\) parameters of \(\E[\bbeta_i\bbeta_i'|\bX_i=\bx]\), we have at most \(\left[T(T+1)-p(p+1) \right]/2\) equations left. This number is the number of possible free parameters in \(\E[\bu_i\bu_i'|\bX_i=\bX]\). In the most unfavorable case \(T=p+1\), and we can allow only \(T+1\) possible parameters in \(\E[\bu_i\bu_i'|\bX_i=\bX]\).

Various assumptions are possible, and Arellano and Bonhomme (2012) explore moving average and autoregressive structures for \(u_{it}\). Here, we will consider the simplest case in which \(u_{it}\) is conditionally homoskedastic across time (but not necessarily across \(\bX\)) and uncorrelated across \(t\): \[ \begin{aligned} \E[u_{it}^2|\bX_i=\bX] & = \sigma^2(\bX), \\ \E[u_{it}u_{is}|\bX_i=\bX] & = 0, \quad t\neq s. \end{aligned} \tag{7.4}\] Under assumption (7.4) it holds that \[ \E\left[\bu_i\bu_i'|\bX_i=\bX \right] = \sigma^2(\bX)\bI_T. \] There is only one unknown parameter in \(\E[\bu_i\bu_i'|\bX_i=\bX]\)!

7.2.3 Variance of Residuals

We can identify the unknown \(\sigma^2(\bX)\) using a standard argument. Let the annihilator matrix associated with \(\bX\) be given by \[ \bM(\bX) = \bI_T-\bX(\bX'\bX)^{-1}\bX'. \tag{7.5}\] Recall three key properties that \(\bM(\cdot)\) possesses \[ \begin{aligned} \bM(\bX)\bX & =0, \\ \bM(\bX)\bM(\bX) & =\bM(\bX), \\ \bM(\bX)' & =\bM(\bX). \end{aligned} \] Now consider the following second moment of \(\bM(\bX)\by_i\) conditional on \(\bX_i=\bX\): \[ \begin{aligned} & \E[\by_i'\bM(\bX)'\bM(\bX)\by_i|\bX_i=\bX]\\ & = \E[\bu_i'\bM(\bX)\bu_i|\bX_i=\bX] \\ & = \E\left[ \mathrm{tr}(\bu_i'\bM(\bX)\bu_i)|\bX_i=\bX\right] \\ & = \sigma^2(\bX)(T-p). \end{aligned} \] The details of the trace argument are standard and can be found in section 4.11 of Hansen (2022).

We conclude that \(\sigma^2(\bX)\) is identified as \[ \sigma^2(\bX) = \dfrac{1}{T-p} \E[\by_i'\bM(\bX)\by_i|\bX_i=\bX]. \]

Note that it is also possible to solve for variance parameters from Equation 7.3! This approach can be useful when residuals exhibit more general dependence.

7.2.4 Variance of Coefficients

We are now in a position to identify \(\var(\bbeta_i)\). In principle, we can solve for \(\E[\bbeta_i\bbeta_i']\) by suitably vectorizing the system in Equation 7.3. However, it will be more convenient to go back to the individual estimators (6.2). For brevity, let \(\bH_i = (\bX_i'\bX_i)^{-1}\bX_i'\), so that
\[ \hat{\bbeta}_i = (\bX_i'\bX_i)^{-1}\bX_i'\by_i = \bbeta_i + \bH_i\bu_i. \] As \(\E[\bbeta_i\bu_i']=0\) by condition (7.2), the variance of the individual estimator can be decomposed as \[ \begin{aligned} & \var(\hat{\bbeta}_i) = \var\left(\bbeta_i + \bH_i\bu_i \right)\\ & = \var(\bbeta_i) + \var(\bH_i)\\ % & = \var(\bbeta_i) + \E\left[\bH_i\bu_i\bu_i'\bH_i' \right]\\ & = \var(\bbeta_i) + \E\left[\E\left[\bH_i\bu_i\bu_i'\bH_i'|\bX_i \right]\right]\\ & = \var(\bbeta_i) + \E\left[\bH_i\E\left[\bu_i\bu_i'|\bX_i \right]\bH_i'\right]\\ & = \var(\bbeta_i) + \E\left[\sigma(\bX_i)\bH_i\bH_i'\right]. \end{aligned} \]

Observe that the variance \(\var(\hat{\bbeta}_i)\) of individual estimators \(\hat{\bbeta}_i\) (not \(\bbeta_i\)!) is identified as \[ \var(\hat{\bbeta}_i) = \var\left( (\bX_i'\bX_i)^{-1}\bX_i\by_i \right). \]

Substituting Equation 7.4 and rearranging, we obtain the following explicit expression for \(\var(\bbeta_i)\): \[ \begin{aligned} \var(\bbeta_i) & = \var(\hat{\bbeta}_i) - \E\left[\sigma(\bX_i)\bH_i\bH_i'\right]\\ & = \var\left( (\bX_i'\bX_i)^{-1}\bX_i\by_i \right)\\ & \quad - \dfrac{1}{T-p}\E\left[ \E[\by_i'\bM(\bX_i)\by_i|\bX_i]\bH_i\bH_i'\right]. \end{aligned} \tag{7.6}\] Note that the right-hand side is a function of the distribution of \((\by_i, \bX_i)\) — hence identification holds.

7.3 Estimation

To construct an estimator for \(\var(\bbeta_i)\) based on Equation 7.6, we replace the population expectations with their sample counterparts. Specifically, \(\var(\hat{\bbeta}_i)\) is estimated using the sample variance of the individual estimators: \[ \begin{aligned} \widehat{\var(\hat{\bbeta}_i) } = \dfrac{1}{N}\sum_{i=1}^N \left( \hat{\bbeta}_i - \hat{\bbeta}_{MG} \right)\left(\hat{\bbeta}_i-\hat{\bbeta}_{MG} \right)'. \end{aligned} \] Likewise, we replace the expectation in the second term of Equation 7.6 with a sample average.

This process yields an estimator for the variance of \(\bbeta\): \[ \begin{aligned} \widehat{\var(\bbeta_i) } & = \dfrac{1}{N}\sum_{i=1}^N \left( \hat{\bbeta}_i - \hat{\bbeta}_{MG} \right)\left(\hat{\bbeta}_i-\hat{\bbeta}_{MG} \right)'\\ & \quad - \dfrac{1}{(T-p)N}\sum_{i=1}^N \by_i'\bM(\bX_i)\by_i\bH_i\bH_i'. \end{aligned} \] Under standard regularity conditions, \(\widehat{\var(\bbeta_i) }\) is consistent and (with some extra work) asymptotically normal, enabling inference on the variance of \(\bbeta_i\).

7.4 Higher-Order Moments and Moving Beyond

It turns out that a similar strategy can be used to identify the third and higher-order moments of \(\bbeta_i\). see Arellano and Bonhomme (2012) for the details. The key ingredients of the identification argument for the \(k\)th moments of \(\bbeta_i\) are:

\(k\)th moments (or cumulants) of \(\by_i\) and \(\hat{\bbeta}_i\)
Restrictions on the time series dynamics of \(\curl{u_{it}}_{t=1}^T\)
Multiplicative separability of the moments of \(\bbeta_i\) and \(\bu_i\) up to order \(k\).

However, moments of \(\bbeta_i\) may be also be obtained from the full distribution of \(\bbeta_i\). Moreover, the distribution may also be used to compute quantiles of \(\bbeta_i\) and further policy-relevant parameter. As such, this distribution is our maximal object of interest, and we now turn towards its identification.

Next Section

In the next section, we will review characteristic functions and deconvolution — key ingredients used in identification in the distribution of \(\bbeta_i\).