9 Distribution of Heterogeneous Coefficients

Summary and Learning Outcomes

This section shows how identify the full distribution of coefficients using a conditional deconvolution argument.

By the end of this section, you should be able to:

Describe the impact of conditional independence assumptions between \(\bbeta_i\) and \(u_{it}\).
Identify the distribution of \(\bbeta_i\) using a conditional deconvolution argument and assumptions on \(u_{it}\).
Propose a distribution estimator for the case of discrete covariates.

9.1 Model and Conditional Independence Assumption

9.1.1 Model

We are now in a position to obtain our final and strongest result in this block: identify the full distribution of \(\bbeta_i\) in model (2.7): \[ y_{it} = \bbeta_i'\bx_{it} + u_{it}. \] As in section 7, we impose strict exogeneity \[ \E[\bu_i|\bbeta_i, \bX_i] = 0. \] We also assume the existence of the first two moments of \(\bbeta_i, \bX_i\), and \(\bu_i\).

As before, the number \(T\) of unit-level observations is assumed to exceed the number \(p\) of covariates. We treat \(T\) as fixed, and consider large-\(N\) identification and estimation arguments.

9.1.2 Conditional Independence of \(\bbeta_i\) and \(\bu_i\)

The key to our identification strategy is the following assumption of conditional independence between \(\bbeta_i\) and \(\bu_i\): \[ \bbeta_i \independent \curl{u_{it}}_{t=1}^T |\bX \tag{9.1}\] To understand assumption (9.1), consider a production function example. Let \(u_{it}\) be the measurement error in the output value \(y_{it}\). The variance of \(u_{it}\) may depend on the scale of the firm (captured by capital), leading to dependence between \(\bX_i\) and \(\bu_i\). However, it is plausible that the firm size captures all information about firm technology relevant for measurement error. In this case, the assumption appears reasonable.

9.1.3 Implication of Conditional Independence

Given the conditional independence assumption (9.1), both \(\by_i\) and the individual estimators \(\hat{\bbeta}_i\) are sums of two conditionally independent vectors. Specifically, conditionally on \(\curl{\bX_i=\bX}\):

\(\by_i\) is the sum of \(\bX\bbeta_i\) and \(\bu_i\)
\(\hat{\bbeta}_i\) is the sum of \(\bbeta_i\) and \((\bX'\bX)^{-1}\bX\bu_i\)

We can write the conditional characteristic function of \(\by_i\) given \(\bX_i=\bX\) using properties (8.1) and (8.2) as: \[ \begin{aligned} \varphi_{\by_i|\bX_i}(\bs|\bX) & = \varphi_{\bX'\bbeta_i|\bX_i}(\bs|\bX)\varphi_{\bu_i|\bX_i}(\bs|\bX) \\ & = \varphi_{\bbeta_i|\bX_i}(\bX'\bs|\bX)\varphi_{\bu_i|\bX_i}(\bs|\bX). \end{aligned} \tag{9.2}\] Similarly, the conditional characteristic function of \(\hat{\bbeta}_i\) given \(\bX_i=\bX\) satisfies \[ \begin{aligned} \varphi_{\hat{\bbeta}_i|\bX_i}(\bs|\bX) & = \varphi_{\bbeta_i|\bX_i}(\bs|\bX) \varphi_{\bH_i\bu_i|\bX_i}(\bs|\bX) \\ & = \varphi_{\bbeta_i|\bX_i}(\bs|\bX) \varphi_{\bu_i|\bX_i}(\bX(\bX'\bX)^{-1}\bs|\bX), \end{aligned} \tag{9.3}\] where we again define \(\bH_i = (\bX_i'\bX_i)^{-1}\bX_i\).

9.2 Identification of the Distribution

9.2.1 Overall Strategy

To identify the distribution, we will proceed similarly to how we identified the variance. The steps are:

Identify \(\varphi_{\bu_i|\bX_i}(\cdot|\bX)\) from Equation 9.2.
Apply deconvolution to Equation 9.3 to recover \(\varphi_{\bbeta_i|\bX_i}(\bs|\bX)\).
Invert the characteristic function of \(\bbeta_i\) to obtain the distribution.

9.2.2 Equation in Hessians of Characteristic Functions

We start by rewriting Equation 9.2 in a more useful form. The characteristic functions in (9.2) are twice differentiable under our moment assumptions. Taking logarithms (see here) and differentiating twice yields \[ \begin{aligned} & \dfrac{\partial^2 \log( \varphi_{\by_i|\bX_i}(\bs|\bX))}{\partial\bs\partial\bs'} \\ & = \bX \dfrac{\partial^2 \log( \varphi_{\bbeta_i|\bX_i}(\bX'\bs|\bX)) }{\partial \bs\partial\bs'} \bX' + \dfrac{\partial^2 \log(\varphi_{\bu_i|\bX_i}(\bs|\bX))}{\partial \bs\partial\bs'} . \end{aligned} \tag{9.4}\]

This equation is similar to the expression (7.3) we obtained for variance. It decomposes the characteristic function of the data into contributions from the coefficients \(\bbeta_i\) and the residuals \(\bu_i\). Unlike the variance expression, system (9.4) is a functional equation parametrized by \(\bs\).

9.2.3 Imposing Structure on the Error Term

Our goal is to solve for the second term in the linear system (9.4). However, like system (7.3), system (9.4) is underdetermined. Accordingly, we need to impose additional assumptions to disentangle the \(\bu_i\) component from the \(\bbeta_i\) one.

In these notes, we consider a simple assumption that strengthens our temporal homoskedasticity assumption (7.4). Specifically, we will assume that \(u_{it}\) is IID across \(i\) and \(t\) conditional on \(\bX_i=\bX\). This assumption implies that all \(u_{it}\) have the same characteristic function for all \(i\) and \(t\): \[ \varphi_{u_{i1}|\bX_i}(s|\bX) = \cdots = \varphi_{u_{iT}|\bX_i}(s|\bX). \tag{9.5}\] We label the common function \(\varphi_{u|\bX_i}(s|\bX)\).

The characteristic function of the \(T\)-vector \(\bu_i\) can be written as \[ \begin{aligned} \varphi_{\bu_i|\bX_i}(\bs|\bX) & = \prod_{j=1}^T \varphi_{u|\bX_i}(s_j|\bX), \\ \bs & = (s_1, s_2, \dots, s_T). \end{aligned} \] Taking logarithms turns the product into a sum: \[ \log\left(\varphi_{\bu_i|\bX_i}(\bs|\bX)\right) = \sum_{j=1}^T \log(\varphi_{u|\bX_i}(s_j|\bX)). \] The Hessian of this function with respect to \(\bs\) is diagonal: \[ \begin{aligned} \dfrac{\partial^2 \log(\varphi_{\bu_i|\bX_i}(\bs|\bX))}{\partial \bs\partial\bs'} & = \diag\curl{\bphi(\bs)},\\ \end{aligned} \] where \[ \bphi(\bs) = \left(\dfrac{d^2\log(\varphi_{u|\bX_i}(s_1|\bX))}{ds_1^2}, \dots, \dfrac{d^2\log(\varphi_{u|\bX_i}(s_T|\bX))}{ds_T^2}\right). \] To summarize, assumption (9.5) reduces the unknown \(T\times T\) matrix \(\frac{\partial^2 \log(\varphi_{\bu_i|\bX_i}(\bs|\bX))}{\partial \bs\partial\bs'}\) to an unknown \(T\)-vector \(\bphi(\bs)\). There are now sufficiently many equations to cover all the remaining unknown components, provided standard rank conditions hold.

9.2.4 Solving for the Distribution of Residuals

To solve for \(\bphi(\bs)\), we return to (9.4). We first put it into more familiar tall form (one line, one equation) using the vectorization operator. Applying the vectorization operator yields \[ \begin{aligned} & \vecc\left(\dfrac{\partial^2 \log( \varphi_{\by_i|\bX_i}(\bs|\bX))}{\partial\bs\partial\bs'}\right) \\ & = (\bX \otimes \bX) \vecc\left(\dfrac{\partial^2 \log( \varphi_{\bbeta_i|\bX_i}(\bX'\bs|\bX)) }{\partial \bs\partial\bs'}\right) + \bA\bphi(\bs), \end{aligned} \tag{9.6}\] where an explicit formula for \(\bA\) can be found here.

Now we premultiply system (9.6) by \(\bM(\bX\otimes \bX)\) where \(\bM(\cdot)\) is defined in (7.5): \[ \begin{aligned} & \bM(\bX\otimes \bX) \vecc\left(\dfrac{\partial^2 \log( \varphi_{\by_i|\bX_i}(\bs|\bX))}{\partial\bs\partial\bs'}\right) \\ & = \bM(\bX\otimes \bX)\bA\bphi(\bs) \end{aligned} \tag{9.7}\] We can solve this system for \(\bphi(\bs)\) provided \(\rank(\bM(\bX\otimes \bX)\bA)=T\). Indeed, this rank condition holds in this case, as shown by Arellano and Bonhomme (2012). As both \(\bM(\bX\otimes\bX)\) and \(\varphi_{\by_i|\bX_i}(\cdot|\cdot)\) are identified, we conclude that \(\bphi(\bs)\) is also identified.

The characteristic function of \(\bu_i\) is now straighforward to recover from \(\bphi(\bs)\) by integrating twice with respect to \(\bs\). As \(\bphi(\bs)\) encodes second derivatives, we need two initial values. These initial values are provided by the properties of the characteristic function and the assumption of strict exogeneity: \[ \begin{aligned} \dfrac{\partial \log(\varphi_{\bu_i|\bX_i}(0|\bX))}{\partial\bs'} & = \E[\bu_i|\bX_i=\bX] = 0,\\ \log(\varphi_{\bu_i|\bX_i}(0|\bX)) & = 0. \end{aligned} \] This completes the first identification step.

9.2.5 Identifying the Distribution of Coefficients

For the second step — identification of \(\varphi_{\bbeta_i|\bX_i}\) — we return to Equation 9.3. We make an additional assumption: \[ \varphi_{\bu_i|\bX_i}(\bs|\bX)\neq 0 \text{ for all }\bs. \] This assumption allows us to divide by the characteristic function of \(\bu_i\) in Equation 9.3 and obtain \[ \varphi_{\bbeta_i|\bX_i}(\bs|\bX) = \dfrac{\varphi_{\hat{\bbeta}_i|\bX_i}(\bs|\bX) }{\varphi_{\bu_i|\bX_i}(\bX(\bX'\bX)^{-1}\bs|\bX)}. \tag{9.8}\]

Finally, the density or cumulative distribution functions of the coefficients may be recovered using inversion formulae. For continuously distributed coefficients, the conditional density is: \[ f_{\bbeta_i|\bX_i}(\bb|\bX) = \dfrac{1}{(2\pi)^n} \int_{\R^p} \exp(-i\bs'\bb)\varphi_{\bbeta_i|\bX_i}(\bs|\bX)d\bs. \] Last, we can recover the unconditional distribution of the coefficients since we know the marginal distribution of \(\bX_i\). For example, if \(f_{\bX_i}\) is the marginal density, the unconditional density of \(\bbeta_i\) is obtained by simply integrating \(\bX_i\) out as \[ f_{\bbeta_i}(\bb) = \int f_{\bbeta_i|\bX_i}(\bb|\bX)f_{\bX_i}(\bX)d\bX. \]

9.3 Estimation

9.3.1 With Discrete Covariates

For estimation, we discuss the conceptually simpler case where \(\bX_i\) has finite support. In such a setting, there is a non-zero probability that \(\bX_i\) takes each value in its support.

Estimation proceeds in three steps:

Estimation of \(\varphi_{\hat{\bbeta}_i|\bX_i}(\cdot|\bX)\).
Estimation of \(\varphi_{\bu_i|\bX_i}(\cdot|\bX)\).
Combining the estimators of the first two steps using Equation 9.8 and inverting the resulting estimated characteristic function.

The characteristic function of the individual estimators \(\hat{\bbeta}_i\) can be estimated with the empirical characteristic function on the sample of units with \(\bX_i=\bX\): \[ \hat{\varphi}_{\hat{\bbeta}_i|\bX_i}(\bs|\bX) = \dfrac{1 }{\sum_{i=1}^N \I\curl{\bX_i=\bX} }\sum_{i=1}^N \I\curl{\bX_i=\bX} \exp\left( i\bs'\hat{\bbeta}_i \right). \]

As for the characteristic function of \(\bu_i\), it can be estimated from a sample version of Equation 9.7. We replace the characteristic function of the data with its empirical counterpart: \[ \hat{\varphi}_{\by_i|\bX_i}(\bs|\bX) = \dfrac{1 }{\sum_{i=1}^N \I\curl{\bX_i=\bX} }\sum_{i=1}^N \I\curl{\bX_i=\bX} \exp\left( i\bs'\by_i \right). \]

9.3.2 With Continuous Covariates

If \(\bX_i\) is continuously distributed, it is instead necessary to estimate the characteristic functions using techniques such as kernel regression. Evdokimov (2010) studies such conditional deconvolution estimators and their asymptotic properties. Inference may be conducted using the results of Kato, Sasaki, and Ura (2021).

Such nonparametric estimators may perform poorly due to the curse of dimensionality if \(T\) and \(p\) are not small. In the next block we will discuss some assumptions that can reduce the dimensionality of the problem and be applied in this context.

Next Section

In the next section, we briefly conclude the block and discuss some further results on heterogeneous linear models.