13 Nonparametric Identification of Average Effects

Summary and Learning Outcomes

This section shows how stationary panel data can overcome heterogeneity bias and identify average marginal effects even with infinite-dimensional unobserved heterogeneity in model (11.4).

By the end of this section, you should be able to:

Identify the average marginal effects for the population of stayers using stationarity and continuity assumptions.
Discuss the importance of stayers.

13.1 A Finite Difference Perspective on Heterogeneity Bias

The previous section highlighted how heterogeneity bias prevents identification of causal marginal effects using cross-sectional averages. We now show how panel data, together with smoothness and stationarity assumptions, can overcome this issue.

To motivate our identification strategy for average marginal effects, it is helpful to recall the definition of a derivative. Recall that the derivative of a function \(g\) is just the limit of a sequence of finite differences: \[ g'(x) = \lim\limits_{h\to 0}\dfrac{g(x+h)-g(x)}{h}. \]

If we take \(g(x)=\E[Y_{it}|X_{it}=x]\), the above finite difference is given by \[ \begin{aligned} & \dfrac{\E[Y_{it}|X_{it} = x+h] - \E[Y_{it} |X_{it}=x] }{h} \\ & = \dfrac{\E[\phi(x+h, A_i, U_{it})|X_{it}=x+h] - \E[\phi(x, A_i, U_{it})|X_{it}=x] }{h}. \end{aligned} \tag{13.1}\] Notice that the conditioning sets on the two expectations are different. Consequently, the conditional distributions of \((A_i, U_{it})\) given \(X_{it}\) are different between the two expectations. In other words, we cannot hold the distribution of \((A_i, U_{it})\) fixed as we vary \(x\) in \(\E[Y_{it}|X_{it}]\). This violates the ceteris paribus logic required for interpreting \(\partial_x \E[Y_{it} | X_{it} = x]\) as a causal effect, as mentioned in the previous section.

The average marginal effect can itself be restated using a finite difference as \[ \begin{aligned} & \E[\partial_x\phi(x+h, A_i, U_{it}|\cdots] \\ & = \E\left[\lim_{h\to 0} \dfrac{ \phi(x+h, A_i, U_{it}) - \phi(x, A_i, U_{it}) }{h} \Bigg| \cdots \right]. \end{aligned} \tag{13.2}\] In this view, the failure of the naive approach of the previous section stems from the fact that we cannot interchange finite differences and expectations when using cross-sectional data in Equation 13.1.

13.2 Identifying Average Marginal Effects for Stayers

13.2.1 Stationarity: Time as an Instrument

Panel data offers a natural way to address this failure of ceteris paribus reasoning. By comparing outcomes within the same unit, we can hold \(A_i\) constant, isolating variation over time.

At the same time, contrasting outcomes within the same unit necessarily means contrasting outcomes from different periods — and hence dealing with variation in \(U_{it}\).

As it turns out, to handle such comparisons it is sufficient to assume that \(U_{it}\) is conditionally stationary. Formally, we assume that the distribution \(U_{i1}\) conditional on \((X_{i1}, X_{i2}, A_i)\) is equal to the conditional distribution of \(U_{i2}\). Under this assumption, the distribution of \(\phi(x, A_i, U_{it})\) does not depend on time \(t\) any more given \((X_{i1}, X_{i2})\). Intuitively, one may think that time is randomly assigned to observations; an interpretation called “time as an instrument” (Chernozhukov et al. 2013). The variation in \(U_{it}\) over time becomes non-systemic and we can use the time dimension to isolate causal effects.

The stationarity assumption actually imposes a time-invariance assumption on the structural function \(\phi\). To see why, note that under this assumption \(U_{it}\) cannot contain changing deterministic variables, including time \(t\). In general, if \(t\) is part of \(U_{it}\), then the function could change arbitrarily between periods. There is effectively no value in panel data if such arbitrary changes are possible. The stationarity assumption rules out such situations.

13.2.2 Two Key Analysis Steps

We base our panel data identification argument on the following expectations involving \(\phi\) \[ \E\left[ \dfrac{\phi(x+h, A_i, U_{i2}) - \phi(x, A_i, U_{i1})}{h}\Bigg|\cdots \right]. \tag{13.3}\] Expectation (13.3) reflects the idea sketched out above:

The same \(A_i\) appears in both \(\phi\) terms.
The \(U_{it}\) terms change over time.

The latter point means that (13.3) does not involve a genuine finite difference in contrast to Equation 13.2. As a consequence, there are two key conceptual pieces in our analysis:

Convergence: showing that expectation (13.3) does converge to the average marginal effect as \(h\to 0\), despite not involving the “correct” finite difference.
Identification: showing that the limit of expectation (13.3) is identified as \(h\to 0\), at least for some subpopulation of interest.

13.2.3 Identifying Moments of Finite Difference

We begin with the identification step, showing how the expression in (13.3) maps onto observable quantities. To identify the limit of expectation (13.3) as \(h\to 0\), it is sufficient to identify (13.3) for all \(h>0\) in some neighborhood of 0. To do so, consider the population of units with \(\curl{ X_{i1} = x, X_{i2} = x+h}\). By model (12.1), the outcomes of these units satisfy \[ \begin{aligned} Y_{i2} & = \phi(x+h, A_i, U_{i2}),\\ Y_{i1} & = \phi(x, A_i, U_{i1}). \end{aligned} \] Accordingly, \[ \begin{aligned} & \E\left[\dfrac{Y_{i2}-Y_{i1}}{h}\Bigg|X_{i1} = x, X_{i2} = x+h \right] \\ & = \E\left[\dfrac{\phi(x+h, A_i, U_{i2}) - \phi(x, A_i, U_{i1})}{h}\Bigg|X_{i1} = x, X_{i2} = x+h \right]. \end{aligned} \tag{13.4}\] We conclude that (13.3) is identified for all \(h>0\) small enough, provided that the joint density of \((X_{i1}, X_{i2})\) is positive in some neighborhood of \((x, x)\),

This shows that we can observe the difference in outcomes directly in panel data for near-stayers — units whose covariates change only slightly over time (\(X_{i1}=x, X_{i2}=x+h\) for small \(h>0\); see e.g. Sasaki and Ura (2021)). It also determines the population for which we are able to identify the parameter of interest. We return to this topic below.

13.2.4 Average Differences and Almost Marginal Effects

With identification in hand, we now analyze the behavior of this expression as \(h\to 0\) to ensure that it converges to a marginal effect. Establishing convergence requires some continuity assumptions on the relationship between \(X_{it}\), \(A_i\) and \(U_{it}\). To motivate their form, we write out (13.3) as an integral, similarly to how we proceeded in the previous section. To that end, define the following conditional densities:

\(f_{A_i, U_{i1}, U_{i2}|X_{i1}, X_{i2}}(a, u_1, u_2|x_1, x_2)\) be the conditional density of \((A_i, U_{i1}, U_{i2})\) given \(\curl{X_{i1} =x_1, X_{i2}=x_2}\);
\(f_{U_{i1}, U_{i2}|X_{i1}, X_{i2}, A_i}(u_1, u_2|x_1, x_2, a)\) be the conditional density of \((U_{i1}, U_{i2})\) given \(\curl{X_{i1} =x_1, X_{i2}=x_2, A_i=a}\);
\(f_{A_i|X_{i1}, X_{i2}}(a|x_1, x_2)\) be the conditional density of \(A_i\) given \(\curl{X_{i1}=x_1, X_{i2}=x_2}\)
\(f_{U_{it}|X_{i1}, X_{i2}, A_i}(u_1, u_2|x_1, x_2, a)\) be the conditional density of \(U_{it}\) given \(\curl{X_{i1} =x_1, X_{i2}=x_2, A_i=a}\).
\(f_{A_i, U_{it}|X_{i1}, X_{i2}}(u_1, u_2|x_1, x_2, a)\) be the conditional density of \((A_i, U_{it})\) given \(\curl{X_{i1} =x_1, X_{i2}=x_2}\).

Observe that the latter two densities do not depend on \(t\) by assumption of stationarity! This property will be crucial to reducing (13.3) to an integral of a finite difference in \(\phi\).

Throughout, we assume that the above densities are taken with respect to some overall dominating measure \(\mu\) that does not depend on \((x_1, x_2)\).

Regarding the existence of \(\mu\)

Intuitively, the assumption that \(\mu\) does not depend on \((x_1, x_2)\) means that the type of distribution of \((A_i, U_{it})\) does not depend with \((x_1, x_2)\). In the simplest case, imagine that \(A_i\) and \(U_{it}\) are just random scalars. The assumption states that the following two statements cannot be true at the same time:

For some \((x_1, x_2)\) there is only a finite number of possible values of \((A_i, U_{it})\) (with \(\mu\) being the counting measure).
For some \((x_1, x_2)\) there is a continuum of possible values of \((A_i, U_{it})\) with no atoms (with \(\mu\) being the Lebesgue measure).

With this notation, we can represent (13.3) as \[ \begin{aligned} & \E\left[\dfrac{\phi(x+h, A_i, U_{i2}) - \phi(x, A_i, U_{i1})}{h}\Bigg|X_{i1} = x, X_{i2} = x+h \right]\\ % & = \int \left[\phi(x+h, a, u_2) - \phi(x, a, u_1) \right] \\ & \hspace{2cm}\times f_{A_i, U_{i1}, U_{i2}|X_{i1}, X_{i2}}(a, u_1, u_2|x,x+h) \mu(da, du_1, du_2)\\ % & = \int \left[\phi(x+h, a, u_2) - \phi(x, a, u_1) \right] \\ & \hspace{2cm}\times f_{U_{i1}, U_{i2}|X_{i1}, X_{i2}, A_i}(u_1, u_2|x, x+h, a) \\ & \hspace{2cm}\times f_{A_i|X_{i1}, X_{i2}}(a|x, x+h) \mu(da, du_1, du_2)\\ % & = \int \left[\phi(x+h, a, u_2) - \phi(x, a, u_1) \right] \\ & \hspace{2cm}\times f_{U_{i1}, U_{i2}|X_{i1}, X_{i2}, A_i}(u_1, u_2|x, x+h, a) \\ & \hspace{2cm}\times f_{A_i|X_{i1}, X_{i2}}(a|x, x+h) \mu(da, du_1, du_2). \end{aligned} \] Now we make four key observations:

We can split the integral into two integrals: one involving \(\phi(x+h, a, u_2)\) and one involving \(\phi(x, a, u_1)\).
\(\phi(x+h, a, u_2)\) does not depend on \(u_1\), and so \(u_{1}\) is just integrated out: \[ \begin{aligned} & \int \phi(x+h, a, u_2) f_{A_i|X_{i1}, X_{i2}}(a|x, x+h) \\ & \hspace{2cm}\times f_{U_{i1}, U_{i2}|X_{i1}, X_{i2}, A_i}(u_1, u_2|x, x+h, a) \mu(da, du_1, du_2) \\ & = \int \phi(x+h, a, u_2) f_{A_i|X_{i1}, X_{i2}}(a|x, x+h) \\ & \hspace{2cm}\times f_{U_{i2}|X_{i1}, X_{i2}, A_i}(u_2|x, x+h, a) \mu(da, du_2). \end{aligned} \]
The conditional density of \(U_{i2}\) is equal to the time-invariant density \(f_{U_{it}|X_{i1}, X_{i2}, A_i}\): \[ \begin{aligned} & = \int \phi(x+h, a, u_2) f_{A_i|X_{i1}, X_{i2}}(a|x, x+h) \\ & \hspace{2cm}\times f_{U_{i2}|X_{i1}, X_{i2}, A_i}(u_2|x, x+h, a) \mu(da, du_2)\\ & = \int \phi(x+h, a, u) f_{A_i|X_{i1}, X_{i2}}(a|x, x+h) \\ & \hspace{2cm}\times f_{U_{it}|X_{i1}, X_{i2}, A_i}(u|x, x+h, a) \mu(da, du). \end{aligned} \]
A symmetric argument applies to the integral with \(\phi(x, a, u_1)\).

We conclude that we can represent (13.3) as \[ \begin{aligned} & \E\left[\dfrac{\phi(x+h, A_i, U_{i2}) - \phi(x, A_i, U_{i1})}{h}\Bigg|X_{i1} = x, X_{i2} = x+h \right]\\ & = \int \left[ \phi(x+h, a, u) - \phi(x, a, u) \right]f_{A_i, U_{it}|X_{i1}, X_{i2}, A_i}(a, u|x, x+h) \mu(da, du) \end{aligned} \]

The achievement of the above representation is that we managed to obtain a genuine finite difference involving \(\phi\) — notice that only \(x\) changes under \(\phi\), while the same \(a\) and \(u\) appear in both terms. We can apply the mean value theorem to this finite difference as \[ \phi(x+h, a, u) = \phi(x, a, u) + h\partial_x\phi(\tilde{x}, a, u), \tag{13.5}\] where \(\tilde{x}\) is a point between \(x\) and \(x+h\); \(\tilde{x}\) possibly depends on \(a\) and \(u\).

Substituting Equation 13.5 into the above representation for (13.3) we conclude that \[ \begin{aligned} & \E\left[\dfrac{\phi(x+h, A_i, U_{i2}) - \phi(x, A_i, U_{i1})}{h}\Bigg|X_{i1} = x, X_{i2} = x+h \right]\\ & = \int \partial_x\phi(\tilde{x}, a, u)f_{A_i, U_{it}|X_{i1}, X_{i2}}(a, u|x, x+h) \mu(da, du). \end{aligned} \tag{13.6}\]

13.2.5 Distributional Continuity in Unobservables

To convert this representation into a true marginal effect, we analyze the integrand in Equation 13.6. In expectation form, we can write it as \[ \E[\partial_x \phi(\tilde{x}, A_i, U_{it})|X_{i1}=x, X_{i2}=x+h]. \] This object is not quite an average marginal effect — the function \(\phi\) is evaluated at a potentially random point \(\tilde{x}=\tilde{x}(A_i, U_{it})\).

To resolve this issue, we need to enforce \(\tilde{x} = x\). Since \(\tilde{x} \in [x-h, x+h]\), this result can be achieved by taking \(h\to 0\) in Equation 13.6, while preserving the expectation interpretation of the result.

To ensure this limiting behavior is well-defined, we impose our second key assumption to allow taking \(h\to\infty\). We assume that the density \(f_{A_i, U_{it}|X_{i1}, X_{i2}}(a, u|x_1, x_2)\) is continuous in \((x_1, x_2)\). In words, we assume that the distribution of unobserved components varies smoothly as we vary covarites.

With this assumption, the integrand in Equation 13.6 converges as \(h\to 0\) \[ \begin{aligned} & \partial_x\phi(\tilde{x}, a, u)f_{A_i, U_{it}|X_{i1}, X_{i2}}(a, u|x, x+h) \\ & \to \partial_x\phi(x, a, u)f_{A_i, U_{it}|X_{i1}, X_{i2}}(a, u|x, x). \end{aligned} \] Under mild dominance conditions, this convergence also applies to the integrals: \[ \begin{aligned} & \int \partial_x\phi(\tilde{x}, a, u)f_{A_i, U_{it}|X_{i1}, X_{i2}}(a, u|x, x+h) \mu(da, du) \\ & \to \int \partial_x\phi(x, a, u)f_{A_i, U_{it}|X_{i1}, X_{i2}}(a, u|x, x) \\ & = \E\left[ \partial_x\phi(x, a, u)|X_{i1}=X_{i2}=x \right], \end{aligned} \tag{13.7}\] finally yielding an average marginal effect.

13.2.6 Result Statement

Combining equations (13.4), (13.6), and (13.7), we obtain our identification result: \[ \begin{aligned} & \E\left[ \partial_x\phi(x, a, u)|X_{i1}=X_{i2}=x \right] \\ & = \lim_{h\to 0} \left[ \dfrac{Y_{i2}-Y_{i1}}{h}\Bigg|X_{i1}=x, X_{i2} = x+h \right]. \end{aligned} \tag{13.8}\] This result, derived by Hoderlein and White (2012) and Chernozhukov et al. (2015), establishes nonparametric identification of average marginal effects for stayers at x — units with identical values of \(X_{i1}\) and \(X_{i2}\).

To obtain this result, we relied on two key assumptions:

Conditional stationarity of \(U_{it}\): used to convert expected change in outcomes into an expected finite difference in \(\phi\)
Continuity of the conditional distribution of \((A_i, U_{it})\) in the conditioning variables.

What makes Equation 13.8 especially powerful is that it is compatible with potentially infinite-dimensional unobserved heterogeneity. We can nonparametrically identify meaningful causal parameters without even needing to know the dimension and the form of the unobserved components \((A_i, U_{it})\).

This is a remarkably flexible framework. For example,

In a consumption context, individuals may have arbitrarily different preferences.
In a production setting, firms may differ in their entire technology sets.

As long as there is smoothness in the distribution of unobservables and comparability over time (i.e., conditional stationarity), the average marginal effect for stayers is identified.

In the next section, we will refine Equation 13.8 to provide an even more explicit identification result and discuss estimation.

13.3 Interpretation and Stayers

Before proceeding, we briefly discuss the population for which Equation 13.8 identifies marginal effects. Stayers represent an important subpopulation in nonparametric panel data analysis, dating back to the seminal work of Chamberlain (1982).

Stayers matter for two main reasons, one positive, one negative:

Empirical relevance: In many microdata settings, stayers and near-stayers comprise a substantial share or an outright majority of the population. See, for example, the evidence and discussion in Sasaki and Wang (2022).
Theoretical necessity: Stayers are typically the only population for which identification is possible without meaningful restrictions on the dependence between \(X_{it}\) and \((A_i, U_{it})\). As shown by the constructive counterexamples in Cooprider, Hoderlein, and Meister (2022), identification generally fails for all distributional features of marginal effects for non-stayers in models (11.4) and (11.5).

Next Section

In the next section, we refine Equation 13.8 into a limit-free expression and discuss practical estimation of the average marginal effect.