14 Nonparametric Estimation of Average Effects

Summary and Learning Outcomes

This section provides an explicit expression for the average marginal effect and discusses its estimation using local polynomial regression.

By the end of this section, you should be able to:

Derive a limit-free expression for the average marginal effect of stayers.
Describe local polynomial estimators for derivatives of regression functions.
Propose an estimator for the average marginal effect.

14.1 An Estimable Form of the Average Marginal Effect

In the previous section, we have identified the average marginal effect for stayers in model (11.4) in terms of a limit of average change in outcomes across periods (Equation 13.8): \[ \begin{aligned} & \E\left[ \partial_x\phi(x, a, u)|X_{i1}=X_{i2}=x \right] \\ & = \lim_{h\to 0} \left[ \dfrac{Y_{i2}-Y_{i1}}{h}\Bigg|X_{i1}=x, X_{i2} = x+h \right]. \end{aligned} \] This representation is convenient for identification and explicitly shows that the source of identification is the within variation of near-stayers.

However, the limit-based representation poses challenges for estimation, which typically benefits from an explicit representation. To address this, we define the following function: \[ g(x_1, x_2) = \E\left[Y_{i2}- Y_{i1}|X_{i1}=x_1, X_{i2} = x_2 \right] . \] This function captures how the average outcome difference changes with changes in the covariates. It plays a central role in constructing our estimator.

Observe that \(g(x, x) = 0\), since \[ \begin{aligned} g(x, x) & = \E\left[Y_{i2}- Y_{i1}|X_{i1}=X_{i2} = x \right] \\ & = \E\left[\phi(x, A_i, U_{i2}) - \phi(x, A_i, U_{i1})|X_{i1}=X_{i2} = x \right]\\ & = \E\left[\phi(x, A_i, U_{it}) - \phi(x, A_i, U_{it})|X_{i1}=X_{i2} = x \right]\\ & = 0, \end{aligned} \] where we use the stationarity of \(U_{it}\) (and hence of \(\phi(x, A_i, U_{it})\)) in the last two lines.

Thus, by the definition of partial derivatives it holds that \[ \begin{aligned} & \E\left[\dfrac{Y_{i2}-Y_{i1}}{h}\Bigg|X_{i1}=x, X_{i2}= x+h \right] \\ & = \dfrac{g(x, x+h)}{h} = \dfrac{g(x, x+h) - g(x, x)}{h} \\ & \to \partial_{x_2} g(x, x), \end{aligned} \tag{14.1}\] where \(\partial_{x_2}\) means the partial derivative with respect to the second argument.

By combining Equation 14.1 with Equation 13.8, we obtain the following representation for the average marginal effects of stayers: \[ \begin{aligned} & \E[\partial_x\phi (x, A_i, U_{it})|X_{i1}=X_{i2}=x] \\ & = \partial_{x_2} \E\left[Y_{i2}- Y_{i1}|X_{i1}=x_1, X_{i2} = x_2 \right]\Big|_{(x_1, x_2)=(x, x)}. \end{aligned} \tag{14.2}\]

The practical importance of Equation 14.2 is that it reduces the problem of estimating average marginal effects to the problem of estimating a derivative of a conditional function — a standard nonparametric regression problem.

14.2 A Primer on Multivariate Local Polynomial Estimation

14.2.1 Idea

In principle, many nonparametric regression methods can estimate derivatives. Of these, local polynomial (LP) regression provides a particularly convenient approach. LP estimators are available in closed form, can estimate derivatives directly, and have favorable asymptotic properties.

We now provide the essential idea for a local quadratic estimator for a bivariate regression function. However, it is straightforward to extend the argument to any higher order polynomial and any number of variables. See 2.5.2 in Li and Racine (2007) and Fan and Gijbels (1996) for some standard references.

To motivate the approach, let \(g(\cdot, \cdot)\) be a smooth function of two variables, and suppose that we see observations \((Y_i, X_{i1}, X_{i2})\) such that \[ \E[y_i|x_{i1}, x_{i2}] = g(x_{i1}, x_{i2}), \quad i=1, \dots, N. \] Let \((x_1, x_2)\) be some fixed point.

The bivariate Taylor’s theorem tell us that, if \((X_{i1}, X_{i2})\) is “close” to \((x_1, x_2)\), then (expanding up to the second order) \[ \begin{aligned} & g(X_{i1}, X_{i2}) \\ & \approx g(x_1, x_2) \\ & \quad + \partial_{x_1} g(x_1, x_2) (X_{i1}-x_1) + \partial_{x_2} g(x_1, x_2)(X_{i2} - x_2) \\ & \quad + \dfrac{1}{2} \partial_{x_1}^2 g(x_1, x_2)(X_{i1}-x_1)^2 + \dfrac{1}{2} \partial_{x_2}^2 g(x_1, x_2)(X_{i2}-x_2)^2 \\ & \quad + \partial_{x_1}\partial_{x_2} g(x_1, x_2) (X_{i1}-x_1)(X_{i2}-x_2) \\ & = \bZ_i(x_1, x_2)'\bbeta(x_1, x_2) \end{aligned} \] where \[ \begin{aligned} \bZ_i(x_1, x_2) & = \begin{pmatrix} 1\\ X_{i1}- x_1\\ X_{i2} - x_2\\ (X_{i1}-x_1)^2/2\\ (X_{i1}- x_1)(X_{i2}-x_2) \\ (X_{i2} - x_2)^2/2 \end{pmatrix}, \\ \bbeta(x_1, x_2) & = \begin{pmatrix} g(x_1, x_2)\\ \partial_{x_1} g(x_1, x_2)\\ \partial_{x_2} g(x_1, x_2)\\ \partial^2_{x_1} g(x_1, x_2)\\ \partial_{x_1}\partial_{x_2} g(x_1, x_2)\\ \partial^2_{x_2} g(x_1, x_2) \end{pmatrix} \end{aligned} \] This expansion suggests that we can estimate the leading derivatives \(\bbeta(x_1, x_2)\) by regressing \(Y_i\) on \(\bZ_i(x_1, x_2)\) using weighted least squares. Observations which are closer to the target point \((x_1, x_2)\) should receive a higher weight.

14.2.2 Estimator

We formalize the idea of closeness using a bivariate kernel function \(K(\cdot, \cdot)\) that will measure the distance between data points and the target point \((x_1, x_2)\). In principle, we can take any function \(K\) that satisfies the following properties: \[ \begin{aligned} K(u_1, u_2) & \geq 0, \\ \iint K(u_1, u_2)du_1du_2 & =1, \\ \int u_j K(u_1, u_2)du_j & =0. \end{aligned} \] A common approach is to take \(K\) to be a product of some univariate density functions \(K_1\): \[ K(x_1, x_2) = K_1(x_1) K(x_2), \] where \(K_1\) may be the probability density function of the standard Gaussian distribution, for example.

To estimate the derivative coefficients in \(\bbeta(x_1, x_2)\), we perform a weighted least squares regression of \(Y_i\) on the “covariates”/basis functions \(\bZ_i(x_1, x_2)\), using kernel weights that reflect proximity to the target point. The resulting estimator is: \[ \begin{aligned} \hat{\bbeta}(x_1, x_2) & = \left( \sum_{i=1}^N K\left(\dfrac{X_{i1}-x_1}{s}, \dfrac{X_{i2}-x_2}{s} \right) \bZ_i(x_1, x_2)\bZ_i(x_1, x_2)'\right)^{-1}\\ & \hspace{0.9cm} \times \sum_{i=1}^N K\left(\dfrac{X_{i1}-x_1}{s}, \dfrac{X_{i2}-x_2}{s} \right) \bZ_i(x_1, x_2) Y_i, \end{aligned} \tag{14.3}\] where \(s>0\) is the smoothing parameter (bandwidth). As usual with kernel estimators, larger values of \(s\) correspond to stronger smoothing — the estimator considers points in a larger neighborhood of \((x_1, x_2)\).

One may generalize the approach to consider local polynomials of general order \(p\). Taking higher \(p\) permits estimation of higher-order derivatives.

Which order of local polynomials should one use? The standard advice is to take the degree \(p\) of the polynomial to be one higher than the highest derivatives of interest. For example, if we are interested in the regression function itself (zeroth derivative), it is most common to use local linear regression (first degree polynomial). In our case, we are interested in the first derivative. According to this rule, we should run local quadratic regression, as described above.

14.3 Estimating Average Marginal Effects

14.3.1 Estimator

Equipped with the local polynomial estimator, we can now estimate the average marginal effect of stayers. To do so, we use \(Y_{i2}-Y_{i1}\) as the dependent variable in Equation 14.3 and select the target point as \((x_1, x_2)=(x, x)\). The third coordinate of \(\hat{\bbeta}(x, x)\) is an estimator for the partial derivative of \(\E[Y_{i2}-Y_{i1}|X_{i1}=x_1, X_{i2}=x_2]\) with respect to \(x_2\), evaluated at \((x, x)\) — precisely the average marginal effect of interest by Equation 14.2: \[ \widehat{ \E}\left[ \partial_x\phi(x, a, u)|X_{i1}=X_{i2}=x \right] = \hat{\bbeta}_3(x, x). \]

14.3.2 Asymptotic Properties

Our estimator inherits all the desirable properties of local polynomial estimators, including consistency and asymptotic normality (see theorem 2.10 in Li and Racine 2007; Masry 1996a, 1996b). In particular, by suitably undersmoothing the local polynomial estimator (taking \(s\) smaller than the MSE-optimal value), one can conduct inference on the target average effect of interest.

14.4 Another Estimator

Before we move on to generalizations of the identification result, it is worth discussing Equation 14.2 somewhat more. The preceding identification and estimation arguments were based on the one-sided finite difference \((\phi(x+h, a, u)-\phi(x, a, u))\). While the one-sided finite difference yields a valid estimator, symmetric (central) differences often provide more accurate approximations to derivatives.. We now present a refinement of our identification result based on this idea.

By suitably adjusting the above argument, one can show that \[ \begin{aligned} & \E[\partial_x \phi(x, A_i, U_{it})|X_{i1}=X_{i2}=x] \\ & = \dfrac{1}{2} \dfrac{d}{dh}\Big( \E\left[Y_{i2}- Y_{i1}|X_{i1}=x-h, X_{i2} = x+h \right]\Big)\Big|_{h=0}\\ & = \dfrac{1}{2}\Bigg( \partial_{x_2} \E[Y_{i2}-Y_{i1}|X_{i1}=x_1, X_{i2}=x_2] \\ & \hspace{2cm} -\partial_{x_1} \E[Y_{i2}-Y_{i1}|X_{i1}=x_1, X_{i2}=x_2] \Bigg)\Bigg|_{(x_1, x_2)=(x, x)}. \end{aligned} \] A symmetric estimator would then be based on the difference based on the difference between the estimated partial derivatives with respect to \(x_2\) and \(x_1\), i.e., the third and second elements of \(\hat{\bbeta}(x)\).

Next Section

In the next section, we somewhat relax the stationarity assumption on the model by allowing certain changes in the structural function.