Inference II: Nonlinear Hypotheses

Handling Nonlinearities with the Delta Method

Vladislav Morozov

Introduction

Lecture Info

Learning Outcomes

This lecture is about extending our distributional and inference results to nonlinear functions of parameters


By the end, you should be able to

  • Derive the asymptotic distribution of nonlinear transformations of parameters using the delta method
  • Construct confidence intervals and hypothesis test for potentially nonlinear hypotheses
  • Discuss the connection between these results and those for linear hypotheses

References


  • Corresponding section on Wikipedia (Up to the “Example” section)
  • Or 6.5, 7.10 in Hansen (2022)

Reminder on the Empirical Example

Reminder: Empirical Model

Studying link between wages and (education, experience) \[ \begin{aligned}[] & [\ln(\text{wage}_i)]^{\text{(education, experience)}} \\ & = \beta_1 + \beta_2 \times \text{education} \\ & \quad + \beta_3 \times \text{experience} + \beta_4 \times \dfrac{\text{experience}^2}{100} + U_i \end{aligned} \tag{1}\]

Reminder: Estimation Results

results = OLS(endog, exog).fit(cov_type='HC0') # Robust covariance matrix estimator
print(results.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:               log_wage   R-squared:                       0.226
Model:                            OLS   Adj. R-squared:                  0.226
Method:                 Least Squares   F-statistic:                     862.5
Date:                Tue, 20 May 2025   Prob (F-statistic):               0.00
Time:                        21:48:33   Log-Likelihood:                -8152.9
No. Observations:               10402   AIC:                         1.631e+04
Df Residuals:                   10398   BIC:                         1.634e+04
Df Model:                           3                                         
Covariance Type:                  HC0                                         
=====================================================================================
                        coef    std err          z      P>|z|      [0.025      0.975]
-------------------------------------------------------------------------------------
const                 0.9799      0.040     24.675      0.000       0.902       1.058
education             0.1114      0.002     50.185      0.000       0.107       0.116
experience            0.0229      0.002     12.257      0.000       0.019       0.027
experience_sq_div    -0.0347      0.004     -8.965      0.000      -0.042      -0.027
==============================================================================
Omnibus:                     4380.404   Durbin-Watson:                   1.833
Prob(Omnibus):                  0.000   Jarque-Bera (JB):           134722.859
Skew:                          -1.401   Prob(JB):                         0.00
Kurtosis:                      20.406   Cond. No.                         219.
==============================================================================

Notes:
[1] Standard Errors are heteroscedasticity robust (HC0)

Parameter of Interest: Nonlinear Transformation

We still have one parameter of interest to look at: \[ \theta = -50\frac{\beta_3}{\beta_4} \]

  • Interpretation: experience level that maximizes expected log wage
  • This \(\theta\) is a smooth nonlinear transformation of \(\bbeta\)

How to do inference on such \(\theta\)?

The Delta Method

Scalar Case

Mean Value Theorem

Recall the following useful result:

Proposition 1 Let \(f(\cdot): \R\to\R\) be differentiable on the interval \([x, y]\). Then there exists some \(c\in[x, y]\) such that \[ f(y)-f(x) = f'(c)(y-x) \]

Rearranged: the “mean value expansion around \(x\)\[ f(y) = f(x) + f'(c)(y-x) \]

Manual Illustration of the Argument

If \(X_1, \dots, X_N\sim\)IID\((\theta, \sigma^2)\), then \(\sqrt{N}(\bar{X}-\theta)\xrightarrow{d} N(0, \sigma^2)\)

What is the asymptotic distribution of \((\bar{X})^2\)?

Mean value theorem (Proposition 1): \[\small (\bar{X})^2 = \theta^2 + 2(\theta+\alpha_N[\bar{X}-\theta])(\bar{X}-\theta), \quad \alpha_N\in[0, 1] \tag{2}\]

By Slutsky’s theorem if \(\theta\neq 0\) \[\small \sqrt{N}(\bar{X}^2 - \theta^2) \xrightarrow{d} N( 0, (2\theta)^2 \sigma^2 ) \]

More Abstract Form of (2)

Can write Equation 2 as \[ \sqrt{N}(f(Y_N)-f(\theta)) = f'(\theta + \alpha_N[Y_N-\theta] ) \sqrt{N}(Y_N-\theta) \] for

  • \(f(y) = y^2\)
  • \(Y_N = \bar{X}\)

Abstracting the Argument


Can replicate the argument if

  • \(Y_N\xrightarrow{p} \theta\) and \(f'(\cdot)\) is continuous with \(f'(\theta)\neq 0\)
  • \(\sqrt{N}(Y_N-\theta)\) converges to a normal distribution

Delta Method in the Univariate Case

Combining the previous arguments gives:

Proposition 2 Let \(\sqrt{N}(Y_N-\theta)\xrightarrow{d} N(0, \sigma^2)\) and let \(f(\cdot)\) be continuously differentiable with \(f'(\theta)\neq 0\). Then

\[ \sqrt{N}(f(Y_N) - f(\theta)) \xrightarrow{d} N(0, [f'(\theta)]^2\sigma^2) \]

More properly called the first-order delta method — there are higher-order versions if \(f'(\theta)=0\)

Multivariate Case

Motivation

Proposition 2 has two limitations:

  • \(Y_N\) is scalar, but we deal with vectors like \(\hat{\bbeta}\)
  • \(f(\cdot)\) is scalar-valued — but we may have multiple transformations of \(\bbeta\) at the same time


Can solve both! Let \(\ba(\bbeta):\R^p\to\R^k\) be the transformation of interest

Jacobian of \(\ba(\cdot)\)

Let \(\ba(\cdot) = (a_1(\cdot), \dots, a_k(\cdot))'\). Define its Jacobian matrix \(\bA(\bbeta)\) as \[ \bA(\bbeta) = \begin{pmatrix} \frac{\partial a_1}{\partial \beta_1}(\bbeta) & \cdots & \frac{\partial a_1}{\partial \beta_p}(\bbeta)\\ \vdots & \ddots & \vdots\\ \frac{\partial a_k}{\partial \beta_1}(\bbeta) & \cdots & \frac{\partial a_k}{\partial \beta_p}(\bbeta) \end{pmatrix} \] Rows correspond to components of \(\ba(\cdot)\); columns — to components of \(\bbeta\)

Delta Method in the Multivariate Case

Proposition 3 Let \(\sqrt{N}(\bY_N-\btheta)\xrightarrow{d} N(0, \bSigma)\). Let \(\ba(\cdot)\) be continuously differentiable. Let \(\bA(\btheta)\) have rank \(k\). Then

\[ \sqrt{N}\left(\ba(\bY_N) - \ba(\btheta)\right) \xrightarrow{d} N(0, \bA(\btheta)\bSigma\bA(\btheta)') \]

  • Proof (not examinable) is similar to the univariate case
  • OLS: take \(\bY_N=\hat{\bbeta}\) and \(\btheta=\bbeta\)

Vector Example I: Norm of \(\bbeta\)

Suppose that our parameter of interest is \(\ba(\bbeta) = \norm{\bbeta}\). Then \[ \bA(\bbeta) = \begin{pmatrix} \dfrac{\beta_1}{\norm{\bbeta}} & \cdots & \dfrac{\beta_p}{\norm{\bbeta}} \end{pmatrix} \]

If \(\bbeta\neq 0\), the delta method (Proposition 3) tells us that \[ \sqrt{N}\left(\norm{\hat{\bbeta}}-\norm{\bbeta}\right)\xrightarrow{d} N(0, \bA(\bbeta)\avar(\bbeta)\bA(\bbeta)') \]

Vector Example II: Linear Transformations

Another example: \(\ba(\bbeta) = \bR\bbeta\). Then \(\bA(\bbeta)=\bR\) and

\[ \sqrt{N}(\bR\hat{\bbeta}-\bR\bbeta)\xrightarrow{d} N(0, \bR\avar(\bbeta)\bR') \]


In words, the delta method implies our results for linear transformations from before

Generalizations

Delta method — extremely general tool!


Some generalizations:

  • The limit does not have to be normal
  • Speed of convergence does not have to be \(\sqrt{N}\)
  • \(f(\cdot)\) can have functions as inputs and outputs

Inference on Nonlinear Transformations

Confidence Intervals

Overall Idea

Can use the delta method (Proposition 3) for inference!


Intuitively, it says that \[ \ba(\hat{\bbeta}) \overset{a}{\sim} N\left(\ba(\bbeta), \dfrac{1}{N}\bA(\bbeta)\avar(\bbeta)\bA(\bbeta)' \right) \]

Can construct tests and confidence intervals same way as before, just need to compute the Jacobian \(\bA\)

Estimating the Asymptotic Variance

For construction, need to estimate \[ \avar(\ba(\hat{\bbeta})) = \bA(\bbeta)\avar(\hat{\bbeta})\bA(\bbeta)' \]

To consistently estimate it:

  • Estimate \(\avar(\hat{\bbeta})\) as before with HC0 (or other robust) errors \(\widehat{\avar}(\hat{\bbeta})\)
  • For \(\bA(\bbeta)\), just use \(\bA(\hat{\bbeta})\)

Example: Confidence Interval for Ratio

Suppose that \(\bbeta=(\beta_1, \beta_2)\), \(\beta_2\neq 0\), and \(a(\bbeta) = \beta_1/\beta_2\)


As in the previous lecture, the following is \((1-\alpha)\times 100\%\) asymptotic confidence interval \[ \small S = \left[ \dfrac{\hat{\beta}_1}{\hat{\beta_2}} - z_{1-\alpha/2} \sqrt{\dfrac{\widehat{\avar}(\hat{\beta}_1/\hat{\beta}_2)}{N} } , \dfrac{\hat{\beta}_1}{\hat{\beta_2}} + z_{1-\alpha/2} \sqrt{\dfrac{\widehat{\avar}(\hat{\beta}_1/\hat{\beta}_2)}{N} } \right] \]

Example: Estimating \(\widehat{\avar}(\hat{\beta}_1/\hat{\beta}_2)\)

Jacobian of our \(a(\cdot)\) is \[ \bA(\bbeta) = \begin{pmatrix} 1/\beta_2 & -\beta_1/\beta_2^2 \end{pmatrix} \] \(\bA(\bbeta)\) defined and maximal rank if \(\beta_2\neq 0\)

So

\[ \widehat{\avar}(\hat{\beta}_1/\hat{\beta}_2) = \begin{pmatrix} 1/\hat{\beta}_2 & -\hat{\beta}_1/\hat{\beta}_2^2 \end{pmatrix}\widehat{\avar}(\hat{\bbeta}) \begin{pmatrix} 1/\hat{\beta}_2 \\ -\hat{\beta}_1/\hat{\beta}_2^2 \end{pmatrix} \]

Application to Empirical Parameter

Our empirical parameter of interest was \[ \ba(\bbeta) = -50\beta_3/\beta_4 \]

Here Jacobian is

\[ \bA(\bbeta) = \begin{pmatrix} 0 & 0 & -50/\beta_4 & 50\beta_3/\beta_4^2 \end{pmatrix} \]

The Delta Method in statsmodels

Can use the Delta method in statsmodels using the NonlinearDeltaCov class — compatible with many different models and estimators, not just OLS.

To define an instance, need

  • Function \(\ba(\cdot)\)
  • \(\hat{\bbeta}\) and \(N^{-1}\widehat{\avar}(\hat{\bbeta})\) (standard errors)
  • (Optionally): function for \(\bA(\cdot)\)

Documentation of NonlinearDeltaCov

Functionality not documented, but in code with good docstrings

from statsmodels.stats._delta_method import NonlinearDeltaCov
help(NonlinearDeltaCov)
Help on class NonlinearDeltaCov in module statsmodels.stats._delta_method:

class NonlinearDeltaCov(builtins.object)
 |  NonlinearDeltaCov(func, params, cov_params, deriv=None, func_args=None)
 |
 |  Asymptotic covariance by Deltamethod
 |
 |  The function is designed for 2d array, with rows equal to
 |  the number of equations or constraints and columns equal to the number
 |  of parameters. 1d params work by chance ?
 |
 |  fun: R^{m*k) -> R^{m}  where m is number of equations and k is
 |  the number of parameters.
 |
 |  equations follow Greene
 |
 |  This class does not use any caching. The intended usage is as a helper
 |  function. Extra methods have been added for convenience but might move
 |  to calling functions.
 |
 |  The naming in this class uses params for the original random variable, and
 |  cov_params for it's covariance matrix. However, this class is independent
 |  of the use cases in support of the models.
 |
 |  Parameters
 |  ----------
 |  func : callable, f(params)
 |      Nonlinear function of the estimation parameters. The return of
 |      the function can be vector valued, i.e. a 1-D array.
 |  params : ndarray
 |      Parameters at which function `func` is evaluated.
 |  cov_params : ndarray
 |      Covariance matrix of the parameters `params`.
 |  deriv : function or None
 |      First derivative or Jacobian of func. If deriv is None, then a
 |      numerical derivative will be used. If func returns a 1-D array,
 |      then the `deriv` should have rows corresponding to the elements
 |      of the return of func.
 |  func_args : None
 |      Not yet implemented.
 |
 |  Methods defined here:
 |
 |  __init__(self, func, params, cov_params, deriv=None, func_args=None)
 |      Initialize self.  See help(type(self)) for accurate signature.
 |
 |  conf_int(self, alpha=0.05, use_t=False, df=None, var_extra=None, predicted=None, se=None)
 |      Confidence interval for predicted based on delta method.
 |
 |      Parameters
 |      ----------
 |      alpha : float, optional
 |          The significance level for the confidence interval.
 |          ie., The default `alpha` = .05 returns a 95% confidence interval.
 |      use_t : boolean
 |          If use_t is False (default), then the normal distribution is used
 |          for the confidence interval, otherwise the t distribution with
 |          `df` degrees of freedom is used.
 |      df : int or float
 |          degrees of freedom for t distribution. Only used and required if
 |          use_t is True.
 |      var_extra : None or array_like float
 |          Additional variance that is added to the variance based on the
 |          delta method. This can be used to obtain confidence intervalls for
 |          new observations (prediction interval).
 |      predicted : ndarray (float)
 |          Predicted value, can be used to avoid repeated calculations if it
 |          is already available.
 |      se : ndarray (float)
 |          Standard error, can be used to avoid repeated calculations if it
 |          is already available.
 |
 |      Returns
 |      -------
 |      conf_int : array
 |          Each row contains [lower, upper] limits of the confidence interval
 |          for the corresponding parameter. The first column contains all
 |          lower, the second column contains all upper limits.
 |
 |  cov(self)
 |      Covariance matrix of the transformed random variable.
 |
 |  grad(self, params=None, **kwds)
 |      First derivative, jacobian of func evaluated at params.
 |
 |      Parameters
 |      ----------
 |      params : None or ndarray
 |          Values at which gradient is evaluated. If params is None, then
 |          the attached params are used.
 |          TODO: should we drop this
 |      kwds : keyword arguments
 |          This keyword arguments are used without changes in the calulation
 |          of numerical derivatives. These are only used if a `deriv` function
 |          was not provided.
 |
 |      Returns
 |      -------
 |      grad : ndarray
 |          gradient or jacobian of the function
 |
 |  predicted(self)
 |      Value of the function evaluated at the attached params.
 |
 |      Note: This is not equal to the expected value if the transformation is
 |      nonlinear. If params is the maximum likelihood estimate, then
 |      `predicted` is the maximum likelihood estimate of the value of the
 |      nonlinear function.
 |
 |  se_vectorized(self)
 |      standard error for each equation (row) treated separately
 |
 |  summary(self, xname=None, alpha=0.05, title=None, use_t=False, df=None)
 |      Summarize the Results of the nonlinear transformation.
 |
 |      This provides a parameter table equivalent to `t_test` and reuses
 |      `ContrastResults`.
 |
 |      Parameters
 |      -----------
 |      xname : list of strings, optional
 |          Default is `c_##` for ## in p the number of regressors
 |      alpha : float
 |          Significance level for the confidence intervals. Default is
 |          alpha = 0.05 which implies a confidence level of 95%.
 |      title : string, optional
 |          Title for the params table. If not None, then this replaces the
 |          default title
 |      use_t : boolean
 |          If use_t is False (default), then the normal distribution is used
 |          for the confidence interval, otherwise the t distribution with
 |          `df` degrees of freedom is used.
 |      df : int or float
 |          degrees of freedom for t distribution. Only used and required if
 |          use_t is True.
 |
 |      Returns
 |      -------
 |      smry : string or Summary instance
 |          This contains a parameter results table in the case of t or z test
 |          in the same form as the parameter results table in the model
 |          results summary.
 |          For F or Wald test, the return is a string.
 |
 |  var(self)
 |      standard error for each equation (row) treated separately
 |
 |  wald_test(self, value)
 |      Joint hypothesis tests that H0: f(params) = value.
 |
 |      The alternative hypothesis is two-sided H1: f(params) != value.
 |
 |      Warning: this might be replaced with more general version that returns
 |      ContrastResults.
 |      currently uses chisquare distribution, use_f option not yet implemented
 |
 |      Parameters
 |      ----------
 |      value : float or ndarray
 |          value of f(params) under the Null Hypothesis
 |
 |      Returns
 |      -------
 |      statistic : float
 |          Value of the test statistic.
 |      pvalue : float
 |          The p-value for the hypothesis test, based and chisquare
 |          distribution and implies a two-sided hypothesis test
 |
 |  ----------------------------------------------------------------------
 |  Data descriptors defined here:
 |
 |  __dict__
 |      dictionary for instance variables
 |
 |  __weakref__
 |      list of weak references to the object

Creating an Instance of NonlinearDeltaCov

In our example, define the function \(a(\cdot)\):

def max_earn(beta: pd.Series):
    return np.array([-50*beta.loc["experience"]/beta.loc["experience_sq_div"]])


Then we supply \(a(\cdot)\) together with parameters and standard errors:

delta_ratio = NonlinearDeltaCov(max_earn, results.params, results.cov_params())

Constructing CIs with NonlinearDeltaCov

Can construct a 95% confidence interval with the conf_int() method

delta_ratio.conf_int(alpha=0.05)
array([[30.36791181, 35.41407441]])

or the summary() method

delta_ratio.summary(alpha=0.1)
Test for Constraints
coef std err z P>|z| [0.05 0.95]
c0 32.8910 1.287 25.550 0.000 30.774 35.008

Nonlinear Wald Tests

Hypotheses

The delta method also allows testing \[ H_0: \ba(\bbeta) = 0 \quad \text{ vs. } \quad H_1: \ba(\bbeta) \neq 0 \] where

  • \(\ba(\cdot)\) is a smooth function
  • If there are any constants, they are absorbed into the definition of \(\ba\)

Example I: Experience and Maximal Earnings

Our remaining empirical question: \[ H_0: -\dfrac{50\beta_3}{\beta_4} - 15 = 0 \quad \text{ vs. } H_0: -\dfrac{50\beta_3}{\beta_4} - 15\neq 0 \]


Interpretation of \(H_0\): expected log wage maximized with 15 years of experience

Example II: Equal Effects

Sometimes same hypothesis can be written in many ways

Example:

  • Want to check that two variables have the same coefficients
  • One way to phrase it: \(H_0: \beta_k/\beta_j -1 =0\)
  • Another way: \(H_0: \beta_k-\beta_j =0\)

Wald Statistic

Use same idea as before: compare distance between \(\ba(\hat{\bbeta})\) and \(0\)


Wald statistic: \[ W = N\ba(\hat{\bbeta})'\left( \bA(\hat{\bbeta})\widehat{\avar}(\hat{\bbeta})\bA(\hat{\bbeta})' \right)^{-1} \ba(\hat{\bbeta}) \tag{3}\]

Decision Rule: Wald Test

We call the following the asymptotic size \(\alpha\) Wald test:

Let \(c_{1-\alpha}\) solve \(P(\chi^2_k\leq c_{1-\alpha})=1-\alpha\) where \(k\) is the number of components in \(\ba(\cdot)\). Then

  • Reject \(H_0\) if \(W>c_{1-\alpha}\)
  • Do not reject \(H_0\) if \(W\leq c_{1-\alpha}\)

Exactly the Wald (and \(t\)) test for \(H_0: \bR\bbeta=\bq\) taken with \(\ba(\bbeta) =\bR\bbeta-\bq\)

Properties of the Wald Test

Proposition 4 Let the assumptions for asymptotic normality of the OLS estimator hold. Let \(\bA(\bbeta)\) have rank \(k\) where \(k\) is the number of components of \(\ba(\bbeta)\). Let \(W\) be defined as in Equation 3. Then

  1. If \(H_0: \ba(\bbeta)=0\) holds, then \(W\xrightarrow{d} \chi^2_k\). The associated test has asymptotic size \(\alpha\)
  2. If \(H_0: \ba(\bbeta)=0\) does not hold, then \(W\xrightarrow{p} +\infty\). The associated test is consistent

Illustration: Nonlinear Wald Test with statsmodels

Can do the Wald test with the wald_test method of NonlinearDeltaCov:

delta_ratio.wald_test(np.array([15]))
(np.float64(193.1535027123987), np.float64(6.516561763470986e-44))

Outputs:

  • Value of \(W\)
  • Corresponding \(p\)-value

Recap and Conclusions

Lecture Recap

Recap

In this lecture we

  1. Established the delta method as a way of obtaining the asymptotic distribution of transformations of parameters
  2. Discussed inference on such transformations through confidence intervals and potentially nonlinear Wald tests

Block Conclusion

Overall Concluding Thoughts on the Block


We now finished the first block — a deeper look at linear regression


What did we do?

Results I: Linear Model Analysis

Deeply analyzed the linear model itself

  • Key properties and causal framework
  • Linear models useful even in some nonparametric settings (we’ll see with event studies/difference-in-differences)

Results II: Asymptotic Arguments

Discussed how to establish consistency and asymptotic normality of the OLS estimator

  • Proofs represented the OLS estimator as a sample average and applied LLNs and CLTs
  • Generally useful approach: turns out many estimators can be represented and handled similarly
    • Linear models (linear IV and linear GMM)
    • Nonlinear models (including familiar ones like logit and probit)

Results III: Inference

We discussed confidence intervals and tests for both linear and nonlinear hypotheses

  • Our constructions and proof for test statistics relied on consistency and asymptotic normality of the OLS estimator, but not the linearity of the model itself (check this!)
  • \(\Rightarrow\) Same strategies can be used for inference in any model where we know the asymptotic distribution of the estimator

References

Hansen, Bruce. 2022. Econometrics. Princeton_University_Press.