Simulating Test Power and Size for Comparing Tests
This lecture is about evaluating hypothesis tests
By the end, you should be able to
Often want to test joint hypotheses: hypotheses that involve several statements about parameters at the same time:
\[ \begin{aligned} & H_0: \text{for all $k$ the statement $A_k$ is true} \quad \text{vs.} \\ & H_1: \text{for at least some $k$, the statement $A_k$ is not true} \end{aligned} \]
Example: null that coefficients are zero (\(A_k=\curl{\theta_k=0}\)) \[ \begin{aligned} H_0: \theta_k = 0, \quad k=1, \dots, p \end{aligned} \]
Two example approaches:
Question of today:
When do we generally use the first approach?
How to answer this question?
Today talk about
Suppose that we have a model with some parameters \(\theta\) (of whatever nature)
Two competing hypotheses (statements about parameters \(\theta\)) \[ H_0: \theta\in \Theta_0 \text{ vs. } H_1: \theta \in \Theta_1 \] for some non-intersecting \(\Theta_0\) and \(\Theta_1\)
Example: linear model of the form:
\[\small Y_i = \sum_{k=0}^{p} \beta_k X_i^{(k)} \]
A test is a decision rule: you see the sample and then you decide in favor of \(H_0\) or \(H_1\)
Formally:
Definition 1 A test \(T\) is a function of the sample \((X_1, \dots, X_N)\) to the space \(\{ \text{Reject } H_0\), \(\text{Do not reject }H_0 \}\)
Technically, Definition 1 is about deterministic tests, there are also randomized tests
Definition 2 The power function \(\text{Power}_T(\theta)\) of the test \(T\) is the probability that \(T\) rejects if \(\theta\) is the true parameter value: \[ \text{Power}_T(\theta) = P(T(X_1, \dots, X_N)=\text{Reject }H_0|\theta) \]
Definition 3 The size \(\alpha\) of the test \(T\) is \[ \alpha = \max_{\theta\in\Theta_0} \text{Power}_T(\theta) \]
In other words, the probability of falsely rejecting the null — type I error
The best possible test has perfect detection:
Usually impossible in practice. Instead, we ask
Power function — key metric in comparing tests
You should prefer tests that
How do we construct a test/decision rule?
Basic approach:
Recall: compare with critical values to measure compatibility with \(H_0\)
Critical values chosen to guarantee
At least in the limit, the test should not falsely reject \(H_0\) “too much”
The asymptotic size \(\alpha\) of the test \(T\) is \(\alpha = \lim_{N\to\infty} \max_{\theta\in\Theta_0} P(T(X_1, \dots, X_N)=\text{Reject }H_0|\theta)\)
Suppose that have estimator \(\hat{\theta}_k\) such that \[ \small \sqrt{N}(\hat{\theta}_k-\theta_k)\xrightarrow{d} N(0, \avar(\hat{\theta}_k)) \] where \(N\) is sample size
Will base the test on the \(t\)-statistic: \[ \small t = \dfrac{\hat{\theta}_k}{\sqrt{ \widehat{\avar}(\hat{\theta}_k)/N } } \]
We call the following the asymptotic size \(\alpha\) \(t\)-test:
Let \(z_{1-\alpha/2} = \Phi^{-1}(1-\alpha/2)\). Then
Let \(\btheta= (\theta_0, \theta_1, \dots, \theta_p)\), \(\bR\) some matrix, \(\bc\) some vector
Suppose that have estimator \(\hat{\btheta}\) such that \[ \small \sqrt{N}(\hat{\btheta}-\btheta)\xrightarrow{d} N(0, \avar(\hat{\btheta})) \]
Will base the test on the Wald statistic: \[ \small W = N\left( \bR\hat{\btheta}-\bc \right)'\left(\bR\widehat{\avar}(\btheta)\bR'\right)^{-1}\left( \bR\hat{\btheta}-\bc \right) \]
We call the following the asymptotic size \(\alpha\) Wald-test:
Let \(c_{1-\alpha}\) solve \(P(\chi^2_k\leq c_{1-\alpha})=1-\alpha\) where \(k\) is the number of rows and rank of \(\bR\). Then
Now turn to answering our motivating question: joint vs. multiple tests for joint nulls
Plan:
After: implementation
Suppose that have coefficients \(\theta_k\) and want to test the joint null \[ H_0: \theta_k = c_k, \quad k=1, \dots, p \] Have estimator \(\hat{\btheta}\) such that \(\sqrt{N}(\hat{\btheta}-\btheta)\xrightarrow{d} N(0, \avar(\hat{\btheta}))\)
Two main approaches for testing \(H_0\):
A joint test uses a single statistic to conduct the text
Example: Wald test based on
\[ W = N(\hat{\btheta}-\bc)'(\widehat{\avar}(\hat{\btheta}))^{-1}(\hat{\btheta}-\bc), \] Test based on comparing \(W\) with \((1-\alpha)\)th quantile of \(\chi^2_p\)
A multiple test aggregates the decisions of many separate tests
For example, let \[ \small t_k = \dfrac{\sqrt{N}(\hat{\btheta}_k-c_k)}{\widehat{\avar}(\hat{\btheta}_k)} \]
A multiple \(t\)-test for \(H_0\) rejects if \(\max_{k=1, \dots, p}\lbrace \abs{t_k}\rbrace\) exceeds some prespecified critical value \(c_{\alpha}\)
Critical values of multiple tests have to be set to ensure overall correct size
Have to be careful: e.g. with \(p=2\) have \[ \begin{aligned} & P(\text{test rejects}|H_0) \\ & = P\left( \lbrace |{t}_1| \geq c_{\alpha}\rbrace |H_0 \right) + P\left( \lbrace |{t}_2| \geq c_{\alpha}\rbrace |H_0 \right) \\ & \quad - P\left( \lbrace |{t}_1| \geq c_{\alpha}\rbrace \cap \lbrace |{t}_2| \geq c_{\alpha}\rbrace |H_0 \right), \end{aligned} \]
Hence if each \(t\)-test is a size \(\alpha\)-test, resulting multiple test has
\[ P(\text{test rejects}|H_0) \in [\alpha, 2\alpha] \]
Now know the most important metrics:
Should compare joint vs. multiple approach to testing in terms of power
But now key open questions:
Today:
When no constraint on setting is present, start with
Ideally: simplest = some kind of “textbook” reference setting
Very often in testing:
Key reason:
There are theoretical reasons for importance of normality as a “base” case, see chapters 6-7 in Van der Vaart (1998)
Simple linear model: \[ Y_i = 1 + \theta_1 X_i^{(1)} + \theta_2 X_{i}^{(2)} + U_i \]
Very simple kind of joint hypothesis: \[ H_0: \theta_1 = \theta_2 = 0 \]
\[ \begin{aligned} U_{i} & \sim N(0, 1), \\ \begin{pmatrix} X_i^{(1)} \\ X_i^{(2)} \end{pmatrix} & \sim N\left(\begin{pmatrix} 1 \\ 1 \end{pmatrix}, \begin{pmatrix} 1 & \rho \\ \rho & 1 \end{pmatrix} \right) \end{aligned}, \] where the correlation \(\rho\) runs between \(-1\) and \(1\)
Varying \(\rho\) — key axis controlling geometry of many asymptotic joint tests — they depend on \(\avar(\hat{\bbeta})\)
Power function is function of \(\btheta=(\theta_1, \theta_2)\)
\[ \small \text{Power}_T(\btheta) = P(T(X_1, \dots, X_N)=\text{Reject }H_0|\btheta) \]
Always need to look at DGPs that vary at least in coefficients in that enter in \(H_0\)
\(\Rightarrow\) DGPs vary in \(c\) and \(\rho\)
Not that restrictive to have \(\theta_1=\theta_2\): can suitably rotate \((X_{i}^{(1)}, X_{i}^{(2)})\) to make any \((\theta_1, \theta_2)\) to \(\theta_1=\theta_2\).
In our case: think about which tests are the most popular
Will compare those based on OLS estimator, just need to choose adjusted critical values for multiple \(t\)-test
Recall: if each \(t\)-test is done at size \(\alpha\), then multiple \(t\)-test based on two has \[ P(\text{test rejects}|H_0) \leq 2\alpha \]
If want overall multiple test to have size \(\leq \alpha\), need to do each component test with size \(\alpha/2\)
\(\rho\) controls correlation between \(\hat{\theta}_1\) and \(\hat{\theta}_2\) with \(\mathrm{corr}(\hat{\theta}_1, \hat{\theta}_2) \approx -\rho\)
Mostly use same code structure developed in first block
Some differences:
tests instead of estimators├── README.md
├── requirements.txt
├── dgps
│ ├── __init__.py
│ └── linear.py
├── main.py
├── results
│ ├── ...
├── sim_infrastructure
│ ├── __init__.py
│ ├── orchestrators.py
│ ├── protocols.py
│ ├── runner.py
│ └── scenarios.py
└── tests
├── __init__.py
├── joint.py
└── multiple.py
“Infrastructure” classes organized in separate folder so that main.py is the only .py file in root
What does execution look like for a single dataset?
For many datasets:
SimulationRunner Implementationsim_infrastructure/runner.py
"""
Module for executing a given Monte Carlo scenario.
This module contains the SimulationRunner class, which executes a scenario
by combining a DGP and a test. It handles data generation, estimation, testing,
and result collection.
Classes:
SimulationRunner: Runs simulations and summarizes results.
"""
import numpy as np
from sim_infrastructure.protocols import DGPProtocol, TestProtocol
class SimulationRunner:
"""Runs Monte Carlo simulations for a given DGP and estimator.
Attributes:
dgp: data-generating process with a sample() method and beta1 attribute.
estimator: estimator with a fit() method and beta1_hat attribute.
errors: array of estimation errors (beta1_hat - beta1) for each simulation.
"""
def __init__(
self,
dgp: DGPProtocol,
test: TestProtocol,
) -> None:
"""Initializes the simulation runner.
Args:
dgp: An instance of a DGP class (must implement `sample`).
estimator: An instance of an estimator class (must implement `fit`).
"""
self.dgp: DGPProtocol = dgp
self.test: TestProtocol = test
self.test_decisions: np.ndarray = np.empty(0)
def simulate(self, n_sim: int, n_obs: int, first_seed: int | None = None) -> None:
"""Runs simulations and stores estimation errors.
Args:
n_sim (int): number of simulations to run.
n_obs (int): Number of observations per simulation.
first_seed (int | None): Starting random seed for reproducibility.
Defaults to None.
"""
# Preallocate array to hold estimation errors
self.test_decisions = np.empty(n_sim)
# Run simulation
for sim_id in range(n_sim):
# Draw data
x, y = self.dgp.sample(
n_obs, seed=first_seed + sim_id if first_seed else None
)
# Test null of zero coefficients
self.test.test(x, y)
# Store error
self.test_decisions[sim_id] = self.test.decision
def summarize_results(self) -> dict:
"""Return a summary of results
Returns:
dict: power of current test at current DGP
"""
return {
"test": self.test.name,
"covar_corr": self.dgp.covar_corr,
"common_coef_val": self.dgp.common_coef_val,
"power": self.test_decisions.mean(),
}Notice how the SimulationRunner summarizes its result: that’s the power function computation!
tests/joint.py Module: Wald Testtests/joint.py
"""
Module for joint tests for joint hypothesis.
Classes:
WaldWithOLS: OLS-based Wald test that all non-intercept coefficients are zero.
"""
import numpy as np
import pandas as pd
from statsmodels.regression.linear_model import OLS
class WaldWithOLS:
"""Class for applying Wald test to test null of zero coefficients.
Tailored to linear model
Y_i = theta0 + theta1 X_{i1} + ... + thetap X_{ip} + U_i
Tests the null:
H0: theta1 = ... = thetap = 0
Attributes:
name (str): test name, "Wald"
decision (np.bool): whether null is rejected
"""
def __init__(self) -> None:
self.name: str = "Wald"
self.decision: np.bool
def test(self, x: pd.DataFrame, y: pd.DataFrame) -> None:
"""Carry out the Wald test with given data.
Args:
x (pd.DataFrame): covariates, including a leading constant column.
y (pd.DataFrame): outcomes.
"""
# Fit models
lin_reg = OLS(y, x)
lin_reg_fit = lin_reg.fit()
# Perform Wald test
num_covars = x.shape[1]
wald_restriction_matrix = np.concatenate(
(np.zeros((num_covars - 1, 1)), np.eye(num_covars - 1)), axis=1
)
wald_test = lin_reg_fit.wald_test(
wald_restriction_matrix,
use_f=False,
scalar=True,
)
self.decision = wald_test.pvalue <= 0.05TestProtocol: asks tests to have a test() method and a decision field
tests/multiple.py Module: Multiple \(t\)-Testtests/multiple.py
"""
Module for multiple tests for joint hypothesis.
Classes:
BonferronigMultipleTWithOLS: OLS-based multiple t-test that all non-intercept
coefficients are zero. Uses Bonferroni correction.
"""
import numpy as np
import pandas as pd
from statsmodels.regression.linear_model import OLS
from statsmodels.stats.multitest import multipletests
class BonferronigMultipleTWithOLS:
"""Class for applying multiple t-test to test null of zero coefficients.
Tailored to linear model
Y_i = theta0 + theta1 X_{i1} + ... + thetap X_{ip} + U_i
Tests the null:
H0: theta1 = ... = thetap = 0
Attributes:
name (str): test name, "Bonferroni t"
decision (np.bool): whether null is rejected
"""
def __init__(self) -> None:
self.name: str = "Bonferroni t"
self.decision: np.bool
def test(self, x: pd.DataFrame, y: pd.DataFrame) -> None:
"""Carry out the multiple t-test with given data.
Args:
x (pd.DataFrame): covariates, including a leading constant column.
y (pd.DataFrame): outcomes.
"""
# Fit models
lin_reg = OLS(y, x)
lin_reg_fit = lin_reg.fit()
# Perform multiple t-tests
p_vals_t = lin_reg_fit.pvalues.iloc[1:]
t_test_corrected_bonf = multipletests(
p_vals_t,
method="bonferroni", # Bonferroni
)
self.decision = t_test_corrected_bonf[0].sum() > 0Again: power function depends on \(c\) (from \(\theta_1=\theta_2=c\))
Here: evaluate for 231 points evenly spaced between \(-1.75\) and \(+1.75\) — sufficient to capture everything
No special meaning for 231 — just a reasonable large number
Want to also see the power function for different values of covariate correlation \(\rho\)
Will analyze for
That is, for every \(\rho\), compute power function on the \(c\)-grid
Using uneven number of grid points ensures that 0 is included in the grid
Will create as follows:
scenarios.py to create a DGP with every combination of \(c\) and \(\rho\) from the griddgps/linear.py
"""
Module for linear data-generating processes (DGPs).
This module contains classes for generating data from simple linear models.
Classes:
BivariateLinearModel: A DGP for static linear models with normal variables,
adjustable correlation between covariates, and two variables + constant.
"""
import numpy as np
import pandas as pd
class BivariateLinearModel:
"""A data-generating process (DGP) for a static linear model
Attributes:
common_coef_val (float): Common values for theta1 and theta2
covar_corr (float): Correlation coefficient between covariates
"""
def __init__(self, common_coef_val: float, covar_corr: float) -> None:
self.common_coef_val: float = common_coef_val
self.covar_corr: float = covar_corr
def sample(
self, n_obs: int, seed: int | None = None
) -> tuple[pd.DataFrame, pd.DataFrame]:
"""Samples data from the static DGP.
Args:
n_obs (int): Number of observations to sample.
seed (int, optional): Random seed for reproducibility. Defaults to None.
Returns:
tuple: (x, y) DataFrames, each of length n_obs.
"""
# Initialize RNG
rng = np.random.default_rng(seed)
# Construct mean and covariance for coefficients
x_mean = np.array([1, 0, 0])
x_covar = np.array([[0, 0, 0], [0, 1, self.covar_corr], [0, self.covar_corr, 1]])
# Construct vector of coefficients
thetas = np.array([1, self.common_coef_val, self.common_coef_val]
)
# Draw covariates and residuals, combine into output
covariates = rng.multivariate_normal(
x_mean,
x_covar,
size=n_obs,
)
resids = rng.normal(0, np.sqrt(1), size=n_obs)
y = (covariates @ thetas) + resids
# Convert output into pandas dataframes with dynamic variable names for X
covariates_df = pd.DataFrame(
covariates,
columns=[f"X{i}" for i in range(covariates.shape[1])],
)
y_df = pd.DataFrame(
y,
columns=["y"],
)
return covariates_df, y_dfscenarios.pysim_infrastructure/scenarios.py
"""
Module for defining simulation scenarios.
This module contains scenarios for comparing power functions of tests under model:
Y = theta0 + theta1*x1 + theta2*x2 + u
The null being tested is
H0: theta1=theta2=0
Classes:
SimulationScenario: data class for scenarios
Variables:
scenarios (list): list of scenarios to un.
"""
import numpy as np
from dataclasses import dataclass
from itertools import product
from dgps.linear import BivariateLinearModel
from sim_infrastructure.protocols import DGPProtocol, TestProtocol
from tests.joint import WaldWithOLS
from tests.multiple import BonferronigMultipleTWithOLS
@dataclass(frozen=True)
class SimulationScenario:
"""A single simulation scenario: DGP, test, and sample size."""
name: str # For readability
dgp: type[DGPProtocol]
dgp_params: dict # E.g. betas go here
test: type[TestProtocol]
test_params: dict # E.g. reg_params go here
sample_size: int
n_simulations: int = 500
first_seed: int = 1
# Create DGP combinations indexed by correlation and coefficient values
common_coef_vals = np.linspace(-1.75, 1.75, 231)
covar_corr_vals = np.linspace(-0.99, 0.99, 51)
dgps = [
(
BivariateLinearModel,
{"common_coef_val": common_coef_val, "covar_corr": covar_corr},
)
for common_coef_val, covar_corr in product(common_coef_vals, covar_corr_vals)
]
# Create list of tests
tests = [
(WaldWithOLS, {}),
(BonferronigMultipleTWithOLS, {}),
]
sample_sizes = [200]
# Generate all combinations
scenarios = [
SimulationScenario(
name="", # will identify results through info from SimulationRunner
dgp=dgp_class,
dgp_params=dgp_params,
test=test_class,
test_params=test_params,
sample_size=size,
)
for (dgp_class, dgp_params), (
test_class,
test_params,
), size in product(dgps, tests, sample_sizes)
]We end up many DGPs
Will implement SimulationOrchestrator that can
ThreadPoolExecutor under free-threaded Python 3.14+)tqdm)Parallel computation is a deep topic. See Antao (2023) on Python, but be aware of the removal of GIL in CPython starting from 3.14
SimulationOrchestratorParallelsim_infrastructure/orchestrators.py
from concurrent.futures import ThreadPoolExecutor, as_completed
from tqdm import tqdm
from sim_infrastructure.runner import SimulationRunner
from sim_infrastructure.scenarios import SimulationScenario
class SimulationOrchestratorParallel:
"""Parallelized simulation orchestrator class."""
def __init__(self, scenarios: list[SimulationScenario]) -> None:
self.scenarios = scenarios
self.summary_results = []
def run_single_scenario(self, scenario):
"""Run a single scenario and return the result dictionary."""
dgp = scenario.dgp(**scenario.dgp_params)
estimator = scenario.test(**scenario.test_params)
runner = SimulationRunner(dgp, estimator)
runner.simulate(
n_sim=scenario.n_simulations,
n_obs=scenario.sample_size,
first_seed=scenario.first_seed,
)
return runner.summarize_results()
def run_all(self, max_workers=None):
"""Run all scenarios in parallel using ThreadPoolExecutor."""
with ThreadPoolExecutor(max_workers=max_workers) as executor:
# Submit all scenarios to the executor
futures = {
executor.submit(self.run_single_scenario, scenario): scenario
for scenario in self.scenarios
}
# Collect results as they complete, with a progress bar
for future in tqdm(
as_completed(futures), total=len(futures), desc="Running simulations"
):
self.summary_results.append(future.result())Here: create a custom ResultsProcessor that processes results and export plots
Sadly, no general approaches for making graphs — always a lot of customization and tight coupling to setting
main.pymain.py
from pathlib import Path
import pandas as pd
from sim_infrastructure.orchestrators import SimulationOrchestratorParallel
from sim_infrastructure.results_processor import ResultsProcessor
from sim_infrastructure.scenarios import scenarios
SIM_RESULTS_PATH = Path() / "results" / "sim_results.csv"
PLOT_FOLDER = Path() / "results" / "plots"
if __name__ == "__main__":
# Create and execute simulations
orchestrator = SimulationOrchestratorParallel(scenarios)
orchestrator.run_all()
# Export results
pd.DataFrame(orchestrator.summary_results).to_csv(SIM_RESULTS_PATH)
# Export plots
results_processor = ResultsProcessor(SIM_RESULTS_PATH, PLOT_FOLDER)
results_processor.export_all_plots()Can we justify always using Wald tests based on this?
Yes. Wald test is safe
In this lecture we
Evaluating Hypothesis Tests