Why Monte Carlo? General Principles

Role of Simulations. Principles, Design, and Anatomy

Vladislav Morozov

Introduction

Lecture Info

Learning Outcomes

This lecture is a high-level overview of why and how we do Monte Carlo simulations

By the end, you should be able to

Discuss the pros and cons of different approach for evaluating statistical methods
State the three key characteristics of good simulations
Describe the high-level flow of a simulation study

Role of Monte Carlo Simulations

The Method Evaluation Problem

What’s The Point of Statistics?

In one sentence:

Developing new methods that allow user to learn parameters of interest from data

Parameters of interest:

Depend on context
Specified by the consumer of the method
Goal of statistics: say if parameter can be learned and how

What Methods Do Consumers Want?

Consumers of statistical methods want to use methods that they know “work well”

Natural:

A better tool gives better results
Easier to defend choosing to use it

But what’s “well”?

What’s “Well”?

Modern understanding:

A method works “well” if some metric of interest “tends” to “look good” under a reasonably broad variety of data generating processes

Metrics: bias, prediction risk, confidence interval coverage
“Tends”: allow for some possibility of bad performance, but with low probability
DGPs:we don’t love parametric assumptions

Theoretical and Empirical Evaluation

Evaluation Methods

Three main methods, in decreasing strength:

Finite sample theoretical guarantees
(Good) Monte Carlo simulations
Empirical validation

Theoretical Bounds

Nonparametric finite sample bounds — best possible results

Usually possible for specific scenarios (bounded variables, bounded dimensions). Examples:

Bounds on lasso estimation error (Wainwright 2019)
The fundamental theorem of PAC learning + finite VC dimension results for certain classifier classes (Shalev-Shwartz and Ben-David 2014)

Limitations of Theoretical Bounds

Key challenge:

In many problems impossible to obtain useful guarantees

Problematic scenarios:

Highly-structured DGPs (dependence or structured outputs)
Nonlinear and multistep algorithms
Settings without natural known bounds on variables

Empirical Evaluation

Other end of spectrum:

Testing methods on real datasets of interest

Examples:

Numerical data: OpenML-CC18 (Bischl et al. 2021)
Images data: ImageNet (Deng et al. 2009)
Text data: General Language Understanding Evaluation (GLUE) (Wang et al. 2019)

Limitations of Evaluation on Real Data

Limited to scenarios where you know the ground truth: only prediction, but no causal inference

Other issues:

Eventually invalid: due to comparing many algorithms on same test set (~implicit training on the test set)
Cost: real-life data is not easy to obtain
Limited scope: results only informative for the dataset used and similar data

Simulations

Place of Simulations

Simulations: check every aspect of performance in a “lab” setting with many synthetic datasets

Lie somewhere between generic theory and specific real test datasets
Using different DGPs — poor person’s version of generic theory, gives confidence for some scenarios
Synthetic data \(\Rightarrow\) full knowledge of target quantities \(\Rightarrow\) can evaluate both causal and predictive methods

Simulations as Evaluation Tools

Simulations allow answering “what if” questions, e.g.:

Does this estimator actually work when tail conditions hold?
Does this inference method suffer size distortions when identification is irregular?
Is there a big efficiency loss when using a more general estimator?

Limitations of Simulations

Only as good as the DGPs used
Computationally expensive — challenging with algorithms that take a long time to train
Limited scope:
- Mostly useful with numerical data
- Reason: not clear how to write DGPs for image, text, etc.

Other Uses: Motivating Tool

An easy clear simulation: good way of motivating a problem

Example: figure from intro of Chernozhukov et al. (2018) — danger of not using Neyman orthogonalization (left panel)

Monte Carlo vs. Other Kinds of Simulations

Our focus: Monte Carlo simulations

MC: drawing many random datasets and tabulating performance across these datasets

Not the only kind of simulations. Contrast with

Deterministic simulations
Synthetic data generation for training data augmentation (see this link)

Principles of Good Simulations

Characteristics of Good Simulations

The Three Key Characteristics

Good simulations are

Realistic
Reproducible
Targeted

Realism

DGPs should mimic essential real-world features without excess complexity

Intuitively:

Simulations are like crash-testing cars in a lab vs. on real roads.
Lab crashes must be similar to real ones to be informative
But don’t need to replicate every single aspect of roads

Reproducibility

Simulations should be reproducible exactly

Steps to achieve:

Set random seeds
Share code and give replications instructions
Describe exactly the environment used (in plain text or Docker)

Targeted Simulations

Simulation DGPs should reflect the property of interest

Example:

When evaluating IV-related methods, use DGPs that vary in instrument strength/number of moment conditions
When evaluating inference on extreme quantiles, use DGPs that vary in tail properties

Anatomy of a Simulation

Common Structure: The Three Steps

Choose what you care about (e.g. bias of several estimators)
Run simulation loop:
- Draw a dataset from given DGP
- Apply methods of interest
- Compare expected result to the obtained one
Summarize results: compute averages, tabulate distributions, etc.

Recap and Conclusions

Recap

In this lecture we

Discussed ways to evaluate statistical methods
Formulated basic principles of good simulations
Looked at core simulation anatomy

Next Questions

Now have an idea of the what and why — next question is how?

How to implement the steps in code?
How to choose DGPs?
How to approach different statistical scenarios?
How to improve reproducibility?

References

Bischl, Bernd, Giuseppe Casalicchio, Matthias Feurer, Pieter Gijsbers, Frank Hutter, Michel Lang, Rafael G. Mantovani, Jan N. van Rijn, and Joaquin Vanschoren. 2021. “OpenML Benchmarking Suites.” arXiv. https://doi.org/10.48550/arXiv.1708.03731.

Chernozhukov, Victor, Denis Chetverikov, Mert Demirer, Esther Duflo, Christian Hansen, Whitney Newey, and James Robins. 2018. “Double/Debiased Machine Learning for Treatment and Structural Parameters.” The Econometrics Journal 21 (1): C1–68. https://doi.org/10.1111/ectj.12097.

Deng, Jia, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. “ImageNet: A Large-Scale Hierarchical Image Database.” In 2009 IEEE Conference on Computer Vision and Pattern Recognition, 248–55. Miami, FL: IEEE. https://doi.org/10.1109/CVPR.2009.5206848.

Shalev-Shwartz, Shai, and Shai Ben-David. 2014. Understanding Machine Learning. 1st ed. West Nyack: Cambridge University Press.

Wainwright, Martin J. 2019. High-Dimensional Statistics: A Non-Asymptotic Viewpoint. Cambridge University Press. https://doi.org/10.1017/9781108627771.

Wang, Alex, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman. 2019. “GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding.” arXiv. https://doi.org/10.48550/arXiv.1804.07461.