Why Monte Carlo? General Principles

Role of Simulations. Principles, Design, and Anatomy

Vladislav Morozov

Introduction

Lecture Info

Learning Outcomes

This lecture is a high-level overview of why and how we do Monte Carlo simulations


By the end, you should be able to

  • Discuss the pros and cons of different approach for evaluating statistical methods
  • State the three key characteristics of good simulations
  • Describe the high-level flow of a simulation study

Role of Monte Carlo Simulations

The Method Evaluation Problem

What’s The Point of Statistics?

In one sentence:

Developing new methods that allow user to learn parameters of interest from data

Parameters of interest:

  • Depend on context
  • Specified by the consumer of the method
  • Goal of statistics: say if parameter can be learned and how

What Methods Do Consumers Want?

Consumers of statistical methods want to use methods that they know “work well

Natural:

  • A better tool gives better results
  • Easier to defend choosing to use it


But what’s “well”?

What’s “Well”?

Modern understanding:

A method works “well” if some metric of interest “tends” to “look good” under a reasonably broad variety of data generating processes

  • Metrics: bias, prediction risk, confidence interval coverage
  • Tends”: allow for some possibility of bad performance, but with low probability
  • DGPs:we don’t love parametric assumptions

Theoretical and Empirical Evaluation

Evaluation Methods


Three main methods, in decreasing strength:

  1. Finite sample theoretical guarantees
  2. (Good) Monte Carlo simulations
  3. Empirical validation

Theoretical Bounds

Nonparametric finite sample bounds — best possible results

Usually possible for specific scenarios (bounded variables, bounded dimensions). Examples:

Limitations of Theoretical Bounds

Key challenge:

In many problems impossible to obtain useful guarantees


Problematic scenarios:

  • Highly-structured DGPs (dependence or structured outputs)
  • Nonlinear and multistep algorithms
  • Settings without natural known bounds on variables

Empirical Evaluation

Other end of spectrum:

Testing methods on real datasets of interest


Examples:

Limitations of Evaluation on Real Data

Limited to scenarios where you know the ground truth: only prediction, but no causal inference


Other issues:

  • Eventually invalid: due to comparing many algorithms on same test set (~implicit training on the test set)
  • Cost: real-life data is not easy to obtain
  • Limited scope: results only informative for the dataset used and similar data

Simulations

Place of Simulations

Simulations: check every aspect of performance in a “lab” setting with many synthetic datasets

  • Lie somewhere between generic theory and specific real test datasets
  • Using different DGPs — poor person’s version of generic theory, gives confidence for some scenarios
  • Synthetic data \(\Rightarrow\) full knowledge of target quantities \(\Rightarrow\) can evaluate both causal and predictive methods

Simulations as Evaluation Tools

Simulations allow answering “what if” questions, e.g.:

  • Does this estimator actually work when tail conditions hold?
  • Does this inference method suffer size distortions when identification is irregular?
  • Is there a big efficiency loss when using a more general estimator?

Limitations of Simulations

  • Only as good as the DGPs used
  • Computationally expensive — challenging with algorithms that take a long time to train
  • Limited scope:
    • Mostly useful with numerical data
    • Reason: not clear how to write DGPs for image, text, etc.

Other Uses: Motivating Tool

An easy clear simulation: good way of motivating a problem

Example: figure from intro of Chernozhukov et al. (2018) — danger of not using Neyman orthogonalization (left panel)

Monte Carlo vs. Other Kinds of Simulations

Our focus: Monte Carlo simulations

MC: drawing many random datasets and tabulating performance across these datasets


Not the only kind of simulations. Contrast with

  • Deterministic simulations
  • Synthetic data generation for training data augmentation (see this link)

Principles of Good Simulations

Characteristics of Good Simulations

The Three Key Characteristics


Good simulations are

  • Realistic
  • Reproducible
  • Targeted

Realism

DGPs should mimic essential real-world features without excess complexity


Intuitively:

  • Simulations are like crash-testing cars in a lab vs. on real roads.
  • Lab crashes must be similar to real ones to be informative
  • But don’t need to replicate every single aspect of roads

Reproducibility

Simulations should be reproducible exactly


Steps to achieve:

  • Set random seeds
  • Share code and give replications instructions
  • Describe exactly the environment used (in plain text or Docker)

Targeted Simulations

Simulation DGPs should reflect the property of interest


Example:

  • When evaluating IV-related methods, use DGPs that vary in instrument strength/number of moment conditions
  • When evaluating inference on extreme quantiles, use DGPs that vary in tail properties

Anatomy of a Simulation

Common Structure: The Three Steps

MonteCarloWorkflow cluster_outer cluster_simulation ResearchQuestion Identify research question and metric of interest DrawData Draw data ResearchQuestion->DrawData ApplyMethod Apply method DrawData->ApplyMethod CompareEstimates Compare estimates to ground truth ApplyMethod->CompareEstimates CompareEstimates->DrawData SummarizeResults Summarize results across datasets CompareEstimates->SummarizeResults

  1. Choose what you care about (e.g. bias of several estimators)
  2. Run simulation loop:
    • Draw a dataset from given DGP
    • Apply methods of interest
    • Compare expected result to the obtained one
  3. Summarize results: compute averages, tabulate distributions, etc.

Recap and Conclusions

Recap


In this lecture we

  1. Discussed ways to evaluate statistical methods
  2. Formulated basic principles of good simulations
  3. Looked at core simulation anatomy

Next Questions


Now have an idea of the what and why — next question is how?

  • How to implement the steps in code?
  • How to choose DGPs?
  • How to approach different statistical scenarios?
  • How to improve reproducibility?

References

Bischl, Bernd, Giuseppe Casalicchio, Matthias Feurer, Pieter Gijsbers, Frank Hutter, Michel Lang, Rafael G. Mantovani, Jan N. van Rijn, and Joaquin Vanschoren. 2021. OpenML Benchmarking Suites.” arXiv. https://doi.org/10.48550/arXiv.1708.03731.
Chernozhukov, Victor, Denis Chetverikov, Mert Demirer, Esther Duflo, Christian Hansen, Whitney Newey, and James Robins. 2018. “Double/Debiased Machine Learning for Treatment and Structural Parameters.” The Econometrics Journal 21 (1): C1–68. https://doi.org/10.1111/ectj.12097.
Deng, Jia, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. ImageNet: A Large-Scale Hierarchical Image Database.” In 2009 IEEE Conference on Computer Vision and Pattern Recognition, 248–55. Miami, FL: IEEE. https://doi.org/10.1109/CVPR.2009.5206848.
Shalev-Shwartz, Shai, and Shai Ben-David. 2014. Understanding Machine Learning. 1st ed. West Nyack: Cambridge University Press.
Wainwright, Martin J. 2019. High-Dimensional Statistics: A Non-Asymptotic Viewpoint. Cambridge University Press. https://doi.org/10.1017/9781108627771.
Wang, Alex, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman. 2019. GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding.” arXiv. https://doi.org/10.48550/arXiv.1804.07461.