Good Simulation Code IV: Orchestration

Data Classes for Scenarios. Running Many Simulations

Vladislav Morozov

Introduction

Lecture Info

Learning Outcomes

This lecture is about making our code execute many simulation scenarios


By the end, you should be able to

  • Capture simulation scenarios in data classes
  • Automatically and manually build collections of simulation scenarios
  • Implement a simulation orchestrator class for running many simulations

References


Programming:

  • Chapter 5 in Ramalho (2022) about data classes
  • Chapter 39 in Lutz (2025) on decorators
  • Chapter 6 in Lau (2023) for a refresher on pandas

Reminder: Previous Simulation Setting

Reminder: Study Bias of Penalized SSR Estimators

Talking about bias of different penalized SSR-based estimators in simple linear model:

\[ \small Y_{t} = \beta_0 + \beta_1 X + U_t, \quad t=1, \dots, T \]

Estimators for \(\beta_1\) minimize penalized SSR with form \[ \small (\hat{\beta}_0, \hat{\beta}_1) = \argmin \sum_{t=1}^T (Y_t - b_0 - b_1 X_t)^2 + \lambda \Pcal(b_0, b_1) \] \(\Pcal(\cdot)\) — penalty (0 for OLS, \(L^1\) for Lasso, \(L^2\) for ridge)

Reminder: File Structure

  • Already implemented some DGPs, estimators, and a simulation runner
  • Figured out a basic file structure
project/
├── dgps/
│   ├── __init__.py
│   ├── static.py       # StaticNormalDGP
│   └── dynamic.py      # DynamicNormalDGP
├── estimators/
│   ├── __init__.py
│   ├── ols-like.py     # SimpleOLS, SimpleRidge, LassoWrapper
├── main.py             # Main script that we call from the CLI
├── protocols.py        # DGPProtocol, EstimatorProtocol
└── runner.py           # SimulationRunner

Problem Statement

Reminder: main.py with Only One Scenario

main.py
from dgps.dynamic import DynamicNormalDGP
from estimators.ols_like import LassoWrapper
from runner import SimulationRunner

if __name__ == "__main__": 
    dgp = DynamicNormalDGP(beta0=0.0, beta1=0.95)
    estimator = LassoWrapper(reg_param=0.04)
    n_obs = 50

    # Run simulation for specified scenario
    runner = SimulationRunner(dgp, estimator)
    runner.simulate(n_sim=1000, n_obs=n_obs, first_seed=1)

    # Print results
    print(
        f"Bias for {dgp.__class__.__name__} + {estimator.__class__.__name__}: "
    )
    runner.summarize_bias()

Issue: How To Run Many Scenarios?

Key challenge:

How do we run many scenarios automatically?

A problem of orchestration: coordinating multiple tasks

  • Automatically
  • As single workflow done in the correct order

Goal: being able to focus on results

Why Not Hardcode?

One way: add all combos (DGPs, estimators, sample sizes) manually in main.py, create SimulationRunner for each one

Not a very good approach

  • Brittle: have to edit main.py for every change
  • Prone to errors
  • Repeats code (e.g. SimulationRunner creation)
  • Breaks separation of concerns: job of the main script is not to say which scenarios you want today

Questions to Answer Today

  • How do we capture what a scenario is?
  • How do we execute all these scenarios?
  • What do we do with the outputs?


  • First: just executing simulations and printing results to the console as before
  • Second: basics of dealing with outputs

Expressing Simulation Scenarios

SimulationScenario Class

What’s A Simulation Scenario

“Scenario” — collection of characteristics that uniquely define a setting for SimulationRunner


Our case has three characteristics:

  • DGP
  • Estimator
  • Sample size

How To Encode Scenarios?

  • Implicitly
  • Explicitly in an object with suitable info:
    • Dictionaries
    • Named tuples (through collections.namedtuple() or typing.NamedTuple)
    • @dataclasses.dataclass

Generally good practice to be explicit (know if something goes wrong; clearer code)

Reminder About Data Classes

  • “Data class” — class that’s just a collection of fields with little extra functionality
  • Here: use @dataclasses.dataclass like in the EPP class (but be aware of other simpler options)

To create: @dataclass and attributes with types

from dataclasses import dataclass

@dataclass(frozen=True)
class SimulationScenario:       # Simple example
    dgp: type[DGPProtocol]
    estimator: type[EstimatorProtocol]
    sample_size: int

SimulationScenario Data Class Definition

@dataclass(frozen=True)
class SimulationScenario:
    """A single simulation scenario: DGP, estimator, and sample size."""
    name: str               # For readability  
    dgp: type[DGPProtocol] 
    dgp_params: dict        # E.g. betas go here
    estimator: type[EstimatorProtocol]
    estimator_params: dict  # E.g. reg_params go here
    sample_size: int
    n_simulations: int = 1000
    first_seed: int = 1
  • Self-documenting
  • A dataclass comes with __init__, a nice __eq__ and other useful methods practically for free

Example SimulationScenario

Can now define example instance:

example_scenario = SimulationScenario(
    name="static_ols_T50",
    dgp=StaticNormalDGP,
    dgp_params={"beta0": 0.0, "beta1": 0.5},
    estimator=SimpleOLS,
    estimator_params={},
    sample_size=50, 
)

Using SimulationScenario with SimulationRunner

# Initialize the scenario
dgp = example_scenario.dgp(**example_scenario.dgp_params)
estimator = example_scenario.estimator(**example_scenario.estimator_params)

# Run the simulation
runner = SimulationRunner(dgp, estimator)
runner.simulate(
    n_sim=example_scenario.n_simulations, 
    n_obs=example_scenario.sample_size, 
    first_seed=example_scenario.first_seed,
)
# Print results
print(f"Bias for {example_scenario.name}: {runner.errors.mean():.4f}")
Bias for static_ols_T50: -0.0027

example_scenario contains all the information necessary for SimulationRunner

Collections of Scenarios

Two Ways To Create Many Scenarios

  1. Manually: a file that explicitly specifies the desired scenarios
  2. Automatically (e.g. as a Cartesian product of list of DGPs, sample sizes, estimators)

Choice depends on your goal:

  • Specific ones (beware of missing a desired combination)
  • All combinations (beware of exponential growth)

Where To Store Scenarios

A couple of options:

  • In a Python array (e.g. list of scenarios)
  • In an external config file (e.g. a YAML config)


For now: a Python list coming from a scenarios.py file is fine for us

Manual Example: List of Scenarios

scenarios.py
scenarios = [
    SimulationScenario(
        name="static_ols_T50",
        dgp=StaticNormalDGP,
        dgp_params={"beta0": 0.0, "beta1": 0.5},
        estimator=SimpleOLS,
        estimator_params={},
        sample_size=50, 
    ),
    SimulationScenario(
        name="dynamic_lasso_T200",
        dgp=DynamicNormalDGP,
        dgp_params={"beta0": 0.0, "beta1": 0.95},
        estimator=LassoWrapper,
        estimator_params={"reg_param": 0.1},
        sample_size=200, 
    )
]

Creating All Possible Combinations

Other extreme: all possible combinations of scenario characteristics


Creation steps:

  1. Create lists/sets of DGPs, estimator, sample sizes
  2. Take Cartesian product
  3. Store results in a list

scenarios.py With All Possible Combinations:

scenarios.py
from itertools import product

# Define lists of components 
dgps = [
    (StaticNormalDGP, {"beta0": 0.0, "beta1": 1.0}, 'static'),
    (DynamicNormalDGP, {"beta0": 0.0, "beta1": 0.0}, 'dynamic_low_pers'),
    (DynamicNormalDGP, {"beta0": 0.0, "beta1": 0.5}, 'dynamic_mid_pers'),
    (DynamicNormalDGP, {"beta0": 0.0, "beta1": 0.95}, 'dynamic_high_pers'),
]
estimators = [
    (SimpleOLS, {}),
    (LassoWrapper, {"reg_param": 0.1}),
    (SimpleRidge, {"reg_param": 0.1})
]
sample_sizes = [50, 200]

# Generate all combinations
scenarios = [
    SimulationScenario(
        name=f"{dgp_class.__name__.lower()}_{dgp_descr}_{estimator_class.__name__.lower()}_T{size}",
        dgp=dgp_class,
        dgp_params=dgp_params,
        estimator=estimator_class,
        estimator_params=estimator_params,
        sample_size=size, 
    )
    for (dgp_class, dgp_params, dgp_descr), (estimator_class, estimator_params), size
    in product(dgps, estimators, sample_sizes)
]

Our Choice

Here: choose automatic approach

len(scenarios)
24

Would be annoying to write all these by hand


Note:

In reality often some hybrid: manually create set of pairs (DGP, estimator), take product with some sizes (not all possible DGP-estimator pairs)

Resulting File Structure

Now have added a new scenarios.py file to our folder:

project/
├── dgps/
│   ├── __init__.py
│   ├── static.py
│   └── dynamic.py       
├── estimators/
│   ├── __init__.py
│   └── ols-like.py       
├── protocols.py
├── runner.py
├── scenarios.py       # New: Defines SimulationScenario and scenarios list
└── main.py

Running Many Scenarios. SimulationOrchestrator Class

What’s Left?

So far:

  • All the simulation infrastructure (runner, DGPs, estimators)
  • Scenario list


Goal: want all scenarios executed when we run

python main.py

What Should It Do?

A simple simulation orchestrator:

  • Should ingest list of scenarios
  • Run all the scenarios:
    • For each scenario, create a SimulationRunner
    • simulate()
  • Do something with the results

More advanced: can parallelize/distribute computation, etc.

How Should Our main.py Look Like?

main.py
from orchestrator import SimulationOrchestrator 
1from scenarios import scenarios

if __name__ == "__main__":
    # Create and execute simulations 
2    orchestrator = SimulationOrchestrator(scenarios)
    orchestrator.run_all()

    # Results logic
3    ...
1
Get scenarios
2
Run them all
3
Do something with the results

Simple SimulationOrchestrator Class Definition

orchestrator.py
class SimulationOrchestrator:
    """Simple simulation orchestration class without any result handling
    """
    def __init__(self, scenarios: list[SimulationScenario]):
        self.scenarios = scenarios 

    def run_all(self):
        for scenario in scenarios: 
            # Create DGP and estimator
            dgp = scenario.dgp(**scenario.dgp_params)
            estimator = scenario.estimator(**scenario.estimator_params)

            # Run the simulation
            runner = SimulationRunner(dgp, estimator)
            runner.simulate(
                n_sim=scenario.n_simulations, 
                n_obs=scenario.sample_size, 
                first_seed=scenario.first_seed,
            )
            # Print results
            print(f"Bias for {scenario.name}: {runner.errors.mean():.4f}")

Executing main.py

Executing the script now prints the results for all scenarios!

python main.py
Bias for staticnormaldgp_static_simpleols_T50: -0.0027
Bias for staticnormaldgp_static_simpleols_T200: -0.0028
Bias for staticnormaldgp_static_lassowrapper_T50: -0.1108
Bias for staticnormaldgp_static_lassowrapper_T200: -0.1048
Bias for staticnormaldgp_static_simpleridge_T50: -0.0048
Bias for staticnormaldgp_static_simpleridge_T200: -0.0033
Bias for dynamicnormaldgp_low_pers_simpleols_T50: -0.0261
Bias for dynamicnormaldgp_low_pers_simpleols_T200: -0.0029
Bias for dynamicnormaldgp_low_pers_lassowrapper_T50: -0.0655
Bias for dynamicnormaldgp_low_pers_lassowrapper_T200: -0.0715
Bias for dynamicnormaldgp_low_pers_simpleridge_T50: -0.0262
Bias for dynamicnormaldgp_low_pers_simpleridge_T200: -0.0029
Bias for dynamicnormaldgp_mid_pers_simpleols_T50: -0.0509
Bias for dynamicnormaldgp_mid_pers_simpleols_T200: -0.0103
Bias for dynamicnormaldgp_mid_pers_lassowrapper_T50: -0.1374
Bias for dynamicnormaldgp_mid_pers_lassowrapper_T200: -0.0880
Bias for dynamicnormaldgp_mid_pers_simpleridge_T50: -0.0515
Bias for dynamicnormaldgp_mid_pers_simpleridge_T200: -0.0105
Bias for dynamicnormaldgp_high_pers_simpleols_T50: -0.0992
Bias for dynamicnormaldgp_high_pers_simpleols_T200: -0.0204
Bias for dynamicnormaldgp_high_pers_lassowrapper_T50: -0.1314
Bias for dynamicnormaldgp_high_pers_lassowrapper_T200: -0.0347
Bias for dynamicnormaldgp_high_pers_simpleridge_T50: -0.0993
Bias for dynamicnormaldgp_high_pers_simpleridge_T200: -0.0204

Discussion

A total victory:

  • Clean, well-focused files and implementation
  • Automatic collection and construction of scenarios
  • Full execution of all simulations


Can talk about further improvements, handling results, but we have an extensible and broadly-applicable core

Simulation Outputs

More On Handling Results

So far: just printing bias values to the console

More typical: export results in some nice tabular/text form

  • Often: whole raw simulation results, particularly in simulations where individual runs are expensive
  • Also: summaries
    • Summary tables
    • Plots

Here: Simple Example


Here: just brief example

  • Store summary bias results on the orchestrator
  • Somehow export them from main.py

Updated SimulationOrchestrator Class

class SimulationOrchestrator:
    """Simulation orchestrator that stores results in a dictionary
    """
    def __init__(self, scenarios: list[SimulationScenario]):
        self.scenarios = scenarios
        self.summary_results = {}

    def run_all(self):
        for scenario in scenarios: 
            # Create DGP and estimator
            dgp = scenario.dgp(**scenario.dgp_params)
            estimator = scenario.estimator(**scenario.estimator_params)

            # Run the simulation
            runner = SimulationRunner(dgp, estimator)
            runner.simulate(
                n_sim=scenario.n_simulations, 
                n_obs=scenario.sample_size, 
                first_seed=scenario.first_seed,
            )
            # Save results
            self.summary_results[scenario.name] = runner.errors.mean()

Results Handling Discussion

  • Here: took a quicker solution: the SimulationOrchestrator implementation knows that its handling bias
  • But can make more loosely coupled
    • Add a summarize() method to SimulationRunner that knows what to export
    • orchestrator will just receive whatever summarize() gives
    • Would make orchestrator even more reusable

Changing main.py

main.py
import pandas as pd

from orchestrator import SimulationOrchestrator 
from scenarios import scenarios                                             

if __name__ == "__main__":
    # Create and execute simulations 
    orchestrator = SimulationOrchestrator(scenarios)                         
    orchestrator.run_all()

    # Results logic (print or export as pd.Series)
    print(pd.Series(orchestrator.summary_results))

Here for simplicity: print the Series, but would generally to_csv()

Executing Results

Executing the script now prints the results for all scenarios!

python main.py
staticnormaldgp_static_simpleols_T50           -0.002680
staticnormaldgp_static_simpleols_T200          -0.002753
staticnormaldgp_static_lassowrapper_T50        -0.110823
staticnormaldgp_static_lassowrapper_T200       -0.104789
staticnormaldgp_static_simpleridge_T50         -0.004828
staticnormaldgp_static_simpleridge_T200        -0.003261
dynamicnormaldgp_low_pers_simpleols_T50        -0.026099
dynamicnormaldgp_low_pers_simpleols_T200       -0.002854
dynamicnormaldgp_low_pers_lassowrapper_T50     -0.065491
dynamicnormaldgp_low_pers_lassowrapper_T200    -0.071531
dynamicnormaldgp_low_pers_simpleridge_T50      -0.026203
dynamicnormaldgp_low_pers_simpleridge_T200     -0.002900
dynamicnormaldgp_mid_pers_simpleols_T50        -0.050864
dynamicnormaldgp_mid_pers_simpleols_T200       -0.010286
dynamicnormaldgp_mid_pers_lassowrapper_T50     -0.137379
dynamicnormaldgp_mid_pers_lassowrapper_T200    -0.088010
dynamicnormaldgp_mid_pers_simpleridge_T50      -0.051537
dynamicnormaldgp_mid_pers_simpleridge_T200     -0.010470
dynamicnormaldgp_high_pers_simpleols_T50       -0.099214
dynamicnormaldgp_high_pers_simpleols_T200      -0.020369
dynamicnormaldgp_high_pers_lassowrapper_T50    -0.131433
dynamicnormaldgp_high_pers_lassowrapper_T200   -0.034744
dynamicnormaldgp_high_pers_simpleridge_T50     -0.099312
dynamicnormaldgp_high_pers_simpleridge_T200    -0.020425
dtype: float64

Recap and Conclusions

Recap


In this lecture we

  • Discussed how to specify a simulation scenario
  • Talked about how to construct a list of many scenarios
  • Implemented a simple orchestrator that pulls together scenarios and executes them

Further Improvements

Can keep adding things to code:

  • Logging and progress tracking
  • Improve robustness of code by adding error handling
  • Parallelize to take advantage
  • Custom output handler classes

Project could also benefit from more reproducibility:

  • Getting the right environment for reproducibility?
  • Not having to rerun all the simulations every time?

Block Recap

This block: structuring and thinking about simulation code

Overall design:

  • Starting simple with functions
  • Going modular for more complex scenarios

Quality of life features:

  • Scenario builders
  • Orchestrator

References

Lau, Sam. 2023. Learning Data Science. 1st ed. Sebastopol: O’Reilly Media, Incorporated.
Lutz, Mark. 2025. Learning Python: Powerful Object-Oriented Programming. Sixth edition. Santa Rosa, CA: O’Reilly.
Ramalho, Luciano. 2022. Fluent Python: Clear, Concise, and Effective Programming. 2nd edition. Sebastopol, California: O’Reilly Media, Inc.