Good Simulation Code II: Modular Approach

A Flexible and Extensible Design

Vladislav Morozov

Introduction

Lecture Info

Learning Outcomes

This lecture is about a modular approach to designing simulation code

By the end, you should be able to

Refactor code into modular reusable classes
Compose DGPs and estimators into simulations
Understand how modularity leads to clearer, more extensible code

References

Statistics:

Chapter 4 of Hansen (2022) for OLS
Chapter 14 of Hansen (2022) for time series basics

Programming

Chapter 2-3 of Hillard (2020) on basics of design
Chapter 26-27 of Lutz (2025) on OOP in Python

Reminder: Simulation Setting

Reminder: Studying OLS Bias

This time: continue studying bias of OLS estimator in simple linear model:

\[ Y_{t} = \beta_0 + \beta_1 X + U_t, \quad t=1, \dots, T \]

Metric of interest: absolute bias

\[ \text{Bias}(\hat{\beta}_1) = \E[\hat{\beta}_1] - \beta_1 \]

Reminder: Static and Dynamic DGPs

Static:

\[ \small Y_{t} = \beta_0 + 0.5 X_{t} + U_{t} \]

Covariate \(X_t\) independent from \(U_t\) and over time

Dynamic:

\[ \small Y_{t} = \beta_0 + \beta_1 Y_{t-1} + U_{t} \]

Dependence: \(\beta_1\) value

\[ \small \beta_1 \in \curl{0, 0.5, 0.95} \]

One DGP per \(U_t, X_t, Y_0\): each \(N(0, 1)\)
\(T=50, 200\)

Reminder: Current State of Simulation

Last time: implemented things with a basic function

def simulate_ols(
    n_sim: int = 1000,
    n_obs: int = 50,
    beta0: float = 0.0,
    beta1: float = 0.5,
    seed: int = 1,
    dgp_type: str = "static",
) -> list[float]:
    """Simulates OLS estimation for static or AR(1) DGP.
    Args:
        n_sim (int): Number of simulations to run. Defaults to 1000.
        n_obs (int): Number of observations per simulation. Defaults to 100.
        beta0 (float): True intercept value. Defaults to 0.
        beta1 (float): True slope coefficient. Defaults to 0.5
        seed (int): random number generator seed. Defaults to 1.
        dgp_type (str): Type of DGP: "static" or "ar1". Defaults to "static".

    Returns:
        list[float]: List of errors of beta1 estimates for each simulation.
    """

    # Create container to store results
    results_ols_errors = []

    # InitializeRNG
    rng = np.random.default_rng(seed)

    # Core simulation loop
    for _ in range(n_sim):
        # 1. Draw data
        if dgp_type == "static":
            x = rng.normal(size=n_obs)
            u = rng.normal(size=n_obs)
            y = beta0 + beta1 * x + u
        elif dgp_type == "ar1":
            y = np.zeros(n_obs + 1)
            u = rng.normal(size=n_obs + 1)
            y[0] = beta0 + u[0]
            for t in range(1, n_obs + 1):
                y[t] = beta0 + beta1 * y[t - 1] + u[t]
            # Use the last n_obs elements of y as and the first n_obs as x
            x = y[:-1]  # x is y lagged (n_obs elements)
            y = y[1:]  # y[1:] is the dependent variable (n_obs elements)
        else:
            # Handle the case of giving the wrong DGP value
            raise ValueError("Invalid DGP choice")

        # 2. Run OLS estimation
        W = np.column_stack([np.ones(n_obs), x])
        beta_hat = np.linalg.inv(W.T @ W) @ W.T @ y

        # 3. Compute error
        results_ols_errors.append(beta_hat[1] - beta1)

    return results_ols_errors

Problem: How To Expand Simulations?

What if we want to expand our simulations?

More distributions (e.g. heavier tails for innovations, different \(X\))
Different models (not necessarily \(Y_t = \beta_0 + \beta_1 X_t + U_t\))

Function approach is difficult to expand: would have to keep adding components and selectors to simulate_ols()

This Lecture: Dealing With Complexity

Today: dealing with design

Untangling the different components of simulation
Learning a more modular structure
Understanding the why of it

Towards a Modular Design

Identifying Component Blocks in Code

A Step Back: What Are We Doing?

There are three code tasks:

Data generation: draw data from a DGP
Estimation: apply an estimator to the data
Orchestration: run the above many times and collect results

For now things are all jumbled up and hardcoded by simulate_ols() — monolithic design

Why Should The Simulation Loop Care?

The simulation loop shouldn’t need to know:

How data is generated (e.g., AR(1) vs. static)
How estimation works (e.g., OLS vs. IV)

It only needs to:

Get data from something (DGP)
Pass data to something (estimator)
Repeat and store results

Why Should the Estimator and DGP Care?

Same logic goes for DGP: it shouldn’t care

Whether it’s going to be used one time or many times in a loop
About how the estimator uses its data after

Likewise, the estimator:

Doesn’t need details of the loop
Doesn’t need to know how the y and x are created, just their form

Conclusion: Components

Overall: identified three logical units in code

These units should be more or less independent (loosely coupled)

E.g.:

Changing DGP details should not affect loop, if loop can still sample in the same way
Can also then swap or add more components easily

Some Design Concepts

Background: Separation of Concerns

Key in clear code: divide various behavious into small, manageable pieces

Best way to divide — by concern. This:

Reduces complexity (each part does one thing well)
Improves reusability (swap DGPs/estimators without rewriting everything)
Makes collaboration easier (different people can work on different parts)

Interfaces: Glue Between Components

Simulation runner only needs to know how to interface with DGP and estimator:

DGP interface: Must provide a method like sample(n_obs, seed)
Estimator interface: Must provide a method like fit(X, y)

As long as a DGP/estimator implement given interface, simulation runner can use it

Their implementation details do not matter to runner

Background: Interfaces

Interfaces are like contracts

DGP promises to give valid data if sample() is called
Estimator promises to give suitable coefficients if fit() is called

Benefit: You can add new DGPs/estimators without touching the simulation runner!

DGPs Encapsulate Their Logic

Each DGP encapsulates its own logic:

AR(1) DGP handles its own loops, initial conditions, etc.
Static DGP just draws IID data.

The outside world only sees the sample() method that returns a sample of X and y

Implementation

DGP and Estimator Classes

Approach: Classes

Now want implementation with these good properties (separation of concerns, loose coupling, encapsulation)

Will capture required logic with appropriate classes

Going OOP allows us to view DGPs, estimators, and simulations as objects:
- Each will have some appropriate data and some logic for doing something with that data
- Different objects can interact using that logic
Simulation runner can compose DGPs and estimators

What Attributes and Methods Do Our Classes Need?

To define a class need to choose what data and functions its instances contain

DGPs need to:
- Know their true beta1 (e.g. for bias computation)
- Offer samples with given number of points from DGP-specific distribution
Estimator needs to:
- Compute beta1_hat based on data
- Remember computed beta1_hat

Example: Static DGP Class

class StaticNormalDGP:
    """A data-generating process (DGP) for a static linear model

    Attributes:
        beta0 (float): Intercept term.
        beta1 (float): Slope coefficient.
    """

    def __init__(self, beta0: float = 0.0, beta1: float = 0.5) -> None:
        """Initializes the DGP with intercept and slope.

        Args:
            beta0 (float): Intercept term. Defaults to 0.0.
            beta1 (float): Slope coefficient. Defaults to 1.0.
        """
        self.beta0: float = beta0
        self.beta1: float = beta1

    def sample(
        self, n_obs: int, seed: int | None = None
    ) -> tuple[np.ndarray, np.ndarray]:
        """Samples data from the static DGP.

        Args:
            n_obs (int): Number of observations to sample.
            seed (int, optional): Random seed for reproducibility. Defaults to None.

        Returns:
            tuple: (x, y) arrays, each of length n_obs.
        """
        rng = np.random.default_rng(seed)
        x = rng.normal(size=n_obs)
        u = rng.normal(size=n_obs)
        y = self.beta0 + self.beta1 * x + u
        return x, y

Discussion of DGP Class

We use a class to:

Capture a specific concern (data drawing)
Specify interfaces (a sample() method)
Encapsulate logic (class keeps its DGP details in sample())

Example usage

static_sampler = StaticNormalDGP(beta1=0.4)
x, y = static_sampler.sample(n_obs=5, seed=1)
print(f"x: {x.round(2)} \ny: {y.round(2)}")

x: [ 0.35  0.82  0.33 -1.3   0.91] 
y: [ 0.58 -0.21  0.71 -0.16  0.66]

Example: AR(1) Class

A class for AR(1) with same sample() interface

class DynamicNormalDGP:
    """A data-generating process (DGP) for a dynamic linear model: y_t = beta0 + beta1*y_{t-1} + u_t.

    Attributes:
        beta0 (float): Intercept term.
        beta1 (float): AR(1) coefficient.
    """

    def __init__(self, beta0: float = 0.0, beta1: float = 0.5):
        """Initializes the DGP with intercept and AR(1) coefficient.

        Args:
            beta0 (float): Intercept term. Defaults to 0.0.
            beta1 (float): AR(1) coefficient. Defaults to 0.5.
        """
        self.beta0: float = beta0
        self.beta1: float = beta1

    def sample(
        self, n_obs: int, seed: int | None = None
    ) -> tuple[np.ndarray, np.ndarray]:
        """Samples data from the dynamic DGP.

        Args:
            n_obs (int): Number of observations to sample.
            seed (int, optional): Random seed for reproducibility. Defaults to None.

        Returns:
            tuple: (x, y) arrays, each of length n_obs.
                  x is y_{t-1} (lagged y), and y is y_t.
        """
        rng = np.random.default_rng(seed)
        y = np.zeros(n_obs + 1)  # Extra observation for lag
        u = rng.normal(size=n_obs + 1)
        y[0] = self.beta0 + u[0]  # Initial condition
        for t in range(1, n_obs + 1):
            y[t] = self.beta0 + self.beta1 * y[t - 1] + u[t]
        # Return lagged y as x and y[1:] as y
        return y[:-1], y[1:]

Discussion: DGP Classes

Now StaticNormalDGP and DynamicNormalDGP capture our DGPs
Important: can interact with them in the same way:
- Ask their true beta1 value
- sample() given number of poinst with given seed
In line with what simulation loop wants, all other logic kept inside

Same Idea: Simple Estimator Class

Hide estimator logic and only provide fit()

class SimpleOLS:
    """A simple OLS estimator for the linear model y = beta0 + beta1*x + u.

    Attributes:
        beta0_hat (float): Estimated intercept. NaN until fit is called.
        beta1_hat (float): Estimated slope. NaN until fit is called.
    """

    def __init__(self) -> None:
        """Initializes the OLS estimator with no estimates."""
        self.beta0_hat: float = np.nan
        self.beta1_hat: float = np.nan

    def fit(self, x: np.ndarray, y: np.ndarray) -> None:
        """Fit OLS to the provided data.

        Args:
            x (np.ndarray): Independent variable (1D array).
            y (np.ndarray): Dependent variable (1D array).
        """
 
        # Add constant to x
        X = np.column_stack([np.ones(len(x)), x])
        # OLS estimation
        beta_hat = np.linalg.inv(X.T @ X) @ X.T @ y
        self.beta0_hat, self.beta1_hat = beta_hat[0], beta_hat[1]

`SimpleOLS` In Action

To fit an instance of SimpleOLS, just pass data:

ols = SimpleOLS()
x, y = static_sampler.sample(n_obs=5000, seed=1)
ols.fit(x, y)
print(f"Estimated slope coefficient: {ols.beta1_hat:.3f}")

Estimated slope coefficient: 0.366

Now ready to put things together into simulation

Simulation Runner Class

Capturing Simulation Logic

Now missing final piece — the simulation loop:

Will create appropriate SimulationRunner class
SimulationRunner composes DGP and estimator (part of its data)
With data, will run Monte Carlo simulation for given number of datasets

Simulation Runner Implementation

class SimulationRunner:
    """Runs Monte Carlo simulations for a given DGP and estimator.

    Attributes:
        dgp: data-generating process with a sample() method.
        estimator: estimator with a fit() method and beta1_hat attribute.
        errors: array of estimation errors (beta1_hat - beta1) for each simulation.
    """

    def __init__(
        self,
        dgp: StaticNormalDGP | DynamicNormalDGP,
        estimator: SimpleOLS,
    ) -> None:
        """Initializes the simulation runner.

        Args:
            dgp: An instance of a DGP class (must implement `sample`).
            estimator: An instance of an estimator class (must implement `fit`).
        """
        self.dgp: StaticNormalDGP | DynamicNormalDGP = dgp
        self.estimator: SimpleOLS = estimator
        self.errors: np.ndarray = np.empty(0)

    def simulate(self, n_sim: int, n_obs: int, first_seed: int | None = None) -> None:
        """Runs simulations and stores estimation errors.

        Args:
            n_sim (int): number of simulations to run.
            n_obs (int): Number of observations per simulation.
            first_seed (int | None): Starting random seed for reproducibility. 
                Defaults to None.
        """
        # Preallocate array to hold estimation errors
        self.errors = np.empty(n_sim)

        # Run simulation
        for sim_id in range(n_sim):
            # Draw data
            x, y = self.dgp.sample(n_obs, seed=first_seed + sim_id if first_seed else None)
            # Fit model
            self.estimator.fit(x, y)
            # Store error
            self.errors[sim_id] = self.estimator.beta1_hat - self.dgp.beta1

Simulation Runner in Action

Simulation flow now:

Create DGP and estimator
Pass to SimulationRunner and simulate()

# Initialize DGP and estimator
static_dgp = StaticNormalDGP(beta0=0.0, beta1=0.5)
ols_estimator = SimpleOLS()

# Initialize and run simulation
ols_static_sim = SimulationRunner(static_dgp, ols_estimator)
ols_static_sim.simulate(n_sim=1000, n_obs=50, first_seed=1)

# Summarize bias
print(f"Average estimation error (bias): {ols_static_sim.errors.mean():.4f}")

Average estimation error (bias): -0.0027

Discussion I: What Just Happened?

DGP, estimator, and simulation runner are now only loosely connected

Benefits:

Can add new scenarios without touching the core simulation logic
Different people can work on these without conflicts
Can reuse components elsewhere
Easier to read and change the code

Discussion II: Reproducibility in `simulate()`

simulate() ensures reproducibility with the first_seed argument

first_seed — seed used for first dataset
\(i\)th dataset uses seed = first_seed + i
Ensures different datasets in different steps

Summarizing Results

What To Do With Results

SimulationRunner stores raw simulation results (here: full estimation errors in errors )

How these are handled — depends on what you want to do

Can add summary methods
Can create other objects that process run simulations (and generate figures)

We will add a simple summarize_bias() method

Example: Adding A Summary to `SimulationRunner`

class SimulationRunner:
    """Runs Monte Carlo simulations for a given DGP and estimator.

    Attributes:
        dgp: Data-generating process with a `sample` method.
        estimator: Estimator with a `fit` method and `beta1_hat` attribute.
        errors: array of estimation errors (beta1_hat - beta1) for each simulation.
    """

    def __init__(
        self,
        dgp: "StaticNormalDGP" | "DynamicNormalDGP",
        estimator: SimpleOLS,
    ) -> None:
        """Initializes the simulation runner.

        Args:
            dgp: An instance of a DGP class (must implement `sample`).
            estimator: An instance of an estimator class (must implement `fit`).
        """
        self.dgp = dgp
        self.estimator = estimator
        self.errors = None

    def simulate(self, n_sim: int, n_obs: int, first_seed: int | None = None) -> None:
        """Runs simulations and stores estimation errors.

        Args:
            n_sim (int): number of simulations to run.
            n_obs (int): Number of observations per simulation.
            first_seed (int | None): Random seed for reproducibility. Defaults to None.
        """
        # Preallocate array to hold estimation errors
        self.errors = np.empty(n_sim)

        # Run simulation
        for sim_id in range(n_sim):
            # Draw data
            x, y = self.dgp.sample(n_obs, seed=first_seed + sim_id if first_seed else None)
            # Fit model
            self.estimator.fit(x, y)
            # Store error
            self.errors[sim_id] = self.estimator.beta1_hat - self.dgp.beta1

    def summarize_bias(self) -> None:
        """Prints the average estimation error (bias) for beta1. 
        """ 
        print(f"Average estimation error (bias): {np.mean(self.errors):.4f}")

Example: Summarizing AR(1) Simulation

For example: compute bias in dynamic model with a lot of persistence:

# Initialize DGP and estimator
dynamic_dgp = DynamicNormalDGP(beta0=0.0, beta1=0.95)
ols_estimator = SimpleOLS()

# Initialize and run simulation
ols_dynamic_sim = SimulationRunner(dynamic_dgp, ols_estimator)
ols_dynamic_sim.simulate(n_sim=1000, n_obs=50, first_seed=1)

# Summarize bias
ols_dynamic_sim.summarize_bias()

Average estimation error (bias): -0.0992

Recap and Conclusions

Recap

In this lecture we

Discussed the benefits of a more modular structure
Implemented such an approach using appropriate classes
Talked about basic result summaries

How To Intepret Example Code

Choose the appropriate level of complexity for your situation:

A brief simulation with one DGP does not need deep architecture and can be done with a monolithic function (last time)
A complex simulation involving many scenarios and estimators would benefit from a clearer structure

The specific implementations of today are not absolute, but an example of how to think

Further Improvements

Our simulation is in a good prototype shape, but can still improve some things:

Clean up relationships between DGP classes and organize them in a common family
Add simulation metadata (DGP name, estimator name for further purposes)
Tracking progress

Next Questions

How to clean up some loose ends in existing code?
How to organize running many simulations?
How to handle simulation results?
How to apply this logic to deeper statistical scenarios?

References

Hansen, Bruce. 2022. Econometrics. Princeton_University_Press.

Hillard, Dane. 2020. Practices of the Python Pro. Shelter Island, NY: Manning Publications Co.

Lutz, Mark. 2025. Learning Python: Powerful Object-Oriented Programming. Sixth edition. Santa Rosa, CA: O’Reilly.

Good Simulation Code II: Modular Approach

Introduction

Lecture Info

Learning Outcomes

References

Reminder: Simulation Setting

Reminder: Studying OLS Bias

Reminder: Static and Dynamic DGPs

Reminder: Current State of Simulation

Problem: How To Expand Simulations?

This Lecture: Dealing With Complexity

Towards a Modular Design

Identifying Component Blocks in Code

A Step Back: What Are We Doing?

Why Should The Simulation Loop Care?

Why Should the Estimator and DGP Care?

Conclusion: Components

Some Design Concepts

Background: Separation of Concerns

Interfaces: Glue Between Components

Background: Interfaces

DGPs Encapsulate Their Logic

Implementation

DGP and Estimator Classes

Approach: Classes

What Attributes and Methods Do Our Classes Need?

Example: Static DGP Class

Discussion of DGP Class

Example: AR(1) Class

Discussion: DGP Classes

Same Idea: Simple Estimator Class

SimpleOLS In Action

Simulation Runner Class

Capturing Simulation Logic

Simulation Runner Implementation

Simulation Runner in Action

Discussion I: What Just Happened?

Discussion II: Reproducibility in simulate()

Summarizing Results

What To Do With Results

Example: Adding A Summary to SimulationRunner

Example: Summarizing AR(1) Simulation

Recap and Conclusions

Recap

How To Intepret Example Code

Further Improvements

Next Questions

References

`SimpleOLS` In Action

Discussion II: Reproducibility in `simulate()`

Example: Adding A Summary to `SimulationRunner`