Good Simulation Code II: Modular Approach

A Flexible and Extensible Design

Vladislav Morozov

Introduction

Lecture Info

Learning Outcomes

This lecture is about a modular approach to designing simulation code


By the end, you should be able to

  • Refactor code into modular reusable classes
  • Compose DGPs and estimators into simulations
  • Understand how modularity leads to clearer, more extensible code

References

Statistics:

  • Chapter 4 of Hansen (2022) for OLS
  • Chapter 14 of Hansen (2022) for time series basics

Programming

  • Chapter 2-3 of Hillard (2020) on basics of design
  • Chapter 26-27 of Lutz (2025) on OOP in Python

Reminder: Simulation Setting

Reminder: Studying OLS Bias

This time: continue studying bias of OLS estimator in simple linear model:

\[ Y_{t} = \beta_0 + \beta_1 X + U_t, \quad t=1, \dots, T \]

Metric of interest: absolute bias

\[ \text{Bias}(\hat{\beta}_1) = \E[\hat{\beta}_1] - \beta_1 \]

Reminder: Static and Dynamic DGPs

Static:

\[ \small Y_{t} = \beta_0 + 0.5 X_{t} + U_{t} \]

Covariate \(X_t\) independent from \(U_t\) and over time

Dynamic:

\[ \small Y_{t} = \beta_0 + \beta_1 Y_{t-1} + U_{t} \]

Dependence: \(\beta_1\) value

\[ \small \beta_1 \in \curl{0, 0.5, 0.95} \]

  • One DGP per \(U_t, X_t, Y_0\): each \(N(0, 1)\)
  • \(T=50, 200\)

Reminder: Current State of Simulation

Last time: implemented things with a basic function

def simulate_ols(
    n_sim: int = 1000,
    n_obs: int = 50,
    beta0: float = 0.0,
    beta1: float = 0.5,
    seed: int = 1,
    dgp_type: str = "static",
) -> list[float]:
    """Simulates OLS estimation for static or AR(1) DGP.
    Args:
        n_sim (int): Number of simulations to run. Defaults to 1000.
        n_obs (int): Number of observations per simulation. Defaults to 100.
        beta0 (float): True intercept value. Defaults to 0.
        beta1 (float): True slope coefficient. Defaults to 0.5
        seed (int): random number generator seed. Defaults to 1.
        dgp_type (str): Type of DGP: "static" or "ar1". Defaults to "static".

    Returns:
        list[float]: List of errors of beta1 estimates for each simulation.
    """

    # Create container to store results
    results_ols_errors = []

    # InitializeRNG
    rng = np.random.default_rng(seed)

    # Core simulation loop
    for _ in range(n_sim):
        # 1. Draw data
        if dgp_type == "static":
            x = rng.normal(size=n_obs)
            u = rng.normal(size=n_obs)
            y = beta0 + beta1 * x + u
        elif dgp_type == "ar1":
            y = np.zeros(n_obs + 1)
            u = rng.normal(size=n_obs + 1)
            y[0] = beta0 + u[0]
            for t in range(1, n_obs + 1):
                y[t] = beta0 + beta1 * y[t - 1] + u[t]
            # Use the last n_obs elements of y as and the first n_obs as x
            x = y[:-1]  # x is y lagged (n_obs elements)
            y = y[1:]  # y[1:] is the dependent variable (n_obs elements)
        else:
            # Handle the case of giving the wrong DGP value
            raise ValueError("Invalid DGP choice")

        # 2. Run OLS estimation
        W = np.column_stack([np.ones(n_obs), x])
        beta_hat = np.linalg.inv(W.T @ W) @ W.T @ y

        # 3. Compute error
        results_ols_errors.append(beta_hat[1] - beta1)

    return results_ols_errors

Problem: How To Expand Simulations?

What if we want to expand our simulations?

  • More distributions (e.g. heavier tails for innovations, different \(X\))
  • Different models (not necessarily \(Y_t = \beta_0 + \beta_1 X_t + U_t\))


Function approach is difficult to expand: would have to keep adding components and selectors to simulate_ols()

This Lecture: Dealing With Complexity

Today: dealing with design

  • Untangling the different components of simulation
  • Learning a more modular structure
  • Understanding the why of it

Towards a Modular Design

Identifying Component Blocks in Code

A Step Back: What Are We Doing?

There are three code tasks:

  1. Data generation: draw data from a DGP
  2. Estimation: apply an estimator to the data
  3. Orchestration: run the above many times and collect results


For now things are all jumbled up and hardcoded by simulate_ols()monolithic design

Why Should The Simulation Loop Care?

The simulation loop shouldn’t need to know:

  • How data is generated (e.g., AR(1) vs. static)
  • How estimation works (e.g., OLS vs. IV)

It only needs to:

  • Get data from something (DGP)
  • Pass data to something (estimator)
  • Repeat and store results

Why Should the Estimator and DGP Care?

Same logic goes for DGP: it shouldn’t care

  • Whether it’s going to be used one time or many times in a loop
  • About how the estimator uses its data after

Likewise, the estimator:

  • Doesn’t need details of the loop
  • Doesn’t need to know how the y and x are created, just their form

Conclusion: Components

Overall: identified three logical units in code


These units should be more or less independent (loosely coupled)


E.g.:

  • Changing DGP details should not affect loop, if loop can still sample in the same way
  • Can also then swap or add more components easily

Some Design Concepts

Background: Separation of Concerns

Key in clear code: divide various behavious into small, manageable pieces

Best way to divide — by concern. This:

  • Reduces complexity (each part does one thing well)
  • Improves reusability (swap DGPs/estimators without rewriting everything)
  • Makes collaboration easier (different people can work on different parts)

Interfaces: Glue Between Components

Simulation runner only needs to know how to interface with DGP and estimator:

  • DGP interface: Must provide a method like sample(n_obs, seed)
  • Estimator interface: Must provide a method like fit(X, y)


As long as a DGP/estimator implement given interface, simulation runner can use it

Their implementation details do not matter to runner

Background: Interfaces

Interfaces are like contracts

  • DGP promises to give valid data if sample() is called
  • Estimator promises to give suitable coefficients if fit() is called


Benefit: You can add new DGPs/estimators without touching the simulation runner!

DGPs Encapsulate Their Logic

Each DGP encapsulates its own logic:

  • AR(1) DGP handles its own loops, initial conditions, etc.
  • Static DGP just draws IID data.


The outside world only sees the sample() method that returns a sample of X and y

Implementation

DGP and Estimator Classes

Approach: Classes

Now want implementation with these good properties (separation of concerns, loose coupling, encapsulation)

Will capture required logic with appropriate classes

  • Going OOP allows us to view DGPs, estimators, and simulations as objects:
    • Each will have some appropriate data and some logic for doing something with that data
    • Different objects can interact using that logic
  • Simulation runner can compose DGPs and estimators

What Attributes and Methods Do Our Classes Need?

To define a class need to choose what data and functions its instances contain

  • DGPs need to:
    • Know their true beta1 (e.g. for bias computation)
    • Offer samples with given number of points from DGP-specific distribution
  • Estimator needs to:
    • Compute beta1_hat based on data
    • Remember computed beta1_hat

Example: Static DGP Class

class StaticNormalDGP:
    """A data-generating process (DGP) for a static linear model

    Attributes:
        beta0 (float): Intercept term.
        beta1 (float): Slope coefficient.
    """

    def __init__(self, beta0: float = 0.0, beta1: float = 0.5) -> None:
        """Initializes the DGP with intercept and slope.

        Args:
            beta0 (float): Intercept term. Defaults to 0.0.
            beta1 (float): Slope coefficient. Defaults to 1.0.
        """
        self.beta0: float = beta0
        self.beta1: float = beta1

    def sample(
        self, n_obs: int, seed: int | None = None
    ) -> tuple[np.ndarray, np.ndarray]:
        """Samples data from the static DGP.

        Args:
            n_obs (int): Number of observations to sample.
            seed (int, optional): Random seed for reproducibility. Defaults to None.

        Returns:
            tuple: (x, y) arrays, each of length n_obs.
        """
        rng = np.random.default_rng(seed)
        x = rng.normal(size=n_obs)
        u = rng.normal(size=n_obs)
        y = self.beta0 + self.beta1 * x + u
        return x, y

Discussion of DGP Class

We use a class to:

  • Capture a specific concern (data drawing)
  • Specify interfaces (a sample() method)
  • Encapsulate logic (class keeps its DGP details in sample())

Example usage

static_sampler = StaticNormalDGP(beta1=0.4)
x, y = static_sampler.sample(n_obs=5, seed=1)
print(f"x: {x.round(2)} \ny: {y.round(2)}")
x: [ 0.35  0.82  0.33 -1.3   0.91] 
y: [ 0.58 -0.21  0.71 -0.16  0.66]

Example: AR(1) Class

A class for AR(1) with same sample() interface

class DynamicNormalDGP:
    """A data-generating process (DGP) for a dynamic linear model: y_t = beta0 + beta1*y_{t-1} + u_t.

    Attributes:
        beta0 (float): Intercept term.
        beta1 (float): AR(1) coefficient.
    """

    def __init__(self, beta0: float = 0.0, beta1: float = 0.5):
        """Initializes the DGP with intercept and AR(1) coefficient.

        Args:
            beta0 (float): Intercept term. Defaults to 0.0.
            beta1 (float): AR(1) coefficient. Defaults to 0.5.
        """
        self.beta0: float = beta0
        self.beta1: float = beta1

    def sample(
        self, n_obs: int, seed: int | None = None
    ) -> tuple[np.ndarray, np.ndarray]:
        """Samples data from the dynamic DGP.

        Args:
            n_obs (int): Number of observations to sample.
            seed (int, optional): Random seed for reproducibility. Defaults to None.

        Returns:
            tuple: (x, y) arrays, each of length n_obs.
                  x is y_{t-1} (lagged y), and y is y_t.
        """
        rng = np.random.default_rng(seed)
        y = np.zeros(n_obs + 1)  # Extra observation for lag
        u = rng.normal(size=n_obs + 1)
        y[0] = self.beta0 + u[0]  # Initial condition
        for t in range(1, n_obs + 1):
            y[t] = self.beta0 + self.beta1 * y[t - 1] + u[t]
        # Return lagged y as x and y[1:] as y
        return y[:-1], y[1:]

Discussion: DGP Classes

  • Now StaticNormalDGP and DynamicNormalDGP capture our DGPs
  • Important: can interact with them in the same way:
    • Ask their true beta1 value
    • sample() given number of poinst with given seed
  • In line with what simulation loop wants, all other logic kept inside

Same Idea: Simple Estimator Class

Hide estimator logic and only provide fit()

class SimpleOLS:
    """A simple OLS estimator for the linear model y = beta0 + beta1*x + u.

    Attributes:
        beta0_hat (float): Estimated intercept. NaN until fit is called.
        beta1_hat (float): Estimated slope. NaN until fit is called.
    """

    def __init__(self) -> None:
        """Initializes the OLS estimator with no estimates."""
        self.beta0_hat: float = np.nan
        self.beta1_hat: float = np.nan

    def fit(self, x: np.ndarray, y: np.ndarray) -> None:
        """Fit OLS to the provided data.

        Args:
            x (np.ndarray): Independent variable (1D array).
            y (np.ndarray): Dependent variable (1D array).
        """
 
        # Add constant to x
        X = np.column_stack([np.ones(len(x)), x])
        # OLS estimation
        beta_hat = np.linalg.inv(X.T @ X) @ X.T @ y
        self.beta0_hat, self.beta1_hat = beta_hat[0], beta_hat[1]

SimpleOLS In Action

To fit an instance of SimpleOLS, just pass data:

ols = SimpleOLS()
x, y = static_sampler.sample(n_obs=5000, seed=1)
ols.fit(x, y)
print(f"Estimated slope coefficient: {ols.beta1_hat:.3f}")
Estimated slope coefficient: 0.366


Now ready to put things together into simulation

Simulation Runner Class

Capturing Simulation Logic

Now missing final piece — the simulation loop:

  • Will create appropriate SimulationRunner class
  • SimulationRunner composes DGP and estimator (part of its data)
  • With data, will run Monte Carlo simulation for given number of datasets

Simulation Runner Implementation

class SimulationRunner:
    """Runs Monte Carlo simulations for a given DGP and estimator.

    Attributes:
        dgp: data-generating process with a sample() method.
        estimator: estimator with a fit() method and beta1_hat attribute.
        errors: array of estimation errors (beta1_hat - beta1) for each simulation.
    """

    def __init__(
        self,
        dgp: StaticNormalDGP | DynamicNormalDGP,
        estimator: SimpleOLS,
    ) -> None:
        """Initializes the simulation runner.

        Args:
            dgp: An instance of a DGP class (must implement `sample`).
            estimator: An instance of an estimator class (must implement `fit`).
        """
        self.dgp: StaticNormalDGP | DynamicNormalDGP = dgp
        self.estimator: SimpleOLS = estimator
        self.errors: np.ndarray = np.empty(0)

    def simulate(self, n_sim: int, n_obs: int, first_seed: int | None = None) -> None:
        """Runs simulations and stores estimation errors.

        Args:
            n_sim (int): number of simulations to run.
            n_obs (int): Number of observations per simulation.
            first_seed (int | None): Starting random seed for reproducibility. 
                Defaults to None.
        """
        # Preallocate array to hold estimation errors
        self.errors = np.empty(n_sim)

        # Run simulation
        for sim_id in range(n_sim):
            # Draw data
            x, y = self.dgp.sample(n_obs, seed=first_seed + sim_id if first_seed else None)
            # Fit model
            self.estimator.fit(x, y)
            # Store error
            self.errors[sim_id] = self.estimator.beta1_hat - self.dgp.beta1

Simulation Runner in Action

Simulation flow now:

  • Create DGP and estimator
  • Pass to SimulationRunner and simulate()
# Initialize DGP and estimator
static_dgp = StaticNormalDGP(beta0=0.0, beta1=0.5)
ols_estimator = SimpleOLS()

# Initialize and run simulation
ols_static_sim = SimulationRunner(static_dgp, ols_estimator)
ols_static_sim.simulate(n_sim=1000, n_obs=50, first_seed=1)

# Summarize bias
print(f"Average estimation error (bias): {ols_static_sim.errors.mean():.4f}")
Average estimation error (bias): -0.0027

Discussion I: What Just Happened?

DGP, estimator, and simulation runner are now only loosely connected


Benefits:

  • Can add new scenarios without touching the core simulation logic
  • Different people can work on these without conflicts
  • Can reuse components elsewhere
  • Easier to read and change the code

Discussion II: Reproducibility in simulate()

simulate() ensures reproducibility with the first_seed argument


  • first_seed — seed used for first dataset
  • \(i\)th dataset uses seed = first_seed + i
  • Ensures different datasets in different steps

Summarizing Results

What To Do With Results

SimulationRunner stores raw simulation results (here: full estimation errors in errors )


How these are handled — depends on what you want to do

  • Can add summary methods
  • Can create other objects that process run simulations (and generate figures)

We will add a simple summarize_bias() method

Example: Adding A Summary to SimulationRunner

class SimulationRunner:
    """Runs Monte Carlo simulations for a given DGP and estimator.

    Attributes:
        dgp: Data-generating process with a `sample` method.
        estimator: Estimator with a `fit` method and `beta1_hat` attribute.
        errors: array of estimation errors (beta1_hat - beta1) for each simulation.
    """

    def __init__(
        self,
        dgp: "StaticNormalDGP" | "DynamicNormalDGP",
        estimator: SimpleOLS,
    ) -> None:
        """Initializes the simulation runner.

        Args:
            dgp: An instance of a DGP class (must implement `sample`).
            estimator: An instance of an estimator class (must implement `fit`).
        """
        self.dgp = dgp
        self.estimator = estimator
        self.errors = None

    def simulate(self, n_sim: int, n_obs: int, first_seed: int | None = None) -> None:
        """Runs simulations and stores estimation errors.

        Args:
            n_sim (int): number of simulations to run.
            n_obs (int): Number of observations per simulation.
            first_seed (int | None): Random seed for reproducibility. Defaults to None.
        """
        # Preallocate array to hold estimation errors
        self.errors = np.empty(n_sim)

        # Run simulation
        for sim_id in range(n_sim):
            # Draw data
            x, y = self.dgp.sample(n_obs, seed=first_seed + sim_id if first_seed else None)
            # Fit model
            self.estimator.fit(x, y)
            # Store error
            self.errors[sim_id] = self.estimator.beta1_hat - self.dgp.beta1

    def summarize_bias(self) -> None:
        """Prints the average estimation error (bias) for beta1. 
        """ 
        print(f"Average estimation error (bias): {np.mean(self.errors):.4f}")

Example: Summarizing AR(1) Simulation

For example: compute bias in dynamic model with a lot of persistence:

# Initialize DGP and estimator
dynamic_dgp = DynamicNormalDGP(beta0=0.0, beta1=0.95)
ols_estimator = SimpleOLS()

# Initialize and run simulation
ols_dynamic_sim = SimulationRunner(dynamic_dgp, ols_estimator)
ols_dynamic_sim.simulate(n_sim=1000, n_obs=50, first_seed=1)

# Summarize bias
ols_dynamic_sim.summarize_bias()
Average estimation error (bias): -0.0992

Recap and Conclusions

Recap


In this lecture we

  • Discussed the benefits of a more modular structure
  • Implemented such an approach using appropriate classes
  • Talked about basic result summaries

How To Intepret Example Code

Choose the appropriate level of complexity for your situation:

  • A brief simulation with one DGP does not need deep architecture and can be done with a monolithic function (last time)
  • A complex simulation involving many scenarios and estimators would benefit from a clearer structure

The specific implementations of today are not absolute, but an example of how to think

Further Improvements


Our simulation is in a good prototype shape, but can still improve some things:

  • Clean up relationships between DGP classes and organize them in a common family
  • Add simulation metadata (DGP name, estimator name for further purposes)
  • Tracking progress

Next Questions


  • How to clean up some loose ends in existing code?
  • How to organize running many simulations?
  • How to handle simulation results?
  • How to apply this logic to deeper statistical scenarios?

References

Hansen, Bruce. 2022. Econometrics. Princeton_University_Press.
Hillard, Dane. 2020. Practices of the Python Pro. Shelter Island, NY: Manning Publications Co.
Lutz, Mark. 2025. Learning Python: Powerful Object-Oriented Programming. Sixth edition. Santa Rosa, CA: O’Reilly.