Good Simulation Code III: Organization

Defining Interfaces. Organizing Code Files

Vladislav Morozov

Introduction

Lecture Info

Learning Outcomes

This lecture is about organizing our code in two ways: defining interfaces and organizing our files


By the end, you should be able to

  • Specify desired interfaces using protocols
  • Organize simulation components in modules
  • Execute the simulation from the command line

References

Statistics:

  • 28.19-28.23 in Hansen (2022) regarding shrinkage
  • 29.5 in Hansen (2022) about ridge regression

Programming:

  • 22 in Lutz (2025) about modules and program structure
  • 13 in Ramalho (2022) and this tutorial on protocols in Python

Scenario: Scope Change

Reminder: Previous Simulation Setting

Reminder: Studying OLS Bias

Recall: so far in this block studied bias of OLS estimator in simple linear model:

\[ Y_{t} = \beta_0 + \beta_1 X + U_t, \quad t=1, \dots, T \]

Metric of interest: absolute bias

\[ \text{Bias}(\hat{\beta}_1) = \E[\hat{\beta}_1] - \beta_1 \]

Reminder: Static and Dynamic DGPs

Static (StaticNormalDGP):

\[ \small Y_{t} = \beta_0 + 0.5 X_{t} + U_{t} \]

Covariate \(X_t\) independent from \(U_t\) and over time

Dynamic (DynamicNormalDGP):

\[ \small Y_{t} = \beta_0 + \beta_1 Y_{t-1} + U_{t} \]

Dependence: \(\beta_1\) value

\[ \small \beta_1 \in \curl{0, 0.5, 0.95} \]

  • One DGP per \(U_t, X_t, Y_0\): each \(N(0, 1)\)
  • \(T=50, 200\)

Reminder: SimpleOLS

OLS logic captured by SimpleOLS class:

class SimpleOLS:
    """A simple OLS estimator for the linear model y = beta0 + beta1*x + u.

    Attributes:
        beta0_hat (float): Estimated intercept. NaN until fit is called.
        beta1_hat (float): Estimated slope. NaN until fit is called.
    """

    def __init__(self) -> None:
        """Initializes the OLS estimator with no estimates."""
        self.beta0_hat: float = np.nan
        self.beta1_hat: float = np.nan

    def fit(self, x: np.ndarray, y: np.ndarray) -> None:
        """Fit OLS to the provided data.

        Args:
            x (np.ndarray): Independent variable (1D array).
            y (np.ndarray): Dependent variable (1D array).
        """
 
        # Add constant to x
        X = np.column_stack([np.ones(len(x)), x])
        # OLS estimation
        beta_hat = np.linalg.inv(X.T @ X) @ X.T @ y
        self.beta0_hat, self.beta1_hat = beta_hat[0], beta_hat[1]

Reminder: SimulationRunner

Simulation logic captured by SimulationRunner class:

class SimulationRunner:
    """Runs Monte Carlo simulations for a given DGP and estimator.

    Attributes:
        dgp: Data-generating process with a sample method.
        estimator: Estimator with a fit method and beta1_hat attribute.
        errors: array of estimation errors (beta1_hat - beta1) for each simulation.
    """

    def __init__(
        self,
        dgp: StaticNormalDGP | DynamicNormalDGP,
        estimator: SimpleOLS,
    ) -> None:
        """Initializes the simulation runner.

        Args:
            dgp: An instance of a DGP class (must implement sample).
            estimator: An instance of an estimator class (must implement fit).
        """
        self.dgp = dgp
        self.estimator = estimator
        self.errors = None

    def simulate(self, n_sim: int, n_obs: int, first_seed: int | None = None) -> None:
        """Runs simulations and stores estimation errors.

        Args:
            n_sim (int): number of simulations to run.
            n_obs (int): Number of observations per simulation.
            first_seed (int | None): Random seed for reproducibility. Defaults to None.
        """
        # Preallocate array to hold estimation errors
        self.errors = np.empty(n_sim)

        # Run simulation
        for sim_id in range(n_sim):
            # Draw data
            x, y = self.dgp.sample(n_obs, seed=first_seed + sim_id if first_seed else None)
            # Fit model
            self.estimator.fit(x, y)
            # Store error
            self.errors[sim_id] = self.estimator.beta1_hat - self.dgp.beta1

    def summarize_bias(self) -> None:
        """Prints the average estimation error (bias) for beta1. 
        """ 
        print(f"Average estimation error (bias): {np.mean(self.errors):.4f}")

Shifting Scope and Adding a New Estimator

Problem: Scope Change

Now imagine that the scope of the overall project changes to a broader one:

New goal: comparing bias of several “near-OLS” estimators in a static and dynamic settings

  • Scope changes are natural, especially in “experimental” scenarios
  • Need to be flexible both in statistics and code

How to proceed?

Add a New Estimator (Ridge)

First OLS-like estimator is ridge

\[ \small \hat{\bbeta}^{Ridge} = (\bX'\bX+\lambda I)^{-1} \bX'\bY \]

  • Ridge — one of simpler examples of shrinkage estimators
  • Shrinkage strength controlled by \(\lambda \geq 0\)
  • Minimizes sum of squared residuals with \(L^2\) penalty: \[ \small \hat{\bbeta}^{Ridge} = \argmin \sum_{t=1}^T (Y_t - b_0 - b_1 X_t)^2 + \lambda \norm{(b_0, b_1)}_2^2 \]

Implementation of Simple Ridge Estimator

Implement ridge with same interface as SimpleOLS:

class SimpleRidge:
    """A simple ridge estimator for the linear model y = beta0 + beta1*x + u.

    Attributes:
        beta0_hat (float): Estimated intercept. NaN until fit is called.
        beta1_hat (float): Estimated slope. NaN until fit is called.
        reg_param (float): Strength of regularization. Defaults to 0.01.
    """

    def __init__(self, reg_param: float = 0.01) -> None:
        """Initializes the OLS estimator with no estimates."""
        self.beta0_hat: float = np.nan
        self.beta1_hat: float = np.nan
        self.reg_param = reg_param

    def fit(self, x: np.ndarray, y: np.ndarray) -> None:
        """Fits the ridge estimator to the provided data.

        Args:
            x (np.ndarray): Independent variable (1D array).
            y (np.ndarray): Dependent variable (1D array).
        """
 
        # Add constant to x
        X = np.column_stack([np.ones(len(x)), x])
        # OLS estimation
        beta_hat = np.linalg.inv(X.T @ X + self.reg_param * np.eye(2)) @ X.T @ y
        self.beta0_hat, self.beta1_hat = beta_hat[0], beta_hat[1]

Bias vs. Regularization Strength

# Initialize DGP and estimator
static_dgp = StaticNormalDGP(beta0=0.0, beta1=0.5)
# Estimators
ridge_weak = SimpleRidge(reg_param=0.01)
ridge_strong = SimpleRidge(reg_param=100)

# Initialize and run simulations
sim_ridge_weak = SimulationRunner(static_dgp, ridge_weak)
sim_ridge_strong = SimulationRunner(static_dgp, ridge_strong)
sim_ridge_weak.simulate(n_sim=1000, n_obs=50, first_seed=1)
sim_ridge_strong.simulate(n_sim=1000, n_obs=50, first_seed=1)
# Summarize bias 
sim_ridge_weak.summarize_bias()
sim_ridge_strong.summarize_bias()
Average estimation error (bias): -0.0028
Average estimation error (bias): -0.3377

More regularization \(\Rightarrow\) more bias

Defining Interfaces

Organizing Classes with Protocols

Results for Ridge: Discussion

SimulationRunner works with new SimpleRidge, even though SimulationRunner was made with SimpleOLS in mind

Why?

  • SimpleOLS and SimpleRidge look same to SimulationRunner: same fit() method and beta1_hat
  • Example of duck typing in action

\(\Rightarrow\) can we just keep adding new estimators to answer new expanded study question?

Problem 1: Communicating Estimator Interface

Problem with “just adding”:

How do you remember or tell teammates what interface the estimators should have?

  • Could write docs: good, but extra effort that a lot of people do not take until later
  • Could look at how SimulationRunner uses the estimator: works best it’s your fresh code. Otherwise harder, and can miss something

Better to be explicit about what new estimators need

Problem 2: Incorrect Type Hints

A smaller but annoying problem:

  • Our type hints in SimulationRunner don’t allow for SimpleRidge as an estimator
  • Not a problem for execution, Python doesn’t check
  • But code quality tools/IDEs will (rightfully) complain:

Figure: code problem raised by Pylance/Pyright

Solution: Protocols

Python protocols allow us to:

  • Express that SimulationRunner expects estimators to have fit() and beta1_hat
  • Capture in one place the expectations for our estimators (and DGPs)
  • Type hint this

Protocols again work like contracts:

If estimator meets protocol, it is compatible with SimulationRunner

Estimator Protocol

Here’s how to express our two requirements:

from typing import Protocol

class EstimatorProtocol(Protocol):
1    def fit(self, x: np.ndarray, y: np.ndarray) -> None: ...

    @property
2    def beta1_hat(self) -> float: ...
1
A fit() method that takes x and y numpy arrays
2
A float attribute called beta1_hat


Both SimpleOLS and SimpleRidge follow our protocol

Protocol for DGPs

What about DGPs? Two requirements:

  • sample() method with two arguments (n_obs and seed)
  • True beta1 value (for estimation error computation)

Same approach:

class DGPProtocol(Protocol):
    def sample(
        self, n_obs: int, seed: int | None = None
    ) -> tuple[np.ndarray, np.ndarray]:
        ...

    @property
    def beta1(self) -> float: ...

How To Use Protocols?

Protocols communicate information about interfaces both to readers and type checkers — a contract

Now modify SimulationRunner to make this explicit:

  • Insert DGPProtocol as type hint for dgp argument
  • Insert EstimatorProtocol as type hint for estimator

Seeing a Protocol in type hints suggests that some particular interface is expected here (e.g. fit() and beta1_hat)

Updating SimulationRunner

Updated implementation with protocols in place:

class SimulationRunner:
    """Runs Monte Carlo simulations for a given DGP and estimator.

    Attributes:
        dgp: data-generating process with a sample() method and beta1 attribute.
        estimator: estimator with a fit() method and beta1_hat attribute.
        errors: array of estimation errors (beta1_hat - beta1) for each simulation.
    """

    def __init__(
        self,
        dgp: DGPProtocol,
        estimator: EstimatorProtocol,
    ) -> None:
        """Initializes the simulation runner.

        Args:
            dgp: An instance of a DGP class (must implement `sample`).
            estimator: An instance of an estimator class (must implement `fit`).
        """
        self.dgp: DGPProtocol = dgp
        self.estimator: EstimatorProtocol = estimator
        self.errors: np.ndarray = np.empty(0)

    def simulate(self, n_sim: int, n_obs: int, first_seed: int | None = None) -> None:
        """Runs simulations and stores estimation errors.

        Args:
            n_sim (int): number of simulations to run.
            n_obs (int): Number of observations per simulation.
            first_seed (int | None): Starting random seed for reproducibility.
                Defaults to None.
        """
        # Preallocate array to hold estimation errors
        self.errors = np.empty(n_sim)

        # Run simulation
        for sim_id in range(n_sim):
            # Update current seed if first_seed is not None
            if first_seed:
                current_seed = first_seed + sim_id

            # Draw data
            x, y = self.dgp.sample(
                n_obs, seed=current_seed
            )
            # Fit model
            self.estimator.fit(x, y)
            # Store error
            self.errors[sim_id] = self.estimator.beta1_hat - self.dgp.beta1

    def summarize_bias(self) -> None:
        """Prints the average estimation error (bias) for beta1. 
        """ 
        print(f"Average estimation error (bias): {np.mean(self.errors):.4f}")

Aside: Alternative Approach with ABCs

Abstract base classes provide another way to specify the desired interface contract

  • With ABCs you define an abstract BaseDGP and BaseEstimator with abstract methods of desired form
  • Concrete estimators and DGPs inherit from from bases and implement sample() and fit()

Protocol vs. ABCs: depends on whether you want an inheritance structure + common logic (no inheritance needed with protocols)

Adding Lasso Following Our Protocols

Lasso

Another penalized SSR (“OLS-like”) estimator for \(\beta_1\) is lasso, which solves: \[ \small \hat{\bbeta}^{Lasso} = \argmin \sum_{t=1}^T (Y_t - b_0 - b_1 X_t)^2 + \lambda \norm{(b_0, b_1)}_1 \]

  • Lasso penalizes \(L^1\) norm of \((b_0, b_1)\), not \(L^2\) like ridge
  • Want to add to our simulation under new scope

How to Add Lasso?

  • Unlike ridge, \(\hat{\bbeta}^{Lasso}\) has no closed-form solution
  • Can implement numerical minimization logic ourselves

But don’t want to implement it ourselves (hard, etc.)

  • There is already a well-optimized implemenation in scikit-learn
  • Problem: scikit-learn lasso does not follow our protocol

Not a problem: can adapt it in a compatible manner

Point of This Section

Why are we discussing this lasso example?


Three reasons:

  • Seeing how our protocol guides us and serves as a contract
  • Seeing how you can make use of existing things from other sources
  • Seeing another new estimator

A Look at scikit-learn Lasso

In general, would use scikit-learn Lasso like this:

from sklearn.linear_model import Lasso
x, y = static_dgp.sample(5000)
x = x.reshape(-1, 1)           # scikit-learn expects 2D arrays
lasso_est = Lasso(alpha=0.05)  # alpha - regularization parameter
lasso_est.fit(x, y);

Now can predict() (“intended” usage of scikit-learn) or access estimated coefficients (what we need):

print(f"Estimates: beta0: {lasso_est.intercept_:.3f}; estimated beta1: {lasso_est.coef_[0]:.3f}")
Estimates: beta0: -0.002; estimated beta1: 0.447

Lasso has different interface: fit() expects 2d array (before we had 1d); estimates stored in intercept_ and coef_

Lasso Adapter Implementation

Adapter class that aligns Lasso with EstimatorProtocol:

from sklearn.linear_model import Lasso as SklearnLasso 

class LassoWrapper:
    """A wrapper for scikit-learn's Lasso to match the EstimatorProtocol.
    Model y = beta0 + beta1*x+u.

    Attributes:
        model (sklearn.linear_model.Lasso): a scikit-learn Lasso instance.
        beta0_hat (float): Estimated intercept. Initialized as np.nan.
        beta1_hat (float): Estimated slope. Initialized as np.nan.
    """

    def __init__(self, reg_param: float = 1.0) -> None:
        """Initializes the Lasso wrapper with a scikit-learn Lasso model.

        Args:
            reg_param (float): Regularization strength (alpha). Defaults to 1.0.
        """
        self.model = SklearnLasso(alpha=reg_param)
        self.beta0_hat: float = np.nan
        self.beta1_hat: float = np.nan

    def fit(self, x: np.ndarray, y: np.ndarray) -> None:
        """Fits the Lasso model to the provided data.

        Args:
            x (np.ndarray): Independent variable (1D array).
            y (np.ndarray): Dependent variable (1D array).
        """ 
        x = x.reshape(-1, 1) # sklearn expects 2d array inputs
        self.model.fit(x, y)
        self.beta0_hat = float(self.model.intercept_)
        self.beta1_hat = float(self.model.coef_[0]) 

Lasso Wrapper in Action

Now works like SimpleRidge:

# Estimators 
lasso_strong = LassoWrapper(reg_param=100)
# Simulations and results  
sim_lasso_strong = SimulationRunner(static_dgp, lasso_strong)
sim_lasso_strong.simulate(n_sim=1000, n_obs=50, first_seed=1) 
sim_lasso_strong.summarize_bias() 
print(f"Variance of estimation error: {sim_lasso_strong.errors.var()}")
Average estimation error (bias): -0.5000
Variance of estimation error: 0.0

Statistical conclusion: strong regularization totally kills \(\beta_1\): always have \(\hat{\beta}^{Lasso}_1=0\)

Adding Lasso: Discussion

On protocols:

  • Our LassoWrapper conforms to the EstimationProtocol
  • \(\Rightarrow\) didn’t need to change SimulationRunner at all, all of its docs and type hints are valid
  • Can now freely add as many estimators and DGPs as necessary without touching core logic

On the wrapper:

  • A technique for making use of external implementations

Organizing Files

Splitting Code Appropriately

A Recap

Have quite a few things implemented:

  • Three estimators: SimpleOLS, SimpleRidge, LassoWrapper
  • Two DGPs: StaticNormalDGP, DynamicNormalDGP
  • Two protocols: DGPProtocol, EstimatorProtocol
  • SimulationRunner

So far: just displaying them from slides in a Jupyter notebook-like fashion

Goal: Organization

Good project file organization:

  • Makes it easy to work in parallel on different features
  • Is self-documenting
  • Allows easy version control
  • Makes it easy to execute overall project

Now talk about explicit file structure

Why Not a Notebook?

Notebooks are great for exploration, but:

  • Hard to version control (diffing .ipynb files is messy)
  • Difficult to reuse code across projects
  • No clear entry point for running simulations


Modules + scripts = better for production and collaboration

Basics: Python Program Architecture

A typical Python program:

  • One top-level file — script
    • Has main program, can be executed
    • Example: our simulation execution
  • \(\geq 0\) supplemental module files
    • Libraries of tools; group components used by script
    • Not executed (or don’t do anything when executed)
    • Example: estimators, DGPs, etc.

Reminder: Python Modules

Modules are the highest-level way to organize Python code


Benefits:

  • Easy way to systematically organize things
  • Can reuse code
  • Avoids duplication
  • Self-contained and avoid name clashes
  • Help with encapsulation

Overall Example File Organization

project/
├── dgps/
│   ├── __init__.py
│   ├── static.py       # StaticNormalDGP
│   └── dynamic.py      # DynamicNormalDGP
├── estimators/
│   ├── __init__.py
│   └── ols-like.py     # SimpleOLS, SimpleRidge, LassoWrapper
├── main.py             # Main script that we call from the CLI
├── protocols.py        # DGPProtocol, EstimatorProtocol
└── runner.py           # SimulationRunner
  • main.py — does main logic
  • Other .py files — modules (libraries) of tools
  • Can also have a .ipynb file for experimenting that also relies on same .py modules

Example: Estimators Module

Group OLS-like estimator class definitions together:

estimators/ols-like.py
"""
Module for OLS-like estimators.

This module contains classes for estimating simple model
    y = b0 + b1 * x + u
with OLS-like methods. All estimators implement the EstimatorProtocol interface.

Classes:
    SimpleOLS: ordinary least squares estimator.
    SimpleRidge: ridge regression estimator.
    LassoWrapper: wrapper for scikit-learn's Lasso estimator.
"""

import numpy as np
from sklearn.linear_model import Lasso as SklearnLasso


class SimpleOLS:
    """A simple OLS estimator for the linear model y = beta0 + beta1*x + u.

    Attributes:
        beta0_hat (float): Estimated intercept. NaN until fit is called.
        beta1_hat (float): Estimated slope. NaN until fit is called.
    """

    def __init__(self) -> None:
        """Initializes the OLS estimator with no estimates."""
        self.beta0_hat: float = np.nan
        self.beta1_hat: float = np.nan

    def fit(self, x: np.ndarray, y: np.ndarray) -> None:
        """Fit OLS to the provided data.

        Args:
            x (np.ndarray): Independent variable (1D array).
            y (np.ndarray): Dependent variable (1D array).
        """

        # Add constant to x
        X = np.column_stack([np.ones(len(x)), x])
        # OLS estimation
        beta_hat = np.linalg.inv(X.T @ X) @ X.T @ y
        self.beta0_hat, self.beta1_hat = beta_hat[0], beta_hat[1]


class SimpleRidge:
    """A simple ridge estimator for the linear model y = beta0 + beta1*x + u.

    Attributes:
        beta0_hat (float): Estimated intercept. NaN until fit is called.
        beta1_hat (float): Estimated slope. NaN until fit is called.
        reg_param (float): Strength of regularization. Defaults to 0.01.
    """

    def __init__(self, reg_param: float = 0.01) -> None:
        """Initializes the OLS estimator with no estimates."""
        self.beta0_hat: float = np.nan
        self.beta1_hat: float = np.nan
        self.reg_param = reg_param

    def fit(self, x: np.ndarray, y: np.ndarray) -> None:
        """Fits the ridge estimator to the provided data.

        Args:
            x (np.ndarray): Independent variable (1D array).
            y (np.ndarray): Dependent variable (1D array).
        """

        # Add constant to x
        X = np.column_stack([np.ones(len(x)), x])
        # OLS estimation
        beta_hat = np.linalg.inv(X.T @ X + self.reg_param * np.eye(2)) @ X.T @ y
        self.beta0_hat, self.beta1_hat = beta_hat[0], beta_hat[1]


class LassoWrapper:
    """A wrapper for scikit-learn's Lasso to match the EstimatorProtocol.
    Model y = beta0 + beta1*x+u.

    Attributes:
        model (sklearn.linear_model.Lasso): a scikit-learn Lasso instance.
        beta0_hat (float): Estimated intercept. Initialized as np.nan.
        beta1_hat (float): Estimated slope. Initialized as np.nan.
    """

    def __init__(self, reg_param: float = 1.0) -> None:
        """Initializes the Lasso wrapper with a scikit-learn Lasso model.

        Args:
            reg_param (float): Regularization strength (alpha). Defaults to 1.0.
        """
        self.model = SklearnLasso(alpha=reg_param)
        self.beta0_hat: float = np.nan
        self.beta1_hat: float = np.nan

    def fit(self, x: np.ndarray, y: np.ndarray) -> None:
        """Fits the Lasso model to the provided data.

        Args:
            x (np.ndarray): Independent variable (1D array).
            y (np.ndarray): Dependent variable (1D array).
        """
        x = x.reshape(-1, 1)  # sklearn expects 2d array inputs
        self.model.fit(x, y)
        self.beta0_hat = float(self.model.intercept_)
        self.beta1_hat = float(self.model.coef_[0])

Reminder: Importing From Modules

Can now import either the whole module (from the project root):

import estimators.ols_like as ols_like


Or can import specific estimators:

from estimators.ols_like import SimpleOLS

File Organization: Discussion

Main rule: choose what’s easiest for understanding and collaboration

Split more if:

  • You have many classes (e.g., 10+ estimators)
  • Multiple people are working on the code

Split less if:

  • The project is small (e.g., 2–3 estimators)
  • You want to minimize file overhead

main.py and Executing Our Simulation

Towards a Main Script

So far have modules, but still need the main.py script

main.py imports all the necessary components and actually executes the simulations. It’s the entry point


  • Today: just a skeleton script that runs a given scenario and prints output to console
  • Next time: running many scenarios and handling outputs

Our First main.py

A script with a single scenario:

main.py
from dgps.dynamic import DynamicNormalDGP
from estimators.ols_like import LassoWrapper
from runner import SimulationRunner

if __name__ == "__main__":
    # Today: just one scenario
    dgp = DynamicNormalDGP(beta0=0.0, beta1=0.95)
    estimator = LassoWrapper(reg_param=0.04)
    n_obs = 50

    # Run simulation for specified scenario
    runner = SimulationRunner(dgp, estimator)
    runner.simulate(n_sim=1000, n_obs=n_obs, first_seed=1)

    # Print results
    print(
        f"Bias for {dgp.__class__.__name__} + {estimator.__class__.__name__}: "
    )
    runner.summarize_bias()

Note: About __name__ and __main__

  • The name if __name__ == "__main__": is a very Python thing
  • See often in scripts executed from the command line


Ensures the code runs only when the script is executed directly, not when imported (avoids accidents)

See chapter 24-25 in Lutz (2025)

Running Our Simulation

To execute, call from the command line when in project root:

py main.py

or

python main.py

or (depending on your system/Python)

python3 main.py


Prints the following to the console:

Bias for DynamicNormalDGP + LassoWrapper: 
Average estimation error (bias): -0.1121

Discussion: So Far and What’s Next?

Already in quite good shape:

  • Modular and extensible structure
  • Reasonable understanding of how components interact
  • Quite a bit of progress on the statistical question

What’s left for next time?

  • How to execute many scenarios conveniently in main.py?
  • How to store results of these scenarios

Recap and Conclusions

Recap


In this lecture we

  • Discussed how to handle scope changes
  • Introduced and defined protocols for enforcing interfaces
  • Shown how to adapt external libraries
  • Made explicit out file structure

Next Questions


  • How to organize running many simulations?
  • How to deal with simulation outputs?
  • How to apply this logic to deeper statistical scenarios?

References

Hansen, Bruce. 2022. Econometrics. Princeton_University_Press.
Lutz, Mark. 2025. Learning Python: Powerful Object-Oriented Programming. Sixth edition. Santa Rosa, CA: O’Reilly.
Ramalho, Luciano. 2022. Fluent Python: Clear, Concise, and Effective Programming. 2nd edition. Sebastopol, California: O’Reilly Media, Inc.