Difference-in-Differences

Causal Inference of Binary Treatment under Parallel Trends

Vladislav Morozov

Introduction

Lecture Info

Learning Outcomes

This lecture is about handling changes over time in causal studies with a binary treatment


By the end, you should be able to

  • State the parallel trends assumption
  • Identify the ATT using a difference-in-differences (DiD) strategy
  • Give a regression characterization of DiD

References

Textbooks:

  • Chapter 18 in Huntington-Klein (2025)
  • “Difference-in-Differences” in Cunningham (2021)
  • Chapter 13.2 in Wooldridge (2020)
  • Chapter 18 in Hansen (2022)

Roth et al. (2023): good overview of recent advances

Empirical Motivation

Empirical Question

Does raising the minimum wage reduce employment?

  • Micro 1 with perfectly competitive markets and elastic labor demand say yes
  • But what about imperfect competition, general equilibrium effects of some workers having more spending power, inelastic labor demand, …?

Not necessarily obvious what the overall effect is. See Neumark, Salas, and Wascher (2014) for history of question

Identification

Motivation

Previous lecture — “event studies” with “no trends” in average untreated outcome: \(\E[Y_{it}^0]\) does not depend on \(t\)


Assumption of no trends difficult to justify unless time horizon very short

  • If there are trends, event studies do not work
  • What can we do?

Two-Unit Thought Experiment

Setting: Two Units with Binary Treatment

To start: very simplified case

  • Two periods of time: \(t=1, 2\)
  • Two units: unit 1 with outcomes \(Y\) and unit 2 with outcome \(Z\)
  • Second unit treated between \(t=1\) and \(t=2\). First unit not treated

Potential outcomes at time \(t\): \(Y_t^d\) and \(Z_t^d\) for treatment values \(d=0, 1\)

Object of Interest


Object of interest: treatment effect of unit 2 at \(t=2\): \[ \delta_Z = Z_2^1- Z_2^0 \]

Change in Outcomes

Observed outcomes satisfy \[ Z_{1} = Z_1^0, \quad Z_{2} = Z_{2}^1 \]

Difference in outcomes (like in event studies): \[ \begin{aligned} Z_2 - Z_1 & = \gamma_T + \delta_T\\ \gamma_T & = Z_{2}^0 - Z_{1}^0 \end{aligned} \] \(\gamma_T\) — trend in outcomes without treatment

Trend for Other Unit

Outcomes of other unit satisfy: \[ Y_1 = Y_1^0, \quad Y_2 = Y_2^0 \]


Difference in outcomes: only trend in outcomes without treatment \[ \begin{aligned} Y_2 - Y_1 & = \gamma_U\\ \gamma_U & = Y_2^0 - Y_1^0 \end{aligned} \]

Difference-in-Differences with Two Units

Under parallel trends identify \(\delta_T\) as \[ \delta_T = (Z_2- Z_1) - (Y_2- Y_1) \]

  • Differences two unit-specific differences — hence called difference-in-differences
  • Old argument, goes back at least to Snow (1856)

Differences-in-Differences

Limitations of Two Unit Approach

Two unit situation might be

  • Unrealistic: literal parallel trends might not hold
  • Uninteresting: if you want to predict the effect for a new unit


\(\Rightarrow\) Now consider an approach with multiple units

Setting

  • Units indexed by \(i=1, \dots, N\), time indexed by \(t=1, 2\)
  • All outcomes labeled \(Y_{it}\)
  • Treatment pattern: realized treatment indicator \(D_{it}\)
    • Untreated/control group (group \(U\)): \(D_{i1}=D_{i2}=0\)
    • Treated group (group \(T\)): \(D_{i1}=0\), \(D_{i2}=1\)

Potential Outcomes

Potential outcomes \(Y_{it}^d\) for \(d=0, 1\)

  • Units in group \(T\): \[ Y_{i1} = Y_{i1}^0, \quad Y_{i2} = Y_{i2}^1 \]
  • Units in group \(U\) \[ Y_{i1} = Y_{i1}^0, \quad Y_{i2}= Y_{i2}^0 \]

Change in Outcomes for Treated

Decompose average of \(Y_{i2}-Y_{i1}\) for treated units (\(T\)): \[ \E[Y_{i2} - Y_{i1}|T] = \E[Y_{i2}^1 - Y_{i2}^0|T] + \E[Y_{i2}^0- Y_{i1}^0|T] \]

  • \(\E[Y_{i2}^1 - Y_{i2}^0|T]\) — parameter of interest, the average treatment effect for the treated (ATT)
  • \(\E[Y_{i2}^0- Y_{i1}^0|T]\) trend in outcomes without treatment

Change in Outcomes for Untreated


For untreated units (\(U\)): \[ \E[Y_{i2}-Y_{i1}|U] = \E[Y_{i2}^0 - Y_{i1}^0|U] \]


Only the time trend is present

Result Statement

Proposition 1 (DiD Identification) Suppose that \(P(D_{i1}=D_{i2} =0)>0\) and \(P(D_{i1}=0, D_{i2}=1)>0\).

Then the ATT is \(\E[Y_{i2}^1 - Y_{i2}^0|T]\) is identified in terms of a difference in differences: \[ \begin{aligned} \E[Y_{i2}^1 - Y_{i2}^0|T] & = \E[Y_{i2}-Y_{i1}|T] - \E[Y_{i2}-Y_{i1}|U] \end{aligned} \]


Assumption means that both treated and untreated units exist

Discussion

What does DiD actually do?

  • Solves selection in treatment by
    • Considering effects only for the treated (ATT instead of ATE\(=\E[Y_{i2}^1 - Y_{i2}^0]\))
    • Compares treated units to themselves
  • Solves evolution over time by
    • Assuming parallel trends without treatment
    • Identifies trend from untreated unit

Estimation and Regression View

DiD Estimator

We have \(ATT = \E[Y_{i2}-Y_{i1}|T] - \E[Y_{i2}-Y_{i1}|U]\)

Sample version: the DiD estimator:

\[ \begin{aligned} \hspace{-2.6cm} & \hspace{-2.6cm}\widehat{ATT}^{DiD} \\ \hspace{-2.6cm} & \hspace{-2.6cm} = \dfrac{1}{N_T}\sum_{Treated} (Y_{i2}-Y_{i1}) - \dfrac{1}{N_U}\sum_{Untreated} (Y_{i2}-Y_{i1}) \end{aligned} \tag{2}\]

\(N_T\) — number of treated units, \(N_U\) — of untreated units

Regression Representation: Coefficients

Can frame DiD estimator differently

Define “coefficients”: \[ \begin{aligned} \alpha_i & = Y_{i1}^0, \\ \gamma & = \E[Y_{i2}^0 - Y_{i1}^0 |T], \\ \delta & = \E[Y_{i2}^1 - Y_{i2}^0 |T]. \end{aligned} \]

By parallel trends, can also have \(\gamma = \E[Y_{i2}^0 - Y_{i1}^0 |U]\)

Regression Representation: Outcomes

Recall \(D_{it}\). Define indicator of second period: \[ SP_{it} = \I\curl{t=2} \]

Can express outcomes as \[ Y_{it} = \begin{cases} \alpha_i + U_{it}, & SP_{it} =0, D_{i2}= 0, 1, \\ \alpha_i + \gamma + U_{it}, & SP_{it} = 1, D_{i2} = 0, \\ \alpha_ i + \gamma + \delta + U_{it}, & SP_{it}, D_{i2} = 1, \end{cases} \] What is the \(U_{it}\)?

TWFE Representation

Compact form of representation: \[ Y_{it} = \alpha_i + \gamma SP_{it} + \delta D_{it} + U_{it} \tag{3}\] Equation 3 is called two-way fixed effect model (more on that in next lecture)

Can eliminate \(\alpha_i\) by taking first differences across time to get \[ Y_{i2}- Y_{i1} = \gamma + \delta D_{it} + U_{i2} \tag{4}\]

Regression Representation: Result

Proposition 2 (Canonical DiD is TWFE) The OLS estimator \(\hat{\delta}\) in Equation 4 satisfies \[ \hat{\delta} = \widehat{ATT}^{DiD}, \] for the ATT estimator of Equation 2.

Proof: by brute force evaluating the OLS estimator, see exercise set 3

Regression Representation: Discussion

  • Like for event studies: managed to get linear regression in a nonparametric context
  • Consistency of OLS = consistency of \(\widehat{ATT}^{DiD}\) (exercise: prove that parallel trends equivalent to strict exogeneity of \(D_{it}\))
  • Regression results: can use all results for OLS, including limit theory and inference

Limitations

Two obvious ones:

  • Need parallel trends
  • We did not identify average effect for the untreated and the overall ATE


Can relax a bit if we have another “control” group — DDD identification and estimator (see Cunningham (2021) or Wooldridge (2020))

Some Extensions

Adding Covariates I: Conditioning

What if there are extra covariates \(\bX_i = (\bX_{i1}, \bX_{i2})\)?

Can relax parallel trends to conditional parallel trends: for each (or some) values \(\bx=(\bx_1, \bx_2)\) \[ \small \E[Y_{i2}^0 - Y_{i1}^0 |T, \bX_{it}=\bx] = \E[Y_{i2}^0 - Y_{i1}^0 |T, \bX_{i}=\bx] \]

Same argument as before: identify conditional ATT \[\small CATT(\bx) = \E[Y_{i2}^0 - Y_{i1}^0 |T, \bX_{i}=\bx] \]

If parallel trends hold for all possible \(\bx\), can also identify the overall (\(\bX\)-unconditional) ATT

Adding Covariates II: Assuming Linearity

Often see linearity assumptions on the causal model in the form: \[ \begin{aligned} Y_{it}^d = Y_{it}^0 + d(Y_{i2}^1- Y_{i2}^0) + \bX_{it}'\bbeta_i \end{aligned} \]

Can estimate ATT consistently from TWFE regression \[ Y_{i2}-Y_{i1} = \gamma + \delta D_{it} + (\bX_{i2}-\bX_{i1})'\bbeta + u_{it}, \] if \(\bbeta\) appropriately defined (think of OLS estimator under heterogeneous coefficient causal model)

Adding More Periods: Setting

Can accommodate more periods of data

  • \(T\geq 2\)
  • Two groups: never treated (\(U\)) and treated, treatment starts between \(t_0-1\) and \(t_0\)
  • Treatment indicators \(D_{it\tau}\), \(t, \tau=1,\dots, T\):
    • Untreated units have \(D_{it\tau}=0\) for all \(t, \tau\)
    • Treated units \(D_{it\tau}=1\) if and only if \(t=\tau\) and \(t\geq t_0\)

Adding More Periods: TWFE

Multi-period specification: \[ \small \begin{aligned} Y_{it} & = \alpha_i + \gamma_t + \sum_{\tau=t_0}^T \delta_\tau D_{it\tau} + U_{it},\\ \alpha_i & = Y_{i1}^0, \\ \gamma_t & = \E[Y_{it}^0- Y_{i1}^0|T],\\ \delta_t & = \E[Y_{it}^1- Y_{it}^0|T] \end{aligned} \] OLS consistent for dynamic ATTs \(\delta_T\) if
\[ \small \E[Y_{it}^0- Y_{i1}^0|T]=\E[Y_{it}^0- Y_{i1}^0|U] \]

Adding More Groups

  • So far: had two groups (untreated and treated at some specific point in time)
  • In many settings: different units receive treatment in different periods (e.g. states implement same laws at different times)
  • Can learn with such staggered timing
  • Need to be careful, cannot just do TWFE, need more sophisticated approaches

See section 3 in Roth et al. (2023)

Empirical Application

Context and Setting

Context

Recall: interested in effect of raising minimum wage

We replicate classical paper of
Card and Krueger (1994). Background of their analysis

  • New Jersey (NJ) raised its minimum wage in 1992
  • Neighboring Pennsylvania (PA) did not
  • NJ and PA touch in Philadelphia area — likely fairly similar populations there

Data Description

Card and Krueger (1994) collect data on fast food restaurants in NJ and PA before and after NJ minimum wage raise:

  • Units \(i\) are restaurants
  • Groups \(T\) and \(U\) are \(i\) in NJ and in PA, respectively
  • Outcome: number of “full-time equivalent” workers (sum of full-time, managers, and half of part-time workers)

Loading and Looking at the Data

Loading and processing the data
ck_data = pd.read_csv("data/card-krueger-1994.csv")

# Compute the number of full time employees
ck_data = ck_data.replace(".", np.nan)
ck_data = ck_data.astype("float64")
ck_data["emp_ft"] = (
    ck_data["empft"] + 
    0.5*ck_data["emppt"] +
    ck_data["nmgrs"]
)

# Extract only the necessary columns
ck_data = ck_data.loc[:, ["store", "state", "time", "emp_ft", "hoursopen"]]

# Insert any missing (store, time) rows
full_index = pd.MultiIndex.from_product(
    [ck_data["store"].unique(), ck_data["time"].unique()],
    names = ["store", "time"]
)
ck_data = (
    ck_data.set_index(["store", "time"])
        .reindex(full_index)
        .reset_index()
)

# Drop stores with missing employee numbers
stores_with_nan = (
    ck_data.groupby("store")["emp_ft"]
    .apply(lambda g: g.isnull().any())
)
stores_with_nan = stores_with_nan[stores_with_nan].index 
ck_data = (
    ck_data.loc[
        ~ck_data["store"].isin(stores_with_nan), 
        :
        ]
)

# Recode states and times more descriptively
states_dict = {0: "PA", 1: "NJ"}
ck_data["state_name"] = ck_data["state"].replace(states_dict)
time_dict = {0:"Before", 1:"After"}
ck_data["time_name"] = ck_data["time"].replace(time_dict)

# Set index
ck_data = ck_data.set_index(["store", "time_name"])
ck_data.head()
time state emp_ft hoursopen state_name
store time_name
46.0 Before 0.0 0.0 40.5 17.0 PA
After 1.0 0.0 24.5 17.0 PA
49.0 Before 0.0 0.0 14.5 13.0 PA
After 1.0 0.0 11.5 13.0 PA
506.0 Before 0.0 0.0 8.5 10.0 PA

Estimation Results

Tabulating Averages

The key components of difference-in-differences are the (state, year)-specific average outcomes:

Computing (state, year)-averages
emp_means = ck_data.groupby(by=["state_name", "time_name"])["emp_ft"].mean()
print(emp_means)
state_name  time_name
NJ          After        20.925566
            Before       20.475728
PA          After        21.140000
            Before       23.440000
Name: emp_ft, dtype: float64

Applying DiD

It is easy to compute averages now. Recall that NJ is the treated group. Applying DiD:

Applying DiD manually
( 
    (emp_means.loc[("NJ", "After")] - emp_means.loc[("NJ", "Before")])
    - (emp_means.loc[("PA", "After")] - emp_means.loc[("PA", "Before")])
).round(2)
np.float64(2.75)

Positive estimate! Increasing minimum wage lead to average gain of 2.75 full-time equivalent workers in fast food in NJ

Visualizing 1992 Employment Levels

Regression Implementation I

Easiest way to obtain standard error of estimated effect — use regression characterization

We have a choice

  • Differenced equation \[ Y_{i2} - Y_{i1} = \gamma + \delta D_{i2} + U_{i2} \]
  • Undifferenced equation \[ Y_{it} = \alpha_i + \gamma SP_{it} + \delta D_{i2} + U_{it} \]

Estimation of Differenced Equation

Differenced equation — easy to estimate with OLS in statsmodels

Regression with differenced equation
endog = ck_data.groupby(by="store")["emp_ft"].diff().dropna()
exog = ck_data.loc[(slice(None),"After"), "state"]
exog.name = "Treated"
exog = sm.add_constant(exog)

# Run OLS
fitted_model = sm.OLS(endog, exog).fit(cov_type="HC0")
print(fitted_model.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                 emp_ft   R-squared:                       0.015
Model:                            OLS   Adj. R-squared:                  0.012
Method:                 Least Squares   F-statistic:                     4.246
Date:                Tue, 20 May 2025   Prob (F-statistic):             0.0400
Time:                        22:02:25   Log-Likelihood:                -1385.7
No. Observations:                 384   AIC:                             2775.
Df Residuals:                     382   BIC:                             2783.
Df Model:                           1                                         
Covariance Type:                  HC0                                         
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
const         -2.3000      1.246     -1.847      0.065      -4.741       0.141
Treated        2.7498      1.335      2.061      0.039       0.134       5.365
==============================================================================
Omnibus:                       33.243   Durbin-Watson:                   2.194
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               95.234
Skew:                          -0.362   Prob(JB):                     2.09e-21
Kurtosis:                       5.330   Cond. No.                         4.32
==============================================================================

Notes:
[1] Standard Errors are heteroscedasticity robust (HC0)

Estimation of Full TWFE Equation

  • Can estimate without differencing data ourselves using linearmodels or pyfixest (more details in next lecture)
  • Here use PanelOLS from linearmodels with entity_effects (\(\alpha_i\)) and time_effects (\(\gamma\))
Estimation using linearmodels
ck_data = ck_data.reset_index().set_index(["store", "time"])
endog = ck_data["emp_ft"]
exog = ck_data.index.get_level_values(1) * ck_data.loc[:, "state"]
exog.name = "Treated"

twfe_results = lm.PanelOLS(
  endog, 
  exog,  
  entity_effects=True, 
  time_effects=True, 
  drop_absorbed=True,
).fit(cov_type="robust")
print(twfe_results)
                          PanelOLS Estimation Summary                           
================================================================================
Dep. Variable:                 emp_ft   R-squared:                        0.0147
Estimator:                   PanelOLS   R-squared (Between):              0.0865
No. Observations:                 768   R-squared (Within):              -0.0506
Date:                Tue, May 20 2025   R-squared (Overall):              0.0813
Time:                        22:02:25   Log-likelihood                   -2239.1
Cov. Estimator:                Robust                                           
                                        F-statistic:                      5.6903
Entities:                         384   P-value                           0.0175
Avg Obs:                       2.0000   Distribution:                   F(1,382)
Min Obs:                       2.0000                                           
Max Obs:                       2.0000   F-statistic (robust):             4.2238
                                        P-value                           0.0405
Time periods:                       2   Distribution:                   F(1,382)
Avg Obs:                       384.00                                           
Min Obs:                       384.00                                           
Max Obs:                       384.00                                           
                                                                                
                             Parameter Estimates                              
==============================================================================
            Parameter  Std. Err.     T-stat    P-value    Lower CI    Upper CI
------------------------------------------------------------------------------
Treated        2.7498     1.3380     2.0552     0.0405      0.1191      5.3806
==============================================================================

F-test for Poolability: 3.5258
P-value: 0.0000
Distribution: F(384,382)

Included effects: Entity, Time

Discussion of Estimation Results


  • In both cases same estimate — 2.75 workers
  • Same standard errors — 1.34
  • Effect is significantly different from zero (at 5% level), though evidence is not overwhelming

Including Covariates

  • Restaurants that open for longer tend to have more workers — want to include this
  • Easy to do with linearmodels, just add variable to exog
  • Find similar effect
Adding linear covariates to TWFE DiD regression
exog = pd.DataFrame(exog)
exog["hoursopen"] = ck_data["hoursopen"]
exog = exog.dropna() 
# Align index of endog with exog
endog = endog[exog.index]

twfe_results = lm.PanelOLS(endog, exog,  entity_effects=True, time_effects=True, drop_absorbed=True).fit(cov_type="robust")
print(twfe_results)
                          PanelOLS Estimation Summary                           
================================================================================
Dep. Variable:                 emp_ft   R-squared:                        0.0312
Estimator:                   PanelOLS   R-squared (Between):              0.8836
No. Observations:                 761   R-squared (Within):              -0.0262
Date:                Tue, May 20 2025   R-squared (Overall):              0.8501
Time:                        22:02:25   Log-likelihood                   -2203.5
Cov. Estimator:                Robust                                           
                                        F-statistic:                      6.0137
Entities:                         384   P-value                           0.0027
Avg Obs:                       1.9818   Distribution:                   F(2,374)
Min Obs:                       1.0000                                           
Max Obs:                       2.0000   F-statistic (robust):             7.8111
                                        P-value                           0.0005
Time periods:                       2   Distribution:                   F(2,374)
Avg Obs:                       380.50                                           
Min Obs:                       377.00                                           
Max Obs:                       384.00                                           
                                                                                
                             Parameter Estimates                              
==============================================================================
            Parameter  Std. Err.     T-stat    P-value    Lower CI    Upper CI
------------------------------------------------------------------------------
Treated        2.7702     1.3213     2.0967     0.0367      0.1722      5.3683
hoursopen      1.1348     0.3347     3.3904     0.0008      0.4767      1.7930
==============================================================================

F-test for Poolability: 2.2060
P-value: 0.0000
Distribution: F(384,374)

Included effects: Entity, Time

Recap and Conclusions

Recap

In this lecture we

  1. Discussed a way to handle changes over time — parallel trends
  2. Identified ATT for binary treatment with difference-in-differences strategy
  3. Gave a regression (TWFE) characterization of DiD estimation
  4. Proposed some extensions

Next Questions

  • How is the TWFE regression actually estimated?
  • What if the treatment is not binary?

References

Card, David, and Alan Krueger. 1994. “Minimum Wages and Employment: A Case Study of the Fast Food Industry in New Jersey and Pennsylvania.” American Economic Review 84 (4): 772–93. https://doi.org/10.3386/w4509.
Cunningham, Scott. 2021. Causal Inference: The Mixtape. Yale University Press. https://doi.org/10.2307/j.ctv1c29t27.
Hansen, Bruce. 2022. Econometrics. Princeton_University_Press.
Huntington-Klein, Nick. 2025. The Effect: An Introduction to Research Design and Causality. S.l.: Chapman and Hall/CRC.
Neumark, David, J M Ian Salas, and William Wascher. 2014. “Revisiting the Minimum WageEmployment Debate: Throwing Out the Baby with the Bathwater?” ILR Review 67 (3): 608–48. https://doi.org/10.1177/00197939140670S307.
Roth, Jonathan, Pedro H. C. Sant’Anna, Alyssa Bilinski, and John Poe. 2023. “What’s Trending in Difference-in-Differences? A Synthesis of the Recent Econometrics Literature.” Journal of Econometrics 235 (2): 2218–44. https://doi.org/10.1016/j.jeconom.2023.03.008.
Snow, John. 1856. On the Mode of Communication of Cholera.” Edinburgh Medical Journal 1 (7): 668–70.
Wooldridge, Jeffrey M. 2020. Introductory Econometrics: A Modern Approach. Seventh edition. Boston, MA: Cengage.