Difference-in-Differences

Causal Inference of Binary Treatment under Parallel Trends

Vladislav Morozov

Introduction

Lecture Info

Learning Outcomes

This lecture is about handling changes over time in causal studies with a binary treatment

By the end, you should be able to

State the parallel trends assumption
Identify the ATT using a difference-in-differences (DiD) strategy
Give a regression characterization of DiD

References

Textbooks:

Chapter 18 in Huntington-Klein (2025)
“Difference-in-Differences” in Cunningham (2021)
Chapter 13.2 in Wooldridge (2020)
Chapter 18 in Hansen (2022)

Roth et al. (2023): good overview of recent advances

Empirical Motivation

Empirical Question

Does raising the minimum wage reduce employment?

Micro 1 with perfectly competitive markets and elastic labor demand say yes
But what about imperfect competition, general equilibrium effects of some workers having more spending power, inelastic labor demand, …?

Not necessarily obvious what the overall effect is. See Neumark, Salas, and Wascher (2014) for history of question

Identification

Motivation

Previous lecture — “event studies” with “no trends” in average untreated outcome: \(\E[Y_{it}^0]\) does not depend on \(t\)

Assumption of no trends difficult to justify unless time horizon very short

If there are trends, event studies do not work
What can we do?

Two-Unit Thought Experiment

Setting: Two Units with Binary Treatment

To start: very simplified case

Two periods of time: \(t=1, 2\)
Two units: unit 1 with outcomes \(Y\) and unit 2 with outcome \(Z\)
Second unit treated between \(t=1\) and \(t=2\). First unit not treated

Potential outcomes at time \(t\): \(Y_t^d\) and \(Z_t^d\) for treatment values \(d=0, 1\)

Object of Interest

Object of interest: treatment effect of unit 2 at \(t=2\): \[ \delta_Z = Z_2^1- Z_2^0 \]

Change in Outcomes

Observed outcomes satisfy \[ Z_{1} = Z_1^0, \quad Z_{2} = Z_{2}^1 \]

Difference in outcomes (like in event studies): \[ \begin{aligned} Z_2 - Z_1 & = \gamma_T + \delta_T\\ \gamma_T & = Z_{2}^0 - Z_{1}^0 \end{aligned} \] \(\gamma_T\) — trend in outcomes without treatment

Role of Assumptions of No Trends

Strict no trends assumption from previous lecture: \[ \gamma_T = 0 \tag{1}\]

Under (1) \(Z_2-Z_1=\delta_T\) — identification of treatment effect
Without (1) change in outcome — sum of treatment effect and evolution over time
Hard to justify (1) outside of high-frequency data

Trend for Other Unit

Outcomes of other unit satisfy: \[ Y_1 = Y_1^0, \quad Y_2 = Y_2^0 \]

Difference in outcomes: only trend in outcomes without treatment \[ \begin{aligned} Y_2 - Y_1 & = \gamma_U\\ \gamma_U & = Y_2^0 - Y_1^0 \end{aligned} \]

Parallel Trends

Difference in differences of outcomes of units: \[ \begin{aligned} (Z_2- Z_1) - (Y_2- Y_1) = \delta_T + (\gamma_T - \gamma_U) \end{aligned} \]

What if the two units had the same evolution?

Assumption (literal parallel trends): \[ \gamma_T = \gamma_U \]

Difference-in-Differences with Two Units

Under parallel trends identify \(\delta_T\) as \[ \delta_T = (Z_2- Z_1) - (Y_2- Y_1) \]

Differences two unit-specific differences — hence called difference-in-differences
Old argument, goes back at least to Snow (1856)

Differences-in-Differences

Limitations of Two Unit Approach

Two unit situation might be

Unrealistic: literal parallel trends might not hold
Uninteresting: if you want to predict the effect for a new unit

\(\Rightarrow\) Now consider an approach with multiple units

Setting

Units indexed by \(i=1, \dots, N\), time indexed by \(t=1, 2\)
All outcomes labeled \(Y_{it}\)
Treatment pattern: realized treatment indicator \(D_{it}\)
- Untreated/control group (group \(U\)): \(D_{i1}=D_{i2}=0\)
- Treated group (group \(T\)): \(D_{i1}=0\), \(D_{i2}=1\)

Potential Outcomes

Potential outcomes \(Y_{it}^d\) for \(d=0, 1\)

Units in group \(T\): \[ Y_{i1} = Y_{i1}^0, \quad Y_{i2} = Y_{i2}^1 \]
Units in group \(U\) \[ Y_{i1} = Y_{i1}^0, \quad Y_{i2}= Y_{i2}^0 \]

Change in Outcomes for Treated

Decompose average of \(Y_{i2}-Y_{i1}\) for treated units (\(T\)): \[ \E[Y_{i2} - Y_{i1}|T] = \E[Y_{i2}^1 - Y_{i2}^0|T] + \E[Y_{i2}^0- Y_{i1}^0|T] \]

\(\E[Y_{i2}^1 - Y_{i2}^0|T]\) — parameter of interest, the average treatment effect for the treated (ATT)
\(\E[Y_{i2}^0- Y_{i1}^0|T]\) trend in outcomes without treatment

Change in Outcomes for Untreated

For untreated units (\(U\)): \[ \E[Y_{i2}-Y_{i1}|U] = \E[Y_{i2}^0 - Y_{i1}^0|U] \]

Only the time trend is present

Parallel Trends Assumption

Same trends on average:

Assumption (average parallel trends): \[ \E[Y_{i2}^0 - Y_{i1}^0|T] = \E[Y_{i2}^0 - Y_{i1}^0|U] \]

Then difference-in-differences \[ \begin{aligned} \E[Y_{i2}^1 - Y_{i2}^0|T] & = \E[Y_{i2}-Y_{i1}|T] - \E[Y_{i2}-Y_{i1}|U] \end{aligned} \] Expressed ATT in terms of distribution of the data

Result Statement

Proposition 1 (DiD Identification) Suppose that \(P(D_{i1}=D_{i2} =0)>0\) and \(P(D_{i1}=0, D_{i2}=1)>0\).

Then the ATT is \(\E[Y_{i2}^1 - Y_{i2}^0|T]\) is identified in terms of a difference in differences: \[ \begin{aligned} \E[Y_{i2}^1 - Y_{i2}^0|T] & = \E[Y_{i2}-Y_{i1}|T] - \E[Y_{i2}-Y_{i1}|U] \end{aligned} \]

Assumption means that both treated and untreated units exist

Discussion

What does DiD actually do?

Solves selection in treatment by
- Considering effects only for the treated (ATT instead of ATE\(=\E[Y_{i2}^1 - Y_{i2}^0]\))
- Compares treated units to themselves
Solves evolution over time by
- Assuming parallel trends without treatment
- Identifies trend from untreated unit

Estimation and Regression View

DiD Estimator

We have \(ATT = \E[Y_{i2}-Y_{i1}|T] - \E[Y_{i2}-Y_{i1}|U]\)

Sample version: the DiD estimator:

\[ \begin{aligned} \hspace{-2.6cm} & \hspace{-2.6cm}\widehat{ATT}^{DiD} \\ \hspace{-2.6cm} & \hspace{-2.6cm} = \dfrac{1}{N_T}\sum_{Treated} (Y_{i2}-Y_{i1}) - \dfrac{1}{N_U}\sum_{Untreated} (Y_{i2}-Y_{i1}) \end{aligned} \tag{2}\]

\(N_T\) — number of treated units, \(N_U\) — of untreated units

Regression Representation: Coefficients

Can frame DiD estimator differently

Define “coefficients”: \[ \begin{aligned} \alpha_i & = Y_{i1}^0, \\ \gamma & = \E[Y_{i2}^0 - Y_{i1}^0 |T], \\ \delta & = \E[Y_{i2}^1 - Y_{i2}^0 |T]. \end{aligned} \]

By parallel trends, can also have \(\gamma = \E[Y_{i2}^0 - Y_{i1}^0 |U]\)

Regression Representation: Outcomes

Recall \(D_{it}\). Define indicator of second period: \[ SP_{it} = \I\curl{t=2} \]

Can express outcomes as \[ Y_{it} = \begin{cases} \alpha_i + U_{it}, & SP_{it} =0, D_{i2}= 0, 1, \\ \alpha_i + \gamma + U_{it}, & SP_{it} = 1, D_{i2} = 0, \\ \alpha_ i + \gamma + \delta + U_{it}, & SP_{it}, D_{i2} = 1, \end{cases} \] What is the \(U_{it}\)?

TWFE Representation

Compact form of representation: \[ Y_{it} = \alpha_i + \gamma SP_{it} + \delta D_{it} + U_{it} \tag{3}\] Equation 3 is called two-way fixed effect model (more on that in next lecture)

Can eliminate \(\alpha_i\) by taking first differences across time to get \[ Y_{i2}- Y_{i1} = \gamma + \delta D_{i2} + U_{i2} \tag{4}\]

Regression Representation: Result

Proposition 2 (Canonical DiD is TWFE) The OLS estimator \(\hat{\delta}\) in Equation 4 satisfies \[ \hat{\delta} = \widehat{ATT}^{DiD}, \] for the ATT estimator of Equation 2.

Proof: by brute force evaluating the OLS estimator, see exercise set 3

Regression Representation: Discussion

Like for event studies: managed to get linear regression in a nonparametric context
Consistency of OLS = consistency of \(\widehat{ATT}^{DiD}\) (exercise: prove that parallel trends equivalent to strict exogeneity of \(D_{it}\))
Regression results: can use all results for OLS, including limit theory and inference

Limitations

Two obvious ones:

Need parallel trends
We did not identify average effect for the untreated and the overall ATE

Can relax a bit if we have another “control” group — DDD identification and estimator (see Cunningham (2021) or Wooldridge (2020))

Some Extensions

Adding Covariates I: Conditioning

What if there are extra covariates \(\bX_i = (\bX_{i1}, \bX_{i2})\)?

Can relax parallel trends to conditional parallel trends: for each (or some) values \(\bx=(\bx_1, \bx_2)\) \[ \small \E[Y_{i2}^0 - Y_{i1}^0 |T, \bX_{it}=\bx] = \E[Y_{i2}^0 - Y_{i1}^0 |T, \bX_{i}=\bx] \]

Same argument as before: identify conditional ATT \[\small CATT(\bx) = \E[Y_{i2}^0 - Y_{i1}^0 |T, \bX_{i}=\bx] \]

If parallel trends hold for all possible \(\bx\), can also identify the overall (\(\bX\)-unconditional) ATT

Adding Covariates II: Assuming Linearity

Often see linearity assumptions on the causal model in the form: \[ \begin{aligned} Y_{it}^d = Y_{it}^0 + d(Y_{it}^1- Y_{it}^0) + \bX_{it}'\bbeta_i \end{aligned} \]

Can estimate ATT consistently from TWFE regression \[ Y_{i2}-Y_{i1} = \gamma + \delta D_{it} + (\bX_{i2}-\bX_{i1})'\bbeta + U_{it}, \] if \(\bbeta\) appropriately defined (think of OLS estimator under heterogeneous coefficient causal model)

Adding More Periods: Setting

Can accommodate more periods of data

\(T\geq 2\)
Two groups: never treated (\(U\)) and treated, treatment starts between \(t_0-1\) and \(t_0\)
Treatment indicators \(D_{it, \tau}\), \(t, \tau=1,\dots, T\):
- Untreated units have \(D_{it, \tau}=0\) for all \(t, \tau\)
- Treated units \(D_{it, \tau}=1\) if and only if \(t=\tau\) and \(t\geq t_0\)

Adding More Periods: TWFE

Multi-period specification: \[ \small \begin{aligned} Y_{it} & = \alpha_i + \gamma_t + \sum_{\tau=t_0}^T \delta_\tau D_{it, \tau} + U_{it},\\ \alpha_i & = Y_{i1}^0, \\ \gamma_t & = \E[Y_{it}^0- Y_{i1}^0|T],\\ \delta_t & = \E[Y_{it}^1- Y_{it}^0|T] \end{aligned} \] OLS consistent for dynamic ATTs \(\delta_T\) if
\[ \small \E[Y_{it}^0- Y_{i1}^0|T]=\E[Y_{it}^0- Y_{i1}^0|U] \]

Adding More Groups

So far: had two groups (untreated and treated at some specific point in time)
In many settings: different units receive treatment in different periods (e.g. states implement same laws at different times)
Can learn with such staggered timing
Need to be careful, cannot just do TWFE, need more sophisticated approaches

See section 3 in Roth et al. (2023)

Empirical Application

Context and Setting

Context

Recall: interested in effect of raising minimum wage

We replicate classical paper of
Card and Krueger (1994). Background of their analysis

New Jersey (NJ) raised its minimum wage in 1992
Neighboring Pennsylvania (PA) did not
NJ and PA touch in Philadelphia area — likely fairly similar populations there

Data Description

Card and Krueger (1994) collect data on fast food restaurants in NJ and PA before and after NJ minimum wage raise:

Units \(i\) are restaurants
Groups \(T\) and \(U\) are \(i\) in NJ and in PA, respectively
Outcome: number of “full-time equivalent” workers (sum of full-time, managers, and half of part-time workers)

Loading and Looking at the Data

Loading and processing the data

ck_data = pd.read_csv("data/card-krueger-1994.csv")

# Compute the number of full time employees
ck_data = ck_data.replace(".", np.nan)
ck_data = ck_data.astype("float64")
ck_data["emp_ft"] = (
    ck_data["empft"] + 
    0.5*ck_data["emppt"] +
    ck_data["nmgrs"]
)

# Extract only the necessary columns
ck_data = ck_data.loc[:, ["store", "state", "time", "emp_ft", "hoursopen"]]

# Insert any missing (store, time) rows
full_index = pd.MultiIndex.from_product(
    [ck_data["store"].unique(), ck_data["time"].unique()],
    names = ["store", "time"]
)
ck_data = (
    ck_data.set_index(["store", "time"])
        .reindex(full_index)
        .reset_index()
)

# Drop stores with missing employee numbers
stores_with_nan = (
    ck_data.groupby("store")["emp_ft"]
    .apply(lambda g: g.isnull().any())
)
stores_with_nan = stores_with_nan[stores_with_nan].index 
ck_data = (
    ck_data.loc[
        ~ck_data["store"].isin(stores_with_nan), 
        :
        ]
)

# Recode states and times more descriptively
states_dict = {0: "PA", 1: "NJ"}
ck_data["state_name"] = ck_data["state"].replace(states_dict)
time_dict = {0:"Before", 1:"After"}
ck_data["time_name"] = ck_data["time"].replace(time_dict)

# Set index
ck_data = ck_data.set_index(["store", "time_name"])
ck_data.head()

		time	state	emp_ft	hoursopen	state_name
store	time_name
46.0	Before	0.0	0.0	40.5	17.0	PA
46.0	After	1.0	0.0	24.5	17.0	PA
49.0	Before	0.0	0.0	14.5	13.0	PA
49.0	After	1.0	0.0	11.5	13.0	PA
506.0	Before	0.0	0.0	8.5	10.0	PA

Estimation Results

Tabulating Averages

The key components of difference-in-differences are the (state, year)-specific average outcomes:

Computing (state, year)-averages

emp_means = ck_data.groupby(by=["state_name", "time_name"])["emp_ft"].mean()
print(emp_means)

state_name  time_name
NJ          After        20.925566
            Before       20.475728
PA          After        21.140000
            Before       23.440000
Name: emp_ft, dtype: float64

Applying DiD

It is easy to compute averages now. Recall that NJ is the treated group. Applying DiD:

Applying DiD manually

( 
    (emp_means.loc[("NJ", "After")] - emp_means.loc[("NJ", "Before")])
    - (emp_means.loc[("PA", "After")] - emp_means.loc[("PA", "Before")])
).round(2)

np.float64(2.75)

Positive estimate! Increasing minimum wage lead to average gain of 2.75 full-time equivalent workers in fast food in NJ

Visualizing 1992 Employment Levels

Regression Implementation I

Easiest way to obtain standard error of estimated effect — use regression characterization

We have a choice

Differenced equation \[ Y_{i2} - Y_{i1} = \gamma + \delta D_{i2} + U_{i2} \]
Undifferenced equation \[ Y_{it} = \alpha_i + \gamma SP_{it} + \delta D_{i2} + U_{it} \]

Estimation of Differenced Equation

Differenced equation — easy to estimate with OLS in statsmodels

Regression with differenced equation

endog = ck_data.groupby(by="store")["emp_ft"].diff().dropna()
exog = ck_data.loc[(slice(None),"After"), "state"]
exog.name = "Treated"
exog = sm.add_constant(exog)

# Run OLS
fitted_model = sm.OLS(endog, exog).fit(cov_type="HC0")
print(fitted_model.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                 emp_ft   R-squared:                       0.015
Model:                            OLS   Adj. R-squared:                  0.012
Method:                 Least Squares   F-statistic:                     4.246
Date:                Fri, 06 Jun 2025   Prob (F-statistic):             0.0400
Time:                        08:19:28   Log-Likelihood:                -1385.7
No. Observations:                 384   AIC:                             2775.
Df Residuals:                     382   BIC:                             2783.
Df Model:                           1                                         
Covariance Type:                  HC0                                         
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
const         -2.3000      1.246     -1.847      0.065      -4.741       0.141
Treated        2.7498      1.335      2.061      0.039       0.134       5.365
==============================================================================
Omnibus:                       33.243   Durbin-Watson:                   2.194
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               95.234
Skew:                          -0.362   Prob(JB):                     2.09e-21
Kurtosis:                       5.330   Cond. No.                         4.32
==============================================================================

Notes:
[1] Standard Errors are heteroscedasticity robust (HC0)

Estimation of Full TWFE Equation

Can estimate without differencing data ourselves using linearmodels or pyfixest (more details in next lecture)
Here use PanelOLS from linearmodels with entity_effects (\(\alpha_i\)) and time_effects (\(\gamma\))

Estimation using linearmodels

ck_data = ck_data.reset_index().set_index(["store", "time"])
endog = ck_data["emp_ft"]
exog = ck_data.index.get_level_values(1) * ck_data.loc[:, "state"]
exog.name = "Treated"

twfe_results = lm.PanelOLS(
  endog, 
  exog,  
  entity_effects=True, 
  time_effects=True, 
  drop_absorbed=True,
).fit(cov_type="robust")
print(twfe_results)

                          PanelOLS Estimation Summary                           
================================================================================
Dep. Variable:                 emp_ft   R-squared:                        0.0147
Estimator:                   PanelOLS   R-squared (Between):              0.0865
No. Observations:                 768   R-squared (Within):              -0.0506
Date:                Fri, Jun 06 2025   R-squared (Overall):              0.0813
Time:                        08:19:28   Log-likelihood                   -2239.1
Cov. Estimator:                Robust                                           
                                        F-statistic:                      5.6903
Entities:                         384   P-value                           0.0175
Avg Obs:                       2.0000   Distribution:                   F(1,382)
Min Obs:                       2.0000                                           
Max Obs:                       2.0000   F-statistic (robust):             4.2238
                                        P-value                           0.0405
Time periods:                       2   Distribution:                   F(1,382)
Avg Obs:                       384.00                                           
Min Obs:                       384.00                                           
Max Obs:                       384.00                                           
                                                                                
                             Parameter Estimates                              
==============================================================================
            Parameter  Std. Err.     T-stat    P-value    Lower CI    Upper CI
------------------------------------------------------------------------------
Treated        2.7498     1.3380     2.0552     0.0405      0.1191      5.3806
==============================================================================

F-test for Poolability: 3.5258
P-value: 0.0000
Distribution: F(384,382)

Included effects: Entity, Time

Discussion of Estimation Results

In both cases same estimate — 2.75 workers
Same standard errors — 1.34
Effect is significantly different from zero (at 5% level), though evidence is not overwhelming

Including Covariates

Restaurants that open for longer tend to have more workers — want to include this
Easy to do with linearmodels, just add variable to exog
Find similar effect

Adding linear covariates to TWFE DiD regression

exog = pd.DataFrame(exog)
exog["hoursopen"] = ck_data["hoursopen"]
exog = exog.dropna() 
# Align index of endog with exog
endog = endog[exog.index]

twfe_results = lm.PanelOLS(endog, exog,  entity_effects=True, time_effects=True, drop_absorbed=True).fit(cov_type="robust")
print(twfe_results)

                          PanelOLS Estimation Summary                           
================================================================================
Dep. Variable:                 emp_ft   R-squared:                        0.0312
Estimator:                   PanelOLS   R-squared (Between):              0.8836
No. Observations:                 761   R-squared (Within):              -0.0262
Date:                Fri, Jun 06 2025   R-squared (Overall):              0.8501
Time:                        08:19:28   Log-likelihood                   -2203.5
Cov. Estimator:                Robust                                           
                                        F-statistic:                      6.0137
Entities:                         384   P-value                           0.0027
Avg Obs:                       1.9818   Distribution:                   F(2,374)
Min Obs:                       1.0000                                           
Max Obs:                       2.0000   F-statistic (robust):             7.8111
                                        P-value                           0.0005
Time periods:                       2   Distribution:                   F(2,374)
Avg Obs:                       380.50                                           
Min Obs:                       377.00                                           
Max Obs:                       384.00                                           
                                                                                
                             Parameter Estimates                              
==============================================================================
            Parameter  Std. Err.     T-stat    P-value    Lower CI    Upper CI
------------------------------------------------------------------------------
Treated        2.7702     1.3213     2.0967     0.0367      0.1722      5.3683
hoursopen      1.1348     0.3347     3.3904     0.0008      0.4767      1.7930
==============================================================================

F-test for Poolability: 2.2060
P-value: 0.0000
Distribution: F(384,374)

Included effects: Entity, Time

Recap and Conclusions

Recap

In this lecture we

Discussed a way to handle changes over time — parallel trends
Identified ATT for binary treatment with difference-in-differences strategy
Gave a regression (TWFE) characterization of DiD estimation
Proposed some extensions

Next Questions

How is the TWFE regression actually estimated?
What if the treatment is not binary?

References

Card, David, and Alan Krueger. 1994. “Minimum Wages and Employment: A Case Study of the Fast Food Industry in New Jersey and Pennsylvania.” American Economic Review 84 (4): 772–93. https://doi.org/10.3386/w4509.

Cunningham, Scott. 2021. Causal Inference: The Mixtape. Yale University Press. https://doi.org/10.2307/j.ctv1c29t27.

Hansen, Bruce. 2022. Econometrics. Princeton_University_Press.

Huntington-Klein, Nick. 2025. The Effect: An Introduction to Research Design and Causality. S.l.: Chapman and Hall/CRC.

Neumark, David, J M Ian Salas, and William Wascher. 2014. “Revisiting the Minimum Wage—Employment Debate: Throwing Out the Baby with the Bathwater?” ILR Review 67 (3): 608–48. https://doi.org/10.1177/00197939140670S307.

Roth, Jonathan, Pedro H. C. Sant’Anna, Alyssa Bilinski, and John Poe. 2023. “What’s Trending in Difference-in-Differences? A Synthesis of the Recent Econometrics Literature.” Journal of Econometrics 235 (2): 2218–44. https://doi.org/10.1016/j.jeconom.2023.03.008.

Snow, John. 1856. “On the Mode of Communication of Cholera.” Edinburgh Medical Journal 1 (7): 668–70.

Wooldridge, Jeffrey M. 2020. Introductory Econometrics: A Modern Approach. Seventh edition. Boston, MA: Cengage.