Beyond Binary Treatments: Fixed Effects Estimators and their Properties
This lecture is about handling more general treatments in panel data using “fixed effect/random intercepts” estimators
By the end, you should be able to
How strongly does pollution affect labor market outcomes?
Cannot just regress labor market outcomes on overall pollution
What if we could control for this likelihood?
Recall: for difference-in-differences showed that \[ \small \widehat{ATT}^{DiD} = \hat{\delta} \] where \(\hat{\delta}\) was the OLS estimator in regression \[\small Y_{i2}- Y_{i1} = \gamma + \delta D_{it} + U_{i2} \tag{1}\] for \(\delta\) — the ATT; \(\gamma\) — average change in outcomes (trend)
Equation 1 obtained by differencing the two-way fixed effect equation: \[ Y_{it} = \alpha_i + \gamma_t + \delta D_{it} + U_{it}, \tag{2}\] where \[ \gamma_1 = 0, \quad \gamma_2 = \gamma \] and \(\alpha_i = Y_{i1}^0\) — baseline differences between units
PanelOLS
from linearmodels
)
First question: “treating \(\alpha_i\) as parameters”?
For now forget about \(D_{it}\) and \(\gamma\) and consider: \[ Y_{it} = \alpha_i + U_{it}, \quad i=1,\dots, N; t=1, \dots, T \tag{3}\] \(\alpha_i\) — individual-specific intercept (“unit fixed effect”). Data assumed balanced (same \(T\) for all units)
Want to represent Equation 3 in vector-matrix form
Before that: more info on matrix forms for panel data.
Vector form as before: single observation (now fixed \(i\) and \(t\)) with vector of covariates: \[ Y_{it} = \bX_{it}'\bbeta + U_{it} \]
Two key matrix forms:
Individual level. Let \(\bY_i = (Y_{i1}, \dots, Y_{iT})\), \(\bX_i = (\bX_{i1}, \dots, \bX_{iT})'\), then \[\small \bY_i = \bX_i\bbeta + \bU_i \]
Full sample. Let \(\bY = (\bY_1, \dots, \bY_N)\), \(\bX= (\bX_1', \dots, \bX_N')'\). Then \[ \small \bY = \bX\bbeta + \bU \] What are the dimensions of \(\bY_i, \bX_i, \bY, \bX\)?
Model (3) in individual matrix form: \[ \bY_i = \mathbf{1}_T\alpha_i + \bU_i \] where \(\mathbf{1}_T\) — \(T\)-vector of ones
Not that insightful
Model (3) in full sample matrix form \[ \begin{aligned} \bY & = \bF \bLambda + \bU, \\ \bLambda & = (\alpha_1, \dots, \alpha_N)', \\ \bF & = \bI_N \otimes \mathbf{1}_T, \end{aligned} \tag{4}\] where \(\otimes\) is the Kronecker product. Intuition:
Now consider more general model: \[ Y_{it} = \alpha_i + \gamma_t + U_{it} \] Here want to treat both \(\alpha_i\) and \(\gamma_t\) as parameters
Individual matrix form: \[ \begin{aligned} \bY_i & = \mathbf{1}_T \alpha_i + \bI_T \bgamma + \bU_i\\ \bgamma & = (\gamma_1, \dots, \gamma_T)' \end{aligned} \]
Can write \[ \begin{aligned} \bY & = \bF\bLambda + \bU, \\ \bF & = \left(\bI_N \otimes \mathbf{1}_T, \mathbf{1}_N\otimes \bI_T \right)\\ \bLambda & = (\alpha_1, \dots, \alpha_N, \gamma_1, \dots, \gamma_T) \end{aligned} \]
Can write Equation 2 as \[ Y_{it} = \alpha_i + \gamma_t + \bX_{it}'\bbeta + U_{it}, \] for \(\bX_{it} = (D_{it})\) and \(\bbeta = (\delta)\)
More generally, consider any vector \(\bX_{it}\) — not just binary treatments
Its matrix form is \[ \bY = \bF\bLambda + \bX\beta + \bU \]
Definition 1 Models of the kind \[ \small \bY = \bF\bLambda +\bX\bbeta + \bU, \tag{5}\] where \(\bF\) is a matrix of 0s and 1s are called fixed effects or random intercept models
Suppose \(\E[U_{it}|\bX_i]=0\). How to estimate parameters of Model (5)?
There are two main strategies:
LSDV — simply regress \(\bY\) on \((\bF, \bX)\): \[ (\hat{\bLambda}, \hat{\bbeta}^{LSDV}) = \argmin_{\bL, \bb} \norm{\bY - \bF\bL -\bX\bb }_2^2 \]
For example with two-way effects:
\[\small \begin{aligned} & \left(\hat{\alpha}_1, \dots, \hat{\alpha}_N, \hat{\gamma}_1, \dots, \hat{\gamma}_T, \hat{\bbeta}^{LSDV} \right)\\ & = \argmin_{a_1, \dots, a_N, g_1, \dots, g_T, \bb}\sum_{i=1}^N \sum_{t=1}^T \left(Y_{it} - a_i - g_t - \bX_{it}'\bb \right)^{2} \end{aligned} \]
First consider one-way model \(Y_{it} = \alpha_{i} + \bX_{it}' + U_{it}\). For \(\bW_{it} = Y_{it}, \bX_{it}, U_{it}\), define the (one-way) within-transformed version of \(W_{it}\) as \[ \small \tilde{W}_{it} = W_{it} - \dfrac{1}{T}\sum_{s=1}^T W_{is} \tag{6}\]
Within transformation eliminated fixed effects (=\(\bF\)) \[ \small \tilde{Y}_{it} = \tilde{\bX}_{it}'\bbeta + \tilde{U}_{it} \]
Suppose \(Y_{it} = \alpha_i + \gamma_t + \bX_{it}'\bbeta + U_{it}\). Define (two-way) within-transformed variables as \[ \small \tilde{W}_{it} = W_{it} - \dfrac{1}{T}\sum_{s=1}^T W_{is} - \dfrac{1}{N} \sum_{j=1}^N W_{jt} + \dfrac{1}{NT} \sum_{j=1}^N \sum_{s=1}^T W_{js} \tag{7}\]
Again eliminated fixed effects (=\(\bF\)) \[ \small \tilde{Y}_{it} = \tilde{\bX}_{it}'\bbeta + \tilde{U}_{it} \]
Can find the matrix formula in section 3.2 of Baltagi (2021)
Consider general model: \[ \small \bY = \bF\bLambda + \bX\bbeta + \bU \]
There exists a linear transformation that eliminates \(\bF\): \[ \small \tilde{\bY} = \tilde{\bX}\bbeta + \tilde{\bU} \]
Called the FWL or the (generalized) within transformation
Transformations — just application of the Frisch-Waugh-Lowell (“anatomy of regression”) theorem. See E1a in Wooldridge (2020)
Within estimation: just regressing \(\tilde{\bY}\) on \(\tilde{\bX}\) with OLS: \[ \hat{\bbeta}^{W} = \argmin_{\bb} \sum_{i=1}^N \sum_{t=1}^T (\tilde{Y}_{it} - \tilde{\bX}_{it}'\bb)^2 \]
Proposition 1 \[ \hat{\bbeta}^{LSDV} = \hat{\bbeta}^{W} \]
Not examinable: proof is a consequence of Frisch-Waugh-Lowell theorem (E1a in Wooldridge 2020)
When to use LSDV vs. within estimation?
Sometimes impossible to compute LSDV estimator: number of fixed effects is too large to even simply store the data matrix:
Another special case of model (5) — pooled OLS:
linearmodels
supports both LSDV and within transformations
PanelOLS.fit(use_lsdv=True)
pyfixest
was designed for high-dimensional FE estimation (can handle small examples too)
So far:
What are the causal properties of such estimators? Under which models do they give meaningful results?
Note:
Will consider two kinds of models under strict exogeneity
For definiteness, we do two-way effects, but can apply same analysis for any configuration of random intercepts, just need to define \(\tilde{Y}_{it}\) appropriately
Work in the following setting
Realized data satisfies \(\tilde{\bY} = \tilde{\bX}_{it}'\bbeta + \tilde{U}_{it}\)
So as \(N\to\infty\) and \(T\) is fixed: \[ \small \hat{\bbeta}^{FE} \xrightarrow{p} \bbeta + \left( \E[\tilde{\bX}_{i}'\tilde{\bX}_i]\right)^{-1} \E[\tilde{\bX}_{i}'\tilde{\bU}_{i}] \]
\(T\) fixed — basically treat each unit a single \(T\)-dimensional observation (as in \(\tilde{\bU}_i\))
So far needed to impose nonsingularity conditions
What does it require of \(\bX_{it}\)?
For consistency want \[ \small \E[\tilde{\bX}_i'\tilde{\bU}_i] = \sum_{t=1}^T \E\left[\tilde{\bX}_{it} \tilde{U}_{it} \right] =0 \]
Sufficient that for all \(t\) \[ \small \E\left[\tilde{\bX}_{it} \tilde{U}_{it} \right] =0 \tag{9}\]
What does this condition require of \(\bX_{it}\) and \(U_{it}\)?
Under one-way transformation (6), Equation 9 becomes \[ \scriptsize \E\left[\tilde{\bX}_{it} \tilde{U}_{it} \right] = \E\left[ \bX_{it}U_{it} - \dfrac{\bX_{it}}{T}\sum_{s=1}^T U_{is} - \dfrac{U_{it}}{T}\sum_{r=1}^T \bX_{ir} + \dfrac{1}{T^2} \sum_{s=1}^T\sum_{r=1}^T \bX_{is} U_{ir}\right] = 0 \]
Here would be sufficient that for all \(t\), \(s\) \[ \small \E[\bX_{it}U_{is}] = 0 \tag{10}\] Intuition: \(\bX_{it}\) and \(U_{is}\) are uncorrelated across all points in time
Problematic direction is usually \(s<t\): predicting future \(\bX_{it}\) from past \(U_{is}\) (your shocks influence your future decisions)
What about beyond one-way effects? Usually impose an assumption that covers all cases — panel data version of strict exogeneity :
Assumption (strict exogeneity): \[ \E[U_{it} | \bX_{i1}, \dots, \bX_{iT}] =0 \]
Much stronger than just \(\E[U_{it}|\bX_{it}]=0\)
Proposition 2 Let \(\E[U_{is} | \bX_{i1}, \dots, \bX_{iT}] =0\) for all \(s\). Then for any within transformation it holds for all \(t\) that \[ \E[\tilde{\bX}_{i}'\tilde{\bU}_i] =0 \]
See first block for key properties of conditional expectation
Proposition 3 Let
Then as \(N\to\infty\)
\[ \hat{\bbeta}^{FE} \xrightarrow{p} \bbeta \]
Proposition 4 Let
Then as \(N\to\infty\)
\[ \scriptsize \sqrt{N}\left(\hat{\bbeta}^{FE} - \bbeta\right) \xrightarrow{d} N\left(0, \left(\E[\tilde{\bX}_i'\tilde{\bX}_i] \right)^{-1} \E[\tilde{\bX}_i'\tilde{\bU}_i\tilde{\bU}_i'\tilde{\bX}_i] \left(\E[\tilde{\bX}_i'\tilde{\bX}_i] \right)^{-1}\right) \]
Consider different potential outcomes setting: \[ \small Y_{it}^{\bx} = \bx'\bbeta_{\textcolor{teal}{i}} + U_{it} \tag{11}\] under strict exogeneity \(\E[U_{it} | \bX_{i1}, \dots, \bX_{iT}] =0\)
Can still use the FE estimator: \[ \hat{\bbeta}^{FE} = \left(\sum_{i=1}^N \tilde{\bX}_i'\tilde{\bX}_i \right)^{-1}\sum_{i=1}^N \tilde{\bX}_i'\tilde{\bY}_i \] Can do any transformation such that \(\E[\tilde{\bX}_i'\tilde{\bX}_i]\) is invertible
What does \(\hat{\bbeta}^{FE}\) do under model (11)?
Substituting model (11) gets us \[ \small \begin{aligned} \hat{\bbeta}^{FE} & = \left(\sum_{i=1}^N \tilde{\bX}_i'\tilde{\bX}_i \right)^{-1}\sum_{i=1}^N \tilde{\bX}_i'\tilde{\bX}_i\bbeta_i + \left(\sum_{i=1}^N \tilde{\bX}_i'\tilde{\bX}_i \right)^{-1}\sum_{i=1}^N \tilde{\bX}_i'\tilde{\bU}_i\\ & \xrightarrow{p} \E\left[ \bW(\tilde{\bX}_i) \bbeta_i\right] \end{aligned} \] for \(\small\bW(\tilde{\bX}_i) = \left(\E\left[\tilde{\bX}_i'\tilde{\bX}_i\right] \right)^{-1} \tilde{\bX}_i'\tilde{\bX}_i\)
Can generalize random intercept models to the form \[ Y_{it}^{\bx} = \balpha_i'\bgamma_t + \bx'\bbeta_i + U_{it} \]
How does pollution affect labor market outcomes?
Use data from Borgschulte, Molitor, and Zou (2024)
# Load data
columns_to_load = [
"countyfip", # FIPS
"rfrnc_yr", # Year
"rfrnc_qtroy", # Quarter
"d_pc_qwi_payroll", # Earnings (annual diff)
"hms_deep", # Number of smoke days
"fe_countyqtroy",
"fe_styr",
"fe_stqtros",
"seer_pop", # Population
]
county_df = pd.read_csv(
"data/county_quarter.tab",
sep="\t",
usecols=columns_to_load,
)
# Rename columns
column_rename_dict = {
"countyfip":"fips",
"rfrnc_yr":"year",
"rfrnc_qtroy":"quarter",
"d_pc_qwi_payroll":"diff_payroll",
"hms_deep":"smoke_days",
"fe_countyqtroy":"fe_id_county_quarter",
"fe_styr":"fe_id_state_year",
"fe_stqtros":"fe_id_state_quarter",
"seer_pop":"population",
}
county_df = county_df.rename(columns=column_rename_dict)
# View data
county_df.dropna(inplace=True)
county_df.head(2)
fips | year | quarter | smoke_days | population | fe_id_state_year | fe_id_county_quarter | fe_id_state_quarter | diff_payroll | |
---|---|---|---|---|---|---|---|---|---|
4 | 1001 | 2007 | 1 | 0.0 | 52405.0 | 2 | 1 | 5 | 25.800293 |
5 | 1001 | 2007 | 2 | 9.0 | 52405.0 | 2 | 2 | 6 | 33.243042 |
Pollution — number of smoke days because of wildfires
Can download the data from the Harvard dataverse
Need to choose random intercepts/FEs so that strict exogeneity holds
Specification of Borgschulte, Molitor, and Zou (2024): include
About 13000 different random intercepts (a bit more complicated than \(\alpha_i\) and \(\delta_t\))
pyfixest
this time to estimatefeols
for fixed effect estimationfml
, random intercepts after |
diff_payroll | |
---|---|
(1) | |
coef | |
smoke_days | -5.217*** (0.774) |
fe | |
fe_id_county_quarter | x |
fe_id_state_year | x |
stats | |
Observations | 160346 |
S.E. type | by: fips+fe_id_state_quarter |
R2 | - |
Significance levels: * p < 0.05, ** p < 0.01, *** p < 0.001. Format of coefficient cell: Coefficient (Std. Error) |
In this lecture we
Panel Data: Linear Panel Data Models