Learning Average Effects in Linear Panel Data Models
This lecture is about estimating the average of heterogeneous coefficients
By the end, you should be able to
Mean group estimation rarely discussed in textbooks
Just two references:
Consider potential outcome model: \[ Y_{it}^{\bx} = \bx'\bbeta_{\textcolor{teal}{i}} + U_{it} \tag{1}\]
Heterogeneous coefficients realistic:
Want to learn \(\E[\bbeta_i]\)
Last time proved that under model (1) FE estimators estimate weighted average \[ \hat{\bbeta}^{FE} \xrightarrow{p} \E[\bW(\tilde{\bX}_i)\bbeta_i] \] Outside of RCTs usually \(\E[\bW(\tilde{\bX}_i)\bbeta_i]\neq \E[\bbeta_i]\)
This lecture — way to estimate \(\E[\bbeta_i]\) with panel data
Last time discussed relationship between pollution and labor market outcomes using US county-level data
What if pollution effect differs between counties?
Vector form of realized outcomes in model (1): \[ \bY_i = \bX_i\bbeta_i + \bU_i \] \(\bbeta_i\) — \(p\)-vector. Interested in \(\E[\bbeta_i]\)
Assumption: \(T\geq p\)
At least as many observations as there are coefficients
Key component — individual estimators \[ \hat{\bbeta}_i = (\bX_i'\bX_i)^{-1}\bX_i'\bY_i \]
Definition 1 The mean group (MG) estimator is defined as \[ \hat{\bbeta}^{MG} = \dfrac{1}{N}\sum_{i=1}^N \hat{\bbeta}_i \]
Cross-sectional average of individual OLS estimators
Can define MG estimator over such units as
\[ \small \hat{\bbeta}^{MG, +} = \dfrac{1}{\sum_{i=1}^N \I\curl{\det(\bX_i'\bX_i)>0} } \sum_{i=1}^N \I\curl{\det(\bX_i'\bX_i)>0} \hat{\bbeta}_i \]
Key expansion: \[ \begin{aligned} \hat{\bbeta}^{MG} & = \dfrac{1}{N}\sum_{i=1}^N \bbeta_i + \dfrac{1}{N} \sum_{i=1}^N (\bX_i'\bX_i)^{-1}\bX_i'\bU_i \end{aligned} \]
Represent \(\hat{\bbeta}^{MG}\) as a continuous transformation of a single vector average: \[ \hat{\bbeta}^{MG} = \begin{pmatrix} \bI_p & \bI_p \end{pmatrix} \left[ \dfrac{1}{N} \sum_{i=1}^N \underbrace{\begin{pmatrix} \bbeta_i \\ (\bX_i'\bX_i)^{-1}\bX_i'\bU_i \end{pmatrix}}_{=\bW_i} \right] \]
Can now prove consistency and asymptotic normality results by applying LLN+CLT to \(\bW_i\), then use CMT (exercise)
Proposition 1 If
Then as \(N\to\infty\), \(T\geq p\) fixed \[ \hat{\bbeta}^{MG} \xrightarrow{p} \E[\bbeta_i] \]
Assuming that \(\E[\norm{(\bX_i\bX_i')^{-1}\bX_i'\bU_i}]<\infty\) — assumption of sufficient variation on \(\bX_i\)
Proposition 2 If
Then as \(N\to\infty\), \(T\geq p\) fixed \[\small \sqrt{N}\left(\hat{\bbeta}^{MG} - \E[\bbeta_i] \right) \xrightarrow{d} N\left(0, \var(\bbeta_i) + \var\left( (\bX_i'\bX_i)^{-1}\bX_i'\bU_i \right) \right) \]
Notice stronger form of strict exogeneity!
Need \(\widehat{\avar}(\hat{\bbeta}^{MG})\) if we want to do inference on \(\E[\bbeta_i]\)
Correct answer — the simplest one:
Pretty obvious result and follows directly from LLN and CLT. But useful to express the MG estimator in terms of the “primitive” data \((\bX_i, \bU_i)\)
Recall: studying relationship between pollution and earnings using approach of Borgschulte, Molitor, and Zou (2024)
Data availabe from the Harvard dataverse
Last time assumed that two random intercepts sufficient to capture important unobserved components:
Unit of analysis effectively county in given season (e.g. King County in summer is a single \(i\))
After that assumed that effect of pollution was same between places
What if pollution effects differ between counties and seasons?
Our approach to estimating with \(\E[\beta_i]\) with mean group estimator:
Step (1) not part of MG estimation, included here due to our model
# Load data
columns_to_load = [
"countyfip", # FIPS
"rfrnc_yr", # Year
"rfrnc_qtroy", # Quarter
"d_pc_qwi_payroll", # Earnings (annual diff)
"hms_deep", # Number of smoke days
"fe_countyqtroy",
"fe_styr",
"fe_stqtros",
"seer_pop", # Population
]
county_df = pd.read_csv(
"data/county_quarter.tab",
sep="\t",
usecols=columns_to_load,
)
# Rename columns
column_rename_dict = {
"countyfip":"fips",
"rfrnc_yr":"year",
"rfrnc_qtroy":"quarter",
"d_pc_qwi_payroll":"diff_payroll",
"hms_deep":"smoke_days",
"fe_countyqtroy":"fe_id_county_quarter",
"fe_styr":"fe_id_state_year",
"fe_stqtros":"fe_id_state_quarter",
"seer_pop":"population",
}
county_df = county_df.rename(columns=column_rename_dict)
# Drop missing rows
county_df.dropna(inplace=True)
# Subtract state-year averages
county_df[["diff_payroll_no_state_year", "smoke_days_no_state_year"]] = (
county_df.groupby("fe_id_state_year")[["diff_payroll", "smoke_days"]]
.transform(lambda x: x-x.mean())
)
Mean group not implemented anywhere (to the best of my knowledge), but very easy to do by hand
for
-loopA small function for running individual regressions:
def run_individual_regression(county_quarter_df: pd.DataFrame) -> np.float64:
"""
Compute the OLS coefficient for 'smoke_days_no_state_year' on 'diff_payroll'
for a given county-quarter DataFrame.
Parameters
----------
county_quarter_df : pd.DataFrame
DataFrame containing at least the columns 'diff_payroll' and 'smoke_days_no_state_year'.
Returns
-------
np.float64
Estimated coefficient for 'smoke_days_no_state_year'.
Raises
------
KeyError
If required columns are missing.
ValueError
If regression cannot be performed (e.g., insufficient data).
"""
try:
y = county_quarter_df['diff_payroll_no_state_year']
X = sm.add_constant(county_quarter_df['smoke_days_no_state_year'])
model = sm.OLS(y, X)
results = model.fit()
coef = results.params['smoke_days_no_state_year']
return coef
except KeyError as e: # Column missing
raise ValueError(f"Required column missing: {e}")
except ValueError as e: # Regression-related issues
raise ValueError(f"Regression could not be performed: {e}")
Apply OLS to each \(i\):
ind_ests = (
county_df.groupby("fe_id_county_quarter")
.apply(lambda x: run_individual_regression(x), include_groups=False)
)
ind_ests.name = "ind_estimates"
# Merge back with data
ind_ests = (
county_df[['fips', 'fe_id_county_quarter', 'quarter']]
.drop_duplicates()
.merge(ind_ests, left_on="fe_id_county_quarter", right_index=True)
)
Our mean group estimates in Q3 (~summer)
Negative and not far from FE value
But:
Large standard deviation! Cannot reject \(\E[\beta_i]=0\) — do not have the finding of Borgschulte, Molitor, and Zou (2024) in this model
# Set colors
BG_COLOR = "whitesmoke"
FONT_COLOR = "black"
GEO_COLOR = "rgb(201, 201, 201)"
OCEAN_COLOR = "rgb(136, 136, 136)"
# Select quarter and extract data
QUARTER = 3
plot_data = ind_ests.loc[ind_ests["quarter"]==QUARTER, "ind_estimates"]
fig, ax = plt.subplots(figsize=(14, 6.5))
# Add density
sns.kdeplot(
plot_data,
ax=ax,
linewidth=3,
label='Density of $\\hat{\\beta}$',
)
# Add line at 0 and mean
ax.axvline(0, color='whitesmoke', linestyle='--', linewidth=0.5, label='')
ax.axvline(plot_data.mean(), linestyle='--', linewidth=0.5, label='$E[\\beta]$')
# Colors
fig.patch.set_facecolor(BG_COLOR)
ax.set_facecolor(BG_COLOR)
ax.spines["top"].set_color(FONT_COLOR)
ax.spines["bottom"].set_color(FONT_COLOR)
ax.spines["left"].set_color(FONT_COLOR)
ax.spines["right"].set_color(FONT_COLOR)
ax.tick_params(colors=FONT_COLOR)
# Limits, legend, text
ax.set_xlim([-150, 100])
ax.legend()
ax.set_title('Density of Individual Estimators (Q3 Effects)', color=FONT_COLOR, loc='left', weight='bold')
ax.set_xlabel('')
plt.show()
Density of \(\hat{\beta}_i\) wide, could be due to two reasons:
In first case collecting more \(T\) would not help
Variance \(\var(\beta_i)\) can be actually identified and estimated, see some of my notes (section 7, Morozov 2025)
# Load the shapefile
shapefile_path = "data/us_counties/USA_Counties.shp"
gdf = gpd.read_file(shapefile_path)
gdf = gdf.to_crs(epsg=4326)
gdf["geometry"] = gdf["geometry"].buffer(0)
ind_ests["fips"] = ind_ests["fips"].astype(str).str.zfill(5)
merged_data = gdf.merge(ind_ests, left_on="FIPS", right_on="fips", how="right")
merged_data = merged_data.loc[merged_data["quarter"]==QUARTER,["NAME", "fips", "geometry", "ind_estimates"]]
merged_data = merged_data.set_index('fips')
# Define custom range limits
vmin, vmax = -5, 5 # Trim extreme values while keeping a balance
# Clip data values to avoid excessive white in the middle
merged_data["scaled_value"] = np.clip(merged_data["ind_estimates"], vmin, vmax)
# Create choropleth map
fig = px.choropleth(merged_data,
geojson=merged_data.geometry,
locations=merged_data.index,
color="scaled_value",
scope="usa",
color_continuous_scale="PuOr",
custom_data=["NAME", "ind_estimates"],
color_continuous_midpoint=0,
width=1150, height=550,
)
# Apply dark theme settings
fig.update_layout(
font_family="Arial",
plot_bgcolor=BG_COLOR,
paper_bgcolor=BG_COLOR,
font=dict(color=FONT_COLOR),
title=dict(
text="<b>Effect of an Additional Day of Wildfire Smoke on Quarterly Earnings (Q3 Effects)</b>",
x=0.03,
xanchor="left",
y=0.97,
yanchor="top",
font=dict(color=FONT_COLOR, size=20)
),
margin=dict(l=20, r=20, t=60, b=20),
coloraxis_colorbar=dict(
title=dict(
text="Smoke days",
font=dict(color=FONT_COLOR, size=12),
side="right"
),
tickfont=dict(color=FONT_COLOR, size=10),
len=0.7,
thickness=20,
x=1.02,
yanchor="middle",
y=0.5
)
)
fig.update_geos(fitbounds="locations", visible=False,
bgcolor=BG_COLOR, # Map background
landcolor=GEO_COLOR, # Land color
lakecolor=OCEAN_COLOR, # Water color
showocean=True,
oceancolor=OCEAN_COLOR, )
fig.update_traces(
hovertemplate="<b>%{customdata[0]}</b><br>Estimated effect: earnings change by %{customdata[1]:.2f} dollars"
)
fig.show()
If the map on this slide is not showing up, reload the page
In this lecture we
Panel Data: Mean Group