3 Within (Fixed Effects) Estimator and Heterogeneous Coefficients

Summary and Learning Outcomes

This section shows that the within (fixed effects) estimator consistently estimates average coefficients only under narrow assumptions.

By the end of this section, you should

Refresh your knowledge of within (fixed effects) estimation.
Learn when the within estimator gives consistent estimates of average coefficients in heterogeneous models.
See the first example of heterogeneity bias in this course.

3.1 Introduction

As noted in the previous section, in this block we will focus our attention on the linear panel data model with unit-specific heterogeneous coefficients (Equation 2.7): \[ y_{it} = \bbeta_i'\bx_{it} + u_{it}. \tag{3.1}\]

Our first key parameter of interest is the average coefficient vector \(\E[\bbeta_i]\) — the linear model analog to the average treatment effect.

3.1.1 Focus: Workhorse Estimators under Heterogeneity

Can existing workhorse estimators for linear panel data models — the within (fixed effects) and dynamic panel IV estimators — correctly estimate \(\E[\bbeta_i]\)? Of course, those methods were developed in the context of the simpler random intercept model (2.8) \[ y_{it} = \alpha_i + \bbeta'\bx_{it} + u_{it}. \tag{3.2}\] However, if they can also handle the more general Equation 3.1, then all the better for us — we do not need to develop any new methods.

Unfortunately, such standard estimators usually fail when coefficients vary across units, as we will demonstrate. This failure holds both for static and the dynamic formulations of Equation 3.1.

3.1.2 This Section: Static Model and Fixed \(T\)

In this section, we consider the static case and the associated workhorse estimator — the within (fixed effects) estimator. By “static”, we mean that the model does not include lagged dependent variables as regressors. In addition, we assume strict exogeneity \[ \E[\bu_i|\bX_i]=0. \]

Throughout, we will assume that the number \(N\) of cross-sectional units is potentially large, while the number \(T\) of data points per unit is fixed. This setup is more reflective of typical micropanels.

3.2 Recall: Within Estimator in the Random Intercept Model

To begin, we will briefly go through the construction and the properties of the within estimator under the random intercept model (3.2).

3.2.1 Construction

Within (“fixed effects”) estimation of the random intercept model has two steps:

Apply the within transformation to each unit.
Apply OLS to the resulting pooled data.

3.2.1.1 Step 1: Within Transformation

To perform the within transformation, we first average the equations for unit \(i\) across \(t\). Label the average of \(y_{it}\) across \(t\) for unit \(i\) as

\[ y_{i\cdot} = \frac{1}{T} \sum_{t=1}^{T} y_{it}. \]

The averaged outcome \(y_{i\cdot}\) satisfies the averaged equation

\[ y_{i\cdot} = \alpha_i + \bbeta' \bx_{i\cdot} + u_{i\cdot}, \] where \(\bx_{i\cdot}\) and \(u_{i\cdot}\) are defined analogously to \(y_{i\cdot}\).

Define the within-transformed variables by subtracting this averaged equation from the original equation for each \(t\): \[ \tilde{y}_{it} = y_{it} - y_{i\cdot}. \]

If Equation 3.2 is true, the within transformed variables follow the within-transformed equation:

\[ \tilde{y}_{it} = \bbeta' \tilde{\bx}_{it} + \tilde{u}_{it}. \tag{3.3}\]

Under Equation 3.2, the within transformation eliminates the individual random intercepts \(\alpha_i\). Equation 3.3 now looks like a regular homogeneous regression.

3.2.1.2 Step 2: OLS on the Within-Transformed Equation

The within (fixed effects) estimator is obtained by simply pooling the data across \(i\) and \(t\) in Equation 3.3 and applying OLS to it. Specifically, the estimator is given by

\[ \hat{\bbeta}^W = \left(\sum_{i=1}^{N} \tilde{\bX}_i' \tilde{\bX}_i \right)^{-1} \sum_{i=1}^{N} \tilde{\bX}_i \tilde{\mathbf{y}}_i. \tag{3.4}\]

3.2.2 Properties

The within estimator enjoys several desirable properties if the random intercept model reflects the underlying causal model. Most of these properties can be derived from its sampling error representation \[ \hat{\bbeta}^W = \bbeta + \left(\sum_{i=1}^{N} \tilde{\bX}_i' \tilde{\bX}_i \right)^{-1} \sum_{i=1}^{N} \tilde{\bX}_i \tilde{\bu}_i. \tag{3.5}\]

\(\hat{\bbeta}^W\) is unbiased for \(\bbeta\). To show this, it is sufficient to notice that strict exogeneity of \(\bu_i\) with respect to \(\bX_i\) implies strict exogeneity of \(\tilde{\bu}_i\) with respect to \(\tilde{\bX}_i\): \[ \E[\tilde{\bu}_i|\tilde{\bX}_i] = \E[\E[\tilde{\bu}_i|\bX_i]|\tilde{\bX}_i] = 0. \] It follows that the mean of the second term in Equation 3.5 is 0, and so \[ \E[\hat{\bbeta}^W] = \bbeta. \]
\(\hat{\bbeta}^W\) is consistent for \(\bbeta\) and asymptotically normal, provided a standard rank condition holds for \(\tilde{\bX}_i\): \[ \hat{\bbeta}^W \xrightarrow{p} \bbeta, \quad \sqrt{N}(\hat{\bbeta}^W - \bbeta) \Rightarrow N(0, \Sigma). \tag{3.6}\]

Since \(\bbeta\) is the average coefficient vector in this homogeneous model, we conclude that the within estimator consistently estimates average coefficients under the random intercept model (3.2).

3.3 Adding Heterogeneous Coefficients

However, there is usually no theoretical reason for the slopes \(\bbeta\) to be homogeneous. An assumption of slope homogeneity goes against acknowledging heterogeneity and including the random intercept terms \(\alpha_i\) in the first place. It is rather more realistic to consider the more general Equation 3.1. Accordingly, we now turn to studying the properties of \(\hat{\bbeta}^W\) in this more realistic setting.

3.3.1 Sampling Error Form

Applying the within transformation to the heterogeneous Equation 3.1 yields another version of the within-transformed equation:

\[ \tilde{y}_{it} = \bbeta_i' \tilde{\bx}_{it} + \tilde{u}_{it}. \]

Note that now the individual heterogeneity is not eliminated! The heterogeneous coefficients \(\bbeta_i\) remain in the equation.

The within estimator on the above equation may then be represented as \[ \begin{aligned} \hat{\bbeta}^W & = \left(\sum_{i=1}^N \tilde{\bX}_i'\tilde{\bX}_i \right)^{-1} \sum_{i=1}^N \tilde{\bX}_i \tilde{\by}_i \\ & = \left(\sum_{i=1}^N \tilde{\bX}_i'\tilde{\bX}_i \right)^{-1} \sum_{i=1}^N \tilde{\bX}_i \tilde{\bX}_i\bbeta_i + \left(\sum_{i=1}^N \tilde{\bX}_i'\tilde{\bX}_i \right)^{-1} \sum_{i=1}^N \tilde{\bX}_i \tilde{\bu}_i. \end{aligned} \] Note the difference of the sampling error representation with Equation 3.5. The first term is now a weighted average of the individual coefficients. The weights themselves depend on the second moments of the within-transformed explanatory variables, which may be viewed as a kind of variance weighting for units.

Does the within estimator target \(\E[\bbeta_i]\)? To proceed, we decompose \(\bbeta_i\) into a common mean component \(\E[\bbeta_i]\) and an idiosyncratic deviation \(\bEta_i\): \[ \bbeta_i = \E[\bbeta_i] + \bEta_i \]

With this representation, we can further analyze the within estimator as \[ \begin{aligned} \hat{\bbeta} & = \E[\bbeta_i] + \left( \dfrac{1}{N}\sum_{i=1}^N \tilde{\bX}_i'\tilde{\bX}_i \right)^{-1} \dfrac{1}{N}\sum_{i=1}^N \tilde{\bX}_i \tilde{\bX}_i\bEta_i \\ & \phantom{ = \E[\bbeta_i] } + \left( \dfrac{1}{N}\sum_{i=1}^N \tilde{\bX}_i'\tilde{\bX}_i \right)^{-1} \dfrac{1}{N}\sum_{i=1}^N \tilde{\bX}_i \tilde{\bu}_i\\ & \xrightarrow{p} \E[\bbeta_i] + \left(\E\left[\tilde{\bX}_i'\tilde{\bX}_i \right] \right)^{-1} \E\left[\tilde{\bX}_i'\tilde{\bX}_i\bEta_i \right] \\ & \phantom{ = \E[\bbeta_i] } + \left(\E\left[\tilde{\bX}_i'\tilde{\bX}_i \right] \right)^{-1} \E\left[\tilde{\bX}_i'\tilde{\bu}_i \right]\\ & = \E[\bbeta_i] + \left(\E\left[\tilde{\bX}_i'\tilde{\bX}_i \right] \right)^{-1} \E\left[\tilde{\bX}_i'\tilde{\bX}_i\bEta_i \right], \end{aligned} \] where we have assumed that a suitable law of large numbers applies as \(N\to\infty\) and \(T\) is fixed, and where \(\E\left[\tilde{\bX}_i'\tilde{\bu}_i \right]=0\) as above.

3.3.2 Conditions for Estimating Average Coefficients

The above representation shows that the within estimator is not estimating \(\E[\bbeta_i]\) unless the following orthogonality condition holds: \[ \E\left[\tilde{\bX}_i'\tilde{\bX}_i\bEta_i \right] =0. \tag{3.7}\] Even though \(\E[\bEta_i]=0\), the above condition does not necessarily hold if \(\bbeta_i\) and \(\bX_i\) are allowed to dependent.

If condition (3.7) holds, the within estimator is consistent for \(\E[\bbeta_i]\) in the heterogeneous coefficient model (3.1). If it fails, the estimator is biased. The difference between the estimand and \(\E[\bbeta_i]\) is known as heterogeneity bias (see Campello, Galvao, and Juhl 2019 in the linear case).

This orthogonality condition (3.7) is a bit complicated to understand. A simpler sufficient condition is a mean independence on the coefficients given the within transformed covariates: \[ \E[\boldsymbol{\eta}_i|\tilde{\bX}_i] = 0. \tag{3.8}\] Under this condition \(\E[\tilde{\bX}_i'\tilde{\bX}_i\bEta_i] =0\), and thus the within estimator is consistent for \(\E[\bbeta_i]\). These conditions were proposed by Wooldridge (2003), Wooldridge (2005) (see all Murtazashvili and Wooldridge (2008) for the IV within estimator case).

Conditions (3.7) and (3.8) restrict the dependence structure between \(\bbeta_i\) and \(\bx_{it}\). Such conditions are sometimes called correlated random effects (CRE) in the literature. CRE assumptions lie between fixed effects (FE) frameworks — which do not restrict the dependence — and random effects (RE) — which assume that unobserved components are independent of the observed ones.

3.3.3 Intuition

How can we interpret condition (3.8)? Intuitively, it requires that the changes in \(\bx_{it}\) over time are uncorrelated with the individual coefficients.

To see this interpretation, it is helpful to think of the following example framework. Suppose that \(\bx_{it}\) is stationary, that is, its distribution does not depend on \(t\). Decompose \(\bx_{it}\) as \[ \bx_{it} = \E[\bx_{it}|\bbeta_i] + \bxi_{it}, \]

Then the within transformed variables satisfy: \[ \tilde{\bx}_{it} = \tilde{\bxi}_{it}. \]

The “systemic” component \(\E[\bx_{it}|\bbeta_i]\) is not present in \(\tilde{\bx}_{it}\). Only the deviations across time \(\tilde{\bxi}_{it}\) are left. Condition (3.8) requires that \(\tilde{\bxi}_{it}\) and \(\bEta_i\) are unrelated on average. Note that it permits an arbitrary relationship between \(\bbeta_i\) and \(\E[\bx_{it}|\bbeta_i]\).

As an example, suppose that we are working with consumption data. A consumer knows their marginal utilities \(\bbeta_i\) of consuming more of a variety of products. With this knowledge, they choose the optimal desired level of consumption — \(\E[\bx_{it}|\bbeta_i]\). When they try to buy this level of products, they may encounter some frictions \(\bxi_{it}\) which cause them to deviate from \(\E[\bx_{it}|\bbeta_i]\) — in rough words, the supermarket might not have their favorite cereal. If these “frictions” are uncorrelated with \(\bbeta_i\), then the required consistency will hold. In the consumption example, this fact means that unpredictable deviation in short-term choices do not necessarily bias the estimation of overall preferences.

3.3.4 Why Panel Data is Useful

It is useful to contrast condition (3.8) with the stronger condition that \[ \E[\bEta_i|\bX_i]=0. \tag{3.9}\] Note the difference in the conditioning sets.

Condition (3.9) is stronger than (3.8) by the tower property of conditional expectation. Intuitively, we can compute \(\tilde{\bX}_i\) from \(\bX_i\), but not vice versa. The requirement that \(\E[\bEta_i|\bX_i]=0\) may be very strong, since it would in general require that \(\E[\bx_{it}|\bbeta_i]\) does not depend on \(\bbeta_i\) — we rule out a systemic dependence even on average.

This contrast also highlights the advantages of panel data. If you want to consistently estimate \(\E[\bbeta_i]\) using OLS and cross-sectional data, you need the very strong condition (3.9) or at least that \(\E[\bx_i\bx_i'\bbeta_i]=0\). With panel data, weaker conditions are sufficient, and systemic dependence between \(\bbeta_i\) and \(\bx_{it}\) is possible.

3.3.5 Example

We conclude this section with a small illustration of the results in a tractable model (see this blog post for a simulation with some dramatic examples). Specifically, we consider the following panel model with two periods, coefficient heterogeneity, and a scalar regressor: \[ y_{it} = \alpha_i + \beta_i x_{it} + u_{it}, \quad t=1,2. \]

where \(\bX_{it}\) and \(\bbeta_i\) are jointly normal:

\[ \begin{aligned} \begin{pmatrix} x_{i1}\\ x_{i2}\\ \beta_i \end{pmatrix} \sim N\left( \begin{pmatrix} 1\\ 2\\ 0.5 \end{pmatrix}, \begin{pmatrix} 1 & \rho_{1, \beta}\rho_{2, \beta} & \rho_{1, \beta}\\ \rho_{1, \beta}\rho_{2, \beta} & 1 & \rho_{2, \beta}\\ \rho_{1, \beta} & \rho_{2, \beta} & 1 \end{pmatrix} \right), \end{aligned} \tag{3.10}\] where the various \(\rho\) parameters are correlations. The correlation between \(x_{i1}\) and \(x_{i2}\) is such that they are only correlated through \(\beta_i\). The distribution of the \(\alpha_i\) does not have to be specified, as it is not involved in the consistency conditions. Likewise, we only need to assume that \(\E[u_{it}|x_{i1}, x_{i2}]=0\) without further specifying the distribution of \(u_{it}\).

The within transformations yields the following equation: \[ y_{i2} - y_{i1} = \beta_i(x_{i2}-x_{i1}) + (u_{i2}-u_{i1}) \] The mean independence condition 3.8 takes form \[ \E[ \beta_i|x_{i2} - x_{i1}] = 0.5. \]

It is not difficult to work out (Brockwell and Davis 2016, A.3.1) that \[ \E[ \beta_i|x_{i2} - x_{i1}] = 0.5 + (\rho_{2, \beta}- \rho_{1, \beta})(x_{i2} - x_{i1} - 1) \] Thus, \(\E[ \beta_i|x_{i2} - x_{i1}]=0.5\) only holds if \[ \rho_{2, \beta} = \rho_{1, \beta}. \tag{3.11}\] In this case, \(x_{it}\) becomes stationary, and does not have dynamics that depend on \(\beta_i\) in the mean.

If condition (3.11) holds, the within estimator is consistent for \(\E[\beta_i]\). It is also possible to show that if condition (3.11) fails, so does the more general Equation 3.7, and the within estimator is inconsistent for \(\E[\beta_i]\). We represent this fact on Figure 3.1, where we fix \(\rho_{1, beta}=0.25\), and vary \(\rho_{2, \beta}\) between \(-1\) and \(1\). Observe that the within estimator is consistent if \(\rho_{2, \beta} = \rho_{1, \beta} = 0.25\). Otherwise, it is biased, potentially severely. For \(\rho_{2, \beta}\leq -0.6\), the estimand of the within estimator has a sign different from that of \(\E[\beta_i]\).

Figure 3.1: The within estimator under coefficient heterogeneity. Consistency and inconsistency for the average coefficient under data generating process (3.10)

Next Section

In the next section, we turn to the dynamic case and briefly introduce the “standard” dynamic panel instrumental variable estimators.