PAC Learning, Bias-Complexity Trade-Off, and the No Free Lunch Theorem
This lecture gives a framework for thinking about learning algorithms and hypothesis classes
By the end, you should be able to
Think in terms of learning algorithms:
Example of learning algorithm: ERM
Want \(\Hcal\) and algorithms that “tend to” return \(\hat{h}_S(\cdot)\in \Hcal\) with “low” \(R(\hat{h}_S)\)
This lecture: how to think about this matter
Special name for lowest possible risk:
Definition 1 The Bayes risk is defined as \[ R^* = \inf R(h) \] across all \(h(\cdot)\) such that \(R(h)\) makes sense.
If \(h\) is such that \(R(h)=R^*\), this \(h\) is known as the Bayes predictor
Note: \(R^*\) depends on the distribution of the data
Infimum instead of minimum to make sure the definition always works
Let \(Y=\) Heads or Tails of 50/50 coin
Here \(R^*=0.5\), \(\bX\) — useless for predicting
True model: \[ Y = \bX'\bbeta + U \] \(U\) — independent of \(\bX\) with \(\E[U]=0, \var(U) =\sigma^2\)
Definition 2 The excess error of a hypothesis \(h\) is defined as \[ R(h) - R^* \]
Intuition: how good we are doing with \(h\) vs. how good we can do at best
Useful to decompose \[ \begin{aligned} R(h) - R^* = \underbrace{\left(R(h) - \inf_{h\in\Hcal} R(h) \right)}_{\text{Estimation error}} + \underbrace{\left(\inf_{h\in\Hcal} R(h)- R^* \right)}_{\text{Approximation error}} \end{aligned} \]
\[ \text{Approximation error} = \inf_{h\in\Hcal} R(h)- R^* \]
Shows how good \(\Hcal\) can do in terms of risk
Can generally get closer to Bayes risk (e.g. by getting closer to Bayes predictor \(h_{Bayes}\))
Image: figure 4.2 in Mohri, Rostamizadeh, and Talwalkar (2018)
\[ \text{Estimation error} = R(h) - \inf_{h\in\Hcal} R(h) \]
Called estimation error because usually care about \(R(\hat{h}) - \inf_{h\in\Hcal} R(h)\) where \(\hat{h}\) is selected based on data
Controlled by
How does estimation error depend on complexity of \(\Hcal\)?
Remember: estimation error is about choosing the best available option
Trade-off: less bias (approximation error) means more complexity is needed
If approximation error dominates
Solution: use a richer \(\Hcal\)
Estimation error dominates if \(\hat{h}\) chosen poorly
Solutions:
Mentioned before: approximation error studied by approximation theory, not a statistical question
How is estimation error studied?
Pieces of learning:
\(\Acal\) selects some \(\hat{h}_S^{\Acal}\) from \(\Hcal\) after seeing \(S\)
Generalization error (risk) of \(\hat{h}^{\Acal}_S\): \[ R(\hat{h}^{\Acal}_S) = \E_{(Y, \bX)}\left[ l(Y, \hat{h}_S(\bX)) \right] \]
PAC learning (Valiant 1984) combines these requirements
Definition 3 \(\Acal\) is a PAC learning algorithm if for any \(\varepsilon>0, \delta>0\) there exists a sample size \(m\) such that for all distributions of data \(\Dcal\) and all samples of size \(m\) from \(\Dcal\) it holds that \[ P_{S}\left(R(\hat{h}^{\Acal}_S) - \min_{h\in\Hcal} R(h) \leq \varepsilon \right) \geq 1-\delta, \] where \(\hat{h}_S\) is the hypothesis selected by \(\Acal\) from \(\Hcal\)
Simplified because the full definition also requires that the sample size \(m\) depend on \((1/\varepsilon, 1/\delta)\) polynomially
So far: talking on the level of a single algorithm \(\Acal\)
Do we even need to study and use different algorithms?
Given a risk function \(R\), is there a universal learning algorithm?
Universal = always capable of returning low risk regardless of \(\Dcal\)
No free lunch theorems: no such algorithm exists
NFL theorems (Wolpert 1996) look like this:
Proposition 1 Let \(\Acal\) be any learning algorithm for binary classification under indicator loss with \(X\in \R\). Let \(m\) be any sample size. There exists some distribution \(\Dcal\) of data such that
Here all risks are evaluated under the “bad” distribution \(\Dcal\)!
You always need some prior (domain) knowledge for successful learning
Prior knowledge expressed in terms of selecting \(\Hcal\), \(\Acal\), imposing penalties, choosing predictors, network architecture, etc.
In this lecture we
With the theoretical foundations of the last three sections, can now deep dive into both theoretical and practical aspects:
Etc.
Learning Framework, Limits and Trade-Offs