Learning Framework, Limits and Trade-Offs

PAC Learning, Bias-Complexity Trade-Off, and the No Free Lunch Theorem

Vladislav Morozov

Introduction

Lecture Info

Learning Outcomes

This lecture gives a framework for thinking about learning algorithms and hypothesis classes

By the end, you should be able to

Discuss the bias-complexity trade-off of selecting the hypothesis class
Define the PAC learning framework and relate PAC learning to complexity
Intuitively state the no free lunch theorem of learning

References

Chapter 1-2 in James et al. (2023)
Deep reading with details:
- PAC learning: chapter 3 Shalev-Shwartz and Ben-David (2014) or chapter 2 in Mohri, Rostamizadeh, and Talwalkar (2018)
- No-Free-Lunch theorem: chapter 5 in Shalev-Shwartz and Ben-David (2014)

This Lecture

Learning Algorithms

Think in terms of learning algorithms:

Algorithm \(\Acal\) sees sample \(S=\curl{(Y_1, \bX_1), \dots, (Y_N, \bX_N)}\) and hypothesis class \(\Hcal\)
Algorithm returns some \(\hat{h}_{S}^{\Acal}(\cdot)\in\Hcal\)

Example of learning algorithm: ERM

This Lecture

Want \(\Hcal\) and algorithms that “tend to” return \(\hat{h}_S(\cdot)\in \Hcal\) with “low” \(R(\hat{h}_S)\)

This lecture: how to think about this matter

Inherent trade-off in terms of complexity of \(\Hcal\)
Thinking about \(\Acal\) choosing from \(\Acal\)
Is there a universally valid \(\Acal\)?

Bias-Complexity Tradeoff

Bayes Risk

Special name for lowest possible risk:

Definition 1 The Bayes risk is defined as \[ R^* = \inf R(h) \] across all \(h(\cdot)\) such that \(R(h)\) makes sense.

If \(h\) is such that \(R(h)=R^*\), this \(h\) is known as the Bayes predictor

Note: \(R^*\) depends on the distribution of the data

Bayes Risk Example: Independent Coin

Let \(Y=\) Heads or Tails of 50/50 coin

\(\bX\) — some variables that do not depend on \(Y\)
Risk — indicator risk

Here \(R^*=0.5\), \(\bX\) — useless for predicting

\(h(\bX)=Tails\) and \(h(\bX) = Heads\) both have \(R(h)=0.5\)

Bayes Risk Example: Linear Model

True model: \[ Y = \bX'\bbeta + U \] \(U\) — independent of \(\bX\) with \(\E[U]=0, \var(U) =\sigma^2\)

Risk: MSE
Best you can do — know actual \(\beta\) (next lecture and exercise)
\(\Rightarrow\) \(R^* = R(\bbeta) = \sigma^2\)

Excess Error

Definition 2 The excess error of a hypothesis \(h\) is defined as \[ R(h) - R^* \]

Intuition: how good we are doing with \(h\) vs. how good we can do at best

Bias-Complexity Trade-Off

Bias-Complexity Trade-Off I: Decomposition

Useful to decompose \[ \begin{aligned} R(h) - R^* = \underbrace{\left(R(h) - \inf_{h\in\Hcal} R(h) \right)}_{\text{Estimation error}} + \underbrace{\left(\inf_{h\in\Hcal} R(h)- R^* \right)}_{\text{Approximation error}} \end{aligned} \]

Inserted \(\inf_{h\in\Hcal} R(h)\) — best you can do with \(\Hcal\)
Expresses key trade-off — bias-complexity trade-off

Approximation Error I

\[ \text{Approximation error} = \inf_{h\in\Hcal} R(h)- R^* \]

Shows how good \(\Hcal\) can do in terms of risk

Not a statistical question, studied by approximation theory
Increasing size/richness of hypothesis class may lower approximation error (but not raise it)

Approximation Error II

Example: \(\Hcal_1 \subseteq \Hcal_2\) \(\subseteq \dots\) \(\subseteq \Hcal_{\gamma} \subseteq \dots\)
Increasing \(\gamma\) — more complex class (e.g. allowing higher powers of polynomials)

Can generally get closer to Bayes risk (e.g. by getting closer to Bayes predictor \(h_{Bayes}\))

Estimation Error I

\[ \text{Estimation error} = R(h) - \inf_{h\in\Hcal} R(h) \]

Called estimation error because usually care about \(R(\hat{h}) - \inf_{h\in\Hcal} R(h)\) where \(\hat{h}\) is selected based on data

Controlled by

Learning algorithm: how is \(\hat{h}\) chosen
Complexity of \(\Hcal\)
Amount of data available for learning

Estimation Error and Overfitting

How does estimation error depend on complexity of \(\Hcal\)?

In general: more difficult to control with larger \(\Hcal\)
Intuitively:
- Sample is finite: only limited information
- Harder to choose well with more options
Related to overfitting: choosing a too complex \(\hat{h}\): low risk on training sample, poor generalization

Bias-Complexity Trade-Off II: Visually

Increasing complexity: approximation error \(\Downarrow\), estimation error \(\Uparrow\) (more or less)
Creates a trade-off
Finding optimal point — art. Depends on the specific problem (more on that later)

Underfitting

If approximation error dominates

\(\Hcal\) not complex enough
Called underfitting

Solution: use a richer \(\Hcal\)

Overfitting

Estimation error dominates if \(\hat{h}\) chosen poorly

Usual source: overfitting
Also maybe due to optimization issues

Solutions:

Use a less complex \(\Hcal\)
Add penalty to empirical risk/objective to punish more complex models (e.g. \(L^1\) or \(L^2\) penalties from last time)

PAC Learning

Motivation

Mentioned before: approximation error studied by approximation theory, not a statistical question

How is estimation error studied?

Part of “core” statistical learning theory
Basic framework for thinking — probably approximately correct (PAC) learning
Introduce it now and talk more about generalization

Reminder: Notation

Pieces of learning:

Algorithm \(\Acal\)
Sample \(S=\curl{(Y_1, \bX_1), \dots, (Y_N, \bX_N)}\)
Hypothesis class \(\Hcal\)

\(\Acal\) selects some \(\hat{h}_S^{\Acal}\) from \(\Hcal\) after seeing \(S\)

Randomness in Generalization and Estimation Errors

Generalization error (risk) of \(\hat{h}^{\Acal}_S\): \[ R(\hat{h}^{\Acal}_S) = \E_{(Y, \bX)}\left[ l(Y, \hat{h}_S(\bX)) \right] \]

Expectation over new independent point \((Y, \bX)\), not over the sample (\(\hat{h}^{\Acal}_S\) fixed for the expectation)!
\(\Rightarrow R(\hat{h}^{\Acal}_S)\) is random because it depends on sample \(S\)
\(\Rightarrow\) estimation error \(R(\hat{h}^{\Acal}_S) - \inf_{h\in\Hcal}R(h)\) also random

PAC: Intuitive Form

Want to have \(R(\hat{h}^{\Acal}_S) - \inf_{h\in\Hcal}R(h)\) small — less than some \(\varepsilon\) (to be approximately correct)
Want this to happen for most samples, at least share \(1-\delta\) of possible samples of size \(m\) (for it to happen probably)
Want it to hold regardless of the distribution of data

PAC learning (Valiant 1984) combines these requirements

PAC: Simplified Definition

Definition 3 \(\Acal\) is a PAC learning algorithm if for any \(\varepsilon>0, \delta>0\) there exists a sample size \(m\) such that for all distributions of data \(\Dcal\) and all samples of size \(m\) from \(\Dcal\) it holds that \[ P_{S}\left(R(\hat{h}^{\Acal}_S) - \min_{h\in\Hcal} R(h) \leq \varepsilon \right) \geq 1-\delta, \] where \(\hat{h}_S\) is the hypothesis selected by \(\Acal\) from \(\Hcal\)

PAC: Discussion

PAC learning algorithms are good because they control estimation error
Complexity and PAC:
- If \(\Hcal\) is more complex, then \(m\) is larger — need larger samples to achieve same \(\varepsilon\) and \(1-\delta\)
- Formal justification for why more complex \(\Hcal\) are harder to learn
“Bad” PAC results do not mean bad performance in practice (e.g. deep neural networks)

No Free Lunch Theorem

Motivation

So far: talking on the level of a single algorithm \(\Acal\)

Trade-offs you face in terms of bias vs. complexity
How to express the ability to control estimation error (PAC)

Do we even need to study and use different algorithms?

Given a risk function \(R\), is there a universal learning algorithm?

No Free Lunch: Intuitive Form

No free lunch theorems: no such algorithm exists

NFL theorems (Wolpert 1996) look like this:

Fix risk \(R(\cdot)\) and any algorithm \(\Acal\) that returns \(\hat{h}_S^{\Acal}\) after seeing a sample \(S\)
There exist a data distribution such that
- \(R(\hat{h}_S^{\Acal})\) is “big”
- There exists another \(h\) such that \(R(h)=0\)

No Free Lunch: Simple Example

Proposition 1 Let \(\Acal\) be any learning algorithm for binary classification under indicator loss with \(X\in \R\). Let \(m\) be any sample size. There exists some distribution \(\Dcal\) of data such that

There exists a hypothesis \(h\) such that \(R(h^*)=0\)
With probability of at least \(1/7\) over samples \(S\) of size \(m\) \[ \small R(\hat{h}^{\Acal}_S) \geq 1/8. \]

No Free Lunch: Discussion

What NFL theorems show: every algorithm fails somewhere
What NFL theorems do not show:
- NFL is not about some problems being harder (with worse risk)
- Specifically constructs a failure of \(\Acal\) where another algorithm succeeds

No Free Lunch: Implications

You always need some prior (domain) knowledge for successful learning

Prior knowledge expressed in terms of selecting \(\Hcal\), \(\Acal\), imposing penalties, choosing predictors, network architecture, etc.

Also: NFL — no universal automatic way to select \(\Hcal\)
\(\Rightarrow\) no automatic way of finding the optimal bias-complexity trade-off

Recap and Conclusions

Recap

In this lecture we

Discussed the bias-complexity trade-off
Introduced PAC-learning
Discussed limitations of learning in form of the no free lunch theorems

Next Questions

With the theoretical foundations of the last three sections, can now deep dive into both theoretical and practical aspects:

How does a practical learning problem look like?
How do you estimate the risk of the chosen hypothesis?
Which algorithms are appropriate in which case?
How to select \(\Hcal\)?

Etc.

References

James, Gareth, Daniela Witten, Trevor Hastie, Robert Tibshirani, and Jonathan E. Taylor. 2023. An Introduction to Statistical Learning: With Applications in Python. Springer Texts in Statistics. Cham: Springer.

Mohri, Mehryar, Afshin Rostamizadeh, and Ameet Talwalkar. 2018. Foundations of Machine Learning. The MIT Press. https://doi.org/10.5555/3360093.

Shalev-Shwartz, Shai, and Shai Ben-David. 2014. Understanding Machine Learning. 1st ed. West Nyack: Cambridge University Press.

Valiant, Leslie G. 1984. “A Theory of the Learnable.” Communications of the ACM 27 (11): 1134–42. https://doi.org/10.1145/1968.1972.

Wolpert, David H. 1996. “The Lack of A Priori Distinctions Between Learning Algorithms.” Neural Computation 8 (7): 1341–90. https://doi.org/10.1162/neco.1996.8.7.1341.