Risk and Hypothesis Classes
This lecture is about the key components of a prediction problem
By the end, you should be able to
Will work only in supervised setting in this course
Setup:
Notice the vocabulary: ML vocabulary established and sometimes different from causal inference vocabulary
Two main kinds of supervised problems:
Key goal of prediction — predicting \(Y\) “well”
How to define “well”?
Quality of prediction measured with risk function
Let \(h(\bX)\) be a prediction of \(Y\) given \(\bX\) (hypothesis)Definition 1 Let the loss function \(l(y, \hat{y})\) satisfy \(l(y, \hat{y})\geq 0\) and \(l(y, y)=0\) for all \(y, \hat{y}\).
The risk function of the hypothesis (prediction) \(h(\cdot)\) is the expected loss: \[ R(h) = \E_{(Y, \bX)}\left[ l(Y, h(\bX)) \right] \]
Risk measures how well the hypothesis \(h\) performs on unseen data — generalization error
Example: indicator risk: \[ \E[\I\curl{Y\neq h(\bX)}] = P(Y\neq h(\bX)) \] Probability of incorrectly predicting \(Y\) with \(h(\bX)\) — where \(Y\) and \(\bX\) are a new observation
Risk function
Risk is not necessarily the criterion function used for estimation
Ideally want to choose best possible \(h(\cdot)\): \[ \small h(\cdot) \in \argmin_{h} R(h(\cdot)) \tag{1}\] But challenges:
Sample version of risk — empirical risk : \[ \small \hat{R}_N(h) = \dfrac{1}{N}\sum_{i=1}^N l(Y_i, h(\bX_i)). \] Average over sample \(S = \curl{(Y_1, \bX_1), \dots, (Y_N, \bX_N)}\)
Minimizing \(\hat{R}_N\) — empirical risk minimization (ERM): \[ \small \hat{h}^{ERM}_S \in \argmin_{h} \hat{R}_N(h) \]
Minimization in Equation 1 — over all \(h\) such that the risk makes sense
Issues:
Solution: look for \(h\) in some hypothesis class \(\Hcal\)
Then ERM: \[ \hat{h}_N^{ERM} \in \argmin_{h\in\Hcal} \hat{R}_N(h) \]
Careful with terminology
“Hypothesis” in ML — basically a model with specific coefficients.
Do not confuse with hypotheses in inference
Popular class: linear predictors/classifiers
Regression: \[ \small \Hcal= \curl{h(\bx)=\varphi(\bx)'\bbeta: \bbeta\in \R^{\dim(\varphi(\bx))} } \]
Binary classification: \[ \small \Hcal = \curl{h(\bx) = \I\curl{\varphi(\bx)'\bbeta \geq 0}: \bbeta\in \R^{\dim(\varphi(\bx))} } \]
\(\varphi(\bx)\) — some known transformation of predictors
Already know an example of ERM with specific hypothesis class — linear regression
Problem elements:
Empirical risk minimizer: \[ \begin{aligned} \hat{h}(\bx) & = \bx'\hat{\bbeta}, \\ \hat{\bbeta} & = \argmin_{\bb} \dfrac{1}{N}\sum_{i=1}^N (Y_i - \bX_i'\bb)^2 = (\bX'\bX)^{-1}\bX'\bY \end{aligned} \]
Choice of \(\Hcal\) — part of choice of inductive bias
Definition 2 Inductive bias is the set of assumptions that the learning algorithm uses to generalize to unseen data
Examples:
Another approach taken by decision trees
Trees:
Can use both for regression and classification
Illustration: figure 8.3 in James et al. (2023)
ERM over \(\Hcal\) \[ \small \hat{h}^{ERM}_S \in \argmin_{h \in \Hcal} \hat{R}_N(h) \]
Can still have some challenges:
Story with overfitting is more complicated thanks to “double descent”, seen especially in deep learning
Visual example — binary classification (red, blue) with two features
Image from Wikipedia
Suppose: \(X\) scalar, \(\Hcal\) — polynomials up to 10th degree \[ \small \Hcal= \curl{h(x) = \sum_{k=0}^{10} \beta_k x^k, \beta\in \R^{11} } \]
How to prefer simpler explanations with ERM?
General answer
\[ \small \hat{h}\in\argmin_{h\in\Hcal} \hat{R}_N(h) + \lambda \Pcal(h) \tag{2}\]
\(\lambda\geq 0\) — fixed penalty parameter, controls balance between penalty and risk
Hypothesis set: \(\Hcal= \curl{h(\bx)= \varphi(\bx)'\bbeta: \bbeta\in \R^{\dim(\varphi(\bx))}}\)
Popular penalties:
Arise often and used in many models (see lecture on predictive regression)
\(\lambda\) in Equation 2:
Important
Quality of \(\hat{h}(\cdot)\) evaluated in terms of actual risk regardless
Example: binary classification (\(Y=0, 1\)) with a logit classifier based on some \(\bX\)
Classifiers indexed by \(\bbeta\): \[ \begin{aligned} h(\bx) & = \I\curl{ \Lambda( \bx'\bbeta ) \geq 0.5 }, \\ \Lambda(x) & = \dfrac{1}{1+\exp(-x)} \end{aligned} \] \(\Lambda\) — CDF of the logistic distribution
ERM to learn best \(\bbeta\) (with indicator/misclassification risk) \[ \hat{\bbeta}^{ERM} \in \argmin_{\bb} \dfrac{1}{N} \sum_{i=1}^N \I\curl{ Y_i\neq \I\curl{ \Lambda( \bX_i'\bbeta ) \geq 0.5 } } \]
Hard minimization — function is not even continuous in \(\bbeta\)
Instead of ERM can maximize the (quasi) log likelihood: \[ \small \hat{\bbeta}^{QML} = \argmax_{\bb} \sum_{i=1}^N\left[Y_i \log(\Lambda( \bX_i'\bbeta )) + (1-Y_i) \log\left( 1 - \Lambda( \bX_i'\bbeta ) \right) \right] \]
Original justification — a very specific data generating process (see 17.1 in Wooldridge (2020)).
In prediction, using maximum likelihood does not mean that you believe it — quasi means likelihood doesn’t necessarily reflect true process
Hypothesis quality checked using the actual risk — probability of predicting \(Y\) incorrectly
Some algorithms can return such scores or probabilities (e.g. SVMs, logit-like). Some algorithms can only return the final labels (e.g. classification trees)
In this lecture we
Risk and Hypothesis Classes