Regression I: MSE, Data Preparation

MSE, Train-Test Split, Exploratory Analysis, and Pipelines

Vladislav Morozov

Introduction

Lecture Info

Learning Outcomes

This lecture — first part of our illustrated regression example

By the end, you should be able to

Discuss properties of MSE, its relation to the conditional mean
Explain why we need a separate test set
Perform basics of exploratory data analysis
Use scikit-learn transformers and pipelines

References

Chapters 3 James et al. (2023)
Relevant material from section 7.1-7.3 in scikit-learn documentation (transformations, pipelines, feature extraction)
More on exploratory data analysis with Python: chapter 9-12 in Lau (2023)

Empirical Setup

Framing Prediction Tasks

Imagine the following scenario

You are interested in investing in a region in California
Want to decide where to invest

Investing may require buying some houses — want to accurately price them

Thus: current prediction problem is

Accurately predict house prices in small subregions in California

Meet `scikit-learn`

For learning we will use scikit-learn — a fantastic and Pythonic library for predictive learning in Python

Will use capabilities from different blocks. Imports:

Imports

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import plotly.express as px

# Data source
from sklearn.datasets import fetch_california_housing

# For splitting dataset
from sklearn.model_selection import train_test_split

# For composing transformations
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

# For preprocessing data
from sklearn.preprocessing import(
    FunctionTransformer, 
    PolynomialFeatures,
    StandardScaler
)

Data

Will use the California housing data available with scikit-learn

Nice, clean dataset
Can be retrieved with special function in sklearn.datasets
Describes median price of house in block and some block characteristics

Loading the Data

data = fetch_california_housing(data_home=data_path, as_frame=True)
data_df = data.frame.copy()                     # Separate data DF
print(data.DESCR)

.. _california_housing_dataset:

California Housing dataset
--------------------------

**Data Set Characteristics:**

:Number of Instances: 20640

:Number of Attributes: 8 numeric, predictive attributes and the target

:Attribute Information:
    - MedInc        median income in block group
    - HouseAge      median house age in block group
    - AveRooms      average number of rooms per household
    - AveBedrms     average number of bedrooms per household
    - Population    block group population
    - AveOccup      average number of household members
    - Latitude      block group latitude
    - Longitude     block group longitude

:Missing Attribute Values: None

This dataset was obtained from the StatLib repository.
https://www.dcc.fc.up.pt/~ltorgo/Regression/cal_housing.html

The target variable is the median house value for California districts,
expressed in hundreds of thousands of dollars ($100,000).

This dataset was derived from the 1990 U.S. census, using one row per census
block group. A block group is the smallest geographical unit for which the U.S.
Census Bureau publishes sample data (a block group typically has a population
of 600 to 3,000 people).

A household is a group of people residing within a home. Since the average
number of rooms and bedrooms in this dataset are provided per household, these
columns may take surprisingly large values for block groups with few households
and many empty houses, such as vacation resorts.

It can be downloaded/loaded using the
:func:`sklearn.datasets.fetch_california_housing` function.

.. rubric:: References

- Pace, R. Kelley and Ronald Barry, Sparse Spatial Autoregressions,
  Statistics and Probability Letters, 33 (1997) 291-297

Key Data Info

Dataset is clean, with no missing values and nice names — good for learning and practicing now:

data_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20640 entries, 0 to 20639
Data columns (total 9 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   MedInc       20640 non-null  float64
 1   HouseAge     20640 non-null  float64
 2   AveRooms     20640 non-null  float64
 3   AveBedrms    20640 non-null  float64
 4   Population   20640 non-null  float64
 5   AveOccup     20640 non-null  float64
 6   Latitude     20640 non-null  float64
 7   Longitude    20640 non-null  float64
 8   MedHouseVal  20640 non-null  float64
dtypes: float64(9)
memory usage: 1.4 MB

Learning Steps

Key Steps

We will go through the following key steps:

Choose the risk (key metric)
Split data into training and test sets
Explore the training set, select features
Train and validate models, select best-performing during validation
Evaluate the final chosen model using the test sample

This lecture: first three steps

Are There More Steps?

In research, not really — we get a nice model that solves a fixed problem
In production use, many things beyond just model development
- Data engineering
- Model deployment and monitoring
- Infrastructure

See book by Huyen (2022) for a great overview

Risk in Regression

(Root) Mean Square Error

Which Criterion to Use?

In this example just want to be precise:

Overprediction and underprediction equally bad
$\Rightarrow$ symmetric loss

Also want bigger price for bigger mistakes

Mean Squared Error

Popular choice of risk with above properties — mean squared error: \[ MSE(\hat{Y}) = \E[(Y-\hat{Y})^2] \]

Why is MSE popular?

MSE is “generic”: For $Y\approx \hat{Y}$ “locally equivalent” to many other smooth risk functions
Motivated by maximum likelihood under normality
Bayes predictor is known and interpretable — $\E[Y|\bX]$

MSE-Optimal Predictor (Bayes Predictor)

Proposition 1 Suppose that $Y$ has finite second moments. Then \[ \E[Y|\bX] = \argmin_{h(\cdot): \E[f(\bX)^2]<\infty} \E[ (Y-h(\bX))^2] \]

MSE-best guess: conditional expectation of $Y$
Explains in what sense $\E[Y|\bX]$ was the “best guess” for $Y$ given “information” $\bX$

Proof I: Expansion

Key trick: add and subtract $\E[Y|\bX]$ under the MSE \[ \begin{aligned} MSE(h) & = \E[(Y- h(\bX))^2] \\ & = \E\left[\left( (Y-\E[Y|\bX]) + (\E[Y|\bX] - h(\bX)) \right)^2\right] \\ & = \E\left[ (Y-\E[Y|\bX])^2 \right] + \E[(\E[Y|\bX] - h(\bX))^2]\\ & \quad + 2\E\left[ (Y-\E[Y|\bX]) (\E[Y|\bX] - h(\bX)) \right] \end{aligned} \]

Proof II: Cross-Term = 0

Recall

Proposition 2 (Properties of conditional expectations) For any variables $V, W$ it holds that

$\E[V]= \E[ \E[V|W]]$
$\E[f(W)V|W] =f(W)\E[V|W]$

It follows that (why?) \[ \E\left[ (Y-\E[Y|\bX]) (\E[Y|\bX] - h(\bX)) \right] = 0 \]

Proof III: Conclusion

So \[ MSE(f) = \E\left[ (Y-\E[Y|\bX])^2 \right] + \E[(\E[Y|\bX] - h(\bX))^2] \]

First term does not depend on $h$
Second term minimized by taking $h(\bX) = \E[Y|\bX]$

This proves Proposition 1

Root MSE

Typically instead of MSE report root MSE: \[ RMSE(\hat{Y}) = \sqrt{ \E[(Y-\hat{Y})^2] } \]

Why?

RMSE expressed in the same units as the outcome, more interpretable than raw MSE
RMSE also minimized by $\E[Y|\bX]$

We will focus on RMSE

Other Risks

When Is MSE not Satisfactory

Sometimes MSE is not the right choice

Examples:

Asymmetry in preferences, e.g. overpredicting is more dangerous than underpredicting
Interest in predicting a specific part of the distribution of $Y$: e.g. price point not exceeded by 90% of houses
Many outliers in $Y$ (heavy tails)

Other Popular Risks

In such cases need to use other losses

Asymmetry: linex
Quantiles: tick (quantile) loss: \[ l(y-\hat{y}) = \begin{cases} \tau (y-\hat{y}), & u \geq 0, \\ (\tau -1)(y-\hat{y}), & u < 0 \end{cases} \] where $\tau$ is the target quantile (e.g. 0.9)
Outliers: MAE or Huber loss

Splitting the Data

Estimating the Risk

Recall: want models with good risk: \[ R(\hat{h}_S) = \E_{\bX, Y}\left[ l(Y, \hat{h}_S(\bX)) \right] \] Expectation — taken over a new point $(\bX, Y)$ — not part of sample $S$ used to select $\hat{h}_S$

How do you estimate risk of $\hat{h}_S$?

Training Loss

Naive approach:

Let $S= \curl{(\bX_1, Y_1), \dots, (\bX_N, Y_N)}$ be the training set — sample used to select $\hat{h}_S$
Estimate $R(\hat{h}_S)$ with empirical risk on $S$ \[ \small \hat{R}_S(\hat{h}_S) = \dfrac{1}{N} \sum_{i=1}^N l(Y_i, \hat{h}_S(\bX_i)) \]

$\hat{R}_S(\hat{h}_S)$ often called training loss

Issues with Evaluating on Training Data

Training loss is bad — too optimistic

$\hat{R}_S(\hat{h}_S)$ is downward biased estimator of $R(\hat{h}_S)$

Not like risk definition: average not over a new point
Intuition: $\hat{h}_S$ picked to do well on $S$, $S$ is not “new” to $\hat{h}_S$

Split: Train and Test Set

Solution — using a separate test set with observations that are new

Split $S$ into two sets:
- Training set $S_{Tr}$: used for selecting $\hat{h}_{S_{Tr}}$
- Test test $S_{Test}$ with $N_{S_{Test}}$ observations
Estimate risk with average over $S_{Test}$ \[ \hat{R}_{S_{Test}}(\hat{h}_{S_{Tr}}) = \dfrac{1}{N_{S_{Test}}} \sum_{j=1}^N l(Y_j, \hat{h}_{S_{Tr}}(\bX_j)) \]

Test Set Gives Unbiased Estimator

Evaluating on test set — good picture of performance

\[ \small \hat{R}_{S_{Test}}(\hat{h}_{S_{Tr}}) = \E[(\hat{h}_{S_{Tr}})] \]

Good properties of this depend on the algorithm not seeing any part of $S_{Test}$
Otherwise you get data leakage

Caution

Do not use any part of the test set for training and comparing models!

Another Problem: Choosing Between Models

Now a problem:

Can’t compare models based on training set
Can’t compare models based on test set

How do you compare competing models?

Further Splits: Validation

Answer: split training set into training set and validation set

Train on training sets, check risk on validation
Each risk on validation set is unbiased
Select model with best validation performance

Can use multiple splits for better estimates (cross-validation, more on that later)

In Practice

Can do simple split with train_test_split() from sklearn.model_selection:

train_set, test_set = train_test_split(data_df, test_size = 0.2, random_state= 1)
print(train_set.shape)
print(test_set.shape)

(16512, 9)
(4128, 9)

Just a simple random split into two sets

We will use cross-validation, no need to explicitly split off a validation set

Exploring the Data

Exploratory Data Analysis

Can now do exploratory data analysis:

Looking at data: distributions, descriptive stats, etc.
Identifying promising variables
Making some features (feature extraction)
Other exploration to gain insights into data

Geographical Distribution of Data

Variable Distributions

Variable Distributions: Results

Most variables look normal
But AveRooms, AveOccup, and AveBedroms looks suspicious: there are some very high values.
Check values

train_set.nlargest(3, "AveOccup").loc[:,["AveOccup", "Longitude", "Latitude"]]

	AveOccup	Longitude	Latitude
19006	1243.333333	-121.98	38.32
3364	599.714286	-120.51	40.41
16669	502.461538	-120.70	35.32

Looking at Suspicious Observations

These places have large prisons
Up to you to decide: drop or keep these observations

Feature Engineering I

Need to think what features we include

Some probably not directly helpful (longitude and latitude), though maybe can be transformed into something more useful (like proximity to activity centers)
About others: sometimes useful to look at scatterplots and correlations

Turning raw data into useful features called feature engineering

Scatterplots

Median income broadly linearly predictive
Less obvious for other features
Note: house values seem to cluster at “round” valuations like $250000, $300000, etc

Feature Engineering II: Adding a New Feature

Can we add any interesting variables?

Example: if a house has lower share of bedrooms, usually has more “luxury” rooms
Such houses likely more expensive
Can define new variable: bedroom ratio

train_exp = train_set.copy()
train_exp["BedroomRatio"] = train_exp["AveBedrms"]/train_exp["AveRooms"]

Example of feature extraction

Correlations

(train_exp.corr()["MedHouseVal"]
        .sort_values(ascending=False)
)

MedHouseVal     1.000000
MedInc          0.688194
AveRooms        0.146508
HouseAge        0.105758
AveOccup       -0.021979
Population     -0.023884
AveBedrms      -0.041592
Longitude      -0.050893
Latitude       -0.139374
BedroomRatio   -0.253362
Name: MedHouseVal, dtype: float64

More or less confirms what we have seen
New feature rather strongly correlated with label — not bad

Summarizing EDA

What we learned:

Data quite nice for the most part
Potentially interesting variable: bedroom ratio
Some patterns due to presence of places with large prisons
- We will keep those observations
- Exercise: retry analysis with dropping them from the training set

A Note on Limitations of Our Data

The housing data set was nice and clean

Good for us now to focus on key ideas
But be aware that real-life data is messy
- Genuine incorrect values
- Missing values
- Etc.

Preparing the Data

Separating Data

Before everything: separate data into $\bX$ and $\bY$:

X_train = train_set.drop("MedHouseVal", axis=1)
y_train = train_set["MedHouseVal"].copy()

print(X_train.head())

       MedInc  HouseAge  AveRooms  AveBedrms  Population  AveOccup  Latitude  \
15961  3.1908      52.0  5.000000   1.014184       879.0  3.117021     37.71   
1771   3.6094      42.0  4.900990   0.957096       971.0  3.204620     37.95   
16414  2.6250      16.0  8.333333   1.666667        20.0  3.333333     37.90   
5056   1.5143      34.0  3.805981   1.149526      3538.0  2.580598     34.02   
8589   7.3356      38.0  5.894904   1.057325       750.0  2.388535     33.89   

       Longitude  
15961    -122.43  
1771     -122.35  
16414    -121.24  
5056     -118.35  
8589     -118.39

Reproducibility

EDA process — ad hoc/experimental
For training want reproducible flow

More formally:

A data pipeline is a series of a chained data transformation

Want a pipeline that

Ingests the raw data (original variables)
Produces the dataset we will present to the learning algorithms (next lecture)

Transformers

What Transformations?

What do we want?

Use features identified in EDA
Drop unused features (longitude, latitude)
Many algorithms work best with standardized data

In terms of transformations:

Create ratio of two columns for bedroom ratio
Drop geography
Make polynomials and standardize all included vars

`scikit-learn` Transformers

scikit-learn provides transformers — tools for preprocessing data into a suitable format

Transformers have a unified interface
Can chain transformers into pipelines
Can combine parallel pipelines together
End result takes in original variables and can be run with a single method

Example Transformer: Standardization I

First example: standardization. Want each column to have mean 0 and variance 1

Standardized version of $k$th variable \[ \tilde{X}^{(k)}_i = \dfrac{ X_i^{(k)} - \E[X_i^{(k)}] }{\var(X_i^{(k)})} \]

Here $\E[X_i^{(k)}]$ and $\var(X_i^{(k)})$ are unknown transformation parameters that need to be learned

Example Transformer: Standardization II

Use StandardScaler() from sklearn.preprocessing:

std_scaler = StandardScaler()

X_standardized = pd.DataFrame(
    std_scaler.fit_transform(X_train),
    columns=std_scaler.get_feature_names_out(),
    index=X_train.index,
)

# Check the mean and the standard deviation
X_standardized.agg(['mean', 'std']).round(3)

	MedInc	HouseAge	AveRooms	AveBedrms	Population	AveOccup	Latitude	Longitude
mean	-0.0	0.0	0.0	0.0	0.0	0.0	0.0	-0.0
std	1.0	1.0	1.0	1.0	1.0	1.0	1.0	1.0

Transformer Interface

Transformers in scikit-learn have the same interface.

Key common methods:

fit() — learn the parameters of the transformation (e.g. means and standard deviation)
transform() — transform data (training or new )
fit_transform() — combination of fit() and transform()

Usually return numpy arrays, column names of result can be obtained from get_feature_names_out()

Applying the Transformer

Our StandardScaler has learned the parameters — the means and standard deviations of each column

std_scaler.mean_

array([ 3.87614927e+00,  2.86044695e+01,  5.44111400e+00,  1.09959762e+00,
        1.42525715e+03,  3.09497079e+00,  3.56321936e+01, -1.19574288e+02])

Can now transform any new collection of $\bX$.
It will use the same parameters (also when we transform validation and test data)

Custom Transformers I

sklearn.preprocessing has many transformers

Encoders (e.g. making dummies with OneHotEncoder)
Normalizers and scalers
Functional transformations (e.g. polynomials with PolynomialFeatures)

But can also write our own transformers

Custom Transformers II

Ways of writing custom transformations

Based on specific simple functions with FunctionTransformer
Fully custom ones, just need to implement minimal (inheriting from BaseEstimator and TransformerMixin is helpful)

Important part: result should have fitting and transforming

Custom Example: Column Ratio

Example: transformer for computing the share bedrooms in all rooms with FunctionTransformer
Sufficient to supply the transforming function that operates on rows

def column_ratio(X):
    return X[:, [0]] / X[:, [1]]

divider_transformer = FunctionTransformer(column_ratio, validate=True)
divider_transformer

FunctionTransformer(func=<function column_ratio at 0x000001BCC8027600>,
                    validate=True)

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Applying Column Ratio

Can now apply:

divider_transformer.fit_transform(X_train.loc[:, ["AveBedrms", "AveRooms"]])

array([[0.20283688],
       [0.1952862 ],
       [0.2       ],
       ...,
       [0.20586183],
       [0.19067619],
       [0.25471698]])

What about feature names?
How to fit that with the rest of the processing?

Pipelines

To compose transformations, can use pipelines

Pipeline from sklearn.pipeline
A pipeline is a sequence of transformations.
Optionally can attach a predictor at the end (next time)
Can operate as a single transformer/predictor
Simple to specify: just provide a list of tuples of form (name, Transformer) to Pipeline

Pipelines: Polynomials + Standardization

polynom_pipeline = Pipeline(
    [ 
        ('poly', PolynomialFeatures(degree=2, include_bias=False)),
        ('scale', StandardScaler()),
    ],
)
polynom_pipeline

Pipeline(steps=[('poly', PolynomialFeatures(include_bias=False)),
                ('scale', StandardScaler())])

Pipelines: Polynomial Example in Action

X_polyn = polynom_pipeline.fit_transform(X_train)
pd.DataFrame(
    X_polyn,
    columns=polynom_pipeline.get_feature_names_out(),
    index=X_train.index,
).head(3)

	MedInc	HouseAge	AveRooms	AveBedrms	Population	AveOccup	Latitude	Longitude	MedInc^2	MedInc HouseAge	...	Population^2	Population AveOccup	Population Latitude	Population Longitude	AveOccup^2	AveOccup Latitude	AveOccup Longitude	Latitude^2	Latitude Longitude	Longitude^2
15961	-0.362326	1.858903	-0.168773	-0.168415	-0.486114	0.001901	0.972290	-1.422509	-0.389251	0.786236	...	-0.198471	-0.033110	-0.440025	0.468169	-0.010737	0.016055	-0.008191	0.957187	-1.086319	1.427586
1771	-0.141023	1.064348	-0.206655	-0.280981	-0.404243	0.009455	1.084596	-1.382659	-0.257675	0.591984	...	-0.185074	-0.028474	-0.346316	0.384502	-0.010693	0.025220	-0.015618	1.074766	-1.167590	1.386898
16414	-0.661450	-1.001494	1.106584	1.118131	-1.250537	0.020554	1.061199	-0.829737	-0.541351	-0.893977	...	-0.259256	-0.066439	-1.259810	1.254967	-0.010626	0.035837	-0.024165	1.050209	-1.018952	0.825090

3 rows × 44 columns

Column Transformers I

Pipelines allow sequential combination of transformations

What about parallel combinations? Recall: want to

Drop some variables
Take ratio of some other vars
Leave the rest untouched

Only then make polynomials and standardize

Use ColumnTransformer from sklearn.compose

Column Transformers II

ColumnTransformer: different transformations for different columns
Also simple to specify: with a list of tuples of (name, transformer, columns)

feat_extr_pipe = ColumnTransformer(
  [
    ('bedroom_ratio', divider_transformer, ['AveBedrms', 'AveRooms']),
    (
      'passthrough', 
      'passthrough', 
      [
        'MedInc', 
        'HouseAge', 
        'AveRooms', 
        'AveBedrms', 
        'Population', 
        'AveOccup',
      ]
    ),
    ('drop', 'drop', ['Longitude', 'Latitude'])
  ]
)

Here a slightly more refined form of division:

Details

def ratio_name(function_transformer, feature_names_in):
    return ["ratio"]

divider_transformer = FunctionTransformer(
  column_ratio, 
  validate=True, 
  feature_names_out = ratio_name
)

Column Transformers III

Our “feature extraction” pipeline:

ColumnTransformer(transformers=[('bedroom_ratio',
                                 FunctionTransformer(feature_names_out=<function ratio_name at 0x000001BCC92A76A0>,
                                                     func=<function column_ratio at 0x000001BCC8027600>,
                                                     validate=True),
                                 ['AveBedrms', 'AveRooms']),
                                ('passthrough', 'passthrough',
                                 ['MedInc', 'HouseAge', 'AveRooms', 'AveBedrms',
                                  'Population', 'AveOccup']),
                                ('drop', 'drop', ['Longitude', 'Latitude'])])

Results of its action

	bedroom_ratio__ratio	passthrough__MedInc	passthrough__HouseAge	passthrough__AveRooms	passthrough__AveBedrms	passthrough__Population	passthrough__AveOccup
15961	0.202837	3.1908	52.0	5.00000	1.014184	879.0	3.117021
1771	0.195286	3.6094	42.0	4.90099	0.957096	971.0	3.204620

Discussion

Same column can go into several arms of the column transformer
Note: adds component names to columns names with two underscores __
Columns in $\bX$ not specified anywhere are handled according to remainder argument
- Defaults to drop (can also passthrough)
- In small applications may be better to be explicit about dropping

Adding The Remaining Components

Now only need to add the polynomial features and the scaler
Can freely combine column transformers and pipelines into other pipelines

preprocessing = Pipeline(
  [
    ('extraction', feat_extr_pipe),
    ('poly', PolynomialFeatures(degree=2, include_bias=False)),
    ('scale', StandardScaler()),
  ]
)

Looking at the Data

Pipeline ingests original data, applies all our steps, and gives a preprocessed dataset

pd.DataFrame(
  preprocessing.fit_transform(X_train), 
  columns=preprocessing.get_feature_names_out(),
  index=X_train.index
).head(2)

	bedroom_ratio__ratio	passthrough__MedInc	passthrough__HouseAge	passthrough__AveRooms	passthrough__AveBedrms	passthrough__Population	passthrough__AveOccup	bedroom_ratio__ratio^2	bedroom_ratio__ratio passthrough__MedInc	bedroom_ratio__ratio passthrough__HouseAge	...	passthrough__AveRooms^2	passthrough__AveRooms passthrough__AveBedrms	passthrough__AveRooms passthrough__Population	passthrough__AveRooms passthrough__AveOccup	passthrough__AveBedrms^2	passthrough__AveBedrms passthrough__Population	passthrough__AveBedrms passthrough__AveOccup	passthrough__Population^2	passthrough__Population passthrough__AveOccup	passthrough__AveOccup^2
15961	-0.178281	-0.362326	1.858903	-0.168773	-0.168415	-0.486114	0.001901	-0.221732	-0.448966	1.251270	...	-0.051199	-0.042603	-0.480251	-0.020454	-0.039705	-0.524949	-0.018452	-0.198471	-0.033110	-0.010737
1771	-0.307097	-0.141023	1.064348	-0.206655	-0.280981	-0.404243	0.009455	-0.307359	-0.216433	0.575888	...	-0.055588	-0.050494	-0.424589	-0.018274	-0.049913	-0.493710	-0.026735	-0.185074	-0.028474	-0.010693

2 rows × 35 columns

Recap and Conclusions

Recap

In this lecture we

Discussed theoretical properties of MSE
Set up the empirical example
1. Data
2. Exploration
3. Prepartion
Met scikit-learn

Next Questions

Ready for actual prediction:

How do predictors work in scikit-learn?
How to attach them to pipelines?
How to evaluate competing models?

References

Huyen, Chip. 2022. Designing Machine Learning Systems: An Iterative Process for Production-Ready Applications. First edition. Beijing Boston Farnham Sebastopol Tokyo: O’Reilly.

James, Gareth, Daniela Witten, Trevor Hastie, Robert Tibshirani, and Jonathan E. Taylor. 2023. An Introduction to Statistical Learning: With Applications in Python. Springer Texts in Statistics. Cham: Springer.

Lau, Sam. 2023. Learning Data Science. 1st ed. Sebastopol: O’Reilly Media, Incorporated.

Regression I: MSE, Data Preparation

Introduction

Lecture Info

Learning Outcomes

References

Empirical Setup

Framing Prediction Tasks

Meet scikit-learn

Data

Loading the Data

Key Data Info

Learning Steps

Key Steps

Are There More Steps?

Risk in Regression

(Root) Mean Square Error

Which Criterion to Use?

Mean Squared Error

MSE-Optimal Predictor (Bayes Predictor)

Proof I: Expansion

Proof II: Cross-Term = 0

Proof III: Conclusion

Root MSE

Other Risks

When Is MSE not Satisfactory

Other Popular Risks

Splitting the Data

Estimating the Risk

Training Loss

Issues with Evaluating on Training Data

Split: Train and Test Set

Test Set Gives Unbiased Estimator

Another Problem: Choosing Between Models

Further Splits: Validation

In Practice

Exploring the Data

Exploratory Data Analysis

Geographical Distribution of Data

Variable Distributions

Variable Distributions: Results

Looking at Suspicious Observations

Feature Engineering I

Scatterplots

Feature Engineering II: Adding a New Feature

Correlations

Summarizing EDA

A Note on Limitations of Our Data

Preparing the Data

Separating Data

Reproducibility

Transformers

What Transformations?

scikit-learn Transformers

Example Transformer: Standardization I

Example Transformer: Standardization II

Transformer Interface

Applying the Transformer

Custom Transformers I

Custom Transformers II

Custom Example: Column Ratio

Applying Column Ratio

Pipelines

Pipelines

Pipelines: Polynomials + Standardization

Pipelines: Polynomial Example in Action

Column Transformers I

Column Transformers II

Column Transformers III

Discussion

Adding The Remaining Components

Full Preprocessing Pipeline

Looking at the Data

Recap and Conclusions

Recap

Next Questions

References

Meet `scikit-learn`

`scikit-learn` Transformers