Regression I: MSE, Data Preparation

MSE, Train-Test Split, Exploratory Analysis, and Pipelines

Vladislav Morozov

Introduction

Lecture Info

Learning Outcomes

This lecture — first part of our illustrated regression example


By the end, you should be able to

  • Discuss properties of MSE, its relation to the conditional mean
  • Explain why we need a separate test set
  • Perform basics of exploratory data analysis
  • Use scikit-learn transformers and pipelines

References


  • Chapters 3 James et al. (2023)
  • Relevant material from section 7.1-7.3 in scikit-learn documentation (transformations, pipelines, feature extraction)
  • More on exploratory data analysis with Python: chapter 9-12 in Lau (2023)

Empirical Setup

Framing Prediction Tasks

Imagine the following scenario

  • You are interested in investing in a region in California
  • Want to decide where to invest

Investing may require buying some houses — want to accurately price them

Thus: current prediction problem is

Accurately predict house prices in small subregions in California

Meet scikit-learn

For learning we will use scikit-learn — a fantastic and Pythonic library for predictive learning in Python


Will use capabilities from different blocks. Imports:

Imports
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import plotly.express as px

# Data source
from sklearn.datasets import fetch_california_housing

# For splitting dataset
from sklearn.model_selection import train_test_split

# For composing transformations
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

# For preprocessing data
from sklearn.preprocessing import(
    FunctionTransformer, 
    PolynomialFeatures,
    StandardScaler
)

Data


Will use the California housing data available with scikit-learn

  • Nice, clean dataset
  • Can be retrieved with special function in sklearn.datasets
  • Describes median price of house in block and some block characteristics

Loading the Data

data = fetch_california_housing(data_home=data_path, as_frame=True)
data_df = data.frame.copy()                     # Separate data DF
print(data.DESCR)
.. _california_housing_dataset:

California Housing dataset
--------------------------

**Data Set Characteristics:**

:Number of Instances: 20640

:Number of Attributes: 8 numeric, predictive attributes and the target

:Attribute Information:
    - MedInc        median income in block group
    - HouseAge      median house age in block group
    - AveRooms      average number of rooms per household
    - AveBedrms     average number of bedrooms per household
    - Population    block group population
    - AveOccup      average number of household members
    - Latitude      block group latitude
    - Longitude     block group longitude

:Missing Attribute Values: None

This dataset was obtained from the StatLib repository.
https://www.dcc.fc.up.pt/~ltorgo/Regression/cal_housing.html

The target variable is the median house value for California districts,
expressed in hundreds of thousands of dollars ($100,000).

This dataset was derived from the 1990 U.S. census, using one row per census
block group. A block group is the smallest geographical unit for which the U.S.
Census Bureau publishes sample data (a block group typically has a population
of 600 to 3,000 people).

A household is a group of people residing within a home. Since the average
number of rooms and bedrooms in this dataset are provided per household, these
columns may take surprisingly large values for block groups with few households
and many empty houses, such as vacation resorts.

It can be downloaded/loaded using the
:func:`sklearn.datasets.fetch_california_housing` function.

.. rubric:: References

- Pace, R. Kelley and Ronald Barry, Sparse Spatial Autoregressions,
  Statistics and Probability Letters, 33 (1997) 291-297

Key Data Info

Dataset is clean, with no missing values and nice names — good for learning and practicing now:

data_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20640 entries, 0 to 20639
Data columns (total 9 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   MedInc       20640 non-null  float64
 1   HouseAge     20640 non-null  float64
 2   AveRooms     20640 non-null  float64
 3   AveBedrms    20640 non-null  float64
 4   Population   20640 non-null  float64
 5   AveOccup     20640 non-null  float64
 6   Latitude     20640 non-null  float64
 7   Longitude    20640 non-null  float64
 8   MedHouseVal  20640 non-null  float64
dtypes: float64(9)
memory usage: 1.4 MB

Learning Steps

Key Steps

We will go through the following key steps:

  • Choose the risk (key metric)
  • Split data into training and test sets
  • Explore the training set, select features
  • Train and validate models, select best-performing during validation
  • Evaluate the final chosen model using the test sample

This lecture: first three steps

Are There More Steps?

  • In research, not really — we get a nice model that solves a fixed problem
  • In production use, many things beyond just model development
    • Data engineering
    • Model deployment and monitoring
    • Infrastructure

See book by Huyen (2022) for a great overview

Risk in Regression

(Root) Mean Square Error

Which Criterion to Use?

In this example just want to be precise:

  • Overprediction and underprediction equally bad
  • \(\Rightarrow\) symmetric loss


Also want bigger price for bigger mistakes

Mean Squared Error

Popular choice of risk with above properties — mean squared error: \[ MSE(\hat{Y}) = \E[(Y-\hat{Y})^2] \]

Why is MSE popular?

  • MSE is “generic”: For \(Y\approx \hat{Y}\) “locally equivalent” to many other smooth risk functions
  • Motivated by maximum likelihood under normality
  • Bayes predictor is known and interpretable — \(\E[Y|\bX]\)

MSE-Optimal Predictor (Bayes Predictor)

Proposition 1 Suppose that \(Y\) has finite second moments. Then \[ \E[Y|\bX] = \argmin_{h(\cdot): \E[f(\bX)^2]<\infty} \E[ (Y-h(\bX))^2] \]

  • MSE-best guess: conditional expectation of \(Y\)
  • Explains in what sense \(\E[Y|\bX]\) was the “best guess” for \(Y\) given “information” \(\bX\)

Proof I: Expansion

Key trick: add and subtract \(\E[Y|\bX]\) under the MSE \[ \begin{aligned} MSE(h) & = \E[(Y- h(\bX))^2] \\ & = \E\left[\left( (Y-\E[Y|\bX]) + (\E[Y|\bX] - h(\bX)) \right)^2\right] \\ & = \E\left[ (Y-\E[Y|\bX])^2 \right] + \E[(\E[Y|\bX] - h(\bX))^2]\\ & \quad + 2\E\left[ (Y-\E[Y|\bX]) (\E[Y|\bX] - h(\bX)) \right] \end{aligned} \]

Proof II: Cross-Term = 0

Recall

Proposition 2 (Properties of conditional expectations) For any variables \(V, W\) it holds that

  1. \(\E[V]= \E[ \E[V|W]]\)
  2. \(\E[f(W)V|W] =f(W)\E[V|W]\)

It follows that (why?) \[ \E\left[ (Y-\E[Y|\bX]) (\E[Y|\bX] - h(\bX)) \right] = 0 \]

Proof III: Conclusion

So \[ MSE(f) = \E\left[ (Y-\E[Y|\bX])^2 \right] + \E[(\E[Y|\bX] - h(\bX))^2] \]

  • First term does not depend on \(h\)
  • Second term minimized by taking \(h(\bX) = \E[Y|\bX]\)

This proves Proposition 1

Root MSE

Typically instead of MSE report root MSE: \[ RMSE(\hat{Y}) = \sqrt{ \E[(Y-\hat{Y})^2] } \]

Why?

  • RMSE expressed in the same units as the outcome, more interpretable than raw MSE
  • RMSE also minimized by \(\E[Y|\bX]\)

We will focus on RMSE

Other Risks

When Is MSE not Satisfactory

Sometimes MSE is not the right choice


Examples:

  • Asymmetry in preferences, e.g. overpredicting is more dangerous than underpredicting
  • Interest in predicting a specific part of the distribution of \(Y\): e.g. price point not exceeded by 90% of houses
  • Many outliers in \(Y\) (heavy tails)

Splitting the Data

Estimating the Risk

Recall: want models with good risk: \[ R(\hat{h}_S) = \E_{\bX, Y}\left[ l(Y, \hat{h}_S(\bX)) \right] \] Expectation — taken over a new point \((\bX, Y)\) — not part of sample \(S\) used to select \(\hat{h}_S\)


How do you estimate risk of \(\hat{h}_S\)?

Training Loss

Naive approach:

  • Let \(S= \curl{(\bX_1, Y_1), \dots, (\bX_N, Y_N)}\) be the training set — sample used to select \(\hat{h}_S\)
  • Estimate \(R(\hat{h}_S)\) with empirical risk on \(S\) \[ \small \hat{R}_S(\hat{h}_S) = \dfrac{1}{N} \sum_{i=1}^N l(Y_i, \hat{h}_S(\bX_i)) \]

\(\hat{R}_S(\hat{h}_S)\) often called training loss

Issues with Evaluating on Training Data

Training loss is bad — too optimistic

\(\hat{R}_S(\hat{h}_S)\) is downward biased estimator of \(R(\hat{h}_S)\)

  • Not like risk definition: average not over a new point
  • Intuition: \(\hat{h}_S\) picked to do well on \(S\), \(S\) is not “new” to \(\hat{h}_S\)

Split: Train and Test Set

Solution — using a separate test set with observations that are new

  • Split \(S\) into two sets:
    • Training set \(S_{Tr}\): used for selecting \(\hat{h}_{S_{Tr}}\)
    • Test test \(S_{Test}\) with \(N_{S_{Test}}\) observations
  • Estimate risk with average over \(S_{Test}\) \[ \hat{R}_{S_{Test}}(\hat{h}_{S_{Tr}}) = \dfrac{1}{N_{S_{Test}}} \sum_{j=1}^N l(Y_j, \hat{h}_{S_{Tr}}(\bX_j)) \]

Test Set Gives Unbiased Estimator

Evaluating on test set — good picture of performance

\[ \small \hat{R}_{S_{Test}}(\hat{h}_{S_{Tr}}) = \E[(\hat{h}_{S_{Tr}})] \]

  • Good properties of this depend on the algorithm not seeing any part of \(S_{Test}\)
  • Otherwise you get data leakage

Caution

Do not use any part of the test set for training and comparing models!

Another Problem: Choosing Between Models

Now a problem:

  • Can’t compare models based on training set
  • Can’t compare models based on test set


How do you compare competing models?

Further Splits: Validation

Answer: split training set into training set and validation set

  • Train on training sets, check risk on validation
  • Each risk on validation set is unbiased
  • Select model with best validation performance


Can use multiple splits for better estimates (cross-validation, more on that later)

In Practice

Can do simple split with train_test_split() from sklearn.model_selection:

train_set, test_set = train_test_split(data_df, test_size = 0.2, random_state= 1)
print(train_set.shape)
print(test_set.shape)
(16512, 9)
(4128, 9)

Just a simple random split into two sets


We will use cross-validation, no need to explicitly split off a validation set

Exploring the Data

Exploratory Data Analysis

Can now do exploratory data analysis:

  • Looking at data: distributions, descriptive stats, etc.
  • Identifying promising variables
  • Making some features (feature extraction)
  • Other exploration to gain insights into data

Geographical Distribution of Data

Variable Distributions

Variable Distributions: Results

  • Most variables look normal
  • But AveRooms, AveOccup, and AveBedroms looks suspicious: there are some very high values.
  • Check values
train_set.nlargest(3, "AveOccup").loc[:,["AveOccup", "Longitude", "Latitude"]]
AveOccup Longitude Latitude
19006 1243.333333 -121.98 38.32
3364 599.714286 -120.51 40.41
16669 502.461538 -120.70 35.32

Looking at Suspicious Observations

  • These places have large prisons
  • Up to you to decide: drop or keep these observations

Feature Engineering I

Need to think what features we include

  • Some probably not directly helpful (longitude and latitude), though maybe can be transformed into something more useful (like proximity to activity centers)
  • About others: sometimes useful to look at scatterplots and correlations


Turning raw data into useful features called feature engineering

Scatterplots

  • Median income broadly linearly predictive
  • Less obvious for other features
  • Note: house values seem to cluster at “round” valuations like $250000, $300000, etc

Feature Engineering II: Adding a New Feature

Can we add any interesting variables?

  • Example: if a house has lower share of bedrooms, usually has more “luxury” rooms
  • Such houses likely more expensive
  • Can define new variable: bedroom ratio
train_exp = train_set.copy()
train_exp["BedroomRatio"] = train_exp["AveBedrms"]/train_exp["AveRooms"]

Example of feature extraction

Correlations

(train_exp.corr()["MedHouseVal"]
        .sort_values(ascending=False)
)
MedHouseVal     1.000000
MedInc          0.688194
AveRooms        0.146508
HouseAge        0.105758
AveOccup       -0.021979
Population     -0.023884
AveBedrms      -0.041592
Longitude      -0.050893
Latitude       -0.139374
BedroomRatio   -0.253362
Name: MedHouseVal, dtype: float64
  • More or less confirms what we have seen
  • New feature rather strongly correlated with label — not bad

Summarizing EDA

What we learned:

  • Data quite nice for the most part
  • Potentially interesting variable: bedroom ratio
  • Some patterns due to presence of places with large prisons
    • We will keep those observations
    • Exercise: retry analysis with dropping them from the training set

A Note on Limitations of Our Data

The housing data set was nice and clean

  • Good for us now to focus on key ideas
  • But be aware that real-life data is messy
    • Genuine incorrect values
    • Missing values
    • Etc.

Preparing the Data

Separating Data

Before everything: separate data into \(\bX\) and \(\bY\):

X_train = train_set.drop("MedHouseVal", axis=1)
y_train = train_set["MedHouseVal"].copy()

print(X_train.head())
       MedInc  HouseAge  AveRooms  AveBedrms  Population  AveOccup  Latitude  \
15961  3.1908      52.0  5.000000   1.014184       879.0  3.117021     37.71   
1771   3.6094      42.0  4.900990   0.957096       971.0  3.204620     37.95   
16414  2.6250      16.0  8.333333   1.666667        20.0  3.333333     37.90   
5056   1.5143      34.0  3.805981   1.149526      3538.0  2.580598     34.02   
8589   7.3356      38.0  5.894904   1.057325       750.0  2.388535     33.89   

       Longitude  
15961    -122.43  
1771     -122.35  
16414    -121.24  
5056     -118.35  
8589     -118.39  

Reproducibility

  • EDA process — ad hoc/experimental
  • For training want reproducible flow
More formally:

A data pipeline is a series of a chained data transformation

Want a pipeline that

  • Ingests the raw data (original variables)
  • Produces the dataset we will present to the learning algorithms (next lecture)

Transformers

What Transformations?

What do we want?

  • Use features identified in EDA
  • Drop unused features (longitude, latitude)
  • Many algorithms work best with standardized data

In terms of transformations:

  • Create ratio of two columns for bedroom ratio
  • Drop geography
  • Make polynomials and standardize all included vars

scikit-learn Transformers

scikit-learn provides transformers — tools for preprocessing data into a suitable format

  • Transformers have a unified interface
  • Can chain transformers into pipelines
  • Can combine parallel pipelines together
  • End result takes in original variables and can be run with a single method

Example Transformer: Standardization I

First example: standardization. Want each column to have mean 0 and variance 1

Standardized version of \(k\)th variable \[ \tilde{X}^{(k)}_i = \dfrac{ X_i^{(k)} - \E[X_i^{(k)}] }{\var(X_i^{(k)})} \]

Here \(\E[X_i^{(k)}]\) and \(\var(X_i^{(k)})\) are unknown transformation parameters that need to be learned

Example Transformer: Standardization II

Use StandardScaler() from sklearn.preprocessing:

std_scaler = StandardScaler()

X_standardized = pd.DataFrame(
    std_scaler.fit_transform(X_train),
    columns=std_scaler.get_feature_names_out(),
    index=X_train.index,
)

# Check the mean and the standard deviation
X_standardized.agg(['mean', 'std']).round(3)
MedInc HouseAge AveRooms AveBedrms Population AveOccup Latitude Longitude
mean -0.0 0.0 0.0 0.0 0.0 0.0 0.0 -0.0
std 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0

Transformer Interface

Transformers in scikit-learn have the same interface.

Key common methods:

  • fit() — learn the parameters of the transformation (e.g. means and standard deviation)
  • transform() — transform data (training or new )
  • fit_transform() — combination of fit() and transform()

Usually return numpy arrays, column names of result can be obtained from get_feature_names_out()

Applying the Transformer

  • Our StandardScaler has learned the parameters — the means and standard deviations of each column
std_scaler.mean_
array([ 3.87614927e+00,  2.86044695e+01,  5.44111400e+00,  1.09959762e+00,
        1.42525715e+03,  3.09497079e+00,  3.56321936e+01, -1.19574288e+02])
  • Can now transform any new collection of \(\bX\).
  • It will use the same parameters (also when we transform validation and test data)

Custom Transformers I

sklearn.preprocessing has many transformers

  • Encoders (e.g. making dummies with OneHotEncoder)
  • Normalizers and scalers
  • Functional transformations (e.g. polynomials with PolynomialFeatures)

But can also write our own transformers

Custom Transformers II

Ways of writing custom transformations

  • Based on specific simple functions with FunctionTransformer
  • Fully custom ones, just need to implement minimal (inheriting from BaseEstimator and TransformerMixin is helpful)


Important part: result should have fitting and transforming

Custom Example: Column Ratio

  • Example: transformer for computing the share bedrooms in all rooms with FunctionTransformer
  • Sufficient to supply the transforming function that operates on rows
def column_ratio(X):
    return X[:, [0]] / X[:, [1]]

divider_transformer = FunctionTransformer(column_ratio, validate=True)
divider_transformer
FunctionTransformer(func=<function column_ratio at 0x000001BCC8027600>,
                    validate=True)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Applying Column Ratio

Can now apply:

divider_transformer.fit_transform(X_train.loc[:, ["AveBedrms", "AveRooms"]])
array([[0.20283688],
       [0.1952862 ],
       [0.2       ],
       ...,
       [0.20586183],
       [0.19067619],
       [0.25471698]])
  • What about feature names?
  • How to fit that with the rest of the processing?

Pipelines

Pipelines

To compose transformations, can use pipelines

  • Pipeline from sklearn.pipeline
  • A pipeline is a sequence of transformations.
  • Optionally can attach a predictor at the end (next time)
  • Can operate as a single transformer/predictor
  • Simple to specify: just provide a list of tuples of form (name, Transformer) to Pipeline

Pipelines: Polynomials + Standardization

polynom_pipeline = Pipeline(
    [ 
        ('poly', PolynomialFeatures(degree=2, include_bias=False)),
        ('scale', StandardScaler()),
    ],
)
polynom_pipeline
Pipeline(steps=[('poly', PolynomialFeatures(include_bias=False)),
                ('scale', StandardScaler())])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Pipelines: Polynomial Example in Action

X_polyn = polynom_pipeline.fit_transform(X_train)
pd.DataFrame(
    X_polyn,
    columns=polynom_pipeline.get_feature_names_out(),
    index=X_train.index,
).head(3)
MedInc HouseAge AveRooms AveBedrms Population AveOccup Latitude Longitude MedInc^2 MedInc HouseAge ... Population^2 Population AveOccup Population Latitude Population Longitude AveOccup^2 AveOccup Latitude AveOccup Longitude Latitude^2 Latitude Longitude Longitude^2
15961 -0.362326 1.858903 -0.168773 -0.168415 -0.486114 0.001901 0.972290 -1.422509 -0.389251 0.786236 ... -0.198471 -0.033110 -0.440025 0.468169 -0.010737 0.016055 -0.008191 0.957187 -1.086319 1.427586
1771 -0.141023 1.064348 -0.206655 -0.280981 -0.404243 0.009455 1.084596 -1.382659 -0.257675 0.591984 ... -0.185074 -0.028474 -0.346316 0.384502 -0.010693 0.025220 -0.015618 1.074766 -1.167590 1.386898
16414 -0.661450 -1.001494 1.106584 1.118131 -1.250537 0.020554 1.061199 -0.829737 -0.541351 -0.893977 ... -0.259256 -0.066439 -1.259810 1.254967 -0.010626 0.035837 -0.024165 1.050209 -1.018952 0.825090

3 rows × 44 columns

Column Transformers I

Pipelines allow sequential combination of transformations

What about parallel combinations? Recall: want to

  • Drop some variables
  • Take ratio of some other vars
  • Leave the rest untouched

Only then make polynomials and standardize

Use ColumnTransformer from sklearn.compose

Column Transformers II

  • ColumnTransformer: different transformations for different columns
  • Also simple to specify: with a list of tuples of (name, transformer, columns)
feat_extr_pipe = ColumnTransformer(
  [
    ('bedroom_ratio', divider_transformer, ['AveBedrms', 'AveRooms']),
    (
      'passthrough', 
      'passthrough', 
      [
        'MedInc', 
        'HouseAge', 
        'AveRooms', 
        'AveBedrms', 
        'Population', 
        'AveOccup',
      ]
    ),
    ('drop', 'drop', ['Longitude', 'Latitude'])
  ]
) 

Here a slightly more refined form of division:

Details
def ratio_name(function_transformer, feature_names_in):
    return ["ratio"]

divider_transformer = FunctionTransformer(
  column_ratio, 
  validate=True, 
  feature_names_out = ratio_name
)

Column Transformers III

Our “feature extraction” pipeline:

ColumnTransformer(transformers=[('bedroom_ratio',
                                 FunctionTransformer(feature_names_out=<function ratio_name at 0x000001BCC92A76A0>,
                                                     func=<function column_ratio at 0x000001BCC8027600>,
                                                     validate=True),
                                 ['AveBedrms', 'AveRooms']),
                                ('passthrough', 'passthrough',
                                 ['MedInc', 'HouseAge', 'AveRooms', 'AveBedrms',
                                  'Population', 'AveOccup']),
                                ('drop', 'drop', ['Longitude', 'Latitude'])])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Results of its action

bedroom_ratio__ratio passthrough__MedInc passthrough__HouseAge passthrough__AveRooms passthrough__AveBedrms passthrough__Population passthrough__AveOccup
15961 0.202837 3.1908 52.0 5.00000 1.014184 879.0 3.117021
1771 0.195286 3.6094 42.0 4.90099 0.957096 971.0 3.204620

Discussion

  • Same column can go into several arms of the column transformer
  • Note: adds component names to columns names with two underscores __
  • Columns in \(\bX\) not specified anywhere are handled according to remainder argument
    • Defaults to drop (can also passthrough)
    • In small applications may be better to be explicit about dropping

Adding The Remaining Components

  • Now only need to add the polynomial features and the scaler
  • Can freely combine column transformers and pipelines into other pipelines
preprocessing = Pipeline(
  [
    ('extraction', feat_extr_pipe),
    ('poly', PolynomialFeatures(degree=2, include_bias=False)),
    ('scale', StandardScaler()),
  ]
)

Full Preprocessing Pipeline

Pipeline(steps=[('extraction',
                 ColumnTransformer(transformers=[('bedroom_ratio',
                                                  FunctionTransformer(feature_names_out=<function ratio_name at 0x000001BCC92A76A0>,
                                                                      func=<function column_ratio at 0x000001BCC8027600>,
                                                                      validate=True),
                                                  ['AveBedrms', 'AveRooms']),
                                                 ('passthrough', 'passthrough',
                                                  ['MedInc', 'HouseAge',
                                                   'AveRooms', 'AveBedrms',
                                                   'Population', 'AveOccup']),
                                                 ('drop', 'drop',
                                                  ['Longitude', 'Latitude'])])),
                ('poly', PolynomialFeatures(include_bias=False)),
                ('scale', StandardScaler())])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Looking at the Data

Pipeline ingests original data, applies all our steps, and gives a preprocessed dataset

pd.DataFrame(
  preprocessing.fit_transform(X_train), 
  columns=preprocessing.get_feature_names_out(),
  index=X_train.index
).head(2)
bedroom_ratio__ratio passthrough__MedInc passthrough__HouseAge passthrough__AveRooms passthrough__AveBedrms passthrough__Population passthrough__AveOccup bedroom_ratio__ratio^2 bedroom_ratio__ratio passthrough__MedInc bedroom_ratio__ratio passthrough__HouseAge ... passthrough__AveRooms^2 passthrough__AveRooms passthrough__AveBedrms passthrough__AveRooms passthrough__Population passthrough__AveRooms passthrough__AveOccup passthrough__AveBedrms^2 passthrough__AveBedrms passthrough__Population passthrough__AveBedrms passthrough__AveOccup passthrough__Population^2 passthrough__Population passthrough__AveOccup passthrough__AveOccup^2
15961 -0.178281 -0.362326 1.858903 -0.168773 -0.168415 -0.486114 0.001901 -0.221732 -0.448966 1.251270 ... -0.051199 -0.042603 -0.480251 -0.020454 -0.039705 -0.524949 -0.018452 -0.198471 -0.033110 -0.010737
1771 -0.307097 -0.141023 1.064348 -0.206655 -0.280981 -0.404243 0.009455 -0.307359 -0.216433 0.575888 ... -0.055588 -0.050494 -0.424589 -0.018274 -0.049913 -0.493710 -0.026735 -0.185074 -0.028474 -0.010693

2 rows × 35 columns

Recap and Conclusions

Recap


In this lecture we

  1. Discussed theoretical properties of MSE
  2. Set up the empirical example
    1. Data
    2. Exploration
    3. Prepartion
  3. Met scikit-learn

Next Questions


Ready for actual prediction:

  • How do predictors work in scikit-learn?
  • How to attach them to pipelines?
  • How to evaluate competing models?

References

Huyen, Chip. 2022. Designing Machine Learning Systems: An Iterative Process for Production-Ready Applications. First edition. Beijing Boston Farnham Sebastopol Tokyo: O’Reilly.
James, Gareth, Daniela Witten, Trevor Hastie, Robert Tibshirani, and Jonathan E. Taylor. 2023. An Introduction to Statistical Learning: With Applications in Python. Springer Texts in Statistics. Cham: Springer.
Lau, Sam. 2023. Learning Data Science. 1st ed. Sebastopol: O’Reilly Media, Incorporated.