Overfitting and Regularization

  • ID: MLPY-F-L07
  • Type: Lesson
  • Audience: Public
  • Theme: Generalization discipline

When Good Training Performance Is Not Enough

A model can perform extremely well on training data and still fail in practice.

This happens when the model memorizes patterns that do not generalize.

This phenomenon is called overfitting.

Overfitting occurs when:

  • The model is too complex for the available data
  • Noise is mistaken for signal
  • Evaluation is performed incorrectly
  • Data leakage is present

The goal of machine learning is not to fit training data.

The goal is to generalize.


Load Data

import pandas as pd

df = pd.read_csv("data/ml-ready/cdi-customer-churn.csv")

X = df.drop(columns="churn")
y = df["churn"]

Train/Test Split

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.2,
    random_state=42,
    stratify=y
)

Define Preprocessing

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.pipeline import Pipeline

numeric_features = X.select_dtypes(include=["int64", "float64"]).columns
categorical_features = X.select_dtypes(include=["object"]).columns

numeric_transformer = Pipeline(
    steps=[("scaler", StandardScaler())]
)

categorical_transformer = Pipeline(
    steps=[("encoder", OneHotEncoder(handle_unknown="ignore"))]
)

preprocessor = ColumnTransformer(
    transformers=[
        ("num", numeric_transformer, numeric_features),
        ("cat", categorical_transformer, categorical_features)
    ]
)

A More Flexible Model

Decision trees can model complex patterns.

They can also overfit easily.

from sklearn.tree import DecisionTreeClassifier

tree_clf = Pipeline(
    steps=[
        ("preprocessor", preprocessor),
        ("classifier", DecisionTreeClassifier(random_state=42))
    ]
)

tree_clf.fit(X_train, y_train)
Pipeline(steps=[('preprocessor',
                 ColumnTransformer(transformers=[('num',
                                                  Pipeline(steps=[('scaler',
                                                                   StandardScaler())]),
                                                  Index(['tenure_months', 'monthly_spend', 'support_calls'], dtype='str')),
                                                 ('cat',
                                                  Pipeline(steps=[('encoder',
                                                                   OneHotEncoder(handle_unknown='ignore'))]),
                                                  Index(['customer_id', 'contract_type', 'autopay'], dtype='str'))])),
                ('classifier', DecisionTreeClassifier(random_state=42))])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Compare Training vs Test Performance

from sklearn.metrics import accuracy_score

train_acc = accuracy_score(y_train, tree_clf.predict(X_train))
test_acc = accuracy_score(y_test, tree_clf.predict(X_test))

train_acc, test_acc
(1.0, 0.6)

If training accuracy is much higher than test accuracy, the model is likely overfitting.


Controlling Complexity

Decision trees have parameters that limit complexity:

  • max_depth
  • min_samples_split
  • min_samples_leaf

Reducing complexity improves generalization.

tree_reg = Pipeline(
    steps=[
        ("preprocessor", preprocessor),
        ("classifier", DecisionTreeClassifier(
            max_depth=3,
            min_samples_leaf=10,
            random_state=42
        ))
    ]
)

tree_reg.fit(X_train, y_train)

train_acc_reg = accuracy_score(y_train, tree_reg.predict(X_train))
test_acc_reg = accuracy_score(y_test, tree_reg.predict(X_test))

train_acc_reg, test_acc_reg
(0.6734375, 0.6625)

Often, the gap between training and test accuracy narrows.


Regularization in Linear Models

Regularization also applies to linear models.

It adds a penalty for large coefficients.

Two common forms:

  • Ridge (L2 penalty)
  • Lasso (L1 penalty)
from sklearn.linear_model import LogisticRegression

log_reg = Pipeline(
    steps=[
        ("preprocessor", preprocessor),
        ("classifier", LogisticRegression(
            penalty="l2",
            C=0.5,
            max_iter=1000
        ))
    ]
)

log_reg.fit(X_train, y_train)

train_acc_lr = accuracy_score(y_train, log_reg.predict(X_train))
test_acc_lr = accuracy_score(y_test, log_reg.predict(X_test))

train_acc_lr, test_acc_lr
(0.7796875, 0.63125)

The parameter C controls regularization strength.

Lower C means stronger regularization.


Bias–Variance Tradeoff

Overfitting and underfitting represent two extremes:

  • High variance → overfitting
  • High bias → underfitting

A well-calibrated model balances bias and variance.

The objective is stable performance on unseen data.


Looking Ahead

In the next lesson, we move from generalization control to interpretation.

Feature importance helps us understand model behavior.

But interpretation requires caution.