import pandas as pd
df = pd.read_csv("data/ml-ready/cdi-customer-churn.csv")
X = df.drop(columns="churn")
y = df["churn"]Overfitting and Regularization
When Good Training Performance Is Not Enough
A model can perform extremely well on training data and still fail in practice.
This happens when the model memorizes patterns that do not generalize.
This phenomenon is called overfitting.
Overfitting occurs when:
- The model is too complex for the available data
- Noise is mistaken for signal
- Evaluation is performed incorrectly
- Data leakage is present
The goal of machine learning is not to fit training data.
The goal is to generalize.
Load Data
Train/Test Split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
X,
y,
test_size=0.2,
random_state=42,
stratify=y
)Define Preprocessing
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.pipeline import Pipeline
numeric_features = X.select_dtypes(include=["int64", "float64"]).columns
categorical_features = X.select_dtypes(include=["object"]).columns
numeric_transformer = Pipeline(
steps=[("scaler", StandardScaler())]
)
categorical_transformer = Pipeline(
steps=[("encoder", OneHotEncoder(handle_unknown="ignore"))]
)
preprocessor = ColumnTransformer(
transformers=[
("num", numeric_transformer, numeric_features),
("cat", categorical_transformer, categorical_features)
]
)A More Flexible Model
Decision trees can model complex patterns.
They can also overfit easily.
from sklearn.tree import DecisionTreeClassifier
tree_clf = Pipeline(
steps=[
("preprocessor", preprocessor),
("classifier", DecisionTreeClassifier(random_state=42))
]
)
tree_clf.fit(X_train, y_train)Pipeline(steps=[('preprocessor',
ColumnTransformer(transformers=[('num',
Pipeline(steps=[('scaler',
StandardScaler())]),
Index(['tenure_months', 'monthly_spend', 'support_calls'], dtype='str')),
('cat',
Pipeline(steps=[('encoder',
OneHotEncoder(handle_unknown='ignore'))]),
Index(['customer_id', 'contract_type', 'autopay'], dtype='str'))])),
('classifier', DecisionTreeClassifier(random_state=42))])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Parameters
Parameters
Index(['tenure_months', 'monthly_spend', 'support_calls'], dtype='str')
Parameters
Index(['customer_id', 'contract_type', 'autopay'], dtype='str')
Parameters
Parameters
Compare Training vs Test Performance
from sklearn.metrics import accuracy_score
train_acc = accuracy_score(y_train, tree_clf.predict(X_train))
test_acc = accuracy_score(y_test, tree_clf.predict(X_test))
train_acc, test_acc(1.0, 0.6)
If training accuracy is much higher than test accuracy, the model is likely overfitting.
Controlling Complexity
Decision trees have parameters that limit complexity:
- max_depth
- min_samples_split
- min_samples_leaf
Reducing complexity improves generalization.
tree_reg = Pipeline(
steps=[
("preprocessor", preprocessor),
("classifier", DecisionTreeClassifier(
max_depth=3,
min_samples_leaf=10,
random_state=42
))
]
)
tree_reg.fit(X_train, y_train)
train_acc_reg = accuracy_score(y_train, tree_reg.predict(X_train))
test_acc_reg = accuracy_score(y_test, tree_reg.predict(X_test))
train_acc_reg, test_acc_reg(0.6734375, 0.6625)
Often, the gap between training and test accuracy narrows.
Regularization in Linear Models
Regularization also applies to linear models.
It adds a penalty for large coefficients.
Two common forms:
- Ridge (L2 penalty)
- Lasso (L1 penalty)
from sklearn.linear_model import LogisticRegression
log_reg = Pipeline(
steps=[
("preprocessor", preprocessor),
("classifier", LogisticRegression(
penalty="l2",
C=0.5,
max_iter=1000
))
]
)
log_reg.fit(X_train, y_train)
train_acc_lr = accuracy_score(y_train, log_reg.predict(X_train))
test_acc_lr = accuracy_score(y_test, log_reg.predict(X_test))
train_acc_lr, test_acc_lr(0.7796875, 0.63125)
The parameter C controls regularization strength.
Lower C means stronger regularization.
Bias–Variance Tradeoff
Overfitting and underfitting represent two extremes:
- High variance → overfitting
- High bias → underfitting
A well-calibrated model balances bias and variance.
The objective is stable performance on unseen data.
Looking Ahead
In the next lesson, we move from generalization control to interpretation.
Feature importance helps us understand model behavior.
But interpretation requires caution.