import pandas as pd
df = pd.read_csv("data/ml-ready/cdi-customer-churn.csv")
X = df.drop(columns="churn")
y = df["churn"]Model Evaluation
Evaluation Is Not a Decoration
A model is not useful because it produces predictions.
It is useful if:
- It generalizes to new data
- Its errors are acceptable in context
- Its evaluation matches the real decision goal
This lesson focuses on disciplined evaluation for classification models.
Load Data
Train/Test Split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
X,
y,
test_size=0.2,
random_state=42,
stratify=y
)Define Preprocessing
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.pipeline import Pipeline
numeric_features = X.select_dtypes(include=["int64", "float64"]).columns
categorical_features = X.select_dtypes(include=["object"]).columns
numeric_transformer = Pipeline(
steps=[("scaler", StandardScaler())]
)
categorical_transformer = Pipeline(
steps=[("encoder", OneHotEncoder(handle_unknown="ignore"))]
)
preprocessor = ColumnTransformer(
transformers=[
("num", numeric_transformer, numeric_features),
("cat", categorical_transformer, categorical_features)
]
)Train a Baseline Classifier
from sklearn.linear_model import LogisticRegression
clf = Pipeline(
steps=[
("preprocessor", preprocessor),
("classifier", LogisticRegression(max_iter=1000))
]
)
clf.fit(X_train, y_train)Pipeline(steps=[('preprocessor',
ColumnTransformer(transformers=[('num',
Pipeline(steps=[('scaler',
StandardScaler())]),
Index(['tenure_months', 'monthly_spend', 'support_calls'], dtype='str')),
('cat',
Pipeline(steps=[('encoder',
OneHotEncoder(handle_unknown='ignore'))]),
Index(['customer_id', 'contract_type', 'autopay'], dtype='str'))])),
('classifier', LogisticRegression(max_iter=1000))])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Parameters
Parameters
Index(['tenure_months', 'monthly_spend', 'support_calls'], dtype='str')
Parameters
Index(['customer_id', 'contract_type', 'autopay'], dtype='str')
Parameters
Parameters
Predictions and Probabilities
import numpy as np
y_pred = clf.predict(X_test)
y_prob = clf.predict_proba(X_test)[:, 1]
y_pred[:10], np.round(y_prob[:10], 3)(array([0, 1, 1, 1, 1, 1, 1, 1, 0, 1]),
array([0.345, 0.666, 0.649, 0.827, 0.811, 0.825, 0.768, 0.57 , 0.411,
0.666]))
Confusion Matrix
A confusion matrix breaks predictions into four counts:
- True positives (TP)
- False positives (FP)
- True negatives (TN)
- False negatives (FN)
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)
cmarray([[19, 39],
[19, 83]])
For readability, we can label it.
import pandas as pd
cm_df = pd.DataFrame(
cm,
index=["Actual 0", "Actual 1"],
columns=["Pred 0", "Pred 1"]
)
cm_df| Pred 0 | Pred 1 | |
|---|---|---|
| Actual 0 | 19 | 39 |
| Actual 1 | 19 | 83 |
Accuracy Can Be Misleading
Accuracy answers:
How often was the prediction correct?
But accuracy does not reflect:
- Class imbalance
- Asymmetric error costs
For churn, false negatives can be more costly than false positives.
Precision, Recall, and F1
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
accuracy, precision, recall, f1(0.6375, 0.680327868852459, 0.8137254901960784, 0.7410714285714286)
Interpretation:
- Precision: among predicted churners, how many truly churned?
- Recall: among true churners, how many did we catch?
- F1: balance between precision and recall
Thresholds Change Behavior
By default, classification uses threshold 0.5.
We can change it.
def predict_with_threshold(probs, threshold=0.5):
return (probs >= threshold).astype(int)Compare metrics at different thresholds.
thresholds = [0.3, 0.5, 0.7]
rows = []
for t in thresholds:
y_t = predict_with_threshold(y_prob, threshold=t)
rows.append({
"threshold": t,
"precision": precision_score(y_test, y_t),
"recall": recall_score(y_test, y_t),
"f1": f1_score(y_test, y_t)
})
pd.DataFrame(rows)| threshold | precision | recall | f1 | |
|---|---|---|---|---|
| 0 | 0.3 | 0.647059 | 0.970588 | 0.776471 |
| 1 | 0.5 | 0.680328 | 0.813725 | 0.741071 |
| 2 | 0.7 | 0.786885 | 0.470588 | 0.588957 |
Lower thresholds usually increase recall.
Higher thresholds usually increase precision.
ROC Curve and AUC
The ROC curve summarizes performance across thresholds.
import matplotlib.pyplot as plt
from sklearn.metrics import roc_curve, roc_auc_score
fpr, tpr, thr = roc_curve(y_test, y_prob)
auc = roc_auc_score(y_test, y_prob)
auc0.6511156186612576
Plot the ROC curve.
fig, ax = plt.subplots()
ax.plot(fpr, tpr)
ax.set_xlabel("False Positive Rate")
ax.set_ylabel("True Positive Rate")
ax.set_title("ROC Curve")
plt.show()
AUC ranges from 0.5 (no skill) to 1.0 (perfect separation).
Cross-Validation
A single train/test split can be noisy.
Cross-validation provides a more stable estimate of generalization.
from sklearn.model_selection import cross_val_score
scores = cross_val_score(clf, X, y, cv=5, scoring="f1")
scores, scores.mean()(array([0.74782609, 0.75862069, 0.80869565, 0.74178404, 0.79475983]),
np.float64(0.7703372583343606))
Looking Ahead
In the next lesson, we will focus on overfitting and regularization.
The goal is not to chase metrics.
The goal is to build models that generalize.