Data Preparation for Machine Learning

  • ID: MLPY-F-L03
  • Type: Lesson
  • Audience: Public
  • Theme: Preventing leakage through structured preparation

Loading the Dataset

import pandas as pd

df = pd.read_csv("data/ml-ready/cdi-customer-churn.csv")
df.head()
customer_id tenure_months monthly_spend support_calls contract_type autopay churn
0 C100000 6 45.85 2 one-year no 0
1 C100001 17 69.95 1 month-to-month yes 1
2 C100002 64 70.98 0 month-to-month no 1
3 C100003 59 33.02 2 month-to-month yes 0
4 C100004 8 70.74 0 month-to-month yes 1

Define Features and Target

X = df.drop(columns="churn")
y = df["churn"]

Train/Test Split

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

Identify Feature Types

numeric_features = X.select_dtypes(include=["int64", "float64"]).columns
categorical_features = X.select_dtypes(include=["object"]).columns

numeric_features, categorical_features
(Index(['tenure_months', 'monthly_spend', 'support_calls'], dtype='str'),
 Index(['customer_id', 'contract_type', 'autopay'], dtype='str'))

Define Preprocessing Pipeline

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.pipeline import Pipeline

numeric_transformer = Pipeline(
    steps=[("scaler", StandardScaler())]
)

categorical_transformer = Pipeline(
    steps=[("encoder", OneHotEncoder(handle_unknown="ignore"))]
)

preprocessor = ColumnTransformer(
    transformers=[
        ("num", numeric_transformer, numeric_features),
        ("cat", categorical_transformer, categorical_features)
    ]
)