Data Preparation for Machine Learning

  • ID: MLPY-F-L03
  • Type: Lesson
  • Audience: Public
  • Theme: Preventing leakage through structured preparation

Loading the Dataset

import pandas as pd

df = pd.read_csv("data/ml-ready/cdi-customer-churn.csv")
print(df.head())
  customer_id  tenure_months  monthly_spend  support_calls   contract_type  \
0     C100000              6          45.85              2        one-year   
1     C100001             17          69.95              1  month-to-month   
2     C100002             64          70.98              0  month-to-month   
3     C100003             59          33.02              2  month-to-month   
4     C100004              8          70.74              0  month-to-month   

  autopay  churn  
0      no      0  
1     yes      1  
2      no      1  
3     yes      0  
4     yes      1  

Define Features and Target

X = df.drop(columns="churn")
y = df["churn"]

Train/Test Split

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

Identify Feature Types

numeric_features = X.select_dtypes(include=["int64", "float64"]).columns
categorical_features = X.select_dtypes(include=["object"]).columns

numeric_features, categorical_features
(Index(['tenure_months', 'monthly_spend', 'support_calls'], dtype='str'),
 Index(['customer_id', 'contract_type', 'autopay'], dtype='str'))

Define Preprocessing Pipeline

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.pipeline import Pipeline

numeric_transformer = Pipeline(
    steps=[("scaler", StandardScaler())]
)

categorical_transformer = Pipeline(
    steps=[("encoder", OneHotEncoder(handle_unknown="ignore"))]
)

preprocessor = ColumnTransformer(
    transformers=[
        ("num", numeric_transformer, numeric_features),
        ("cat", categorical_transformer, categorical_features)
    ]
)