ML Thinking and Problem Types

ID: MLPY-F-L02
Type: Lesson
Audience: Public
Theme: Framing predictive problems correctly

Machine Learning Begins Before Modeling

Most mistakes in machine learning happen before the first model is trained.

They occur when:

The prediction target is unclear
The data does not match the question
Future information leaks into training
Evaluation metrics do not match the real objective

Machine learning begins with framing.

What Is a Prediction Problem?

A prediction problem answers:

Given the available information, what do we want to estimate?

To make this concrete, define three things:

Target (y): what you want to predict
Features (X): what you will use to make the prediction
Prediction context: when the prediction happens and what will be known at that time

Without these three elements, modeling is premature.

Supervised Learning in One Sentence

This guide focuses on supervised learning.

Supervised learning means:

We have labeled examples. Each row contains:

input features (X)
a known outcome (y)

The model learns a mapping from X to y that should generalize to new data.

Two Core Problem Types

The problem type is determined by the target variable.

Regression

Regression is used when the target is numeric and continuous.

Examples:

predicting monthly spend
predicting house price
predicting customer lifetime value

Classification

Classification is used when the target is categorical.

Examples:

churn vs no churn
fraud vs not fraud
disease vs no disease

A classification model may output:

a class label (0 or 1)
a probability (risk score)

Our Dataset: Churn Prediction

We will use a synthetic churn dataset generated in Lesson 01.

The goal is to predict:

churn = 1 if the customer churned, 0 otherwise

Load the dataset

import pandas as pd

df = pd.read_csv("data/ml-ready/cdi-customer-churn.csv")
print(df.head())

  customer_id  tenure_months  monthly_spend  support_calls   contract_type  \
0     C100000              6          45.85              2        one-year   
1     C100001             17          69.95              1  month-to-month   
2     C100002             64          70.98              0  month-to-month   
3     C100003             59          33.02              2  month-to-month   
4     C100004              8          70.74              0  month-to-month   

  autopay  churn  
0      no      0  
1     yes      1  
2      no      1  
3     yes      0  
4     yes      1

Identify target and features

X = df.drop(columns="churn")
y = df["churn"]

X.shape, y.shape

((800, 6), (800,))

Confirm the problem type

y.dtype, sorted(y.unique())

(dtype('int64'), [np.int64(0), np.int64(1)])

Because the target is binary (0/1), this is a classification problem.

Why Problem Type Matters

The model choice depends on the problem type, but the workflow is shared.

Regression uses regression metrics (MAE, MSE, R²)
Classification uses classification metrics (precision, recall, ROC AUC)

Using the wrong metric leads to the wrong conclusion.

The Train/Test Mental Model

A predictive model must work on new data.

To simulate this, we split the dataset into:

training set: used to fit the model
test set: used to evaluate generalization

Evaluating on training data produces overly optimistic performance.

This is called overfitting.

We will study overfitting later, but the mental model starts here.

Data Leakage: The Hidden Failure

Data leakage occurs when the model has access to information it would not have at prediction time.

Common examples:

using future-derived variables
fitting preprocessing on the full dataset before splitting
including target-derived variables

Leakage produces high accuracy and false confidence.

The CDI workflow prevents leakage by design:

Split first, then transform, then model.

From Question to Workflow

Before you train any model, ask:

What exactly am I predicting?
When will this prediction be made?
What information will be available at that time?
What error is more costly? false positive or false negative?
What metric matches that cost?

These questions determine the workflow:

Design → Data → Model → Evaluation

Not the other way around.

Looking Ahead

In the next lesson, we will prepare data properly for modeling.

We will introduce pipelines to ensure:

transformations are fit only on training data
the same transformations apply to test data
leakage is prevented automatically