Preface and Setup

ID: MLPY-F-L01
Type: Lesson
Audience: Public
Theme: Workflow-first machine learning

Why This Guide Exists

Machine learning is often presented as a collection of algorithms.

Linear regression.
Logistic regression.
Decision trees.
Neural networks.

But applied machine learning is not a menu of techniques.

It is a structured workflow:

Design → Data → Model → Evaluation → Interpretation

In production settings, this workflow extends further to deployment and monitoring. In this free track, we focus on building the foundation correctly.

Machine learning is a system, not a model.

What Machine Learning Means in Practice

Machine learning is the practice of building systems that learn patterns from data to make predictions or decisions on unseen inputs.

In applied settings, success depends less on algorithm choice and more on:

framing the right problem,
establishing strong baselines,
validating results properly,
preventing data leakage,
interpreting outputs responsibly.

Unlike traditional programming, rules are learned from data rather than hard-coded.

Where Machine Learning Is Used

Machine learning supports decision-making across domains:

Healthcare: risk scoring and early detection
Finance: fraud detection and credit risk
Marketing: churn prediction and segmentation
Operations: demand forecasting and optimization
Research: predictive modeling and pattern discovery

The domain changes.

The workflow remains.

Scope of This Free Track

This guide focuses on supervised learning using structured tabular data.

You will learn how to:

Frame regression and classification problems
Split data correctly
Prevent leakage using pipelines
Train baseline models in scikit-learn
Evaluate models with appropriate metrics
Diagnose overfitting
Interpret feature importance cautiously

This is a complete and scalable foundation.

Advanced topics such as hyperparameter tuning, ensemble methods, deployment, and monitoring build on this structure but are beyond the scope of this free track.

Prerequisites

You should be comfortable with:

Basic Python syntax
Working with pandas DataFrames
Running commands in a terminal

Basic familiarity with statistics is helpful but not required.

Environment Setup

This project uses a virtual environment to isolate dependencies.

Create the environment

#| label: create-environment
bash scripts/setup-env.sh
source .venv/bin/activate

This installs the core dependencies:

pandas
numpy
matplotlib
scikit-learn

Generate the Synthetic Dataset

#| label: generate-dataset
python scripts/make-cdi-customer-churn.py

The dataset will be saved to:

data/ml-ready/cdi-customer-churn.csv

This dataset will be used consistently throughout the guide.

Verify Installation

import sklearn
import pandas as pd
import numpy as np

sklearn.__version__

'1.8.0'

If a version number prints, the environment is correctly configured.

Rendering the Book

To render the Quarto book locally:

#| label: render-book
bash scripts/build-all.sh

The rendered site will appear in:

docs/

Navigation is handled automatically by Quarto.

How to Approach This Guide

Do not rush into modeling.

Most machine learning errors originate in:

unclear problem framing
improper data splitting
hidden leakage
incorrect metric selection
overinterpretation of results

The next lesson begins with the most important skill:

Framing the prediction problem correctly.