Preface and Setup

  • ID: MLPY-F-L01
  • Type: Lesson
  • Audience: Public
  • Theme: Workflow-first machine learning

Why This Guide Exists

Machine learning is often presented as a collection of algorithms.

Linear regression.
Logistic regression.
Decision trees.
Neural networks.

But applied machine learning is not a menu of techniques.

It is a structured workflow:

Design → Data → Model → Evaluation → Interpretation

In production settings, this workflow extends further to deployment and monitoring. In this free track, we focus on building the foundation correctly.

Machine learning is a system, not a model.


What Machine Learning Means in Practice

Machine learning is the practice of building systems that learn patterns from data to make predictions or decisions on unseen inputs.

In applied settings, success depends less on algorithm choice and more on:

  • framing the right problem,
  • establishing strong baselines,
  • validating results properly,
  • preventing data leakage,
  • interpreting outputs responsibly.

Unlike traditional programming, rules are learned from data rather than hard-coded.


Where Machine Learning Is Used

Machine learning supports decision-making across domains:

  • Healthcare: risk scoring and early detection
  • Finance: fraud detection and credit risk
  • Marketing: churn prediction and segmentation
  • Operations: demand forecasting and optimization
  • Research: predictive modeling and pattern discovery

The domain changes.

The workflow remains.


Scope of This Free Track

This guide focuses on supervised learning using structured tabular data.

You will learn how to:

  • Frame regression and classification problems
  • Split data correctly
  • Prevent leakage using pipelines
  • Train baseline models in scikit-learn
  • Evaluate models with appropriate metrics
  • Diagnose overfitting
  • Interpret feature importance cautiously

This is a complete and scalable foundation.

Advanced topics such as hyperparameter tuning, ensemble methods, deployment, and monitoring build on this structure but are beyond the scope of this free track.


Prerequisites

You should be comfortable with:

  • Basic Python syntax
  • Working with pandas DataFrames
  • Running commands in a terminal

Basic familiarity with statistics is helpful but not required.


Environment Setup

This project uses a virtual environment to isolate dependencies.

Create the environment

#| label: create-environment
bash scripts/setup-env.sh
source .venv/bin/activate

This installs the core dependencies:

  • pandas
  • numpy
  • matplotlib
  • scikit-learn

Generate the Synthetic Dataset

#| label: generate-dataset
python scripts/make-cdi-customer-churn.py

The dataset will be saved to:

data/ml-ready/cdi-customer-churn.csv

This dataset will be used consistently throughout the guide.


Verify Installation

import sklearn
import pandas as pd
import numpy as np

sklearn.__version__
'1.8.0'

If a version number prints, the environment is correctly configured.


Rendering the Book

To render the Quarto book locally:

#| label: render-book
bash scripts/build-all.sh

The rendered site will appear in:

docs/

Navigation is handled automatically by Quarto.


How to Approach This Guide

Do not rush into modeling.

Most machine learning errors originate in:

  • unclear problem framing
  • improper data splitting
  • hidden leakage
  • incorrect metric selection
  • overinterpretation of results

The next lesson begins with the most important skill:

Framing the prediction problem correctly.