Appendix: Troubleshooting and Extra Notes

  • ID: MLPY-APP
  • Type: Lesson
  • Audience: Public
  • Theme: Troubleshooting and extra notes

Common Rendering Issues

NameError: variable not defined

Cause: Each Quarto chapter runs in a fresh Python session.

Solution: Ensure every chapter is self-contained: - load the dataset - split the data - define preprocessing - define the model

Do not rely on variables from previous chapters.


All arrays must be of the same length

Cause: Mixing expanded one-hot feature names with original column names.

Solution: - Built-in tree feature importance → use expanded feature names
- Permutation importance → use original X.columns

These represent different feature spaces.


ModuleNotFoundError

Cause: Virtual environment not activated.

Solution:

#| label: activate-env
source .venv/bin/activate

Verify installation:

import sklearn
import pandas

Reproducibility Notes

To ensure consistent results:

  • Use fixed random_state values
  • Keep train/test splits consistent
  • Avoid modifying datasets mid-analysis
  • Do not evaluate on training data

For more robust validation, consider:

  • Cross-validation
  • Multiple random seeds
  • Version control for datasets

Model Reuse and Persistence

Applied machine learning rarely ends at evaluation.

In real settings, you often need to reuse a trained pipeline outside the notebook.

Deployment does not automatically mean cloud infrastructure.

A practical first step is being able to:

  • save a trained model or pipeline
  • reload it in a fresh session
  • run predictions reliably on new records

The format matters less than the habit.

Reproducibility and versioning are more important than the specific deployment tool.


On Metrics and Interpretation

Accuracy is not always sufficient.

Always ask:

  • What error is more costly?
  • False positive or false negative?
  • Does the metric match the operational objective?

Metrics should reflect consequences.


On Feature Importance

Feature importance indicates influence on prediction within this dataset.

It does not imply:

  • causation
  • mechanism
  • domain-level truth

Treat feature importance as a diagnostic tool.

Not as a claim.


Scaling Up

If you apply this workflow to real-world data, expect additional challenges:

  • Missing data
  • Severe class imbalance
  • High-dimensional features
  • Temporal leakage
  • Deployment constraints

The structured workflow still applies:

Design → Data → Model → Evaluation → Interpretation


Final Note

Machine learning is powerful when disciplined.

The workflow you learned here is more important than any single algorithm.

Structure first.
Then complexity.