import sklearn
import pandasAppendix: Troubleshooting and Extra Notes
Common Rendering Issues
NameError: variable not defined
Cause: Each Quarto chapter runs in a fresh Python session.
Solution: Ensure every chapter is self-contained: - load the dataset - split the data - define preprocessing - define the model
Do not rely on variables from previous chapters.
All arrays must be of the same length
Cause: Mixing expanded one-hot feature names with original column names.
Solution: - Built-in tree feature importance → use expanded feature names
- Permutation importance → use original X.columns
These represent different feature spaces.
ModuleNotFoundError
Cause: Virtual environment not activated.
Solution:
#| label: activate-env
source .venv/bin/activate
Verify installation:
Reproducibility Notes
To ensure consistent results:
- Use fixed random_state values
- Keep train/test splits consistent
- Avoid modifying datasets mid-analysis
- Do not evaluate on training data
For more robust validation, consider:
- Cross-validation
- Multiple random seeds
- Version control for datasets
Model Reuse and Persistence
Applied machine learning rarely ends at evaluation.
In real settings, you often need to reuse a trained pipeline outside the notebook.
Deployment does not automatically mean cloud infrastructure.
A practical first step is being able to:
- save a trained model or pipeline
- reload it in a fresh session
- run predictions reliably on new records
The format matters less than the habit.
Reproducibility and versioning are more important than the specific deployment tool.
On Metrics and Interpretation
Accuracy is not always sufficient.
Always ask:
- What error is more costly?
- False positive or false negative?
- Does the metric match the operational objective?
Metrics should reflect consequences.
On Feature Importance
Feature importance indicates influence on prediction within this dataset.
It does not imply:
- causation
- mechanism
- domain-level truth
Treat feature importance as a diagnostic tool.
Not as a claim.
Scaling Up
If you apply this workflow to real-world data, expect additional challenges:
- Missing data
- Severe class imbalance
- High-dimensional features
- Temporal leakage
- Deployment constraints
The structured workflow still applies:
Design → Data → Model → Evaluation → Interpretation
Final Note
Machine learning is powerful when disciplined.
The workflow you learned here is more important than any single algorithm.
Structure first.
Then complexity.