Lecture 21: Foundations & Core Terminology

Machine Learning (ML) builds systems that learn patterns from data to make predictions or decisions. This lecture establishes a rigorous vocabulary and mental model you’ll use throughout the course.

Contents 1. Problem Taxonomy 2. Data, Features, Labels 3. Train/Validation/Test & Leakage 4. Loss, Risk & Objective 5. Bias–Variance, Capacity & Regularization 6. Assumptions & Inductive Bias 7. End-to-End Workflow (Bird’s-eye) 8. Terminology Glossary (Quick Reference)

1) Problem Taxonomy

Supervised Learning

Example: Predict HbA1c from lifestyle + labs (regression).

Unsupervised / Self-supervised

Example: Group food logs into diet patterns.

Other Learning Settings

2) Data, Features, Labels

Core Units

Feature Types

Encoding: one-hot, label, target encoding (with care!).

3) Train/Validation/Test & Leakage

Data Leakage: information from validation/test appears in training (e.g., scaling on full dataset, future timestamps used). Always fit preprocessors on train only.

4) Loss, Risk & Objective

TaskCommon LossIntuition
RegressionMSE / MAEPenalize distance between prediction and truth.
ClassificationLog Loss / HingeEncourage confident, correct probabilities/margins.
Empirical Risk Minimization (ERM): minimize average loss on training data. Add regularization to control complexity.

5) Bias–Variance, Capacity & Regularization

Example: Linear model with L2 reduces weight magnitudes → smoother fit.

6) Assumptions & Inductive Bias

7) End-to-End Workflow (Bird’s-eye)

  1. Define objective & success metrics.
  2. Acquire & audit data (schema, quality, bias).
  3. Split → preprocess → feature engineer.
  4. Train baseline → iterate with CV and tuning.
  5. Evaluate → explain → stress test for robustness.
  6. Deploy with monitoring (drift, performance, fairness).

8) Terminology Glossary (Quick Reference)

Confusion Matrix
TP, FP, TN, FN
Precision = TP/(TP+FP)
Recall (TPR) = TP/(TP+FN)
Specificity (TNR) = TN/(TN+FP)
F1 = 2·Prec·Rec/(Prec+Rec)
ROC-AUC / PR-AUC – threshold-independent summaries
Calibration – predicted probs ≈ empirical freq
RMSE/MAE/R² – regression metrics
Class Imbalance – skewed label frequencies
Sampling – stratified, SMOTE, undersample
Feature Scaling – standardize, min-max, robust
Pipelines – chain transforms + model safely
Mini Project Idea: Build a baseline classifier for diabetes risk using demographic + lifestyle data. Start with logistic regression, log loss as objective, ROC-AUC as selection metric.