Lecture 21: Foundations & Core Terminology

Machine Learning (ML) builds systems that learn patterns from data to make predictions or decisions. This lecture establishes a rigorous vocabulary and mental model you’ll use throughout the course.

Contents 1. Problem Taxonomy 2. Data, Features, Labels 3. Train/Validation/Test & Leakage 4. Loss, Risk & Objective 5. Bias–Variance, Capacity & Regularization 6. Assumptions & Inductive Bias 7. End-to-End Workflow (Bird’s-eye) 8. Terminology Glossary (Quick Reference)

1) Problem Taxonomy

Supervised Learning

Regression – predict continuous value (e.g., price).
Classification – predict category (spam vs ham).
Ranking – order items (search results).
Structured Prediction – sequences/sets/graphs.

Example: Predict HbA1c from lifestyle + labs (regression).

Unsupervised / Self-supervised

Clustering – discover groups (customer segments).
Dimensionality Reduction – compress features (PCA).
Association Rules – “A → B” co-occurrence.

Example: Group food logs into diet patterns.

Other Learning Settings

Reinforcement Learning – learn actions via reward.
Online / Streaming – update per new sample.
Semi-supervised – few labels + many unlabeled.
Transfer Learning – reuse knowledge across tasks.
Active Learning – choose what to label next.

2) Data, Features, Labels

Core Units

Dataset = rows (instances/samples) × columns (features).
Feature (x) – input variable.
Label/Target (y) – what to predict.
Parameters – learned by training (e.g., weights).
Hyperparameters – chosen by you (e.g., tree depth).

Feature Types

Numeric: continuous (glucose mg/dL), integer (age).
Categorical: nominal (city), ordinal (stage I < II < III).
Binary: yes/no, 0/1.
Text/Image/Audio/Time-series: unstructured.

Encoding: one-hot, label, target encoding (with care!).

3) Train/Validation/Test & Leakage

Train set – fit parameters.
Validation set – tune hyperparameters, pick models.
Test set – final unbiased performance estimate.

Data Leakage: information from validation/test appears in training (e.g., scaling on full dataset, future timestamps used). Always fit preprocessors on train only.

4) Loss, Risk & Objective

Task	Common Loss	Intuition
Regression	MSE / MAE	Penalize distance between prediction and truth.
Classification	Log Loss / Hinge	Encourage confident, correct probabilities/margins.

Empirical Risk Minimization (ERM): minimize average loss on training data. Add regularization to control complexity.

5) Bias–Variance, Capacity & Regularization

Underfitting (high bias) – model too simple.
Overfitting (high variance) – fits noise, poor generalization.
Capacity – richness of hypothesis class (e.g., degree of polynomial, tree depth).
Regularization – L2/Ridge, L1/Lasso, ElasticNet, Early stopping, Dropout.

Example: Linear model with L2 reduces weight magnitudes → smoother fit.

6) Assumptions & Inductive Bias

I.I.D. samples often assumed; time-series violate this.
Stationarity in time-series; concept drift breaks it.
Inductive Bias: assumptions enabling generalization (e.g., linearity).
No Free Lunch: no single algorithm best for all problems.

7) End-to-End Workflow (Bird’s-eye)

Define objective & success metrics.
Acquire & audit data (schema, quality, bias).
Split → preprocess → feature engineer.
Train baseline → iterate with CV and tuning.
Evaluate → explain → stress test for robustness.
Deploy with monitoring (drift, performance, fairness).

8) Terminology Glossary (Quick Reference)

Confusion Matrix
TP, FP, TN, FN

Precision = TP/(TP+FP)

Recall (TPR) = TP/(TP+FN)

Specificity (TNR) = TN/(TN+FP)

F1 = 2·Prec·Rec/(Prec+Rec)

ROC-AUC / PR-AUC – threshold-independent summaries

Calibration – predicted probs ≈ empirical freq

RMSE/MAE/R² – regression metrics

Class Imbalance – skewed label frequencies

Sampling – stratified, SMOTE, undersample

Feature Scaling – standardize, min-max, robust

Pipelines – chain transforms + model safely

Mini Project Idea: Build a baseline classifier for diabetes risk using demographic + lifestyle data. Start with logistic regression, log loss as objective, ROC-AUC as selection metric.