Lecture 23a: Supervised Learning Algorithms (Classic & Strong Baselines)

1) Linear & Logistic Regression

Linear Regression
Logistic Regression

2) k-Nearest Neighbors (kNN)

3) Naïve Bayes

4) Decision Trees & Ensembles

Decision Trees
Random Forests
Gradient Boosting (XGB/LightGBM/GBM)

5) Support Vector Machines (SVM)

6) Practical Baseline Recipe

# Classification baseline (pseudocode)
Pipeline([
  ("pre", ColumnTransformer([
       ("num", StandardScaler(), num_cols),
       ("cat", OneHotEncoder(handle_unknown="ignore"), cat_cols)
  ])),
  ("clf", LogisticRegression(max_iter=200, class_weight="balanced"))
])

7) When to Use What?

Hands-on Challenge: Train LR, RF, and GBM on the same dataset with a single pipeline. Use 5-fold CV; compare PR-AUC for the positive class; plot calibration curves and discuss trade-offs.
Lecture 23b - Unsupervised Learning & Representation

Lecture 24: Unsupervised Learning & Representation

1) Clustering

k-Means
Hierarchical
DBSCAN / HDBSCAN

2) Mixture Models & Soft Clustering

Gaussian Mixture Models (GMM) assume data from a mixture of Gaussians; EM algorithm learns means/covariances; gives soft cluster probabilities.

3) Dimensionality Reduction

PCA / SVD
t-SNE / UMAP

4) Association Rules

5) Anomaly Detection

Isolation Forest – isolates anomalies with short paths.
One-Class SVM – boundary around normal data.
Autoencoders – reconstruction error flags anomalies.

6) Representation Learning (Brief)

7) Practical Playbook

# Customer Segmentation (pseudocode)
Pipeline([
  ("prep", ColumnTransformer([
       ("num", StandardScaler(), num_cols),
       ("cat", OneHotEncoder(handle_unknown="ignore"), cat_cols)
  ])),
  ("dim", PCA(n_components=10)),
  ("cluster", KMeans(n_clusters=5, n_init="auto"))
])
Exercise: Compare k-Means vs GMM at k=3..8 on the same standardized data. Plot silhouette scores; pick k; interpret centroids and business actions.