Lecture 23 - Supervised Learning Algorithms

Linear Regression

Model: y = w·x + b
Loss: MSE; regularize with Ridge/Lasso/ElasticNet
Assumes linearity; check residual plots

Logistic Regression

Sigmoid maps to probability
Loss: log loss; supports class weights
Great baseline for classification

2) k-Nearest Neighbors (kNN)

Non-parametric; decision by neighbors (distance-weighted).
Pros: simple, strong on local structure.
Cons: slow at prediction; needs scaling; curse of dimensionality.

3) Naïve Bayes

Assumes conditional independence of features given class.
Variants: Gaussian, Multinomial, Bernoulli.
Excellent for text (Multinomial + TF-IDF).

4) Decision Trees & Ensembles

Decision Trees

Splits by impurity (Gini/Entropy/MSE).
Interpretable but can overfit; control depth.

Random Forests

Bagging of trees → lower variance.
Robust, handles mixed features.

Gradient Boosting (XGB/LightGBM/GBM)

Trees added sequentially to fix errors.
Often state-of-the-art on tabular data.

5) Support Vector Machines (SVM)

Maximize margin; kernels map to higher dimensions.
Good for medium-sized, high-dimensional problems.
Scale features; tune C and kernel params.

6) Practical Baseline Recipe

# Classification baseline (pseudocode)
Pipeline([
  ("pre", ColumnTransformer([
       ("num", StandardScaler(), num_cols),
       ("cat", OneHotEncoder(handle_unknown="ignore"), cat_cols)
  ])),
  ("clf", LogisticRegression(max_iter=200, class_weight="balanced"))
])

Report: accuracy, F1, ROC-AUC, PR-AUC, confusion matrix.
Calibrate probabilities if needed (Platt/Isotonic).

7) When to Use What?

Hands-on Challenge: Train LR, RF, and GBM on the same dataset with a single pipeline. Use 5-fold CV; compare PR-AUC for the positive class; plot calibration curves and discuss trade-offs.

Lecture 24: Unsupervised Learning & Representation

1) Clustering

k-Means

Minimize within-cluster variance.
Needs k; spherical clusters.

Hierarchical

Agglomerative (linkage: single/complete/average/ward).
Dendrogram to pick clusters.

DBSCAN / HDBSCAN

Density-based; finds arbitrary shapes.
Detects noise; no need for k.

2) Mixture Models & Soft Clustering

Gaussian Mixture Models (GMM) assume data from a mixture of Gaussians; EM algorithm learns means/covariances; gives soft cluster probabilities.

3) Dimensionality Reduction

PCA / SVD

Linear projections maximizing variance.
Great for noise reduction & visualization.

t-SNE / UMAP

Nonlinear, for visualization; preserve local neighborhoods.
Not for downstream modeling directly; tune perplexity/nn.

4) Association Rules

Support, Confidence, Lift to evaluate rules.
Market basket analysis; recommend co-purchased items.

5) Anomaly Detection

Isolation Forest – isolates anomalies with short paths.

One-Class SVM – boundary around normal data.

Autoencoders – reconstruction error flags anomalies.

6) Representation Learning (Brief)

7) Practical Playbook

# Customer Segmentation (pseudocode)
Pipeline([
  ("prep", ColumnTransformer([
       ("num", StandardScaler(), num_cols),
       ("cat", OneHotEncoder(handle_unknown="ignore"), cat_cols)
  ])),
  ("dim", PCA(n_components=10)),
  ("cluster", KMeans(n_clusters=5, n_init="auto"))
])

Validate clusters: silhouette, Davies-Bouldin, stability across seeds.
Profile clusters with summary stats; name them for stakeholders.

Exercise: Compare k-Means vs GMM at k=3..8 on the same standardized data. Plot silhouette scores; pick k; interpret centroids and business actions.

Lecture 23a: Supervised Learning Algorithms (Classic & Strong Baselines)

1) Linear & Logistic Regression