Lecture 15 — Optimization Examples: Linear Regression & SVD

Lecture 15 — Optimization Examples

Linear Regression (Normal equation, Gradient Descent, Ridge) & SVD / Pseudoinverse — step-by-step with code

Overview

This lecture gives concrete optimization procedures used in ML: closed-form & iterative solutions for linear regression, regularization (ridge), and stable solutions via SVD (pseudoinverse). Each method includes intuition, step-by-step math, and runnable code.

Part A — Linear Regression: Problem & Objective

Given data \(X\in\mathbb{R}^{n\times p}\) (rows = samples, columns = features) and targets \(y\in\mathbb{R}^n\), we use a linear model:

y ≈ X β

Minimize squared error (ordinary least squares):

J(β) = ||y - Xβ||₂²

Goal: find β that minimizes J(β).

Part B — Closed-form: Normal Equations

Derivation (brief):

  1. J(β) = (y - Xβ)T(y - Xβ).
  2. Gradient: ∇β J = -2 XT(y - Xβ).
  3. Set ∇β J = 0 → XTX β = XTy.
  4. Solve (if invertible): β̂ = (XTX)-1 XT y.

Use np.linalg.solve in code rather than explicit matrix inverse for numerical stability.

Python (NumPy) — closed-form

import numpy as np

# X: n x p, y: n
XtX = X.T @ X
Xty = X.T @ y
beta_closed = np.linalg.solve(XtX, Xty)   # equivalent to (XtX)^{-1} X^T y
      

Part C — Gradient Descent (iterative)

Update rule:

β ← β - η ∇β J = β - 2η XT(Xβ - y)

Algorithm: choose learning rate η, initialize β₀ (e.g., zeros), iterate until convergence (monitor loss or gradient norm).

Python (NumPy) — gradient descent

# Simple gradient descent for linear regression
beta = np.zeros(p)
lr = 1e-3
for epoch in range(10000):
    grad = 2 * X.T @ (X @ beta - y)   # shape (p,)
    beta -= lr * grad
# check convergence, adjust lr, or use adaptive optimizers
      

GD is simple but requires tuning η; for many problems SGD or mini-batch is preferred.

Part D — Ridge Regression (Tikhonov regularization)

Add L2 regularization to stabilize ill-conditioned problems:

Jridge(β) = ||y - Xβ||₂² + λ ||β||₂²

Closed-form solution:

β̂ = (XTX + λ I)-1 XT y

Python (NumPy) — ridge

lam = 1.0
p = X.shape[1]
beta_ridge = np.linalg.solve(X.T @ X + lam * np.eye(p), X.T @ y)
      

λ > 0 improves conditioning; pick λ by cross-validation.

Part E — SVD & Pseudoinverse (numerical stability)

When X is rank-deficient or XTX is ill-conditioned, use the SVD: X = U Σ VT. The Moore–Penrose pseudoinverse X+ = V Σ+ UT gives the minimum-norm solution:

β̂ = X+ y = V Σ+ UT y

Where Σ+ replaces each non-zero σᵢ by 1/σᵢ and leaves zeros as zero. Truncating small σᵢ yields a form of regularization (truncated SVD).

Python (NumPy) — SVD solution

U, s, Vt = np.linalg.svd(X, full_matrices=False)   # X = U @ diag(s) @ Vt
S_inv = np.diag([1/si if si > 1e-12 else 0.0 for si in s])
X_pinv = Vt.T @ S_inv @ U.T
beta_svd = X_pinv @ y
      

SVD-based solution is numerically stable and reveals rank & conditioning (singular values s).

Part F — Worked numeric example (small)

Data:

X = [[1, 1],
     [1, 2],
     [1, 3]]
y = [1, 2, 2]
      

Compute normal eqn, ridge, SVD — they produce comparable β in this example.

Python full example (copy to Jupyter)

import numpy as np

X = np.array([[1.,1.],[1.,2.],[1.,3.]])
y = np.array([1.,2.,2.])

# closed-form
beta_closed = np.linalg.solve(X.T @ X, X.T @ y)

# ridge
lam = 1e-3
beta_ridge = np.linalg.solve(X.T @ X + lam*np.eye(2), X.T @ y)

# SVD pseudo-inverse
U, s, Vt = np.linalg.svd(X, full_matrices=False)
S_inv = np.diag([1/si if si > 1e-12 else 0. for si in s])
beta_svd = Vt.T @ S_inv @ U.T @ y

print("closed:", beta_closed)
print("ridge:", beta_ridge)
print("svd  :", beta_svd)
      

Try changing the data to make columns collinear (e.g., repeat a column) and observe differences.

Part G — Interpretation & Practical Tips

Part H — Interactive: Small Gradient Descent Demo (Intercept + Slope)

This playground runs a small batch gradient descent on a manually-entered 1-D dataset (fit y = w0 + w1 x). Use it to see how parameters evolve step-by-step.

Part I — Links between SVD, PCA and regression

Part J — Exercises (recommended)

  1. Generate a dataset where two features are highly collinear. Compare β̂ from closed-form, ridge, and SVD (with truncation).
  2. Implement mini-batch gradient descent and compare convergence speed with full-batch GD on a larger generated dataset.
  3. Perform PCA (SVD) on a dataset and reconstruct using top-k components; measure reconstruction error as a function of k.