Lecture 15 — Optimization Examples: Linear Regression & SVD

Overview

This lecture gives concrete optimization procedures used in ML: closed-form & iterative solutions for linear regression, regularization (ridge), and stable solutions via SVD (pseudoinverse). Each method includes intuition, step-by-step math, and runnable code.

Part A — Linear Regression: Problem & Objective

Given data \(X\in\mathbb{R}^{n\times p}\) (rows = samples, columns = features) and targets \(y\in\mathbb{R}^n\), we use a linear model:

y ≈ X β

Minimize squared error (ordinary least squares):

J(β) = ||y - Xβ||₂²

Goal: find β that minimizes J(β).

Part B — Closed-form: Normal Equations

Derivation (brief):

J(β) = (y - Xβ)^T(y - Xβ).
Gradient: ∇β J = -2 X^T(y - Xβ).
Set ∇β J = 0 → X^TX β = X^Ty.
Solve (if invertible): β̂ = (X^TX)^-1 X^T y.

Use np.linalg.solve in code rather than explicit matrix inverse for numerical stability.

Python (NumPy) — closed-form

import numpy as np

# X: n x p, y: n
XtX = X.T @ X
Xty = X.T @ y
beta_closed = np.linalg.solve(XtX, Xty)   # equivalent to (XtX)^{-1} X^T y

Part C — Gradient Descent (iterative)

Update rule:

β ← β - η ∇β J = β - 2η X^T(Xβ - y)

Algorithm: choose learning rate η, initialize β₀ (e.g., zeros), iterate until convergence (monitor loss or gradient norm).

Python (NumPy) — gradient descent

# Simple gradient descent for linear regression
beta = np.zeros(p)
lr = 1e-3
for epoch in range(10000):
    grad = 2 * X.T @ (X @ beta - y)   # shape (p,)
    beta -= lr * grad
# check convergence, adjust lr, or use adaptive optimizers

GD is simple but requires tuning η; for many problems SGD or mini-batch is preferred.

Part D — Ridge Regression (Tikhonov regularization)

Add L2 regularization to stabilize ill-conditioned problems:

J_ridge(β) = ||y - Xβ||₂² + λ ||β||₂²

Closed-form solution:

β̂ = (X^TX + λ I)^-1 X^T y

Python (NumPy) — ridge

lam = 1.0
p = X.shape[1]
beta_ridge = np.linalg.solve(X.T @ X + lam * np.eye(p), X.T @ y)

λ > 0 improves conditioning; pick λ by cross-validation.

Part E — SVD & Pseudoinverse (numerical stability)

When X is rank-deficient or X^TX is ill-conditioned, use the SVD: X = U Σ V^T. The Moore–Penrose pseudoinverse X⁺ = V Σ⁺ U^T gives the minimum-norm solution:

β̂ = X⁺ y = V Σ⁺ U^T y

Where Σ⁺ replaces each non-zero σᵢ by 1/σᵢ and leaves zeros as zero. Truncating small σᵢ yields a form of regularization (truncated SVD).

Python (NumPy) — SVD solution

U, s, Vt = np.linalg.svd(X, full_matrices=False)   # X = U @ diag(s) @ Vt
S_inv = np.diag([1/si if si > 1e-12 else 0.0 for si in s])
X_pinv = Vt.T @ S_inv @ U.T
beta_svd = X_pinv @ y

SVD-based solution is numerically stable and reveals rank & conditioning (singular values s).

Part F — Worked numeric example (small)

Data:

X = [[1, 1],
     [1, 2],
     [1, 3]]
y = [1, 2, 2]

Compute normal eqn, ridge, SVD — they produce comparable β in this example.

Python full example (copy to Jupyter)

import numpy as np

X = np.array([[1.,1.],[1.,2.],[1.,3.]])
y = np.array([1.,2.,2.])

# closed-form
beta_closed = np.linalg.solve(X.T @ X, X.T @ y)

# ridge
lam = 1e-3
beta_ridge = np.linalg.solve(X.T @ X + lam*np.eye(2), X.T @ y)

# SVD pseudo-inverse
U, s, Vt = np.linalg.svd(X, full_matrices=False)
S_inv = np.diag([1/si if si > 1e-12 else 0. for si in s])
beta_svd = Vt.T @ S_inv @ U.T @ y

print("closed:", beta_closed)
print("ridge:", beta_ridge)
print("svd  :", beta_svd)

Try changing the data to make columns collinear (e.g., repeat a column) and observe differences.

Part G — Interpretation & Practical Tips

When to use closed-form: Small p, p ≪ n, and well-conditioned XtX.
When to use iterative (GD/SGD): Very large n or streaming data; mini-batches are standard in deep learning.
When to use SVD: Ill-conditioned X or when you need low-rank approximations / PCA links.
Regularization: Use ridge (L2) to stabilize inverse, use L1 (Lasso) for sparsity (requires iterative solvers).
Scale features: Standardize features before gradient-based optimization; it improves convergence.

Part H — Interactive: Small Gradient Descent Demo (Intercept + Slope)

This playground runs a small batch gradient descent on a manually-entered 1-D dataset (fit y = w0 + w1 x). Use it to see how parameters evolve step-by-step.

Data points (x,y) — one per line, comma separated (example below): Learning rate (η) Epochs (iterations)

Part I — Links between SVD, PCA and regression

PCA is computed via SVD of the centered data matrix; principal components correspond to right singular vectors (V).
Truncated SVD gives best low-rank approximation (Eckart–Young theorem) — useful for denoising and dimensionality reduction.
Using truncated SVD for regression (project onto top-k singular vectors) acts as a regularizer (reduces variance).

Part J — Exercises (recommended)

Generate a dataset where two features are highly collinear. Compare β̂ from closed-form, ridge, and SVD (with truncation).
Implement mini-batch gradient descent and compare convergence speed with full-batch GD on a larger generated dataset.
Perform PCA (SVD) on a dataset and reconstruct using top-k components; measure reconstruction error as a function of k.