This lecture gives concrete optimization procedures used in ML: closed-form & iterative solutions for linear regression, regularization (ridge), and stable solutions via SVD (pseudoinverse). Each method includes intuition, step-by-step math, and runnable code.
Given data \(X\in\mathbb{R}^{n\times p}\) (rows = samples, columns = features) and targets \(y\in\mathbb{R}^n\), we use a linear model:
y ≈ X β
Minimize squared error (ordinary least squares):
J(β) = ||y - Xβ||₂²
Goal: find β that minimizes J(β).
Derivation (brief):
Use np.linalg.solve in code rather than explicit matrix inverse for numerical stability.
import numpy as np
# X: n x p, y: n
XtX = X.T @ X
Xty = X.T @ y
beta_closed = np.linalg.solve(XtX, Xty) # equivalent to (XtX)^{-1} X^T y
Update rule:
β ← β - η ∇β J = β - 2η XT(Xβ - y)
Algorithm: choose learning rate η, initialize β₀ (e.g., zeros), iterate until convergence (monitor loss or gradient norm).
# Simple gradient descent for linear regression
beta = np.zeros(p)
lr = 1e-3
for epoch in range(10000):
grad = 2 * X.T @ (X @ beta - y) # shape (p,)
beta -= lr * grad
# check convergence, adjust lr, or use adaptive optimizers
GD is simple but requires tuning η; for many problems SGD or mini-batch is preferred.
Add L2 regularization to stabilize ill-conditioned problems:
Jridge(β) = ||y - Xβ||₂² + λ ||β||₂²
Closed-form solution:
β̂ = (XTX + λ I)-1 XT y
lam = 1.0
p = X.shape[1]
beta_ridge = np.linalg.solve(X.T @ X + lam * np.eye(p), X.T @ y)
λ > 0 improves conditioning; pick λ by cross-validation.
When X is rank-deficient or XTX is ill-conditioned, use the SVD: X = U Σ VT. The Moore–Penrose pseudoinverse X+ = V Σ+ UT gives the minimum-norm solution:
β̂ = X+ y = V Σ+ UT y
Where Σ+ replaces each non-zero σᵢ by 1/σᵢ and leaves zeros as zero. Truncating small σᵢ yields a form of regularization (truncated SVD).
U, s, Vt = np.linalg.svd(X, full_matrices=False) # X = U @ diag(s) @ Vt
S_inv = np.diag([1/si if si > 1e-12 else 0.0 for si in s])
X_pinv = Vt.T @ S_inv @ U.T
beta_svd = X_pinv @ y
SVD-based solution is numerically stable and reveals rank & conditioning (singular values s).
Data:
X = [[1, 1],
[1, 2],
[1, 3]]
y = [1, 2, 2]
Compute normal eqn, ridge, SVD — they produce comparable β in this example.
import numpy as np
X = np.array([[1.,1.],[1.,2.],[1.,3.]])
y = np.array([1.,2.,2.])
# closed-form
beta_closed = np.linalg.solve(X.T @ X, X.T @ y)
# ridge
lam = 1e-3
beta_ridge = np.linalg.solve(X.T @ X + lam*np.eye(2), X.T @ y)
# SVD pseudo-inverse
U, s, Vt = np.linalg.svd(X, full_matrices=False)
S_inv = np.diag([1/si if si > 1e-12 else 0. for si in s])
beta_svd = Vt.T @ S_inv @ U.T @ y
print("closed:", beta_closed)
print("ridge:", beta_ridge)
print("svd :", beta_svd)
Try changing the data to make columns collinear (e.g., repeat a column) and observe differences.
This playground runs a small batch gradient descent on a manually-entered 1-D dataset (fit y = w0 + w1 x). Use it to see how parameters evolve step-by-step.