Stochastic Gradient Descent (SGD)
For loss \(\mathcal{L}(\mathbf{w}) = \frac{1}{n}\sum_{i=1}^n \ell(\mathbf{w}; \mathbf{x}_i,y_i)\), pick a random sample (or mini-batch) and update
\[ \mathbf{w}_{t+1} = \mathbf{w}_t - \eta_t \, \widehat{\nabla} \mathcal{L}(\mathbf{w}_t). \]
With diminishing step-sizes (e.g., \(\eta_t = \eta_0/(1+\gamma t)\)) and convex \(\ell\), SGD converges in expectation.
Momentum & Nesterov
# Classical momentum
v = beta * v + (1-beta) * grad
w = w - eta * v
# Nesterov lookahead (pseudo)
w_look = w - eta*beta*v
grad = grad_at(w_look)
v = beta*v + (1-beta)*grad
w = w - eta*v
Momentum accelerates along gentle valleys and damps oscillations in steep directions.
Learning-Rate Schedules
- Step decay: \(\eta_t = \eta_0\,\gamma^{\lfloor t/T\rfloor}\)
- Cosine: \(\eta_t = \eta_{\min}+\tfrac{1}{2}(\eta_0-\eta_{\min})(1+\cos(\pi t/T))\)
- Cyclical: triangular/triangular2 with periodic restarts
- Warmup: start small then ramp to \(\eta_0\)
Worked Example (Mini-batch Logistic with Momentum)
import numpy as np
rng = np.random.default_rng(0)
X = rng.normal(size=(400, 3))
w_true = np.array([1.5,-2.0,0.5]); b_true = -0.3
logits = X @ w_true + b_true
p = 1/(1+np.exp(-logits))
y = (rng.uniform(size=400) < p).astype(float)
w = np.zeros(3); b = 0.0
eta = 0.1; beta = 0.9
v_w = np.zeros_like(w); v_b = 0.0
batch = 32
for t in range(200):
idx = rng.choice(len(X), size=batch, replace=False)
Xb, yb = X[idx], y[idx]
z = Xb @ w + b
pb = 1/(1+np.exp(-z))
grad_w = Xb.T @ (pb - yb) / batch
grad_b = np.sum(pb - yb) / batch
v_w = beta*v_w + (1-beta)*grad_w
v_b = beta*v_b + (1-beta)*grad_b
w -= eta * v_w
b -= eta * v_b
Adam (Bonus)
Bias-corrected moment estimates:
m = beta1*m + (1-beta1)*grad
v = beta2*v + (1-beta2)*(grad**2)
mh = m/(1-beta1**t)
vh = v/(1-beta2**t)
w -= eta * mh/(np.sqrt(vh) + 1e-8)
Adam adapts per-parameter step sizes; useful for sparse/ill-scaled problems.