Objective
Understand vanilla gradient descent by minimizing a simple convex quadratic:
\[ f(\mathbf{w}) = \tfrac{1}{2}\,\mathbf{w}^\top \mathbf{A} \, \mathbf{w} - \mathbf{b}^\top \mathbf{w} + c, \quad \mathbf{A} \succ 0. \]
The minimizer is \(\mathbf{w}^* = \mathbf{A}^{-1}\mathbf{b}\). Gradient descent iterates
\[ \mathbf{w}_{t+1} = \mathbf{w}_t - \eta \nabla f(\mathbf{w}_t) = \mathbf{w}_t - \eta (\mathbf{A} \mathbf{w}_t - \mathbf{b}). \]
Convergence condition
Let \(\lambda_{\min}\le \lambda \le \lambda_{\max}\) be eigenvalues of \(\mathbf{A}\). For a constant step-size \(\eta\), convergence is guaranteed if
\[ 0 < \eta < \frac{2}{\lambda_{\max}}. \]
Worked Example
Take \(\mathbf{A}=\begin{bmatrix}3&0\\0&1\end{bmatrix}\), \(\mathbf{b}=\begin{bmatrix}6\\2\end{bmatrix}\). Then \(\mathbf{w}^* = [2,\,2]^\top\). With \(\eta = 0.5\), start at \(\mathbf{w}_0=[0,0]^\top\):
\[ \nabla f(\mathbf{w}_0) = -\mathbf{b} = [-6,-2]^\top, \quad \mathbf{w}_1 = [3,1]^\top. \]
Next:
\[ \nabla f(\mathbf{w}_1) = \mathbf{Aw}_1-\mathbf{b} = [3,1]^\top, \quad \mathbf{w}_2 = [1.5,0.5]^\top. \]
Continuing yields geometric convergence to \([2,2]^\top\).
Implementation Snippet (NumPy)
import numpy as np
A = np.array([[3.,0.],[0.,1.]])
b = np.array([6.,2.])
w = np.array([0.,0.])
eta = 0.5
for t in range(10):
grad = A @ w - b
w = w - eta * grad
print(t+1, w)
Key Takeaways
- Quadratics provide a clean sandbox to see step-size effects.
- Condition number \(\kappa=\lambda_{\max}/\lambda_{\min}\) dictates speed.
- Preconditioning (e.g., feature scaling) reduces \(\kappa\) and speeds convergence.