Optimization Techniques in Machine Learning

Lecture 17 – Example 1: Gradient Descent on a Quadratic

Objective

Understand vanilla gradient descent by minimizing a simple convex quadratic:

\[ f(\mathbf{w}) = \tfrac{1}{2}\,\mathbf{w}^\top \mathbf{A} \, \mathbf{w} - \mathbf{b}^\top \mathbf{w} + c, \quad \mathbf{A} \succ 0. \]

The minimizer is \(\mathbf{w}^* = \mathbf{A}^{-1}\mathbf{b}\). Gradient descent iterates

\[ \mathbf{w}_{t+1} = \mathbf{w}_t - \eta \nabla f(\mathbf{w}_t) = \mathbf{w}_t - \eta (\mathbf{A} \mathbf{w}_t - \mathbf{b}). \]

Convergence condition

Let \(\lambda_{\min}\le \lambda \le \lambda_{\max}\) be eigenvalues of \(\mathbf{A}\). For a constant step-size \(\eta\), convergence is guaranteed if

\[ 0 < \eta < \frac{2}{\lambda_{\max}}. \]

Tip: A good practical choice is \(\eta=\frac{1}{\lambda_{\max}}\) when \(\lambda_{\max}\) is known/estimated (e.g., power iteration).

Worked Example

Take \(\mathbf{A}=\begin{bmatrix}3&0\\0&1\end{bmatrix}\), \(\mathbf{b}=\begin{bmatrix}6\\2\end{bmatrix}\). Then \(\mathbf{w}^* = [2,\,2]^\top\). With \(\eta = 0.5\), start at \(\mathbf{w}_0=[0,0]^\top\):

\[ \nabla f(\mathbf{w}_0) = -\mathbf{b} = [-6,-2]^\top, \quad \mathbf{w}_1 = [3,1]^\top. \]

Next:

\[ \nabla f(\mathbf{w}_1) = \mathbf{Aw}_1-\mathbf{b} = [3,1]^\top, \quad \mathbf{w}_2 = [1.5,0.5]^\top. \]

Continuing yields geometric convergence to \([2,2]^\top\).

Implementation Snippet (NumPy)

import numpy as np
A = np.array([[3.,0.],[0.,1.]])
b = np.array([6.,2.])
w = np.array([0.,0.])
eta = 0.5
for t in range(10):
    grad = A @ w - b
    w = w - eta * grad
    print(t+1, w)

Key Takeaways

  • Quadratics provide a clean sandbox to see step-size effects.
  • Condition number \(\kappa=\lambda_{\max}/\lambda_{\min}\) dictates speed.
  • Preconditioning (e.g., feature scaling) reduces \(\kappa\) and speeds convergence.