Lecture 29 — SVM Dual Formulation & Kernels

1. From Primal to Dual (intuition + derivation)

Primal (soft-margin SVM): for training examples (xᵢ, yᵢ), yᵢ∈{−1,+1},

min_{w,b,ξ}  (1/2) ||w||² + C ∑_{i} ξᵢ
s.t.  yᵢ (w·xᵢ + b) ≥ 1 - ξᵢ,   ξᵢ ≥ 0  for all i

Introduce Lagrange multipliers αᵢ ≥ 0 for margin constraints and μᵢ ≥ 0 for ξᵢ ≥ 0. Solve stationarity conditions → eliminate primal variables w, ξ, b. This yields the dual:

max_{α}  ∑_{i} αᵢ − (1/2) ∑_{i,j} αᵢ αⱼ yᵢ yⱼ (xᵢ·xⱼ)
s.t.   0 ≤ αᵢ ≤ C,   ∑_{i} αᵢ yᵢ = 0

Key observations:

Only training points appear through inner-products xᵢ·xⱼ (the Gram matrix).
Support vectors are those with αᵢ > 0; points with 0 < αᵢ < C lie exactly on margin (support), αᵢ = C are slack-support (misclassified/within margin).
After solving α, compute w = ∑ αᵢ yᵢ xᵢ and find b using support vectors.

2. KKT Conditions (optimality)

The Karush-Kuhn-Tucker conditions link primal and dual optimal solutions:

Primal feasibility: constraints satisfied.
Dual feasibility: αᵢ ≥ 0 and μᵢ ≥ 0.
Stationarity: gradient of Lagrangian w.r.t primal = 0 → yields w = ∑ αᵢ yᵢ xᵢ.
Complementary slackness: αᵢ [ yᵢ (w·xᵢ + b) − 1 + ξᵢ ] = 0 and μᵢ ξᵢ = 0.

Complementary slackness implies:

If αᵢ > 0 → the corresponding constraint is active: yᵢ (w·xᵢ + b) = 1 − ξᵢ.
If 0 < αᵢ < C → ξᵢ = 0 and point lies exactly on margin (distance = 1/||w||).

3. Kernel Trick

Replace inner product xᵢ·xⱼ by a kernel function K(xᵢ,xⱼ)=φ(xᵢ)·φ(xⱼ) computed directly in input space. Dual becomes:

max_{α}  ∑ αᵢ − (1/2) ∑ αᵢ αⱼ yᵢ yⱼ K(xᵢ,xⱼ)
s.t.  0 ≤ αᵢ ≤ C,  ∑ αᵢ yᵢ = 0

Common kernels:

Linear: K(u,v)=u·v

Polynomial: K(u,v)=(γ u·v + r)^d

RBF / Gaussian: K(u,v)=exp(−γ ||u−v||²)

Sigmoid: K(u,v)=tanh(γ u·v + r)

Use kernels when you suspect a nonlinear decision boundary but want to avoid explicit mapping φ(x).

4. Practical Notes

Solve the dual via quadratic programming (QP) — many libraries (libsvm, scikit-learn) use SMO or specialized solvers.
Scale features — kernels are sensitive to feature scales.
Choose C (tradeoff margin vs errors) and kernel hyperparameters (γ, degree) via cross-validation.
For large datasets prefer linear SVM (liblinear) or approximate methods / subsampling.

5. Python sketch (scikit-learn)

# train SVM with RBF kernel
from sklearn.svm import SVC
clf = SVC(C=1.0, kernel='rbf', gamma='scale')
clf.fit(X_train, y_train)

# support vectors and dual coefficients:
sv = clf.support_vectors_
alphas = clf.dual_coef_   # shape (n_classes-1, n_SV)
intercept = clf.intercept_

Note: scikit-learn hides the QP solver; libsvm under the hood implements SMO and stores dual coefficients.