1. From Primal to Dual (intuition + derivation)
Primal (soft-margin SVM): for training examples (xᵢ, yᵢ), yᵢ∈{−1,+1},
min_{w,b,ξ} (1/2) ||w||² + C ∑_{i} ξᵢ
s.t. yᵢ (w·xᵢ + b) ≥ 1 - ξᵢ, ξᵢ ≥ 0 for all i
Introduce Lagrange multipliers αᵢ ≥ 0 for margin constraints and μᵢ ≥ 0 for ξᵢ ≥ 0. Solve stationarity conditions → eliminate primal variables w, ξ, b. This yields the dual:
max_{α} ∑_{i} αᵢ − (1/2) ∑_{i,j} αᵢ αⱼ yᵢ yⱼ (xᵢ·xⱼ)
s.t. 0 ≤ αᵢ ≤ C, ∑_{i} αᵢ yᵢ = 0
Key observations:
- Only training points appear through inner-products xᵢ·xⱼ (the Gram matrix).
- Support vectors are those with αᵢ > 0; points with 0 < αᵢ < C lie exactly on margin (support), αᵢ = C are slack-support (misclassified/within margin).
- After solving α, compute
w = ∑ αᵢ yᵢ xᵢand findbusing support vectors.
2. KKT Conditions (optimality)
The Karush-Kuhn-Tucker conditions link primal and dual optimal solutions:
- Primal feasibility: constraints satisfied.
- Dual feasibility: αᵢ ≥ 0 and μᵢ ≥ 0.
- Stationarity: gradient of Lagrangian w.r.t primal = 0 → yields w = ∑ αᵢ yᵢ xᵢ.
- Complementary slackness: αᵢ [ yᵢ (w·xᵢ + b) − 1 + ξᵢ ] = 0 and μᵢ ξᵢ = 0.
Complementary slackness implies:
- If αᵢ > 0 → the corresponding constraint is active: yᵢ (w·xᵢ + b) = 1 − ξᵢ.
- If 0 < αᵢ < C → ξᵢ = 0 and point lies exactly on margin (distance = 1/||w||).
3. Kernel Trick
Replace inner product xᵢ·xⱼ by a kernel function K(xᵢ,xⱼ)=φ(xᵢ)·φ(xⱼ) computed directly in input space. Dual becomes:
max_{α} ∑ αᵢ − (1/2) ∑ αᵢ αⱼ yᵢ yⱼ K(xᵢ,xⱼ)
s.t. 0 ≤ αᵢ ≤ C, ∑ αᵢ yᵢ = 0
Common kernels:
Linear: K(u,v)=u·v
Polynomial: K(u,v)=(γ u·v + r)^d
RBF / Gaussian: K(u,v)=exp(−γ ||u−v||²)
Sigmoid: K(u,v)=tanh(γ u·v + r)
Use kernels when you suspect a nonlinear decision boundary but want to avoid explicit mapping φ(x).
4. Practical Notes
- Solve the dual via quadratic programming (QP) — many libraries (libsvm, scikit-learn) use SMO or specialized solvers.
- Scale features — kernels are sensitive to feature scales.
- Choose C (tradeoff margin vs errors) and kernel hyperparameters (γ, degree) via cross-validation.
- For large datasets prefer linear SVM (liblinear) or approximate methods / subsampling.
5. Python sketch (scikit-learn)
# train SVM with RBF kernel
from sklearn.svm import SVC
clf = SVC(C=1.0, kernel='rbf', gamma='scale')
clf.fit(X_train, y_train)
# support vectors and dual coefficients:
sv = clf.support_vectors_
alphas = clf.dual_coef_ # shape (n_classes-1, n_SV)
intercept = clf.intercept_
Note: scikit-learn hides the QP solver; libsvm under the hood implements SMO and stores dual coefficients.