Lecture 4 — Kernel Functions & Norms in Linear Algebra

Theory • Properties • Examples • Interactive calculators (kernels & norms)

Part A — Kernel Functions

A kernel function k(x, y) computes an inner product between feature-space mappings of x and y without explicitly computing the mapping:

k(x, y) = <φ(x), φ(y)>

The key benefit (the kernel trick) is that many learning algorithms (SVM, kernel ridge, kernel PCA) only need inner products; a kernel lets us operate in a high (possibly infinite) dimensional feature space implicitly and efficiently.

Common kernel functions

Kernel	Formula	Notes
Linear	k(x,y)=x·y	Equivalent to no mapping; fast, good baseline.
Polynomial	k(x,y)=(α x·y + c)^d	Allows interactions up to degree d.
RBF / Gaussian	k(x,y)=exp(-\|\|x-y\|\|² / (2σ²))	Infinite-dimensional; locally sensitive; popular.
Sigmoid	k(x,y)=tanh(α x·y + c)	Linked to neural nets; not always positive-definite.

Properties required for valid kernels

Symmetry: k(x,y)=k(y,x).
Positive semidefiniteness: For any finite set {x₁,...,x_n}, the kernel matrix K with Kᵢⱼ=k(xᵢ,xⱼ) must be positive semidefinite (all eigenvalues ≥0). This is Mercer's condition.
Closure properties: Sums, products, and limits of valid kernels are valid kernels.

Example: Polynomial kernel (simple)

Let x=[x₁,x₂], y=[y₁,y₂], choose d=2, c=1:

k(x,y) = (x·y + 1)² = (x₁y₁ + x₂y₂ + 1)²

This equals the dot product in a 6-dimensional feature space consisting of squared and cross terms — but we avoid computing φ(x) explicitly.

Machine learning usage

SVM: Replace x·y by k(x,y) in the dual formulation to learn nonlinear boundaries.
Kernel PCA: Compute principal components from the kernel matrix to perform nonlinear dimensionality reduction.
Kernel ridge regression: Solve ridge regression in feature space via kernel matrices.

Part B — Norms in Linear Algebra

A norm is a function that assigns a non-negative length or size to vectors (and matrices). Norms satisfy: positivity, scalability, and triangle inequality.

Vector norms

L2 norm (Euclidean): ||x||₂ = sqrt(Σ xᵢ²) — measures Euclidean length.
L1 norm (Manhattan): ||x||₁ = Σ |xᵢ| — encourages sparsity in ML (Lasso).
Infinity norm: ||x||_∞ = maxᵢ |xᵢ| — maximum absolute component.

Matrix norms

Frobenius norm: ||A||_F = sqrt(Σᵢⱼ aᵢⱼ²) — like Euclidean norm for matrices.
Operator / spectral norm: ||A||₂ = largest singular value of A.
Induced L1 / L∞ norms: maximum absolute column/row sums respectively.

Why norms matter in ML

Regularization: Penalize ||w||₂² (ridge) or ||w||₁ (lasso) to control complexity and prevent overfitting.
Stability & conditioning: Norms (and condition number = σ_max/σ_min) tell how sensitive linear systems are to perturbations.
Distance metrics: Many kernels and similarity measures use norms and distances (e.g., RBF uses ||x-y||₂²).

Worked examples

Vector norms

For x=[3,4], ||x||₂ = 5 because sqrt(3²+4²)=5. ||x||₁ = 7. ||x||_∞ = 4.

Matrix norms

For A = [[1,2],[3,4]]:

||A||_F = sqrt(1+4+9+16) = sqrt(30) ≈ 5.477

The spectral norm is the largest singular value (compute via SVD).

Norms and regularization

Ridge regression minimizes ||y - Xw||₂² + λ||w||₂²; the λ||w||₂² term shrinks weights, reducing variance.

Quick reference

Notation	Meaning
\|\|x\|\|₂	Euclidean norm
\|\|A\|\|_F	Frobenius norm
cond(A)	Condition number (σ_max/σ_min)

Interactive tools

Kernel evaluator

Enter two vectors (space-separated) and choose kernel.

Vector x Vector y Kernel

Norm calculator

Enter a vector or small matrix (rows semicolon-separated).

Vector (e.g. 3 4)

Matrix (e.g. 1 2; 3 4)