Optimization is at the core of machine learning. Every learning algorithm (from linear regression to deep neural networks) involves minimizing or maximizing a cost/loss function. Here we explore the most important optimization techniques.
Concept: Moves step by step in the opposite direction of the gradient to minimize the loss function.
Update Rule:
θ = θ - η ∇L(θ) where η is learning rate.
Importance: Fundamental algorithm used in almost all machine learning models.
Concept: Updates parameters using one random training sample at a time. Faster but noisier updates.
Update Rule: θ = θ - η ∇L(θ; xᵢ, yᵢ)
Importance: Makes optimization feasible for large-scale data.
Concept: A compromise between GD and SGD. Updates are performed using small batches of training data.
Importance: Reduces noise of SGD and computational burden of full GD.
Optimization techniques are the backbone of training machine learning models. These algorithms are used to minimize a loss function (error) by updating the model parameters iteratively. Different algorithms vary in convergence speed, stability, and efficiency for different datasets.
Gradient Descent is the most basic optimization algorithm. It updates parameters in the direction opposite to the gradient of the loss function.
Update Rule: θ = θ - η ∇L(θ)
Application: Used in linear regression, logistic regression, neural networks.
Instead of using the entire dataset, SGD updates the parameters using one random sample at a time. This makes it faster but noisier.
Advantage: Faster convergence on large datasets.
Application: Deep learning frameworks like TensorFlow and PyTorch use SGD variants.
A compromise between Gradient Descent and SGD. It uses small batches of data for updates.
Advantage: Balances efficiency and convergence stability.
Application: Deep learning training pipelines.
Uses second-order derivatives (Hessian matrix) for faster convergence.
Update Rule: θ = θ - H-1 ∇L(θ)
Application: Logistic regression, convex optimization problems.
Momentum accelerates SGD by adding a fraction of the previous update to the current one.
Update Rule: v = βv + η∇L(θ); θ = θ - v
Application: Neural networks (helps escape local minima).
Adapts the learning rate for each parameter based on past gradients.
Advantage: Works well with sparse data.
Application: NLP problems like word embeddings.
Combines Momentum and RMSProp. It is the most popular optimization algorithm for deep learning.
Update Rule: Uses moving averages of gradients and squared gradients.
Application: Training deep neural networks, CNNs, RNNs.
Try updating x step-by-step using Gradient Descent for f(x) = x².