Concept: Moves step by step in the opposite direction of the gradient to minimize the loss function.

Update Rule: θ = θ - η ∇L(θ) where η is learning rate.

Example: Training Linear Regression by minimizing Mean Squared Error (MSE).

Importance: Fundamental algorithm used in almost all machine learning models.

Concept: Updates parameters using one random training sample at a time. Faster but noisier updates.

Update Rule: θ = θ - η ∇L(θ; xᵢ, yᵢ)

Example: Used in training deep neural networks where datasets are huge.

Importance: Makes optimization feasible for large-scale data.

Concept: A compromise between GD and SGD. Updates are performed using small batches of training data.

Example: Commonly used in deep learning (batch size 32, 64, 128).

Importance: Reduces noise of SGD and computational burden of full GD.

Lecture 10: Optimization Algorithms in Machine Learning

Optimization techniques are the backbone of training machine learning models. These algorithms are used to minimize a loss function (error) by updating the model parameters iteratively. Different algorithms vary in convergence speed, stability, and efficiency for different datasets.

1. Gradient Descent

Gradient Descent is the most basic optimization algorithm. It updates parameters in the direction opposite to the gradient of the loss function.

Update Rule: θ = θ - η ∇L(θ)

Application: Used in linear regression, logistic regression, neural networks.

2. Stochastic Gradient Descent (SGD)

Instead of using the entire dataset, SGD updates the parameters using one random sample at a time. This makes it faster but noisier.

Advantage: Faster convergence on large datasets.

Application: Deep learning frameworks like TensorFlow and PyTorch use SGD variants.

3. Mini-Batch Gradient Descent

A compromise between Gradient Descent and SGD. It uses small batches of data for updates.

Advantage: Balances efficiency and convergence stability.

Application: Deep learning training pipelines.

4. Newton’s Method

Uses second-order derivatives (Hessian matrix) for faster convergence.

Update Rule: θ = θ - H^-1 ∇L(θ)

Application: Logistic regression, convex optimization problems.

5. Momentum

Momentum accelerates SGD by adding a fraction of the previous update to the current one.

Update Rule: v = βv + η∇L(θ); θ = θ - v

Application: Neural networks (helps escape local minima).

6. AdaGrad (Adaptive Gradient)

Adapts the learning rate for each parameter based on past gradients.

Advantage: Works well with sparse data.

Application: NLP problems like word embeddings.

7. Adam (Adaptive Moment Estimation)

Combines Momentum and RMSProp. It is the most popular optimization algorithm for deep learning.

Update Rule: Uses moving averages of gradients and squared gradients.

Application: Training deep neural networks, CNNs, RNNs.

Key Insights

Gradient Descent is simple but computationally expensive.
SGD and Mini-batch are widely used in practice.
Momentum and Adam improve convergence speed and performance.
AdaGrad and variants adapt to sparse features efficiently.

🧮 Gradient Descent Playground

Try updating x step-by-step using Gradient Descent for f(x) = x².

Initial x:

Learning rate (η):

Lecture 10: Optimization Algorithms in Machine Learning

1. Gradient Descent (GD)

2. Stochastic Gradient Descent (SGD)

3. Mini-Batch Gradient Descent

Lecture 10: Optimization Algorithms in Machine Learning

1. Gradient Descent

2. Stochastic Gradient Descent (SGD)

3. Mini-Batch Gradient Descent

4. Newton’s Method

5. Momentum

6. AdaGrad (Adaptive Gradient)

7. Adam (Adaptive Moment Estimation)

Key Insights

🧮 Gradient Descent Playground

Results