Gradient Update Rules: From Vanilla Gradient Descent to Modern Optimizers

Updated: 2026-02-17

Gradient descent and its cousins form the backbone of modern deep learning. Whether you’re training a small convolutional neural network (CNN) or a gigantic transformer, the way you update parameters directly determines convergence speed, stability of learning, and ultimately the quality of the model’s predictions. Understanding both the theory behind these update rules and how they behave in practice can transform a “good” model into a great one.

In this article we will:

Uncover the mathematical foundations of gradient update rules in the context of convex and non‑convex optimization.
Map the evolution of optimizers from basic batch gradient descent to sophisticated adaptive algorithms like Adam and Nadam.
Break down key factors (learning rate, momentum, batch size, regularization) that influence training dynamics.
Present actionable best‑practice recommendations for choosing, tuning, and troubleshooting optimizers.
Illustrate the practical impact with real‑world case studies across computer vision, natural language processing (NLP), and reinforcement learning (RL).

By blending theory, hands‑on examples, and industry guidance, this guide offers an authoritative blueprint that is both approachable for newcomers and insightful for experienced practitioners. Let’s dive in.

1. The Gradient Descent Landscape

1.1 The Core Idea

At its heart, gradient descent aims to minimize a loss function ( L(\theta) ) by iteratively moving the parameter vector ( \theta ) opposite to its gradient:

[ \theta_{t+1} = \theta_t - \alpha \nabla_{\theta} L(\theta_t) ]

( \theta_t ): Parameters at iteration ( t ).
( \alpha ): Learning rate (step size).
( \nabla_{\theta} L(\theta_t) ): Gradient of the loss w.r.t. parameters.

Each update is a step along the steepest descent direction in parameter space, scaled by ( \alpha ).

1.2 Convex vs. Non‑Convex

Convex problems: A unique global optimum exists. Classic optimization theory guarantees convergence for appropriate ( \alpha ).
Non‑convex problems (typical in deep learning): Multiple local minima, saddle points, and flat regions abound. Gradient descent can get trapped, but stochasticity helps escape sub‑optimal landscapes.

Key takeaway: In deep learning we rely on the stochastic variant, where gradients are computed on mini‑batches rather than the full dataset, introducing noise that encourages exploration of parameter space.

2. Classical Optimizers: The Foundation

2.1 Batch Gradient Descent (BGD)

Definition: Compute the gradient over the entire training set ( E ):

[ \theta_{t+1} = \theta_t - \alpha \frac{1}{|E|} \sum_{i=1}^{|E|} \nabla_{\theta} L(\theta_t; x_i) ]

Pros: Deterministic, smooth convergence trajectory.
Cons: Unfeasible for large datasets, high memory use, and poor generalization if data is not fully representative.

2.2 Stochastic Gradient Descent (SGD)

Definition: Update parameters using a single training example ( x_i ):

[ \theta_{t+1} = \theta_t - \alpha \nabla_{\theta} L(\theta_t; x_i) ]

Pros: Computationally cheap, memory efficient, introduces beneficial noise.
Cons: Highly noisy updates can lead to erratic training; sensitive to learning rate.

2.3 Mini‑Batch Gradient Descent

A compromise: use a small subset ( B ) (size ( m )):

[ \theta_{t+1} = \theta_t - \alpha \frac{1}{m} \sum_{x \in B} \nabla_{\theta} L(\theta_t; x) ]

Typical mini‑batch size: 32, 64, 128, or 256.
Practical effect: Provides a balance between variance reduction (larger batches) and computational efficiency (smaller batches).

3. Momentum: Accelerating the Journey

3.1 Why Momentum?

Gradient updates can stall at zig‑zagging plateaus or suffer from oscillations along ravines. Momentum blends past update directions to smooth the trajectory.

3.2 The Classical Momentum Update

[ \begin{aligned} v_{t+1} &= \beta v_t + (1 - \beta) \nabla_{\theta} L(\theta_t) \ \theta_{t+1} &= \theta_t - \alpha v_{t+1} \end{aligned} ]

( v_t ): Velocity (momentum term).
( \beta ): Momentum coefficient (typically 0.9).

Result: Faster convergence, less sensitivity to learning rate choice than plain SGD.

3.3 Nesterov Accelerated Gradient (NAG)

NAG improves upon classical momentum by looking ahead:

[ \begin{aligned} v_{t+1} &= \beta v_t - \alpha \nabla_{\theta} L\big(\theta_t + \beta v_t\big) \ \theta_{t+1} &= \theta_t + v_{t+1} \end{aligned} ]

Benefit: Provides a corrective term that anticipates the next position, thereby yielding sharper updates.

4. Adapting the Step Size: From Adagrad to Adam

Modern deep learning thrives on the ability to adapt learning rates per parameter. Let’s traverse the road from Adagrad to the popular Adam algorithm.

4.1 Adagrad: Element‑Wise Learning Rate Decay

Update rule:

[ \theta_{t+1} = \theta_t - \frac{\alpha}{\sqrt{G_t + \epsilon}} \odot \nabla_{\theta} L(\theta_t) ]

( G_t ): Accumulated squared gradients up to time ( t ).
( \epsilon ): Small constant for numerical stability.

Insight: Features that occur frequently get smaller learning rates over time; rare features receive larger steps.

Limitations:

Rapid learning rate decay, often leading to premature stagnation.
Typically unsuitable for training deep neural networks.

4.2 RMSProp: Mitigating Adagrad’s Decay

RMSProp uses an exponential moving average of squared gradients:

[ E[g^2]t = \gamma E[g^2]{t-1} + (1 - \gamma)(\nabla_{\theta} L(\theta_t))^2 ]

Update rule becomes:

[ \theta_{t+1} = \theta_t - \frac{\alpha}{\sqrt{E[g^2]t + \epsilon}} \odot \nabla{\theta} L(\theta_t) ]

( \gamma ): Decay rate (commonly 0.9).
Effect: Prevents learning rates from vanishing too quickly.

4.3 Adam: The Gold Standard

Adam combines momentum and RMSProp.

First‑moment estimate (mean gradient):

[ m_t = \beta_1 m_{t-1} + (1-\beta_1) \nabla_{\theta} L(\theta_t) ]

Second‑moment estimate (uncentered variance):

[ v_t = \beta_2 v_{t-1} + (1-\beta_2) (\nabla_{\theta} L(\theta_t))^2 ]

Bias‑corrected:

[ \hat{m}_t = \frac{m_t}{1 - \beta_1^t} , \quad \hat{v}_t = \frac{v_t}{1 - \beta_2^t} ]

Parameter update:

[ \theta_{t+1} = \theta_t - \alpha \frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon} ]

Default hyperparameters: ( \beta_1=0.9 ), ( \beta_2=0.999 ), ( \epsilon=10^{-8} ).
Pros: Works well out of the box for many architectures, converges faster, handles sparse data.

4.4 Nadam: Nesterov‑Adam Hybrid

Adding Nesterov momentum to Adam yields Nadam. Its update rule for ( m_t ) uses NAG’s anticipation before bias correction, often improving convergence on extremely deep models (e.g., ResNet‑50 or GPT‑2).

5. Key Hyperparameters: The Art of Tuning

Optimizers expose several tunable knobs. Below we distill each into a concise “impact map” and present a rule‑of‑thumb for each.

Factor	Effect	Typical Range	Practical Tips
Learning rate ( \alpha )	Size of each step	1e‑3…1e‑5 (SGD); higher for Adam	Use learning‑rate schedules; start higher, anneal later
Batch size	Variance of gradients	32, 64, 128, 256	Smaller = noisier but can speed up convergence; too large may over‑smooth
Momentum ( \beta )	Smooths updates	0.9–0.99	Default 0.9; NAG slightly better at high training depth
Weight decay	L2 regularization	1e‑4–1e‑3	Combine with optimizers to avoid over‑fitting
Learning‑rate warm‑up	Mitigates initial divergence	0–5 epochs	Add linearly to ( \alpha ) in early stages
Gradient clipping	Controls exploding gradients	0.1–5	Especially vital in recurrent neural networks (RNNs) and RL

Actionable rule: When switching optimizers, keep the baseline learning rate the same; then adjust momentum or adaptive coefficients as recommended by the specific algorithm. Do not blindly increase ( \alpha ) while switching to Adam, as Adam’s adaptive rates will already compensate.

6. Training Dynamics in Practice

6.1 The Role of Batch Size

Very small batches (1–8): Highly noisy, may generalize better but slower convergence.
Larger batches (512–2048): Faster computation but risk of sharp minima and poorer generalization.
Recent findings: Training “large‑batch” models with SGD may require explicit noise injections or learning‑rate scaling (e.g., linear scaling rule).

6.2 Learning‑Rate Scheduling

Step decay: Drop ( \alpha ) by a factor (e.g., 10×) at predetermined epochs.
Cosine annealing: Smooth cosine curve; avoids abrupt drops.
Warm restarts: Combine with cosine schedule to re‑inject high learning rates.

Practical recipe:

scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(optimiser, T_max=200, eta_min=1e-5)
for epoch in range(num_epochs):
    train(...)
    scheduler.step()

6.3 Regularization Through Optimizer Choices

Adam with weight decay (AdamW): Separates weight decay from the adaptive learning rate, leading to better regularization.
SGD with L2: Classic combination; often performs better on ImageNet when fine‑tuned with pre‑trained models.

7. Choosing the Right Optimizer: A Practical Decision Tree

Below is a distilled decision path that mirrors typical production scenarios.

Dataset Size & Memory Constraints?
- Large → Mini‑batch SGD or Adam with small batches.
- Small → Batch gradient descent (if computation allows).
Model Depth & Complexity?
- Very deep (e.g., > 50 layers) → AdamW or RMSProp.
- Moderate → SGD + momentum or NAG.
Require Fast Convergence?
- Yes → Adam or Nadam.
- No (focus on interpretability or small models) → SGD + momentum.
Concerned about Generalization?
- Yes → Use SGD + momentum with cyclical learning rates or SGD‑WD (SGD with weight decay).
- No → Adam.
Training on GPUs / TPUs?
- Large batch training → AdamW or SGD with linear scaling rule.
- Memory‑aware → Adam with gradient clipping.

Case study: The Vision Transformer (ViT) authors used AdamW with a 0.5 × cosine schedule and a 16‑batch size, achieving state‑of‑the‑art performance on ImageNet with significantly fewer epochs than earlier CNNs trained with plain SGD.

8. Troubleshooting Common Optimizer Failures

Symptom	Likely Cause	Fix
Training stalls immediately	Learning rate too high, weight decay too aggressive	Reduce ( \alpha ); lower weight decay.
Loss oscillates wildly	High variance in gradients (small batch size)	Increase batch size; add momentum.
Convergence too slow	Adam’s adaptive decay too low; learning rate too small	Use AdamW with higher base ( \alpha ); add learning‑rate warm‑up.
Over‑fitting	Lack of regularization; high learning rate	Add dropout, weight decay, or use early stopping.
Excessive memory consumption	Batch gradient descent on big data	Switch to mini‑batch SGD; consider gradient accumulation.

9. Real‑World Impact: Case Studies

9.1 Computer Vision – Object Detection

Approach	Optimizer	Batch Size	Epochs	mAP (Mean Average Precision)
Vanilla SGD	SGD + momentum, ( \beta=0.9 )	32	50	75.2%
AdamW	AdamW, weight decay 0.01	64	30	78.9%
AdamW + Warm‑up	AdamW, ( \alpha = 1e‑4 ), warm‑up 5 epochs	64	30	78.4%

Observation: Switching from classic SGD with momentum to AdamW and a moderate batch size cuts training time by almost 40 % while boosting accuracy by more than 3 %.

9.2 NLP – Language Modeling

Approach	Optimizer	Scheduler	Training Steps	Perplexity
Baseline	AdamX (Adam with Noisy L2)	Piecewise constant	500k	22.3
Fast	AdamW + One‑Cycle LR	Cyclical	300k	20.1
Robust	SGD + NAG + Weight Decay	Cosine annealing	500k	21.4

The One‑Cycle learning‑rate schedule dramatically reduced time to converge, and the use of AdamW prevented over‑fitting by automatically adjusting per‑parameter learning rates.

9.3 Reinforcement Learning – Policy Optimization

Agent	Optimizer	Hyperparams	Sample Efficiency
Proximal Policy Optimization (PPO)	SGD + clipping	( \alpha=3e‑4 ), ( \beta=0.9 )	Standard
PPO‑Adam	Adam	( \alpha=1e‑4 ), ( \beta_1=0.9 ), ( \beta_2=0.999 )	30 % more samples
PPO‑RMSProp	RMSProp	( \gamma=0.9 ), ( \alpha=1e‑4 )	Comparable

In policy gradient methods, the stochastic nature of the environment injects additional variance. Adaptive optimizers (Adam, RMSProp) help navigate the noisy gradients, yielding better sample efficiency.

10. Advanced Topics and Future Directions

10.1 Learning‑Rate Warmup & Linear Scaling

For very large batch sizes (> 1024) scaling the learning rate proportionally (linear scaling rule) helps maintain stability:

[ \alpha_{\text{effective}} = \alpha_{\text{base}} \times \frac{m_{\text{batch}}}{m_{\text{ref}}} ]

Reference batch size ( m_{\text{ref}} ) could be 256. Warm‑up a few epochs to avoid abrupt gradient spikes.

10.2 LAMB & LARS – Layer‑wise Adaptive Rates

Both LAMB (Layer‑wise Adaptive Moments) and LARS (Layer‑wise Adaptive Rate Scaling) are tailored for deep learning training at massive batch sizes typical of BERT fine‑tuning. They combine Adam‑style adaptation with layer‑specific scaling factors.

10.3 Meta‑Optimization

Meta‑learning techniques (e.g., MAML) or Adam with hypergradient can discover optimal optimizer hyperparameters automatically, particularly useful when dealing with non‑stationary objective landscapes.

11. Summary Cheat Sheet

Base optimizers:
- SGD + momentum / NAG: Great for moderate complexity models; offers best generalization when tuned.
- AdamW / Adam: Fast convergence; handles sparse data and very deep architectures.
- RMSProp: Good when gradients are highly sparse or when training RNNs.
- Nadam / Nadam: Hybrid; use on extremely deep networks.
Hyperparameter defaults:
- ( \alpha \approx 1e‑3 ) (Adam); ( 3e‑4 ) (SGD).
- Momentum ( \beta \approx 0.9 ).
- Weight decay ( 1e‑4 ).
Schedules:
- Warm‑up → Linear schedule, then cosine annealing.
- For pre‑trained fine‑tuning, use One‑Cycle or Cyclical LR.
Gradient clipping:
- Apply in RNNs, LSTM‑based policy networks, and sometimes in large‑scale Transformers to avoid gradient explosions.
Regularization:
- Prefer AdamW over Adam when weight decay is needed; this separates decay from learning-rate adaptation.

11. Conclusion

Through a blend of theoretical insights and extensive empirical evidence, the optimal choice of optimizer becomes a function of data scale, model architecture, and resource availability. Modern frameworks and research trends favor adaptive optimizers (AdamW, Adam, RMSProp) combined with thoughtful learning‑rate scheduling, warm‑up regimes, and regularization. However, the humble SGD + momentum remains a robust default for scenarios where interpretability and generalization are paramount.

Take‑away: Start with AdamW + weight decay and a cosine schedule, then iteratively refine batch size and learning‑rate strategies. Monitor training dynamics closely, and use the troubleshooting table to quickly diagnose and resolve inefficiencies.

This guide should arm developers and researchers alike with a practical, data‑driven approach to optimizer selection and hyperparameter tuning—transforming the theoretical landscape of gradient‑descent into a clear, actionable workflow.