The applicability of batch or stochastic gradient descent really depends on the error manifold expected.

Batch gradient descent computes the gradient using the whole dataset. This is great for convex, or relatively smooth error manifolds. In this case, we move somewhat directly towards an optimum solution, either local or global. Additionally, batch gradient descent, given an annealed learning rate, will eventually find the minimum located in it's basin of attraction.

Stochastic gradient descent (SGD) computes the gradient using a single sample. Most applications of SGD actually use a minibatch of several samples, for reasons that will be explained a bit later. SGD works well (Not well, I suppose, but better than batch gradient descent) for error manifolds that have lots of local maxima/minima. In this case, the somewhat noisier gradient calculated using the reduced number of samples tends to jerk the model out of local minima into a region that hopefully is more optimal. Single samples are really noisy, while minibatches tend to average a little of the noise out. Thus, the amount of jerk is reduced when using minibatches. A good balance is struck when the minibatch size is small enough to avoid some of the poor local minima, but large enough that it doesn't avoid the global minima or better-performing local minima. (Incidently, this assumes that the best minima have a larger and deeper basin of attraction, and are therefore easier to fall into.)

One benefit of SGD is that it's computationally a whole lot faster. Large datasets often can't be held in RAM, which makes vectorization much less efficient. Rather, each sample or batch of samples must be loaded, worked with, the results stored, and so on. Minibatch SGD, on the other hand, is usually intentionally made small enough to be computationally tractable.

Usually, this computational advantage is leveraged by performing many more iterations of SGD, making many more steps than conventional batch gradient descent. This usually results in a model that is very close to that which would be found via batch gradient descent, or better.

The way I like to think of how SGD works is to imagine that I have one point that represents my input distribution. My model is attempting to learn that input distribution. Surrounding the input distribution is a shaded area that represents the input distributions of all of the possible minibatches I could sample. It's usually a fair assumption that the minibatch input distributions are close in proximity to the true input distribution. Batch gradient descent, at all steps, takes the steepest route to reach the true input distribution. SGD, on the other hand, chooses a random point within the shaded area, and takes the steepest route towards this point. At each iteration, though, it chooses a new point. The average of all of these steps will approximate the true input distribution, usually quite well.

Answer from Jason_L_Bens on Stack Exchange
🌐
Zilliz
zilliz.com › glossary › batch-gradient-descent
Batch Gradient Descent Explained
Choosing the right batch size is critical to balancing compute and model quality. In mini-batch gradient descent we typically choose batch sizes that are powers of 2, 32 or 64 for example.
🌐
Deepgram
deepgram.com › ai-glossary › batch-gradient-descent
Batch Gradient Descent
The Foundation: Batch Gradient Descent (BGD) is an iterative optimization algorithm central to machine learning, designed to minimize the cost function—a measure of prediction error—in models. All-encompassing Approach: Unlike its counterparts, BGD leverages the entire dataset to calculate the gradient of the cost function, ensuring each step is informed by a comprehensive view of the data landscape. Steady Convergence: By considering all training examples ...
🌐
Kenndanielso
kenndanielso.github.io › mlrefined › blog_posts › 13_Multilayer_perceptrons › 13_6_Stochastic_and_minibatch_gradient_descent.html
13.6 Stochastic and mini-batch gradient descent
We now run batch and mini-batch gradient descent with the same initialization and fixed steplength for $100$ iterations and plot the cost function evaluation as well as number of misclassifications for each iteration. ... In this Example we compare $40$ iterations of batch and mini-batch (batch size =$500$) gradient descent using multi-class perceptron cost on the MNIST dataset consisting of $P=70,000$ images of hand-written digits 0-9.
🌐
Medium
medium.com › @jaleeladejumo › gradient-descent-from-scratch-batch-gradient-descent-stochastic-gradient-descent-and-mini-batch-def681187473
Gradient Descent From Scratch- Batch Gradient Descent, Stochastic Gradient Descent, and Mini-Batch Gradient Descent. | by Jaleel Adejumo | Medium
April 12, 2023 - In this article, I will take you through the implementation of Batch Gradient Descent, Stochastic Gradient Descent, and Mini-Batch Gradient Descent coding from scratch in python. This will be beginners friendly. Understanding gradient descent method will help you in optimising your loss during ML model training.
🌐
Spot Intelligence
spotintelligence.com › home › batch gradient descent in machine learning made simple & how to tutorial in python
Batch Gradient Descent In Machine Learning Made Simple & How To Tutorial In Python
May 22, 2024 - What is batch gradient descent? How does it work and where is it used in machine learning? A Python tutorial example and the math explained.
Top answer
1 of 7
187

The applicability of batch or stochastic gradient descent really depends on the error manifold expected.

Batch gradient descent computes the gradient using the whole dataset. This is great for convex, or relatively smooth error manifolds. In this case, we move somewhat directly towards an optimum solution, either local or global. Additionally, batch gradient descent, given an annealed learning rate, will eventually find the minimum located in it's basin of attraction.

Stochastic gradient descent (SGD) computes the gradient using a single sample. Most applications of SGD actually use a minibatch of several samples, for reasons that will be explained a bit later. SGD works well (Not well, I suppose, but better than batch gradient descent) for error manifolds that have lots of local maxima/minima. In this case, the somewhat noisier gradient calculated using the reduced number of samples tends to jerk the model out of local minima into a region that hopefully is more optimal. Single samples are really noisy, while minibatches tend to average a little of the noise out. Thus, the amount of jerk is reduced when using minibatches. A good balance is struck when the minibatch size is small enough to avoid some of the poor local minima, but large enough that it doesn't avoid the global minima or better-performing local minima. (Incidently, this assumes that the best minima have a larger and deeper basin of attraction, and are therefore easier to fall into.)

One benefit of SGD is that it's computationally a whole lot faster. Large datasets often can't be held in RAM, which makes vectorization much less efficient. Rather, each sample or batch of samples must be loaded, worked with, the results stored, and so on. Minibatch SGD, on the other hand, is usually intentionally made small enough to be computationally tractable.

Usually, this computational advantage is leveraged by performing many more iterations of SGD, making many more steps than conventional batch gradient descent. This usually results in a model that is very close to that which would be found via batch gradient descent, or better.

The way I like to think of how SGD works is to imagine that I have one point that represents my input distribution. My model is attempting to learn that input distribution. Surrounding the input distribution is a shaded area that represents the input distributions of all of the possible minibatches I could sample. It's usually a fair assumption that the minibatch input distributions are close in proximity to the true input distribution. Batch gradient descent, at all steps, takes the steepest route to reach the true input distribution. SGD, on the other hand, chooses a random point within the shaded area, and takes the steepest route towards this point. At each iteration, though, it chooses a new point. The average of all of these steps will approximate the true input distribution, usually quite well.

2 of 7
19

As other answer suggests, the main reason to use SGD is to reduce the computation cost of gradient while still largely maintaining the gradient direction when averaged over many mini-batches or samples - that surely helps bring you to the local minima.

  1. Why minibatch works.

The mathematics behind this is that, the "true" gradient of the cost function (the gradient for the generalization error or for infinitely large samples set) is the expectation of the gradient over the true data generating distribution $p_{data}$; the actual gradient computed over a batch of samples is always an approximation to the true gradient with the empirical data distribution . Batch gradient descent can bring you the possible "optimal" gradient given all your data samples, it is not the "true" gradient though. A smaller batch (i.e. a minibatch) is probably not as optimal as the full batch, but they are both approximations - so is the single-sample minibatch (SGD).

Assuming there is no dependence between the samples in one minibatch, the computed is an unbiased estimate of the true gradient. The (squared) standard errors of the estimates with different minibatch sizes is inversely proportional to the sizes of the minibatch. That is, I.e., the reduction of standard error is the square root of the increase of sample size. This means, if the minibatch size is small, the learning rate has to be small too, in order to achieve stability over the big variance. When the samples are not independent, the property of unbiased estimate is no longer maintained. That requires you to shuffle the samples before the training, if the samples are sequenced not randomly enough.

  1. Why minibatch may work better.

Firstly, minibatch makes some learning problems from technically intractable to be tractable due to the reduced computation demand with smaller batch size.

Secondly, reduced batch size does not necessarily mean reduced gradient accuracy. The training samples many have lots of noises or outliers or biases. A randomly sampled minibatch may reflect the true data generating distribution better (or no worse) than the original full batch. If some iterations of the minibatch gradient updates give you a better estimation, overall the averaged result of one epoch can be better than the gradient computed from a full batch.

Thirdly, minibatch does not only help deal with unpleasant data samples, but also help deal with unpleasant cost function that has many local minima. As Jason_L_Bens mentions, sometimes the error manifolds may be easier to trap a regular gradient into a local minima, while more difficult to trap the temporarily random gradient computed with minibatch.

Finally, with gradient descent, you are not reaching the global minima in one step, but iterating on the error manifold. Gradient largely gives you only the direction to iterate. With minibatch, you can iterate much faster. In many cases, the more iterations, the better point you can reach. You do not really care at all weather the point is optimal globally or even locally. You just want to reach a reasonable model that brings you acceptable generalization error. Minibatch makes that easier.

You may find the book "Deep learning" by Ian Goodfellow, et al, has pretty good discussions on this topic if you read through it carefully.

🌐
Bogotobogo
bogotobogo.com › python › python_numpy_batch_gradient_descent_algorithm.php
Python Tutorial: batch gradient descent algorithm - 2020
Batch gradient descent algorithm Single Layer Neural Network - Perceptron model on the Iris dataset using Heaviside step activation function Batch gradient descent versus stochastic gradient descent Single Layer Neural Network - Adaptive Linear Neuron using linear (identity) activation function with batch gradient descent method Single Layer Neural Network : Adaptive Linear Neuron using linear (identity) activation function with stochastic gradient descent (SGD) Logistic Regression VC (Vapnik-Chervonenkis) Dimension and Shatter Bias-variance tradeoff Maximum Likelihood Estimation (MLE) Neural
Find elsewhere
🌐
Analytics Vidhya
analyticsvidhya.com › home › variants of gradient descent algorithm
Variants of Gradient Descent Algorithm | Types of Gradient Descent
November 7, 2023 - Based on the way we are calculating this cost function there are different variants of Gradient Descent. Let’s say there are a total of ‘m’ observations in a data set and we use all these observations to calculate the cost function J, then this is known as Batch Gradient Descent.
🌐
Medium
medium.com › @lomashbhuva › batch-gradient-descent-a-comprehensive-guide-to-multi-dimensional-optimization-ccacd24569ba
Batch Gradient Descent: A Comprehensive Guide to Multi-Dimensional Optimization🌟🚀 | by Lomash Bhuva | Medium
February 23, 2025 - Stochastic Gradient Descent (SGD) — Uses a single random data point per iteration, making updates noisier but faster. Mini-Batch Gradient Descent — A compromise between BGD and SGD, using small batches of data per update.
🌐
Towards Data Science
towardsdatascience.com › home › latest › the math behind stochastic gradient descent
Batch, Mini Batch & Stochastic Gradient Descent
January 24, 2025 - Implementing Stochastic Gradient Descent (SGD) in machine learning models is a practical step that brings the theoretical aspects of the algorithm into real-world application. This section will guide you through the basic implementation of SGD and provide tips for integrating it into machine learning workflows. Now let’s consider a simple case of SGD applied to Linear Regression: class SGDRegressor: def __init__(self, learning_rate=0.01, epochs=100, batch_size=1, reg=None, reg_param=0.0): """ Constructor for the SGDRegressor.
🌐
GeeksforGeeks
geeksforgeeks.org › machine learning › gradient-descent-algorithm-and-its-variants
Gradient Descent Algorithm in Machine Learning - GeeksforGeeks
Variants include Batch Gradient Descent, Stochastic Gradient Descent and Mini Batch Gradient Descent
Published   2 weeks ago
🌐
MachineLearningMastery
machinelearningmastery.com › home › blog › a gentle introduction to mini-batch gradient descent and how to configure batch size
A Gentle Introduction to Mini-Batch Gradient Descent and How to Configure Batch Size - MachineLearningMastery.com
August 19, 2019 - Batch gradient descent is a variation of the gradient descent algorithm that calculates the error for each example in the training dataset, but only updates the model after all training examples have been evaluated.
🌐
StatusNeo
statusneo.com › home › deep learning › efficientdl: mini-batch gradient descent explained
EfficientDL: Mini-batch Gradient Descent Explained - StatusNeo
September 13, 2023 - Variance Reduction in Mini-batch Gradient Descent: In Mini-batch Gradient Descent, the training dataset is divided into smaller subsets known as mini-batches, each containing a fixed number of data points. When computing the gradient and updating model parameters, the algorithm considers the average gradient over the mini-batch.
🌐
GeeksforGeeks
geeksforgeeks.org › difference-between-batch-gradient-descent-and-stochastic-gradient-descent
Difference between Batch Gradient Descent and Stochastic Gradient Descent - GeeksforGeeks
March 4, 2025 - Stochastic Gradient Descent (SGD) addresses the inefficiencies of Batch Gradient Descent by computing the gradient using only a single training example (or a small subset) in each iteration.
🌐
Kaggle
kaggle.com › code › avadhutvarvatkar › gradient-descent-explanation
Gradient Descent Explanation 🔥💹
Checking your browser before accessing www.kaggle.com · Click here if you are not automatically redirected after 5 seconds
🌐
Baeldung
baeldung.com › home › artificial intelligence › machine learning › differences between gradient, stochastic and mini batch gradient descent
Differences Between Gradient, Stochastic and Mini Batch Gradient Descent | Baeldung on Computer Science
February 28, 2025 - We can see that, depending on the dataset, Gradient Descent may have to iterate through many samples, which can lead to being unproductive. As we can see, in this case, the gradients are calculated on one random shuffled part out of partitions. Let’s assume batches.
🌐
Medium
medium.com › @ugurozcan108 › batch-gradient-descent-in-python-4d3b16d40755
Batch Gradient Descent in Python. The gradient descent algorithm… | by Uğur Özcan | Medium
March 17, 2022 - In this problem, we expect you to implement the batch gradient descent algorithm manually. The cost/loss function will be the mean squared loss for linear regression: You have to initialize all the weights to zero to meet the desired output. Input: x: an array of training examples y: an array of output corresponding to each training example lr: the learning rate for the algorithm iter: number of iterations the algorithm will perform
🌐
Sebastian Raschka
sebastianraschka.com › faq › docs › sgd-methods.html
How is stochastic gradient descent implemented in the context of machine learning and deep learning? | Sebastian Raschka, PhD
January 17, 2026 - Note that using only one training example per update results in very noisy gradients since the loss is approximated from one training example only. Noisy gradients can be useful if we have non-convex loss functions and want to escape sharp local minima. Batch gradient descent or just “gradient descent” is the determinisic (not stochastic) variant.
🌐
Medium
medium.com › data-science › batch-mini-batch-stochastic-gradient-descent-7a62ecba642a
Batch, Mini Batch & Stochastic Gradient Descent | by Sushant Patrikar | TDS Archive | Medium
October 1, 2019 - Batch Gradient Descent can be used for smoother curves. SGD can be used when the dataset is large. Batch Gradient Descent converges directly to minima. SGD converges faster for larger datasets. But, since in SGD we use only one example at a time, we cannot implement the vectorized implementation on it.