The applicability of batch or stochastic gradient descent really depends on the error manifold expected.

Batch gradient descent computes the gradient using the whole dataset. This is great for convex, or relatively smooth error manifolds. In this case, we move somewhat directly towards an optimum solution, either local or global. Additionally, batch gradient descent, given an annealed learning rate, will eventually find the minimum located in it's basin of attraction.

Stochastic gradient descent (SGD) computes the gradient using a single sample. Most applications of SGD actually use a minibatch of several samples, for reasons that will be explained a bit later. SGD works well (Not well, I suppose, but better than batch gradient descent) for error manifolds that have lots of local maxima/minima. In this case, the somewhat noisier gradient calculated using the reduced number of samples tends to jerk the model out of local minima into a region that hopefully is more optimal. Single samples are really noisy, while minibatches tend to average a little of the noise out. Thus, the amount of jerk is reduced when using minibatches. A good balance is struck when the minibatch size is small enough to avoid some of the poor local minima, but large enough that it doesn't avoid the global minima or better-performing local minima. (Incidently, this assumes that the best minima have a larger and deeper basin of attraction, and are therefore easier to fall into.)

One benefit of SGD is that it's computationally a whole lot faster. Large datasets often can't be held in RAM, which makes vectorization much less efficient. Rather, each sample or batch of samples must be loaded, worked with, the results stored, and so on. Minibatch SGD, on the other hand, is usually intentionally made small enough to be computationally tractable.

Usually, this computational advantage is leveraged by performing many more iterations of SGD, making many more steps than conventional batch gradient descent. This usually results in a model that is very close to that which would be found via batch gradient descent, or better.

The way I like to think of how SGD works is to imagine that I have one point that represents my input distribution. My model is attempting to learn that input distribution. Surrounding the input distribution is a shaded area that represents the input distributions of all of the possible minibatches I could sample. It's usually a fair assumption that the minibatch input distributions are close in proximity to the true input distribution. Batch gradient descent, at all steps, takes the steepest route to reach the true input distribution. SGD, on the other hand, chooses a random point within the shaded area, and takes the steepest route towards this point. At each iteration, though, it chooses a new point. The average of all of these steps will approximate the true input distribution, usually quite well.

Answer from Jason_L_Bens on Stack Exchange
🌐
GeeksforGeeks
geeksforgeeks.org › machine learning › difference-between-batch-gradient-descent-and-stochastic-gradient-descent
Difference between Batch Gradient Descent and Stochastic Gradient Descent - GeeksforGeeks
September 30, 2025 - Batch Gradient Descent is more accurate but slower and computationally expensive. It is ideal when working with small to medium-sized datasets and when high accuracy is required. Stochastic Gradient Descent, on the other hand, is faster and ...
Top answer
1 of 7
187

The applicability of batch or stochastic gradient descent really depends on the error manifold expected.

Batch gradient descent computes the gradient using the whole dataset. This is great for convex, or relatively smooth error manifolds. In this case, we move somewhat directly towards an optimum solution, either local or global. Additionally, batch gradient descent, given an annealed learning rate, will eventually find the minimum located in it's basin of attraction.

Stochastic gradient descent (SGD) computes the gradient using a single sample. Most applications of SGD actually use a minibatch of several samples, for reasons that will be explained a bit later. SGD works well (Not well, I suppose, but better than batch gradient descent) for error manifolds that have lots of local maxima/minima. In this case, the somewhat noisier gradient calculated using the reduced number of samples tends to jerk the model out of local minima into a region that hopefully is more optimal. Single samples are really noisy, while minibatches tend to average a little of the noise out. Thus, the amount of jerk is reduced when using minibatches. A good balance is struck when the minibatch size is small enough to avoid some of the poor local minima, but large enough that it doesn't avoid the global minima or better-performing local minima. (Incidently, this assumes that the best minima have a larger and deeper basin of attraction, and are therefore easier to fall into.)

One benefit of SGD is that it's computationally a whole lot faster. Large datasets often can't be held in RAM, which makes vectorization much less efficient. Rather, each sample or batch of samples must be loaded, worked with, the results stored, and so on. Minibatch SGD, on the other hand, is usually intentionally made small enough to be computationally tractable.

Usually, this computational advantage is leveraged by performing many more iterations of SGD, making many more steps than conventional batch gradient descent. This usually results in a model that is very close to that which would be found via batch gradient descent, or better.

The way I like to think of how SGD works is to imagine that I have one point that represents my input distribution. My model is attempting to learn that input distribution. Surrounding the input distribution is a shaded area that represents the input distributions of all of the possible minibatches I could sample. It's usually a fair assumption that the minibatch input distributions are close in proximity to the true input distribution. Batch gradient descent, at all steps, takes the steepest route to reach the true input distribution. SGD, on the other hand, chooses a random point within the shaded area, and takes the steepest route towards this point. At each iteration, though, it chooses a new point. The average of all of these steps will approximate the true input distribution, usually quite well.

2 of 7
19

As other answer suggests, the main reason to use SGD is to reduce the computation cost of gradient while still largely maintaining the gradient direction when averaged over many mini-batches or samples - that surely helps bring you to the local minima.

  1. Why minibatch works.

The mathematics behind this is that, the "true" gradient of the cost function (the gradient for the generalization error or for infinitely large samples set) is the expectation of the gradient over the true data generating distribution $p_{data}$; the actual gradient computed over a batch of samples is always an approximation to the true gradient with the empirical data distribution . Batch gradient descent can bring you the possible "optimal" gradient given all your data samples, it is not the "true" gradient though. A smaller batch (i.e. a minibatch) is probably not as optimal as the full batch, but they are both approximations - so is the single-sample minibatch (SGD).

Assuming there is no dependence between the samples in one minibatch, the computed is an unbiased estimate of the true gradient. The (squared) standard errors of the estimates with different minibatch sizes is inversely proportional to the sizes of the minibatch. That is, I.e., the reduction of standard error is the square root of the increase of sample size. This means, if the minibatch size is small, the learning rate has to be small too, in order to achieve stability over the big variance. When the samples are not independent, the property of unbiased estimate is no longer maintained. That requires you to shuffle the samples before the training, if the samples are sequenced not randomly enough.

  1. Why minibatch may work better.

Firstly, minibatch makes some learning problems from technically intractable to be tractable due to the reduced computation demand with smaller batch size.

Secondly, reduced batch size does not necessarily mean reduced gradient accuracy. The training samples many have lots of noises or outliers or biases. A randomly sampled minibatch may reflect the true data generating distribution better (or no worse) than the original full batch. If some iterations of the minibatch gradient updates give you a better estimation, overall the averaged result of one epoch can be better than the gradient computed from a full batch.

Thirdly, minibatch does not only help deal with unpleasant data samples, but also help deal with unpleasant cost function that has many local minima. As Jason_L_Bens mentions, sometimes the error manifolds may be easier to trap a regular gradient into a local minima, while more difficult to trap the temporarily random gradient computed with minibatch.

Finally, with gradient descent, you are not reaching the global minima in one step, but iterating on the error manifold. Gradient largely gives you only the direction to iterate. With minibatch, you can iterate much faster. In many cases, the more iterations, the better point you can reach. You do not really care at all weather the point is optimal globally or even locally. You just want to reach a reasonable model that brings you acceptable generalization error. Minibatch makes that easier.

You may find the book "Deep learning" by Ian Goodfellow, et al, has pretty good discussions on this topic if you read through it carefully.

Discussions

machine learning - Stochastic Gradient Descent(SGD) vs Mini-batch size 1 - Stack Overflow
Is stochastic gradient descent basically the name given to mini-batch training where batch size = 1 and selecting random training rows? i.e. it is the same as 'normal' gradient descent, it's just the More on stackoverflow.com
🌐 stackoverflow.com
machine learning - What is the difference between Gradient Descent and Stochastic Gradient Descent? - Data Science Stack Exchange
Have a look at the answers here, ... using stochastic minibatches for training offers advantages. One perhaps downside, is that the path to the optimum (assuming it would always be the same optimum) can be much noisier. So instead of a nice smooth loss curve, showing how the error descreases in each iteration of gradient descent, you might see something like this: We clearly see the loss decreasing over time, however there are large variations from epoch to epoch (training batch to training ... More on datascience.stackexchange.com
🌐 datascience.stackexchange.com
Stochastic Gradient Descent Definition
Hello all, Please CMIIW. From the course’s videos, I learned that Stochastic Gradient Descent (SGD) is gradient descent with a mini-batch size of one. It has something to do with when the parameters are updated using gradient descent. And its alternatives are mini-batch gradient descent and ... More on community.deeplearning.ai
🌐 community.deeplearning.ai
0
0
September 29, 2022
How is Batch Gradient Descent different from Stochastic?
It should be a more stable alternative to SGD for reaching the optimum since you use more examples from which you can generalize the direction of the global gradient more precise than rather with just one example. That is my understanding of why it is used. More on reddit.com
🌐 r/learnmachinelearning
13
2
August 13, 2022
🌐
Reddit
reddit.com › r/learnmachinelearning › how is batch gradient descent different from stochastic?
r/learnmachinelearning on Reddit: How is Batch Gradient Descent different from Stochastic?
August 13, 2022 -

Yes, I understand the textbook definition for what the difference is. The part that I am confused about is the actual math. Where does the actual batching or grouping take place? From what I understand, when doing backprop you need to use the Z values and Y values for an individual input. Ex. dC/dA in the last layer is (a[i] - y[i]) and dA/dZ = f'(Z) where f is the activation function. So how would grouping inputs affect any of this?

I've read that after predicting each input's prediction for a given batch then you just "update the weights." What does that mean though? Does it mean you have to store all the Z values and Y values and then perform backprop on each individual input's set of Z's and Y's? To then update the weights would I take the average W(new) of all the individual backprops? If thats the case then I don't see how that is any different than stochastic gradient descent. Is the time at which you update the weights the only thing that is different? Or am I missing something along the way?

🌐
Medium
medium.com › @divakar_239 › stochastic-vs-batch-gradient-descent-8820568eada1
Stochastic vs Batch Gradient Descent | by Divakar Kapil | Medium
June 21, 2019 - Sometimes a stable error gradient can lead to a local minima and unlike stochastic gradient descent no noisy steps are there to help get out of the local minima · The entire training set can be too large to process in the memory due to which additional memory might be needed · Depending on computer resources it can take too long for processing all the training samples as a batch
🌐
Lunartech
lunartech.ai › blog › gradient-descent-vs-mini-batch-gradient-descent-vs-stochastic-gradient-descent-an-expert-comparison
Gradient Descent vs. Mini-Batch Gradient Descent vs. Stochastic Gradient Descent: An Expert Comparison - LUNARTECH
In essence, Stochastic Gradient Descent redefines optimization by offering a scalable and efficient alternative to traditional Gradient Descent. Its ability to handle vast datasets with reduced computational demands, coupled with enhanced ...
🌐
Sebastian Raschka
sebastianraschka.com › faq › docs › gradient-optimization.html
What are gradient descent and stochastic gradient descent? | Sebastian Raschka, PhD
January 17, 2026 - In Gradient Descent optimization, we compute the cost gradient based on the complete training set; hence, we sometimes also call it batch gradient descent. In case of very large datasets, using Gradient Descent can be quite costly since we are ...
Find elsewhere
🌐
Baeldung
baeldung.com › home › artificial intelligence › machine learning › differences between gradient, stochastic and mini batch gradient descent
Differences Between Gradient, Stochastic and Mini Batch Gradient Descent | Baeldung on Computer Science
February 28, 2025 - By taking a subset of data we result in fewer iterations than SGD, and the computational burden is also reduced compared to GD. This middle technique is usually more preferred and used in machine learning applications.
🌐
Medium
medium.com › data-science › batch-mini-batch-stochastic-gradient-descent-7a62ecba642a
Batch, Mini Batch & Stochastic Gradient Descent | by Sushant Patrikar | TDS Archive | Medium
October 1, 2019 - We have also seen the Stochastic Gradient Descent. Batch Gradient Descent can be used for smoother curves. SGD can be used when the dataset is large. Batch Gradient Descent converges directly to minima.
🌐
Towards Data Science
towardsdatascience.com › home › latest › the math behind stochastic gradient descent
The Math Behind Stochastic Gradient Descent | Towards Data Science
January 24, 2025 - Therefore, this randomness is ... gradient descent. In traditional batch gradient descent, you calculate the gradient of the loss function with respect to the parameters for the entire training set....
🌐
Medium
medium.com › @amannagrawall002 › batch-vs-stochastic-vs-mini-batch-gradient-descent-techniques-7dfe6f963a6f
Gradient Descent : Batch , Stocastic and Mini batch | by Aman Agrawal | Medium
October 24, 2024 - Gradient Descent is the backbone of both Machine Learning and Deep Learning (Neural Networks), including the Backpropagation algorithm. Mathematically, it involves subtracting the gradient (the partial derivative of the loss function with respect to the parameter) multiplied by the learning rate from the current parameter values. ... Now, we will discuss Batch, Stochastic, and Mini-Batch Gradient Descent in detail, followed by a comparison of the three.
🌐
Analytics Vidhya
analyticsvidhya.com › home › variants of gradient descent algorithm
Variants of Gradient Descent Algorithm | Types of Gradient Descent
November 7, 2023 - In the case of Stochastic Gradient Descent, we update the parameters after every single observation and we know that every time the weights are updated it is known as an iteration.
Top answer
1 of 7
91

For a quick simple explanation:

In both gradient descent (GD) and stochastic gradient descent (SGD), you update a set of parameters in an iterative manner to minimize an error function.

While in GD, you have to run through ALL the samples in your training set to do a single update for a parameter in a particular iteration, in SGD, on the other hand, you use ONLY ONE or SUBSET of training sample from your training set to do the update for a parameter in a particular iteration. If you use SUBSET, it is called Minibatch Stochastic gradient Descent.

Thus, if the number of training samples are large, in fact very large, then using gradient descent may take too long because in every iteration when you are updating the values of the parameters, you are running through the complete training set. On the other hand, using SGD will be faster because you use only one training sample and it starts improving itself right away from the first sample.

SGD often converges much faster compared to GD but the error function is not as well minimized as in the case of GD. Often in most cases, the close approximation that you get in SGD for the parameter values are enough because they reach the optimal values and keep oscillating there.

If you need an example of this with a practical case, check Andrew NG's notes here where he clearly shows you the steps involved in both the cases. cs229-notes

Source: Quora Thread

2 of 7
21

The inclusion of the word stochastic simply means the random samples from the training data are chosen in each run to update parameter during optimisation, within the framework of gradient descent.

Doing so not only computed errors and updates weights in faster iterations (because we only process a small selection of samples in one go), it also often helps to move towards an optimum more quickly. Have a look at the answers here, for more information as to why using stochastic minibatches for training offers advantages.

One perhaps downside, is that the path to the optimum (assuming it would always be the same optimum) can be much noisier. So instead of a nice smooth loss curve, showing how the error descreases in each iteration of gradient descent, you might see something like this:

We clearly see the loss decreasing over time, however there are large variations from epoch to epoch (training batch to training batch), so the curve is noisy.

This is simply because we compute the mean error over our stochastically/randomly selected subset, from the entire dataset, in each iteration. Some samples will produce high error, some low. So the average can vary, depending on which samples we randomly used for one iteration of gradient descent.

🌐
Wikipedia
en.wikipedia.org › wiki › Stochastic_gradient_descent
Stochastic gradient descent - Wikipedia
3 weeks ago - In pseudocode, stochastic gradient descent can be presented as : ... Randomly shuffle samples in the training set. ... A compromise between computing the true gradient and the gradient at a single sample is to compute the gradient against more than one training sample (called a "mini-batch") at each step.
🌐
IBM
ibm.com › think › topics › gradient-descent
What is Gradient Descent? | IBM
November 17, 2025 - Batch gradient descent also usually ... the global one. Stochastic gradient descent (SGD) runs a training epoch for each example within the dataset and it updates each training example's parameters one at a time....
🌐
Quora
quora.com › Is-stochastic-gradient-descent-similar-to-mini-batch-gradient-descent
Is stochastic gradient descent similar to mini-batch gradient descent? - Quora
Answer (1 of 3): Its common that different people and different literature use different terms for the same things. Sometimes its because people are lazy or careless. Sometimes its because subjects like engineering etc have lo0se definitions because they are not rigorous mathematical definitions ...
🌐
Quora
quora.com › What-is-the-difference-between-a-batch-and-a-stochastic-gradient-descent
What is the difference between a batch and a stochastic gradient descent? - Quora
Additionally, batch gradient descent, ... in its basin of attraction. Stochastic gradient descent (SGD) on the other hand, computes the gradient after seeing only a single training example....
🌐
DeepLearning.AI
community.deeplearning.ai › course q&a › deep learning specialization › improving deep neural networks: hyperparameter tun
Stochastic Gradient Descent Definition - Improving Deep Neural Networks: Hyperparameter tun - DeepLearning.AI
September 29, 2022 - Hello all, Please CMIIW. From the course’s videos, I learned that Stochastic Gradient Descent (SGD) is gradient descent with a mini-batch size of one. It has something to do with when the parameters are updated using gradient descent.
🌐
YouTube
youtube.com › watch
Gradient Descent Explained: Batch, Mini-Batch, and Stochastic (Simple) - YouTube
Ace your machine learning interviews with Exponent’s ML engineer interview course: https://bit.ly/4bAaFK5This video provides an in-depth explanation of vari...
Published   February 6, 2024
🌐
GeeksforGeeks
geeksforgeeks.org › machine learning › ml-stochastic-gradient-descent-sgd
ML - Stochastic Gradient Descent (SGD) - GeeksforGeeks
Stochastic Gradient Descent (SGD) is an optimization algorithm in machine learning, particularly when dealing with large datasets. It is a variant of the traditional gradient descent algorithm but offers several advantages in terms of efficiency and scalability making it the go-to method for many deep-learning tasks. Path followed by batch gradient descent vs.
Published   September 30, 2025