For a quick simple explanation:

In both gradient descent (GD) and stochastic gradient descent (SGD), you update a set of parameters in an iterative manner to minimize an error function.

While in GD, you have to run through ALL the samples in your training set to do a single update for a parameter in a particular iteration, in SGD, on the other hand, you use ONLY ONE or SUBSET of training sample from your training set to do the update for a parameter in a particular iteration. If you use SUBSET, it is called Minibatch Stochastic gradient Descent.

Thus, if the number of training samples are large, in fact very large, then using gradient descent may take too long because in every iteration when you are updating the values of the parameters, you are running through the complete training set. On the other hand, using SGD will be faster because you use only one training sample and it starts improving itself right away from the first sample.

SGD often converges much faster compared to GD but the error function is not as well minimized as in the case of GD. Often in most cases, the close approximation that you get in SGD for the parameter values are enough because they reach the optimal values and keep oscillating there.

If you need an example of this with a practical case, check Andrew NG's notes here where he clearly shows you the steps involved in both the cases. cs229-notes

Source: Quora Thread

Answer from Sociopath on Stack Exchange
🌐
Medium
medium.com › @seshu8hachi › stochastic-gradient-descent-vs-gradient-descent-exploring-the-differences-9c29698b3a9b
Stochastic gradient descent vs Gradient descent — Exploring the differences | by Seshu Kumar Vungarala | Medium
May 23, 2023 - SGD only needs to store the current training example, making it more memory-efficient. Stochastic Gradient Descent: Faster convergence rate due to the use of single training examples in each iteration.
Top answer
1 of 7
91

For a quick simple explanation:

In both gradient descent (GD) and stochastic gradient descent (SGD), you update a set of parameters in an iterative manner to minimize an error function.

While in GD, you have to run through ALL the samples in your training set to do a single update for a parameter in a particular iteration, in SGD, on the other hand, you use ONLY ONE or SUBSET of training sample from your training set to do the update for a parameter in a particular iteration. If you use SUBSET, it is called Minibatch Stochastic gradient Descent.

Thus, if the number of training samples are large, in fact very large, then using gradient descent may take too long because in every iteration when you are updating the values of the parameters, you are running through the complete training set. On the other hand, using SGD will be faster because you use only one training sample and it starts improving itself right away from the first sample.

SGD often converges much faster compared to GD but the error function is not as well minimized as in the case of GD. Often in most cases, the close approximation that you get in SGD for the parameter values are enough because they reach the optimal values and keep oscillating there.

If you need an example of this with a practical case, check Andrew NG's notes here where he clearly shows you the steps involved in both the cases. cs229-notes

Source: Quora Thread

2 of 7
21

The inclusion of the word stochastic simply means the random samples from the training data are chosen in each run to update parameter during optimisation, within the framework of gradient descent.

Doing so not only computed errors and updates weights in faster iterations (because we only process a small selection of samples in one go), it also often helps to move towards an optimum more quickly. Have a look at the answers here, for more information as to why using stochastic minibatches for training offers advantages.

One perhaps downside, is that the path to the optimum (assuming it would always be the same optimum) can be much noisier. So instead of a nice smooth loss curve, showing how the error descreases in each iteration of gradient descent, you might see something like this:

We clearly see the loss decreasing over time, however there are large variations from epoch to epoch (training batch to training batch), so the curve is noisy.

This is simply because we compute the mean error over our stochastically/randomly selected subset, from the entire dataset, in each iteration. Some samples will produce high error, some low. So the average can vary, depending on which samples we randomly used for one iteration of gradient descent.

Discussions

ELI5 Gradient Descent and Stochastic Gradient Descent
GD: check all your data, find the answer that minimises cost for all data on average SGD: check some, minimise cost for them, check some more and so on. ELI5: GD is a ball rolling down the slope, SGD is a wobbly bicycle free-rolling down a slope. More on reddit.com
🌐 r/learnmachinelearning
16
25
July 30, 2022
machine learning - Gradient Descent vs Stochastic Gradient Descent algorithms - Stack Overflow
I tried to train a FeedForward Neural Network on the MNIST Handwritten Digits dataset (includes 60K training samples). I each time iterated over all the training samples, performing Backpropagatio... More on stackoverflow.com
🌐 stackoverflow.com
machine learning - Stochastic Gradient Descent vs Online Gradient Descent - Cross Validated
I was wondering what the difference between stochastic gradient descent and online gradient descent is? Or is it the same algorithm? More on stats.stackexchange.com
🌐 stats.stackexchange.com
August 13, 2015
[D] The unreasonable effectiveness of stochastic gradient descent
Most explanations mention some hand-wavy argument about "escaping narrow local minima" and "escaping saddle points" But what else could it possibly be? Seems like a perfectly good explanation. I'd be absolutely shocked if SGD found better solutions than if you partition the parameter space into a huge number of equally sized regions, do GD starting in each, and pick the best solution More on reddit.com
🌐 r/statistics
7
8
October 22, 2021
🌐
GeeksforGeeks
geeksforgeeks.org › machine learning › difference-between-batch-gradient-descent-and-stochastic-gradient-descent
Difference between Batch Gradient Descent and Stochastic Gradient Descent - GeeksforGeeks
September 30, 2025 - It is ideal when working with small to medium-sized datasets and when high accuracy is required. Stochastic Gradient Descent, on the other hand, is faster and requires less computational power, making it suitable for large datasets.

For a quick simple explanation:

In both gradient descent (GD) and stochastic gradient descent (SGD), you update a set of parameters in an iterative manner to minimize an error function.

While in GD, you have to run through ALL the samples in your training set to do a single update for a parameter in a particular iteration, in SGD, on the other hand, you use ONLY ONE or SUBSET of training sample from your training set to do the update for a parameter in a particular iteration. If you use SUBSET, it is called Minibatch Stochastic gradient Descent.

Thus, if the number of training samples are large, in fact very large, then using gradient descent may take too long because in every iteration when you are updating the values of the parameters, you are running through the complete training set. On the other hand, using SGD will be faster because you use only one training sample and it starts improving itself right away from the first sample.

SGD often converges much faster compared to GD but the error function is not as well minimized as in the case of GD. Often in most cases, the close approximation that you get in SGD for the parameter values are enough because they reach the optimal values and keep oscillating there.

If you need an example of this with a practical case, check Andrew NG's notes here where he clearly shows you the steps involved in both the cases. cs229-notes

Source: Quora Thread

Answer from Sociopath on Stack Exchange
🌐
Sebastian Raschka
sebastianraschka.com › faq › docs › gradient-optimization.html
What are gradient descent and stochastic gradient descent? | Sebastian Raschka, PhD
January 17, 2026 - Here, the term “stochastic” comes from the fact that the gradient based on a single training sample is a “stochastic approximation” of the “true” cost gradient. Due to its stochastic nature, the path towards the global cost minimum is not “direct” as in Gradient Descent, but may go “zig-zag” if we are visuallizing the cost surface in a 2D space.
Find elsewhere
🌐
Wikipedia
en.wikipedia.org › wiki › Stochastic_gradient_descent
Stochastic gradient descent - Wikipedia
3 weeks ago - Stochastic gradient descent (often abbreviated SGD) is an iterative method for optimizing an objective function with suitable smoothness properties (e.g. differentiable or subdifferentiable). It can be regarded as a stochastic approximation of gradient descent optimization, since it replaces ...
🌐
ScienceDirect
sciencedirect.com › science › article › pii › S2772415822000177
Deep learning, stochastic gradient descent and diffusion maps - ScienceDirect
June 28, 2022 - A natural approach to the optimization problem in (1.1) is to use gradient descent (GD). However, when · N is large, it may be computationally prohibitive to compute the full gradient of the objective function · f and so stochastic gradient descent (SGD) provides an alternative.
🌐
Clairedavid
clairedavid.github.io › ml_in_hep › week2 › DL_stochGD.html
Stochastic Gradient Descent — Machine Learning in Particle Physics
Depending on the initial values for the \(\theta\) parameters, the gradient descent (black line) can converge to a local minimum and the associated \(\theta\) parameters will not be the most optimal ones. This term is important, let’s define it properly. It is mostly used as an adjective. Etymologically, “stochastic” comes from Greek for “guess” or “conjecture.”
🌐
scikit-learn
scikit-learn.org › stable › modules › sgd.html
1.5. Stochastic Gradient Descent — scikit-learn 1.8.0 documentation
Stochastic Gradient Descent (SGD) is a simple yet very efficient approach to fitting linear classifiers and regressors under convex loss functions such as (linear) Support Vector Machines and Logis...
Top answer
1 of 2
12

I'll try to give you some intuition over the problem...

Initially, updates were made in what you (correctly) call (Batch) Gradient Descent. This assures that each update in the weights is done in the "right" direction (Fig. 1): the one that minimizes the cost function.

With the growth of datasets size, and complexier computations in each step, Stochastic Gradient Descent came to be preferred in these cases. Here, updates to the weights are done as each sample is processed and, as such, subsequent calculations already use "improved" weights. Nonetheless, this very reason leads to it incurring in some misdirection in minimizing the error function (Fig. 2).

As such, in many situations it is preferred to use Mini-batch Gradient Descent, combining the best of both worlds: each update to the weights is done using a small batch of the data. This way, the direction of the updates is somewhat rectified in comparison with the stochastic updates, but is updated much more regularly than in the case of the (original) Gradient Descent.

[UPDATE] As requested, I present below the pseudocode for batch gradient descent in binary classification:

error = 0

for sample in data:
    prediction = neural_network.predict(sample)
    sample_error = evaluate_error(prediction, sample["label"]) # may be as simple as 
                                                # module(prediction - sample["label"])
    error += sample_error

neural_network.backpropagate_and_update(error)

(In the case of multi-class labeling, error represents an array of the error for each label.)

This code is run for a given number of iterations, or while the error is above a threshold. For stochastic gradient descent, the call to neural_network.backpropagate_and_update() is called inside the for cycle, with the sample error as argument.

2 of 2
4

The new scenario you describe (performing Backpropagation on each randomly picked sample), is one common "flavor" of Stochastic Gradient Descent, as described here: https://www.quora.com/Whats-the-difference-between-gradient-descent-and-stochastic-gradient-descent

The 3 most common flavors according to this document are (Your flavor is C):

A)

randomly shuffle samples in the training set
for one or more epochs, or until approx. cost minimum is reached:
    for training sample i:
        compute gradients and perform weight updates

B)

for one or more epochs, or until approx. cost minimum is reached:
    randomly shuffle samples in the training set
    for training sample i:
        compute gradients and perform weight updates

C)

for iterations t, or until approx. cost minimum is reached:
    draw random sample from the training set
    compute gradients and perform weight updates
🌐
GeeksforGeeks
geeksforgeeks.org › machine learning › ml-stochastic-gradient-descent-sgd
ML - Stochastic Gradient Descent (SGD) - GeeksforGeeks
The key difference from traditional gradient descent is that, in SGD, the parameter updates are made based on a single data point, not the entire dataset. The random selection of data points introduces stochasticity which can be both an advantage ...
Published   September 30, 2025
🌐
Carnegie Mellon University
stat.cmu.edu › ~ryantibs › convexopt › lectures › stochastic-gd.pdf pdf
Stochastic Gradient Descent Ryan Tibshirani Convex Optimization 10-725
Thus at each iteration, gradient descent moves in a direction that · balances decreasing f with increasing the ℓ2 norm, same as in the · regularized problem · 21 · References and further reading · • D. Bertsekas (2010), “Incremental gradient, subgradient, and · proximal methods for convex optimization: a survey” · • A. Nemirovski and A. Juditsky and G. Lan and A. Shapiro · (2009), “Robust stochastic optimization approach to ·
🌐
Stanford
deeplearning.stanford.edu › tutorial › supervised › OptimizationStochasticGradientDescent
Optimization: Stochastic Gradient Descent
The standard gradient descent algorithm ... over the full training set. Stochastic Gradient Descent (SGD) simply does away with the expectation in the update and computes the gradient of the parameters using only a single or a few training examples....
🌐
JanBask Training
janbasktraining.com › community › data-science › stochastic-gradient-descent-vs-gradient-descent-explain-the-difference
Stochastic gradient descent vs gradient descent - Explain the difference | JanBask Training Community
February 13, 2023 - While in GD, you have to run through ALL the samples in your training set to do a single update for a parameter in a particular iteration, in SGD, on the other hand, you use ONLY ONE or SUBSET of training sample from your training set to do the update for a parameter in a particular iteration. If you use SUBSET, it is called Mini Batch Stochastic gradient Descent.
Top answer
1 of 3
9

Apparently, different authors have different ideas about stochastic gradient descent. Bishop says:

On-line gradient descent, also known as sequential gradient descent or stochastic gradient descent, makes an update to the weight vector based on one data point at a time…

Whereas, [2] describes that as subgradient descent, and gives a more general definition for stochastic gradient descent:

In stochastic gradient descent we do not require the update direction to be based exactly on the gradient. Instead, we allow the direction to be a random vector and only require that its expected value at each iteration will equal the gradient direction. Or, more generally, we require that the expected value of the random vector will be a subgradient of the function at the current vector.

Shalev-Shwartz, S., & Ben-David, S. (2014). Understanding Machine Learning: From Theory to Algorithms. Cambridge University Press.

2 of 3
5

As an example, let's place ourselves in the context of Linear/Logistic Regression. Let's assume you have $N$ samples in your training set. You want to use loop once through those samples to learn the coefficients of your model.

  • Stochastic Gradient Descent: you would randomly select one of those training samples at each iteration to update your coefficients.
  • Online Gradient Descent: you would use the "most recent" sample at each iteration. There is no stochasticity as you deterministically select your sample. In industry, where datasets are large, we train "live" by using the most recent samples as soon as they arrive to update the coefficients.
🌐
TutorialsPoint
tutorialspoint.com › difference-between-sgd-gd-and-mini-batch-gd
Difference Between SGD, GD, and Mini-batch GD
April 25, 2023 - In conclusion, the most popular machine learning optimization methods are gradient descent, stochastic gradient descent, and mini-batch gradient descent. Stochastic Gradient Descent converges quickly but has high noise, whereas Gradient Descent converges slowly but has low noise.
🌐
Lunartech
lunartech.ai › blog › gradient-descent-vs-stochastic-gradient-descent-unveiling-the-core-differences
Gradient Descent vs. Stochastic Gradient Descent: Unveiling the Core Differences - LUNARTECH
In contrast, Stochastic Gradient Descent (SGD) leverages randomly selected subsets of the training data, often processing one or a few samples at a time to estimate the gradient. This method introduces inherent variability and noise into the ...
🌐
Ruder
ruder.io › optimizing-gradient-descent
An overview of gradient descent optimization algorithms
March 20, 2020 - Gradient descent is the preferred way to optimize neural networks and many other machine learning algorithms but is often used as a black box. This post explores how many of the most popular gradient-based optimization algorithms such as Momentum, Adagrad, and Adam actually work.
🌐
Lunartech
lunartech.ai › blog › gradient-descent-vs-mini-batch-gradient-descent-vs-stochastic-gradient-descent-an-expert-comparison
Gradient Descent vs. Mini-Batch Gradient Descent vs. Stochastic Gradient Descent: An Expert Comparison - LUNARTECH
Its deterministic and stable ... applications. Stochastic Gradient Descent (SGD) emerges as a transformative optimization algorithm, particularly suited for large-scale and real-time machine learning applications....
🌐
Quora
quora.com › In-machine-learning-why-is-it-better-to-use-stochastic-gradient-descent-instead-of-gradient-descent
In machine learning, why is it better to use stochastic gradient descent instead of gradient descent? - Quora
Ability to escape saddle points and poor local minima · Stochastic noise helps move iterates out of saddle points and shallow local minima where deterministic batch gradient descent can stall.