stochastic gradient descent vs gradient descent

What is the difference between Gradient Descent and Stochastic Gradient Descent?

datascience.stackexchange.com › questions › 36450 › what-is-the-difference-between-gradient-descent-and-stochastic-gradient-descent

For a quick simple explanation:

In both gradient descent (GD) and stochastic gradient descent (SGD), you update a set of parameters in an iterative manner to minimize an error function.

While in GD, you have to run through ALL the samples in your training set to do a single update for a parameter in a particular iteration, in SGD, on the other hand, you use ONLY ONE or SUBSET of training sample from your training set to do the update for a parameter in a particular iteration. If you use SUBSET, it is called Minibatch Stochastic gradient Descent.

Thus, if the number of training samples are large, in fact very large, then using gradient descent may take too long because in every iteration when you are updating the values of the parameters, you are running through the complete training set. On the other hand, using SGD will be faster because you use only one training sample and it starts improving itself right away from the first sample.

SGD often converges much faster compared to GD but the error function is not as well minimized as in the case of GD. Often in most cases, the close approximation that you get in SGD for the parameter values are enough because they reach the optimal values and keep oscillating there.

If you need an example of this with a practical case, check Andrew NG's notes here where he clearly shows you the steps involved in both the cases. cs229-notes

Source: Quora Thread

Answer from Sociopath on Stack Exchange

Medium

medium.com › @seshu8hachi › stochastic-gradient-descent-vs-gradient-descent-exploring-the-differences-9c29698b3a9b

Stochastic gradient descent vs Gradient descent — Exploring the differences | by Seshu Kumar Vungarala | Medium

May 23, 2023 - SGD only needs to store the current training example, making it more memory-efficient. Stochastic Gradient Descent: Faster convergence rate due to the use of single training examples in each iteration.

Stack Exchange

datascience.stackexchange.com › questions › 36450 › what-is-the-difference-between-gradient-descent-and-stochastic-gradient-descent

machine learning - What is the difference between Gradient Descent and Stochastic Gradient Descent? - Data Science Stack Exchange

Top answer

1 of 7

91

For a quick simple explanation:

In both gradient descent (GD) and stochastic gradient descent (SGD), you update a set of parameters in an iterative manner to minimize an error function.

While in GD, you have to run through ALL the samples in your training set to do a single update for a parameter in a particular iteration, in SGD, on the other hand, you use ONLY ONE or SUBSET of training sample from your training set to do the update for a parameter in a particular iteration. If you use SUBSET, it is called Minibatch Stochastic gradient Descent.

Thus, if the number of training samples are large, in fact very large, then using gradient descent may take too long because in every iteration when you are updating the values of the parameters, you are running through the complete training set. On the other hand, using SGD will be faster because you use only one training sample and it starts improving itself right away from the first sample.

SGD often converges much faster compared to GD but the error function is not as well minimized as in the case of GD. Often in most cases, the close approximation that you get in SGD for the parameter values are enough because they reach the optimal values and keep oscillating there.

If you need an example of this with a practical case, check Andrew NG's notes here where he clearly shows you the steps involved in both the cases. cs229-notes

Source: Quora Thread

2 of 7

21

The inclusion of the word stochastic simply means the random samples from the training data are chosen in each run to update parameter during optimisation, within the framework of gradient descent.

Doing so not only computed errors and updates weights in faster iterations (because we only process a small selection of samples in one go), it also often helps to move towards an optimum more quickly. Have a look at the answers here, for more information as to why using stochastic minibatches for training offers advantages.

One perhaps downside, is that the path to the optimum (assuming it would always be the same optimum) can be much noisier. So instead of a nice smooth loss curve, showing how the error descreases in each iteration of gradient descent, you might see something like this:

We clearly see the loss decreasing over time, however there are large variations from epoch to epoch (training batch to training batch), so the curve is noisy.

This is simply because we compute the mean error over our stochastically/randomly selected subset, from the entire dataset, in each iteration. Some samples will produce high error, some low. So the average can vary, depending on which samples we randomly used for one iteration of gradient descent.

Discussions

ELI5 Gradient Descent and Stochastic Gradient Descent

GD: check all your data, find the answer that minimises cost for all data on average SGD: check some, minimise cost for them, check some more and so on. ELI5: GD is a ball rolling down the slope, SGD is a wobbly bicycle free-rolling down a slope. More on reddit.com

r/learnmachinelearning

16

25

July 30, 2022

machine learning - Gradient Descent vs Stochastic Gradient Descent algorithms - Stack Overflow

I tried to train a FeedForward Neural Network on the MNIST Handwritten Digits dataset (includes 60K training samples). I each time iterated over all the training samples, performing Backpropagatio... More on stackoverflow.com

stackoverflow.com

machine learning - Stochastic Gradient Descent vs Online Gradient Descent - Cross Validated

I was wondering what the difference between stochastic gradient descent and online gradient descent is? Or is it the same algorithm? More on stats.stackexchange.com

stats.stackexchange.com

August 13, 2015

[D] The unreasonable effectiveness of stochastic gradient descent

Most explanations mention some hand-wavy argument about "escaping narrow local minima" and "escaping saddle points" But what else could it possibly be? Seems like a perfectly good explanation. I'd be absolutely shocked if SGD found better solutions than if you partition the parameter space into a huge number of equally sized regions, do GD starting in each, and pick the best solution More on reddit.com

r/statistics

7

8

October 22, 2021

Videos

youtube.com

Gradient Descent vs. Stochastic Gradient Descent: A Clear ...