For a quick simple explanation:
In both gradient descent (GD) and stochastic gradient descent (SGD), you update a set of parameters in an iterative manner to minimize an error function.
While in GD, you have to run through ALL the samples in your training set to do a single update for a parameter in a particular iteration, in SGD, on the other hand, you use ONLY ONE or SUBSET of training sample from your training set to do the update for a parameter in a particular iteration. If you use SUBSET, it is called Minibatch Stochastic gradient Descent.
Thus, if the number of training samples are large, in fact very large, then using gradient descent may take too long because in every iteration when you are updating the values of the parameters, you are running through the complete training set. On the other hand, using SGD will be faster because you use only one training sample and it starts improving itself right away from the first sample.
SGD often converges much faster compared to GD but the error function is not as well minimized as in the case of GD. Often in most cases, the close approximation that you get in SGD for the parameter values are enough because they reach the optimal values and keep oscillating there.
If you need an example of this with a practical case, check Andrew NG's notes here where he clearly shows you the steps involved in both the cases. cs229-notes
Source: Quora Thread
Answer from Sociopath on Stack ExchangeFor a quick simple explanation:
In both gradient descent (GD) and stochastic gradient descent (SGD), you update a set of parameters in an iterative manner to minimize an error function.
While in GD, you have to run through ALL the samples in your training set to do a single update for a parameter in a particular iteration, in SGD, on the other hand, you use ONLY ONE or SUBSET of training sample from your training set to do the update for a parameter in a particular iteration. If you use SUBSET, it is called Minibatch Stochastic gradient Descent.
Thus, if the number of training samples are large, in fact very large, then using gradient descent may take too long because in every iteration when you are updating the values of the parameters, you are running through the complete training set. On the other hand, using SGD will be faster because you use only one training sample and it starts improving itself right away from the first sample.
SGD often converges much faster compared to GD but the error function is not as well minimized as in the case of GD. Often in most cases, the close approximation that you get in SGD for the parameter values are enough because they reach the optimal values and keep oscillating there.
If you need an example of this with a practical case, check Andrew NG's notes here where he clearly shows you the steps involved in both the cases. cs229-notes
Source: Quora Thread
The inclusion of the word stochastic simply means the random samples from the training data are chosen in each run to update parameter during optimisation, within the framework of gradient descent.
Doing so not only computed errors and updates weights in faster iterations (because we only process a small selection of samples in one go), it also often helps to move towards an optimum more quickly. Have a look at the answers here, for more information as to why using stochastic minibatches for training offers advantages.
One perhaps downside, is that the path to the optimum (assuming it would always be the same optimum) can be much noisier. So instead of a nice smooth loss curve, showing how the error descreases in each iteration of gradient descent, you might see something like this:

We clearly see the loss decreasing over time, however there are large variations from epoch to epoch (training batch to training batch), so the curve is noisy.
This is simply because we compute the mean error over our stochastically/randomly selected subset, from the entire dataset, in each iteration. Some samples will produce high error, some low. So the average can vary, depending on which samples we randomly used for one iteration of gradient descent.
ELI5 Gradient Descent and Stochastic Gradient Descent
machine learning - Gradient Descent vs Stochastic Gradient Descent algorithms - Stack Overflow
machine learning - Stochastic Gradient Descent vs Online Gradient Descent - Cross Validated
[D] The unreasonable effectiveness of stochastic gradient descent
Videos
As the title says, I want to understand gradient descent and SGD as a child would.
For a quick simple explanation:
In both gradient descent (GD) and stochastic gradient descent (SGD), you update a set of parameters in an iterative manner to minimize an error function.
While in GD, you have to run through ALL the samples in your training set to do a single update for a parameter in a particular iteration, in SGD, on the other hand, you use ONLY ONE or SUBSET of training sample from your training set to do the update for a parameter in a particular iteration. If you use SUBSET, it is called Minibatch Stochastic gradient Descent.
Thus, if the number of training samples are large, in fact very large, then using gradient descent may take too long because in every iteration when you are updating the values of the parameters, you are running through the complete training set. On the other hand, using SGD will be faster because you use only one training sample and it starts improving itself right away from the first sample.
SGD often converges much faster compared to GD but the error function is not as well minimized as in the case of GD. Often in most cases, the close approximation that you get in SGD for the parameter values are enough because they reach the optimal values and keep oscillating there.
If you need an example of this with a practical case, check Andrew NG's notes here where he clearly shows you the steps involved in both the cases. cs229-notes
Source: Quora Thread
Answer from Sociopath on Stack ExchangeI'll try to give you some intuition over the problem...
Initially, updates were made in what you (correctly) call (Batch) Gradient Descent. This assures that each update in the weights is done in the "right" direction (Fig. 1): the one that minimizes the cost function.

With the growth of datasets size, and complexier computations in each step, Stochastic Gradient Descent came to be preferred in these cases. Here, updates to the weights are done as each sample is processed and, as such, subsequent calculations already use "improved" weights. Nonetheless, this very reason leads to it incurring in some misdirection in minimizing the error function (Fig. 2).

As such, in many situations it is preferred to use Mini-batch Gradient Descent, combining the best of both worlds: each update to the weights is done using a small batch of the data. This way, the direction of the updates is somewhat rectified in comparison with the stochastic updates, but is updated much more regularly than in the case of the (original) Gradient Descent.
[UPDATE] As requested, I present below the pseudocode for batch gradient descent in binary classification:
error = 0
for sample in data:
prediction = neural_network.predict(sample)
sample_error = evaluate_error(prediction, sample["label"]) # may be as simple as
# module(prediction - sample["label"])
error += sample_error
neural_network.backpropagate_and_update(error)
(In the case of multi-class labeling, error represents an array of the error for each label.)
This code is run for a given number of iterations, or while the error is above a threshold. For stochastic gradient descent, the call to neural_network.backpropagate_and_update() is called inside the for cycle, with the sample error as argument.
The new scenario you describe (performing Backpropagation on each randomly picked sample), is one common "flavor" of Stochastic Gradient Descent, as described here: https://www.quora.com/Whats-the-difference-between-gradient-descent-and-stochastic-gradient-descent
The 3 most common flavors according to this document are (Your flavor is C):
A)
randomly shuffle samples in the training set
for one or more epochs, or until approx. cost minimum is reached:
for training sample i:
compute gradients and perform weight updates
B)
for one or more epochs, or until approx. cost minimum is reached:
randomly shuffle samples in the training set
for training sample i:
compute gradients and perform weight updates
C)
for iterations t, or until approx. cost minimum is reached:
draw random sample from the training set
compute gradients and perform weight updates
Apparently, different authors have different ideas about stochastic gradient descent. Bishop says:
On-line gradient descent, also known as sequential gradient descent or stochastic gradient descent, makes an update to the weight vector based on one data point at a time…
Whereas, [2] describes that as subgradient descent, and gives a more general definition for stochastic gradient descent:
In stochastic gradient descent we do not require the update direction to be based exactly on the gradient. Instead, we allow the direction to be a random vector and only require that its expected value at each iteration will equal the gradient direction. Or, more generally, we require that the expected value of the random vector will be a subgradient of the function at the current vector.
Shalev-Shwartz, S., & Ben-David, S. (2014). Understanding Machine Learning: From Theory to Algorithms. Cambridge University Press.
As an example, let's place ourselves in the context of Linear/Logistic Regression. Let's assume you have $N$ samples in your training set. You want to use loop once through those samples to learn the coefficients of your model.
- Stochastic Gradient Descent: you would randomly select one of those training samples at each iteration to update your coefficients.
- Online Gradient Descent: you would use the "most recent" sample at each iteration. There is no stochasticity as you deterministically select your sample. In industry, where datasets are large, we train "live" by using the most recent samples as soon as they arrive to update the coefficients.