The applicability of batch or stochastic gradient descent really depends on the error manifold expected.
Batch gradient descent computes the gradient using the whole dataset. This is great for convex, or relatively smooth error manifolds. In this case, we move somewhat directly towards an optimum solution, either local or global. Additionally, batch gradient descent, given an annealed learning rate, will eventually find the minimum located in it's basin of attraction.
Stochastic gradient descent (SGD) computes the gradient using a single sample. Most applications of SGD actually use a minibatch of several samples, for reasons that will be explained a bit later. SGD works well (Not well, I suppose, but better than batch gradient descent) for error manifolds that have lots of local maxima/minima. In this case, the somewhat noisier gradient calculated using the reduced number of samples tends to jerk the model out of local minima into a region that hopefully is more optimal. Single samples are really noisy, while minibatches tend to average a little of the noise out. Thus, the amount of jerk is reduced when using minibatches. A good balance is struck when the minibatch size is small enough to avoid some of the poor local minima, but large enough that it doesn't avoid the global minima or better-performing local minima. (Incidently, this assumes that the best minima have a larger and deeper basin of attraction, and are therefore easier to fall into.)
One benefit of SGD is that it's computationally a whole lot faster. Large datasets often can't be held in RAM, which makes vectorization much less efficient. Rather, each sample or batch of samples must be loaded, worked with, the results stored, and so on. Minibatch SGD, on the other hand, is usually intentionally made small enough to be computationally tractable.
Usually, this computational advantage is leveraged by performing many more iterations of SGD, making many more steps than conventional batch gradient descent. This usually results in a model that is very close to that which would be found via batch gradient descent, or better.
The way I like to think of how SGD works is to imagine that I have one point that represents my input distribution. My model is attempting to learn that input distribution. Surrounding the input distribution is a shaded area that represents the input distributions of all of the possible minibatches I could sample. It's usually a fair assumption that the minibatch input distributions are close in proximity to the true input distribution. Batch gradient descent, at all steps, takes the steepest route to reach the true input distribution. SGD, on the other hand, chooses a random point within the shaded area, and takes the steepest route towards this point. At each iteration, though, it chooses a new point. The average of all of these steps will approximate the true input distribution, usually quite well.
Answer from Jason_L_Bens on Stack ExchangeThe applicability of batch or stochastic gradient descent really depends on the error manifold expected.
Batch gradient descent computes the gradient using the whole dataset. This is great for convex, or relatively smooth error manifolds. In this case, we move somewhat directly towards an optimum solution, either local or global. Additionally, batch gradient descent, given an annealed learning rate, will eventually find the minimum located in it's basin of attraction.
Stochastic gradient descent (SGD) computes the gradient using a single sample. Most applications of SGD actually use a minibatch of several samples, for reasons that will be explained a bit later. SGD works well (Not well, I suppose, but better than batch gradient descent) for error manifolds that have lots of local maxima/minima. In this case, the somewhat noisier gradient calculated using the reduced number of samples tends to jerk the model out of local minima into a region that hopefully is more optimal. Single samples are really noisy, while minibatches tend to average a little of the noise out. Thus, the amount of jerk is reduced when using minibatches. A good balance is struck when the minibatch size is small enough to avoid some of the poor local minima, but large enough that it doesn't avoid the global minima or better-performing local minima. (Incidently, this assumes that the best minima have a larger and deeper basin of attraction, and are therefore easier to fall into.)
One benefit of SGD is that it's computationally a whole lot faster. Large datasets often can't be held in RAM, which makes vectorization much less efficient. Rather, each sample or batch of samples must be loaded, worked with, the results stored, and so on. Minibatch SGD, on the other hand, is usually intentionally made small enough to be computationally tractable.
Usually, this computational advantage is leveraged by performing many more iterations of SGD, making many more steps than conventional batch gradient descent. This usually results in a model that is very close to that which would be found via batch gradient descent, or better.
The way I like to think of how SGD works is to imagine that I have one point that represents my input distribution. My model is attempting to learn that input distribution. Surrounding the input distribution is a shaded area that represents the input distributions of all of the possible minibatches I could sample. It's usually a fair assumption that the minibatch input distributions are close in proximity to the true input distribution. Batch gradient descent, at all steps, takes the steepest route to reach the true input distribution. SGD, on the other hand, chooses a random point within the shaded area, and takes the steepest route towards this point. At each iteration, though, it chooses a new point. The average of all of these steps will approximate the true input distribution, usually quite well.
As other answer suggests, the main reason to use SGD is to reduce the computation cost of gradient while still largely maintaining the gradient direction when averaged over many mini-batches or samples - that surely helps bring you to the local minima.
- Why minibatch works.
The mathematics behind this is that, the "true" gradient of the cost function (the gradient for the generalization error or for infinitely large samples set) is the expectation of the gradient over the true data generating distribution $p_{data}$; the actual gradient
computed over a batch of samples is always an approximation to the true gradient with the empirical data distribution
.
Batch gradient descent can bring you the possible "optimal" gradient given all your data samples, it is not the "true" gradient though. A smaller batch (i.e. a minibatch) is probably not as optimal as the full batch, but they are both approximations - so is the single-sample minibatch (SGD).
Assuming there is no dependence between the samples in one minibatch, the computed
is an unbiased estimate of the true gradient. The (squared) standard errors of the estimates with different minibatch sizes is inversely proportional to the sizes of the minibatch. That is,
I.e., the reduction of standard error is the square root of the increase of sample size. This means, if the minibatch size is small, the learning rate has to be small too, in order to achieve stability over the big variance. When the samples are not independent, the property of unbiased estimate is no longer maintained. That requires you to shuffle the samples before the training, if the samples are sequenced not randomly enough.
- Why minibatch may work better.
Firstly, minibatch makes some learning problems from technically intractable to be tractable due to the reduced computation demand with smaller batch size.
Secondly, reduced batch size does not necessarily mean reduced gradient accuracy. The training samples many have lots of noises or outliers or biases. A randomly sampled minibatch may reflect the true data generating distribution better (or no worse) than the original full batch. If some iterations of the minibatch gradient updates give you a better estimation, overall the averaged result of one epoch can be better than the gradient computed from a full batch.
Thirdly, minibatch does not only help deal with unpleasant data samples, but also help deal with unpleasant cost function that has many local minima. As Jason_L_Bens mentions, sometimes the error manifolds may be easier to trap a regular gradient into a local minima, while more difficult to trap the temporarily random gradient computed with minibatch.
Finally, with gradient descent, you are not reaching the global minima in one step, but iterating on the error manifold. Gradient largely gives you only the direction to iterate. With minibatch, you can iterate much faster. In many cases, the more iterations, the better point you can reach. You do not really care at all weather the point is optimal globally or even locally. You just want to reach a reasonable model that brings you acceptable generalization error. Minibatch makes that easier.
You may find the book "Deep learning" by Ian Goodfellow, et al, has pretty good discussions on this topic if you read through it carefully.
machine learning - Stochastic Gradient Descent(SGD) vs Mini-batch size 1 - Stack Overflow
machine learning - What is the difference between Gradient Descent and Stochastic Gradient Descent? - Data Science Stack Exchange
Stochastic Gradient Descent Definition
How is Batch Gradient Descent different from Stochastic?
Videos
Yes, I understand the textbook definition for what the difference is. The part that I am confused about is the actual math. Where does the actual batching or grouping take place? From what I understand, when doing backprop you need to use the Z values and Y values for an individual input. Ex. dC/dA in the last layer is (a[i] - y[i]) and dA/dZ = f'(Z) where f is the activation function. So how would grouping inputs affect any of this?
I've read that after predicting each input's prediction for a given batch then you just "update the weights." What does that mean though? Does it mean you have to store all the Z values and Y values and then perform backprop on each individual input's set of Z's and Y's? To then update the weights would I take the average W(new) of all the individual backprops? If thats the case then I don't see how that is any different than stochastic gradient descent. Is the time at which you update the weights the only thing that is different? Or am I missing something along the way?
For a quick simple explanation:
In both gradient descent (GD) and stochastic gradient descent (SGD), you update a set of parameters in an iterative manner to minimize an error function.
While in GD, you have to run through ALL the samples in your training set to do a single update for a parameter in a particular iteration, in SGD, on the other hand, you use ONLY ONE or SUBSET of training sample from your training set to do the update for a parameter in a particular iteration. If you use SUBSET, it is called Minibatch Stochastic gradient Descent.
Thus, if the number of training samples are large, in fact very large, then using gradient descent may take too long because in every iteration when you are updating the values of the parameters, you are running through the complete training set. On the other hand, using SGD will be faster because you use only one training sample and it starts improving itself right away from the first sample.
SGD often converges much faster compared to GD but the error function is not as well minimized as in the case of GD. Often in most cases, the close approximation that you get in SGD for the parameter values are enough because they reach the optimal values and keep oscillating there.
If you need an example of this with a practical case, check Andrew NG's notes here where he clearly shows you the steps involved in both the cases. cs229-notes
Source: Quora Thread
The inclusion of the word stochastic simply means the random samples from the training data are chosen in each run to update parameter during optimisation, within the framework of gradient descent.
Doing so not only computed errors and updates weights in faster iterations (because we only process a small selection of samples in one go), it also often helps to move towards an optimum more quickly. Have a look at the answers here, for more information as to why using stochastic minibatches for training offers advantages.
One perhaps downside, is that the path to the optimum (assuming it would always be the same optimum) can be much noisier. So instead of a nice smooth loss curve, showing how the error descreases in each iteration of gradient descent, you might see something like this:

We clearly see the loss decreasing over time, however there are large variations from epoch to epoch (training batch to training batch), so the curve is noisy.
This is simply because we compute the mean error over our stochastically/randomly selected subset, from the entire dataset, in each iteration. Some samples will produce high error, some low. So the average can vary, depending on which samples we randomly used for one iteration of gradient descent.