The applicability of batch or stochastic gradient descent really depends on the error manifold expected.
Batch gradient descent computes the gradient using the whole dataset. This is great for convex, or relatively smooth error manifolds. In this case, we move somewhat directly towards an optimum solution, either local or global. Additionally, batch gradient descent, given an annealed learning rate, will eventually find the minimum located in it's basin of attraction.
Stochastic gradient descent (SGD) computes the gradient using a single sample. Most applications of SGD actually use a minibatch of several samples, for reasons that will be explained a bit later. SGD works well (Not well, I suppose, but better than batch gradient descent) for error manifolds that have lots of local maxima/minima. In this case, the somewhat noisier gradient calculated using the reduced number of samples tends to jerk the model out of local minima into a region that hopefully is more optimal. Single samples are really noisy, while minibatches tend to average a little of the noise out. Thus, the amount of jerk is reduced when using minibatches. A good balance is struck when the minibatch size is small enough to avoid some of the poor local minima, but large enough that it doesn't avoid the global minima or better-performing local minima. (Incidently, this assumes that the best minima have a larger and deeper basin of attraction, and are therefore easier to fall into.)
One benefit of SGD is that it's computationally a whole lot faster. Large datasets often can't be held in RAM, which makes vectorization much less efficient. Rather, each sample or batch of samples must be loaded, worked with, the results stored, and so on. Minibatch SGD, on the other hand, is usually intentionally made small enough to be computationally tractable.
Usually, this computational advantage is leveraged by performing many more iterations of SGD, making many more steps than conventional batch gradient descent. This usually results in a model that is very close to that which would be found via batch gradient descent, or better.
The way I like to think of how SGD works is to imagine that I have one point that represents my input distribution. My model is attempting to learn that input distribution. Surrounding the input distribution is a shaded area that represents the input distributions of all of the possible minibatches I could sample. It's usually a fair assumption that the minibatch input distributions are close in proximity to the true input distribution. Batch gradient descent, at all steps, takes the steepest route to reach the true input distribution. SGD, on the other hand, chooses a random point within the shaded area, and takes the steepest route towards this point. At each iteration, though, it chooses a new point. The average of all of these steps will approximate the true input distribution, usually quite well.
Answer from Jason_L_Bens on Stack ExchangeVideos
The applicability of batch or stochastic gradient descent really depends on the error manifold expected.
Batch gradient descent computes the gradient using the whole dataset. This is great for convex, or relatively smooth error manifolds. In this case, we move somewhat directly towards an optimum solution, either local or global. Additionally, batch gradient descent, given an annealed learning rate, will eventually find the minimum located in it's basin of attraction.
Stochastic gradient descent (SGD) computes the gradient using a single sample. Most applications of SGD actually use a minibatch of several samples, for reasons that will be explained a bit later. SGD works well (Not well, I suppose, but better than batch gradient descent) for error manifolds that have lots of local maxima/minima. In this case, the somewhat noisier gradient calculated using the reduced number of samples tends to jerk the model out of local minima into a region that hopefully is more optimal. Single samples are really noisy, while minibatches tend to average a little of the noise out. Thus, the amount of jerk is reduced when using minibatches. A good balance is struck when the minibatch size is small enough to avoid some of the poor local minima, but large enough that it doesn't avoid the global minima or better-performing local minima. (Incidently, this assumes that the best minima have a larger and deeper basin of attraction, and are therefore easier to fall into.)
One benefit of SGD is that it's computationally a whole lot faster. Large datasets often can't be held in RAM, which makes vectorization much less efficient. Rather, each sample or batch of samples must be loaded, worked with, the results stored, and so on. Minibatch SGD, on the other hand, is usually intentionally made small enough to be computationally tractable.
Usually, this computational advantage is leveraged by performing many more iterations of SGD, making many more steps than conventional batch gradient descent. This usually results in a model that is very close to that which would be found via batch gradient descent, or better.
The way I like to think of how SGD works is to imagine that I have one point that represents my input distribution. My model is attempting to learn that input distribution. Surrounding the input distribution is a shaded area that represents the input distributions of all of the possible minibatches I could sample. It's usually a fair assumption that the minibatch input distributions are close in proximity to the true input distribution. Batch gradient descent, at all steps, takes the steepest route to reach the true input distribution. SGD, on the other hand, chooses a random point within the shaded area, and takes the steepest route towards this point. At each iteration, though, it chooses a new point. The average of all of these steps will approximate the true input distribution, usually quite well.
As other answer suggests, the main reason to use SGD is to reduce the computation cost of gradient while still largely maintaining the gradient direction when averaged over many mini-batches or samples - that surely helps bring you to the local minima.
- Why minibatch works.
The mathematics behind this is that, the "true" gradient of the cost function (the gradient for the generalization error or for infinitely large samples set) is the expectation of the gradient over the true data generating distribution $p_{data}$; the actual gradient
computed over a batch of samples is always an approximation to the true gradient with the empirical data distribution
.
Batch gradient descent can bring you the possible "optimal" gradient given all your data samples, it is not the "true" gradient though. A smaller batch (i.e. a minibatch) is probably not as optimal as the full batch, but they are both approximations - so is the single-sample minibatch (SGD).
Assuming there is no dependence between the samples in one minibatch, the computed
is an unbiased estimate of the true gradient. The (squared) standard errors of the estimates with different minibatch sizes is inversely proportional to the sizes of the minibatch. That is,
I.e., the reduction of standard error is the square root of the increase of sample size. This means, if the minibatch size is small, the learning rate has to be small too, in order to achieve stability over the big variance. When the samples are not independent, the property of unbiased estimate is no longer maintained. That requires you to shuffle the samples before the training, if the samples are sequenced not randomly enough.
- Why minibatch may work better.
Firstly, minibatch makes some learning problems from technically intractable to be tractable due to the reduced computation demand with smaller batch size.
Secondly, reduced batch size does not necessarily mean reduced gradient accuracy. The training samples many have lots of noises or outliers or biases. A randomly sampled minibatch may reflect the true data generating distribution better (or no worse) than the original full batch. If some iterations of the minibatch gradient updates give you a better estimation, overall the averaged result of one epoch can be better than the gradient computed from a full batch.
Thirdly, minibatch does not only help deal with unpleasant data samples, but also help deal with unpleasant cost function that has many local minima. As Jason_L_Bens mentions, sometimes the error manifolds may be easier to trap a regular gradient into a local minima, while more difficult to trap the temporarily random gradient computed with minibatch.
Finally, with gradient descent, you are not reaching the global minima in one step, but iterating on the error manifold. Gradient largely gives you only the direction to iterate. With minibatch, you can iterate much faster. In many cases, the more iterations, the better point you can reach. You do not really care at all weather the point is optimal globally or even locally. You just want to reach a reasonable model that brings you acceptable generalization error. Minibatch makes that easier.
You may find the book "Deep learning" by Ian Goodfellow, et al, has pretty good discussions on this topic if you read through it carefully.