Yes you are right. In Keras batch_size refers to the batch size in Mini-batch Gradient Descent. If you want to run a Batch Gradient Descent, you need to set the batch_size to the number of training samples. Your code looks perfect except that I don't understand why you store the model.fit function to an object history.
Yes you are right. In Keras batch_size refers to the batch size in Mini-batch Gradient Descent. If you want to run a Batch Gradient Descent, you need to set the batch_size to the number of training samples. Your code looks perfect except that I don't understand why you store the model.fit function to an object history.
Taking theoretical considerations aside, given real-life dataset and size of typical modern neural network, it would usually take unreasonably long to train on batches of size one, and you won't have enough RAM and/or GPU memory to train on whole dataset at once. So it is usually not the question "if" mini-batch should be used, but "what size" of batches should you use. The batch_size argument is the number of observations to train on in a single step, usually smaller sizes work better because having regularizing effect. Moreover, often people use more complicated optimizers (e.g. Adam, RMSprop) and other regularization tricks, what makes the relation between model performance, batch size, learning rate and computation time more complicated.
Videos
There are actually three (3) cases:
batch_size = 1means indeed stochastic gradient descent (SGD)- A
batch_sizeequal to the whole of the training data is (batch) gradient descent (GD) - Intermediate cases (which are actually used in practice) are usually referred to as mini-batch gradient descent
See A Gentle Introduction to Mini-Batch Gradient Descent and How to Configure Batch Size for more details and references. Truth is, in practice, when we say "SGD" we usually mean "mini-batch SGD".
These definitions are in fact fully compliant with what you report from your experiments:
With
batch_size=len(train_data)(GD case), only one update is indeed expected per epoch (since there is only one batch), hence the1/1indication in Keras output.In contrast, with
batch_size = 1(SGD case), you expect as many updates as samples in your training data (since this is now the number of your batches), i.e. 90000, hence the90000/90000indication in Keras output.
i.e. the number of updates per epoch (which Keras indicates) is equal to the number of batches used (and not to the batch size).
batch_size is the size of how large each update will be.
Here, batch_size=1 means the size of each update is 1 sample. By your definitions, this would be SGD.
If you have batch_size=len(train_data), that means that each update to your weights will require the resulting gradient from your entire dataset. This is actually just good old gradient descent.
Batch gradient descent is somewhere in the middle, where the batch_size isn't 1 and the batch size isn't your entire training dataset. Take 32 for example. Batch gradient descent would update your weights every 32 examples, so it smooths out the ruggedness of SGD with just 1 example (where outliers may have a lot of impact) and yet has the benefits that SGD has over regular gradient descne.t
Although the other answer basically already gives you the correct result, I would like to clarify on a few points you made in your post, and correct it.
The (commonly accepted) definitions of the different terms are as follows.
- Gradient Descent (GD): Iterative method to find a (local or global) optimum in your function. Default Gradient Descent will go through all examples (one epoch), then update once.
- Stochastic Gradient Descent (SGD): Unlike regular GD, it will go through one example, then immediately update. This way, you get a way higher update rate.
- Mini Batching: Since the frequent updates of SGD are quite costly (updating the gradients is kind of tedious to perform), and can lead to worse results in certain circumstances, it is helpful to aggregate multiple (but not all) examples into one update. This means, you would go through n examples (where n is your batch size), and then update. This will still result in multiple updates within one epoch, but not necessarily as many as with SGD.
- Epoch: One epoch simply refers to a pass through all of your training data. You can generally perform as many epochs as you would like.
One another note, you are correct about ADAM. It is generally seen as a more powerful variant of vanilla gradient descent, since it uses more sophisticated heuristics (first order derivatives) to speed up and stabilize convergence.
Your understanding of epoch and batch_size seems correct.
Little more precision below.
An epoch corresponds to one whole training dataset sweep. This sweep can be performed in several ways.
- Batch mode: Gradient of loss over the whole training dataset is used to update model weights. One optimisation iteration corresponds to one epoch.
- Stochastic mode: Gradient of loss over one training dataset point is used to update model weights. If there are N examples in the training dataset, N optimisation iterations correspond to one epoch.
- Mini-batch mode: Gradient of loss over a small sample of points from the training dataset is used to update model weights. The sample is of size batch_size. If there are
N_examplesexamples in the training dataset,N_examples/batch_sizeoptimisation iterations correspond to one epoch.
In your case (epochs=100, batch_size=32), the regressor would sweep the whole dataset 100 items, with mini data batches of size 32 (ie. Mini-batch mode).
If I assume your dataset size is N_examples, the regressor would perform N_examples/32 model weight optimisation iteration per epoch.
So for 100 epochs: 100*N_examples/32 model weight optimisation iterations.
All in all, having epoch>1 and having batch_size>1 are compatible.

