mini batch gradient descent keras

How to set mini-batch size in SGD in keras

stats.stackexchange.com › questions › 221886 › how-to-set-mini-batch-size-in-sgd-in-keras

Yes you are right. In Keras batch_size refers to the batch size in Mini-batch Gradient Descent. If you want to run a Batch Gradient Descent, you need to set the batch_size to the number of training samples. Your code looks perfect except that I don't understand why you store the model.fit function to an object history.

Answer from Ernest S Kirubakaran on Stack Exchange

Stack Exchange

stats.stackexchange.com › questions › 221886 › how-to-set-mini-batch-size-in-sgd-in-keras

neural networks - How to set mini-batch size in SGD in keras - Cross Validated

Top answer

1 of 2

2 of 2

Taking theoretical considerations aside, given real-life dataset and size of typical modern neural network, it would usually take unreasonably long to train on batches of size one, and you won't have enough RAM and/or GPU memory to train on whole dataset at once. So it is usually not the question "if" mini-batch should be used, but "what size" of batches should you use. The batch_size argument is the number of observations to train on in a single step, usually smaller sizes work better because having regularizing effect. Moreover, often people use more complicated optimizers (e.g. Adam, RMSprop) and other regularization tricks, what makes the relation between model performance, batch size, learning rate and computation time more complicated.

Medium

medium.com › @juanc.olamendy › mini-batch-gradient-descent-in-keras-95cfdd7dd7a5

Mini-Batch Gradient Descent in Keras | by Juan C Olamendy | Medium

November 30, 2023 - Popular batch sizes range from ... of GPUs or CPUs. Keras, a high-level neural networks API, makes implementing Mini-Batch Gradient Descent straightforward, especially for deep learning models....

Videos

41:34

YouTube

Gradient Descent For Neural Network | Deep Learning Tutorial 12 ...

August 14, 2020

36:47

YouTube

Stochastic Gradient Descent vs Batch Gradient Descent vs Mini Batch ...

Mini Batch Gradient Descent (C2W2L01) - YouTube

August 25, 2017

154K

View all

MachineLearningMastery

machinelearningmastery.com › home › blog › a gentle introduction to mini-batch gradient descent and how to configure batch size

A Gentle Introduction to Mini-Batch Gradient Descent and How to Configure Batch Size - MachineLearningMastery.com

August 19, 2019 - It looks like Keras performs mini batch gradient descent by default using the batch_size param.

Dive into Deep Learning

d2l.ai › chapter_optimization › minibatch-sgd.html

12.5. Minibatch Stochastic Gradient Descent — Dive into Deep Learning 1.0.3 documentation

In general, minibatch stochastic gradient descent is faster than stochastic gradient descent and gradient descent for convergence to a smaller risk, when measured in terms of clock time. Modify the batch size and learning rate and observe the rate of decline for the value of the objective function and the time consumed in each epoch.

Kaggle

kaggle.com › code › houssemaminetouihri › mini-batch-gradient-descent-and-momentum

Mini-Batch Gradient Descent and Momentum

Checking your browser before accessing www.kaggle.com · Click here if you are not automatically redirected after 5 seconds

Stack Overflow

stackoverflow.com › questions › 60295202 › keras-minibatch-gradient-descent-with-dropout-layer

tensorflow - Keras minibatch gradient descent with Dropout layer - Stack Overflow

February 19, 2020 - As you mention, the drop mask is key to obtain the appropriate behavior. The gradients and forward pass are computed together, and a different drop mask is sampled for each sample in a batch, meaning that this works without additional support from the framework.

Stack Overflow

stackoverflow.com › questions › 53769556 › full-gradient-descent-in-keras

python - Full gradient descent in keras - Stack Overflow

Top answer

1 of 1

This happens for two reasons:

First, when the data is not shuffled, the train/validation split is inappropriate.
Second, full gradient descent performs a single update per epoch, so more training epochs might be required to converge.

Why doesn't your model match the wave?

From model.fit:

validation_split: Float between 0 and 1. Fraction of the training data to be used as validation data. The model will set apart this fraction of the training data, will not train on it, and will evaluate the loss and any model metrics on this data at the end of each epoch. The validation data is selected from the last samples in the x and y data provided, before shuffling.

Which means that your validation set consists of the last 20% training samples. Because you are using a log scale for your independent variable (x_train), it turns out that your train/validation split is:

split_point = int(0.2*N)
x_val = x_train[-split_point:]
y_val = y_train[-split_point:]
x_train_ = x_train[:-split_point]
y_train_ = y_train[:-split_point]
plt.scatter(x_train_, y_train_, c='g')
plt.scatter(x_val, y_val, c='r')
plt.show()

In the previous plot, training and validation data are represented by green and red points, respectively. Note that your training dataset is not representative of the whole population.

Why does it still not match the training dataset?

In addition to an inappropriate train/test split, full gradient descent might require more training epochs to converge (the gradients are less noisy, but it only performs a single gradient update per epoch). If, instead, you train your model for ~1500 epochs (or use mini-batch gradient descent with a batch size of, say, 32), you end up getting:

Stack Overflow

stackoverflow.com › questions › 63139072 › batch-size-for-stochastic-gradient-descent-is-length-of-training-data-and-not-1

python - Batch size for Stochastic gradient descent is length of training data and not 1? - Stack Overflow

Top answer

1 of 2

There are actually three (3) cases:

batch_size = 1 means indeed stochastic gradient descent (SGD)
A batch_size equal to the whole of the training data is (batch) gradient descent (GD)
Intermediate cases (which are actually used in practice) are usually referred to as mini-batch gradient descent

See A Gentle Introduction to Mini-Batch Gradient Descent and How to Configure Batch Size for more details and references. Truth is, in practice, when we say "SGD" we usually mean "mini-batch SGD".

These definitions are in fact fully compliant with what you report from your experiments:

With batch_size=len(train_data) (GD case), only one update is indeed expected per epoch (since there is only one batch), hence the 1/1 indication in Keras output.
In contrast, with batch_size = 1 (SGD case), you expect as many updates as samples in your training data (since this is now the number of your batches), i.e. 90000, hence the 90000/90000 indication in Keras output.

i.e. the number of updates per epoch (which Keras indicates) is equal to the number of batches used (and not to the batch size).

2 of 2

batch_size is the size of how large each update will be.

Here, batch_size=1 means the size of each update is 1 sample. By your definitions, this would be SGD.

If you have batch_size=len(train_data), that means that each update to your weights will require the resulting gradient from your entire dataset. This is actually just good old gradient descent.

Batch gradient descent is somewhere in the middle, where the batch_size isn't 1 and the batch size isn't your entire training dataset. Take 32 for example. Batch gradient descent would update your weights every 32 examples, so it smooths out the ruggedness of SGD with just 1 example (where outliers may have a lot of impact) and yet has the benefits that SGD has over regular gradient descne.t

Find elsewhere

Google Bing Mojeek

Stack Overflow

stackoverflow.com › questions › 51373903 › mini-batch-gradient-descent-adam-and-epochs

python - Mini Batch Gradient Descent, adam and epochs - Stack Overflow

Top answer

1 of 3

Although the other answer basically already gives you the correct result, I would like to clarify on a few points you made in your post, and correct it.
The (commonly accepted) definitions of the different terms are as follows.

Gradient Descent (GD): Iterative method to find a (local or global) optimum in your function. Default Gradient Descent will go through all examples (one epoch), then update once.
Stochastic Gradient Descent (SGD): Unlike regular GD, it will go through one example, then immediately update. This way, you get a way higher update rate.
Mini Batching: Since the frequent updates of SGD are quite costly (updating the gradients is kind of tedious to perform), and can lead to worse results in certain circumstances, it is helpful to aggregate multiple (but not all) examples into one update. This means, you would go through n examples (where n is your batch size), and then update. This will still result in multiple updates within one epoch, but not necessarily as many as with SGD.
Epoch: One epoch simply refers to a pass through all of your training data. You can generally perform as many epochs as you would like.

One another note, you are correct about ADAM. It is generally seen as a more powerful variant of vanilla gradient descent, since it uses more sophisticated heuristics (first order derivatives) to speed up and stabilize convergence.

2 of 3

Your understanding of epoch and batch_size seems correct.

Little more precision below.

An epoch corresponds to one whole training dataset sweep. This sweep can be performed in several ways.

Batch mode: Gradient of loss over the whole training dataset is used to update model weights. One optimisation iteration corresponds to one epoch.
Stochastic mode: Gradient of loss over one training dataset point is used to update model weights. If there are N examples in the training dataset, N optimisation iterations correspond to one epoch.
Mini-batch mode: Gradient of loss over a small sample of points from the training dataset is used to update model weights. The sample is of size batch_size. If there are N_examples examples in the training dataset, N_examples/batch_size optimisation iterations correspond to one epoch.

In your case (epochs=100, batch_size=32), the regressor would sweep the whole dataset 100 items, with mini data batches of size 32 (ie. Mini-batch mode).

If I assume your dataset size is N_examples, the regressor would perform N_examples/32 model weight optimisation iteration per epoch.

So for 100 epochs: 100*N_examples/32 model weight optimisation iterations.

All in all, having epoch>1 and having batch_size>1 are compatible.

GeeksforGeeks

geeksforgeeks.org › ml-mini-batch-gradient-descent-with-python

ML | Mini-Batch Gradient Descent with Python | GeeksforGeeks

August 2, 2022 - Batch Gradient Descent: Parameters are updated after computing the gradient of the error with respect to the entire training set · Stochastic Gradient Descent: Parameters are updated after computing the gradient of the error with respect to a single training example · Mini-Batch Gradient Descent: Parameters are updated after computing the gradient of the error with respect to a subset of the training set...

MachineLearningMastery

machinelearningmastery.com › home › blog › how to control the stability of training neural networks with the batch size

How to Control the Stability of Training Neural Networks With the Batch Size - MachineLearningMastery.com

August 27, 2020 - Keras allows you to train your model using stochastic, batch, or minibatch gradient descent.

Stack Exchange

stats.stackexchange.com › questions › 406183 › does-keras-sgd-optimizer-implement-batch-mini-batch-or-stochastic-gradient-des

neural networks - Does Keras SGD optimizer implement batch, mini-batch, or stochastic gradient descent? - Cross Validated

Top answer

1 of 1

It works just as you suggest. batch_size parameter does exactly what you would expect: it sets the size of the batch:

batch_size: Integer or None. Number of samples per gradient update. If unspecified, batch_size will default to 32.

From programming point of view, Keras decouples the weight update formula parameters specific to each optimizer (learning rate, momentum, etc.) from the global training properties (batch size, training length, etc.) that are share between methods. It is matter of convenience—there is no point in having optimizers SGD, MBGD, BGD that all do the same thing just with different batch size.

Stack Overflow

stackoverflow.com › questions › 63687295 › how-to-implement-mini-batch-gradient-descent-in-tensorflow-2

python - How to implement mini-batch gradient descent in Tensorflow 2? - Stack Overflow

Top answer

1 of 1

Keras has an inbuilt batch_size argument to its model.fit method (since you tagged this question with keras I assume that you're using it). I believe that this will probably be the best optimised method to achieve what you're looking for.

GitHub

github.com › codebasics › deep-learning-keras-tf-tutorial › blob › master › 8_sgd_vs_gd › gd_and_sgd.ipynb

deep-learning-keras-tf-tutorial/8_sgd_vs_gd/gd_and_sgd.ipynb at master · codebasics/deep-learning-keras-tf-tutorial

Mini batch is intermediate version of batch GD and stochastic GD. In stochastic we used one randomly picked training sample, In mini gradient descent you will use a batch of samples in each iterations.

Author codebasics

Keras

keras.io › api › optimizers › sgd

Keras documentation: SGD

If an int, model & optimizer variables will not be updated at every step; instead they will be updated every gradient_accumulation_steps steps, using the average value of the gradients since the last update. This is known as "gradient accumulation". This can be useful when your batch size is very small, in order to reduce gradient noise at each update step.

GeeksforGeeks

geeksforgeeks.org › mini-batch-gradient-descent-in-deep-learning

Mini-Batch Gradient Descent in Deep Learning - GeeksforGeeks

May 23, 2025 - Instead of updating weights after calculating the error for each data point (in stochastic gradient descent) or after the entire dataset (in batch gradient descent), mini-batch gradient descent updates the model’s parameters after processing a mini-batch of data.

TensorFlow

tensorflow.org › tensorflow core › writing a training loop from scratch

Writing a training loop from scratch | TensorFlow Core

July 24, 2023 - # The operations that the layer applies # to its inputs are going to be recorded # on the GradientTape. logits = model(x_batch_train, training=True) # Logits for this minibatch # Compute the loss value for this minibatch. loss_value = loss_fn(y_batch_train, logits) # Use the gradient tape to automatically retrieve # the gradients of the trainable variables with respect to the loss. grads = tape.gradient(loss_value, model.trainable_weights) # Run one step of gradient descent by updating # the value of the variables to minimize the loss.

DeepLizard

deeplizard.com › learn › video › U4WB9p6ODjM

Batch Size in a Neural Network explained - deeplizard

August 20, 2019 - Additionally, note if using mini-batch gradient descent, which is normally the type of gradient descent algorithm used by most neural network APIs like Keras by default, the gradient update will occur on a per-batch basis.

GitHub

github.com › topics › mini-batch-gradient-descent

mini-batch-gradient-descent · GitHub Topics · GitHub

machine-learning logistic-regression adaboost university-of-washington decision-tree stochastic-gradient-descent coursera-course mini-batch-gradient-descent linear-classifiers ... deep-learning neural-network tensorflow keras coursera recurrent-neural-networks batch-normalization convolutional-neural-networks gradient-descent hyperparameter-tuning andrew-ng classification-algorithims mini-batch-gradient-descent neural-style-transfer deeplearning-ai yolov3 machine-learning-case-study

Kaggle

kaggle.com › code › residentmario › full-batch-mini-batch-and-online-learning

Full batch, mini-batch, and online learning

Checking your browser before accessing www.kaggle.com · Click here if you are not automatically redirected after 5 seconds