Yes you are right. In Keras batch_size refers to the batch size in Mini-batch Gradient Descent. If you want to run a Batch Gradient Descent, you need to set the batch_size to the number of training samples. Your code looks perfect except that I don't understand why you store the model.fit function to an object history.

Answer from Ernest S Kirubakaran on Stack Exchange
🌐
Medium
medium.com › @juanc.olamendy › mini-batch-gradient-descent-in-keras-95cfdd7dd7a5
Mini-Batch Gradient Descent in Keras | by Juan C Olamendy | Medium
November 30, 2023 - Popular batch sizes range from ... of GPUs or CPUs. Keras, a high-level neural networks API, makes implementing Mini-Batch Gradient Descent straightforward, especially for deep learning models....
🌐
Dive into Deep Learning
d2l.ai › chapter_optimization › minibatch-sgd.html
12.5. Minibatch Stochastic Gradient Descent — Dive into Deep Learning 1.0.3 documentation
In general, minibatch stochastic gradient descent is faster than stochastic gradient descent and gradient descent for convergence to a smaller risk, when measured in terms of clock time. Modify the batch size and learning rate and observe the rate of decline for the value of the objective function and the time consumed in each epoch.
🌐
Kaggle
kaggle.com › code › houssemaminetouihri › mini-batch-gradient-descent-and-momentum
Mini-Batch Gradient Descent and Momentum
Checking your browser before accessing www.kaggle.com · Click here if you are not automatically redirected after 5 seconds
🌐
Stack Overflow
stackoverflow.com › questions › 60295202 › keras-minibatch-gradient-descent-with-dropout-layer
tensorflow - Keras minibatch gradient descent with Dropout layer - Stack Overflow
February 19, 2020 - As you mention, the drop mask is key to obtain the appropriate behavior. The gradients and forward pass are computed together, and a different drop mask is sampled for each sample in a batch, meaning that this works without additional support from the framework.
Top answer
1 of 1
8

This happens for two reasons:

  • First, when the data is not shuffled, the train/validation split is inappropriate.
  • Second, full gradient descent performs a single update per epoch, so more training epochs might be required to converge.

Why doesn't your model match the wave?

From model.fit:

  • validation_split: Float between 0 and 1. Fraction of the training data to be used as validation data. The model will set apart this fraction of the training data, will not train on it, and will evaluate the loss and any model metrics on this data at the end of each epoch. The validation data is selected from the last samples in the x and y data provided, before shuffling.

Which means that your validation set consists of the last 20% training samples. Because you are using a log scale for your independent variable (x_train), it turns out that your train/validation split is:

split_point = int(0.2*N)
x_val = x_train[-split_point:]
y_val = y_train[-split_point:]
x_train_ = x_train[:-split_point]
y_train_ = y_train[:-split_point]
plt.scatter(x_train_, y_train_, c='g')
plt.scatter(x_val, y_val, c='r')
plt.show()

In the previous plot, training and validation data are represented by green and red points, respectively. Note that your training dataset is not representative of the whole population.


Why does it still not match the training dataset?

In addition to an inappropriate train/test split, full gradient descent might require more training epochs to converge (the gradients are less noisy, but it only performs a single gradient update per epoch). If, instead, you train your model for ~1500 epochs (or use mini-batch gradient descent with a batch size of, say, 32), you end up getting:

Top answer
1 of 2
3

There are actually three (3) cases:

  • batch_size = 1 means indeed stochastic gradient descent (SGD)
  • A batch_size equal to the whole of the training data is (batch) gradient descent (GD)
  • Intermediate cases (which are actually used in practice) are usually referred to as mini-batch gradient descent

See A Gentle Introduction to Mini-Batch Gradient Descent and How to Configure Batch Size for more details and references. Truth is, in practice, when we say "SGD" we usually mean "mini-batch SGD".

These definitions are in fact fully compliant with what you report from your experiments:

  • With batch_size=len(train_data) (GD case), only one update is indeed expected per epoch (since there is only one batch), hence the 1/1 indication in Keras output.

  • In contrast, with batch_size = 1 (SGD case), you expect as many updates as samples in your training data (since this is now the number of your batches), i.e. 90000, hence the 90000/90000 indication in Keras output.

i.e. the number of updates per epoch (which Keras indicates) is equal to the number of batches used (and not to the batch size).

2 of 2
0

batch_size is the size of how large each update will be.

Here, batch_size=1 means the size of each update is 1 sample. By your definitions, this would be SGD.

If you have batch_size=len(train_data), that means that each update to your weights will require the resulting gradient from your entire dataset. This is actually just good old gradient descent.

Batch gradient descent is somewhere in the middle, where the batch_size isn't 1 and the batch size isn't your entire training dataset. Take 32 for example. Batch gradient descent would update your weights every 32 examples, so it smooths out the ruggedness of SGD with just 1 example (where outliers may have a lot of impact) and yet has the benefits that SGD has over regular gradient descne.t

Find elsewhere
Top answer
1 of 3
3

Although the other answer basically already gives you the correct result, I would like to clarify on a few points you made in your post, and correct it.
The (commonly accepted) definitions of the different terms are as follows.

  • Gradient Descent (GD): Iterative method to find a (local or global) optimum in your function. Default Gradient Descent will go through all examples (one epoch), then update once.
  • Stochastic Gradient Descent (SGD): Unlike regular GD, it will go through one example, then immediately update. This way, you get a way higher update rate.
  • Mini Batching: Since the frequent updates of SGD are quite costly (updating the gradients is kind of tedious to perform), and can lead to worse results in certain circumstances, it is helpful to aggregate multiple (but not all) examples into one update. This means, you would go through n examples (where n is your batch size), and then update. This will still result in multiple updates within one epoch, but not necessarily as many as with SGD.
  • Epoch: One epoch simply refers to a pass through all of your training data. You can generally perform as many epochs as you would like.

One another note, you are correct about ADAM. It is generally seen as a more powerful variant of vanilla gradient descent, since it uses more sophisticated heuristics (first order derivatives) to speed up and stabilize convergence.

2 of 3
2

Your understanding of epoch and batch_size seems correct.

Little more precision below.

An epoch corresponds to one whole training dataset sweep. This sweep can be performed in several ways.

  • Batch mode: Gradient of loss over the whole training dataset is used to update model weights. One optimisation iteration corresponds to one epoch.
  • Stochastic mode: Gradient of loss over one training dataset point is used to update model weights. If there are N examples in the training dataset, N optimisation iterations correspond to one epoch.
  • Mini-batch mode: Gradient of loss over a small sample of points from the training dataset is used to update model weights. The sample is of size batch_size. If there are N_examples examples in the training dataset, N_examples/batch_size optimisation iterations correspond to one epoch.

In your case (epochs=100, batch_size=32), the regressor would sweep the whole dataset 100 items, with mini data batches of size 32 (ie. Mini-batch mode).

If I assume your dataset size is N_examples, the regressor would perform N_examples/32 model weight optimisation iteration per epoch.

So for 100 epochs: 100*N_examples/32 model weight optimisation iterations.

All in all, having epoch>1 and having batch_size>1 are compatible.

🌐
GeeksforGeeks
geeksforgeeks.org › ml-mini-batch-gradient-descent-with-python
ML | Mini-Batch Gradient Descent with Python | GeeksforGeeks
August 2, 2022 - Batch Gradient Descent: Parameters are updated after computing the gradient of the error with respect to the entire training set · Stochastic Gradient Descent: Parameters are updated after computing the gradient of the error with respect to a single training example · Mini-Batch Gradient Descent: Parameters are updated after computing the gradient of the error with respect to a subset of the training set...
🌐
GitHub
github.com › codebasics › deep-learning-keras-tf-tutorial › blob › master › 8_sgd_vs_gd › gd_and_sgd.ipynb
deep-learning-keras-tf-tutorial/8_sgd_vs_gd/gd_and_sgd.ipynb at master · codebasics/deep-learning-keras-tf-tutorial
Mini batch is intermediate version of batch GD and stochastic GD. In stochastic we used one randomly picked training sample, In mini gradient descent you will use a batch of samples in each iterations.
Author   codebasics
🌐
Keras
keras.io › api › optimizers › sgd
Keras documentation: SGD
If an int, model & optimizer variables will not be updated at every step; instead they will be updated every gradient_accumulation_steps steps, using the average value of the gradients since the last update. This is known as "gradient accumulation". This can be useful when your batch size is very small, in order to reduce gradient noise at each update step.
🌐
GeeksforGeeks
geeksforgeeks.org › mini-batch-gradient-descent-in-deep-learning
Mini-Batch Gradient Descent in Deep Learning - GeeksforGeeks
May 23, 2025 - Instead of updating weights after calculating the error for each data point (in stochastic gradient descent) or after the entire dataset (in batch gradient descent), mini-batch gradient descent updates the model’s parameters after processing a mini-batch of data.
🌐
TensorFlow
tensorflow.org › tensorflow core › writing a training loop from scratch
Writing a training loop from scratch | TensorFlow Core
July 24, 2023 - # The operations that the layer applies # to its inputs are going to be recorded # on the GradientTape. logits = model(x_batch_train, training=True) # Logits for this minibatch # Compute the loss value for this minibatch. loss_value = loss_fn(y_batch_train, logits) # Use the gradient tape to automatically retrieve # the gradients of the trainable variables with respect to the loss. grads = tape.gradient(loss_value, model.trainable_weights) # Run one step of gradient descent by updating # the value of the variables to minimize the loss.
🌐
DeepLizard
deeplizard.com › learn › video › U4WB9p6ODjM
Batch Size in a Neural Network explained - deeplizard
August 20, 2019 - Additionally, note if using mini-batch gradient descent, which is normally the type of gradient descent algorithm used by most neural network APIs like Keras by default, the gradient update will occur on a per-batch basis.
🌐
GitHub
github.com › topics › mini-batch-gradient-descent
mini-batch-gradient-descent · GitHub Topics · GitHub
machine-learning logistic-regression adaboost university-of-washington decision-tree stochastic-gradient-descent coursera-course mini-batch-gradient-descent linear-classifiers ... deep-learning neural-network tensorflow keras coursera recurrent-neural-networks batch-normalization convolutional-neural-networks gradient-descent hyperparameter-tuning andrew-ng classification-algorithims mini-batch-gradient-descent neural-style-transfer deeplearning-ai yolov3 machine-learning-case-study
🌐
Kaggle
kaggle.com › code › residentmario › full-batch-mini-batch-and-online-learning
Full batch, mini-batch, and online learning
Checking your browser before accessing www.kaggle.com · Click here if you are not automatically redirected after 5 seconds