batch gradient descent example

Batch gradient descent versus stochastic gradient descent

stats.stackexchange.com › questions › 49528 › batch-gradient-descent-versus-stochastic-gradient-descent

The applicability of batch or stochastic gradient descent really depends on the error manifold expected.

Batch gradient descent computes the gradient using the whole dataset. This is great for convex, or relatively smooth error manifolds. In this case, we move somewhat directly towards an optimum solution, either local or global. Additionally, batch gradient descent, given an annealed learning rate, will eventually find the minimum located in it's basin of attraction.

Stochastic gradient descent (SGD) computes the gradient using a single sample. Most applications of SGD actually use a minibatch of several samples, for reasons that will be explained a bit later. SGD works well (Not well, I suppose, but better than batch gradient descent) for error manifolds that have lots of local maxima/minima. In this case, the somewhat noisier gradient calculated using the reduced number of samples tends to jerk the model out of local minima into a region that hopefully is more optimal. Single samples are really noisy, while minibatches tend to average a little of the noise out. Thus, the amount of jerk is reduced when using minibatches. A good balance is struck when the minibatch size is small enough to avoid some of the poor local minima, but large enough that it doesn't avoid the global minima or better-performing local minima. (Incidently, this assumes that the best minima have a larger and deeper basin of attraction, and are therefore easier to fall into.)

One benefit of SGD is that it's computationally a whole lot faster. Large datasets often can't be held in RAM, which makes vectorization much less efficient. Rather, each sample or batch of samples must be loaded, worked with, the results stored, and so on. Minibatch SGD, on the other hand, is usually intentionally made small enough to be computationally tractable.

Usually, this computational advantage is leveraged by performing many more iterations of SGD, making many more steps than conventional batch gradient descent. This usually results in a model that is very close to that which would be found via batch gradient descent, or better.

The way I like to think of how SGD works is to imagine that I have one point that represents my input distribution. My model is attempting to learn that input distribution. Surrounding the input distribution is a shaded area that represents the input distributions of all of the possible minibatches I could sample. It's usually a fair assumption that the minibatch input distributions are close in proximity to the true input distribution. Batch gradient descent, at all steps, takes the steepest route to reach the true input distribution. SGD, on the other hand, chooses a random point within the shaded area, and takes the steepest route towards this point. At each iteration, though, it chooses a new point. The average of all of these steps will approximate the true input distribution, usually quite well.

Answer from Jason_L_Bens on Stack Exchange

Deepgram

deepgram.com › ai-glossary › batch-gradient-descent

Batch Gradient Descent

The Foundation: Batch Gradient Descent (BGD) is an iterative optimization algorithm central to machine learning, designed to minimize the cost function—a measure of prediction error—in models. All-encompassing Approach: Unlike its counterparts, BGD leverages the entire dataset to calculate the gradient of the cost function, ensuring each step is informed by a comprehensive view of the data landscape. Steady Convergence: By considering all training examples ...

Zilliz

zilliz.com › glossary › batch-gradient-descent

Batch Gradient Descent Explained

Choosing the right batch size is critical to balancing compute and model quality. In mini-batch gradient descent we typically choose batch sizes that are powers of 2, 32 or 64 for example.

Videos

05:03

YouTube

Batch Gradient Descent vs Mini-Batch Gradient Descent vs Stochastic ...

Stochastic Gradient Descent vs Batch Gradient Descent vs Mini Batch ...

Stochastic vs Batch vs Mini-Batch Gradient Descent - YouTube

September 5, 2019

10:53

YouTube

Stochastic Gradient Descent, Clearly Explained!!! - YouTube

Mini Batch Gradient Descent (C2W2L01) - YouTube

medium.com › @jaleeladejumo › gradient-descent-from-scratch-batch-gradient-descent-stochastic-gradient-descent-and-mini-batch-def681187473

Gradient Descent From Scratch- Batch Gradient Descent, Stochastic Gradient Descent, and Mini-Batch Gradient Descent. | by Jaleel Adejumo | Medium

April 12, 2023 - In this article, I will take you through the implementation of Batch Gradient Descent, Stochastic Gradient Descent, and Mini-Batch Gradient Descent coding from scratch in python. This will be beginners friendly. Understanding gradient descent method will help you in optimising your loss during ML model training.

Kenndanielso

kenndanielso.github.io › mlrefined › blog_posts › 13_Multilayer_perceptrons › 13_6_Stochastic_and_minibatch_gradient_descent.html

13.6 Stochastic and mini-batch gradient descent

We now run batch and mini-batch gradient descent with the same initialization and fixed steplength for $100$ iterations and plot the cost function evaluation as well as number of misclassifications for each iteration. ... In this Example we compare $40$ iterations of batch and mini-batch (batch size =$500$) gradient descent using multi-class perceptron cost on the MNIST dataset consisting of $P=70,000$ images of hand-written digits 0-9.

Spot Intelligence

spotintelligence.com › home › batch gradient descent in machine learning made simple & how to tutorial in python

Batch Gradient Descent In Machine Learning Made Simple & How To Tutorial In Python

May 22, 2024 - What is batch gradient descent? How does it work and where is it used in machine learning? A Python tutorial example and the math explained.

Stack Exchange

stats.stackexchange.com › questions › 49528 › batch-gradient-descent-versus-stochastic-gradient-descent

optimization - Batch gradient descent versus stochastic gradient descent - Cross Validated

Top answer

1 of 7

187

The applicability of batch or stochastic gradient descent really depends on the error manifold expected.

Batch gradient descent computes the gradient using the whole dataset. This is great for convex, or relatively smooth error manifolds. In this case, we move somewhat directly towards an optimum solution, either local or global. Additionally, batch gradient descent, given an annealed learning rate, will eventually find the minimum located in it's basin of attraction.

Stochastic gradient descent (SGD) computes the gradient using a single sample. Most applications of SGD actually use a minibatch of several samples, for reasons that will be explained a bit later. SGD works well (Not well, I suppose, but better than batch gradient descent) for error manifolds that have lots of local maxima/minima. In this case, the somewhat noisier gradient calculated using the reduced number of samples tends to jerk the model out of local minima into a region that hopefully is more optimal. Single samples are really noisy, while minibatches tend to average a little of the noise out. Thus, the amount of jerk is reduced when using minibatches. A good balance is struck when the minibatch size is small enough to avoid some of the poor local minima, but large enough that it doesn't avoid the global minima or better-performing local minima. (Incidently, this assumes that the best minima have a larger and deeper basin of attraction, and are therefore easier to fall into.)

One benefit of SGD is that it's computationally a whole lot faster. Large datasets often can't be held in RAM, which makes vectorization much less efficient. Rather, each sample or batch of samples must be loaded, worked with, the results stored, and so on. Minibatch SGD, on the other hand, is usually intentionally made small enough to be computationally tractable.

Usually, this computational advantage is leveraged by performing many more iterations of SGD, making many more steps than conventional batch gradient descent. This usually results in a model that is very close to that which would be found via batch gradient descent, or better.

The way I like to think of how SGD works is to imagine that I have one point that represents my input distribution. My model is attempting to learn that input distribution. Surrounding the input distribution is a shaded area that represents the input distributions of all of the possible minibatches I could sample. It's usually a fair assumption that the minibatch input distributions are close in proximity to the true input distribution. Batch gradient descent, at all steps, takes the steepest route to reach the true input distribution. SGD, on the other hand, chooses a random point within the shaded area, and takes the steepest route towards this point. At each iteration, though, it chooses a new point. The average of all of these steps will approximate the true input distribution, usually quite well.

2 of 7

19

As other answer suggests, the main reason to use SGD is to reduce the computation cost of gradient while still largely maintaining the gradient direction when averaged over many mini-batches or samples - that surely helps bring you to the local minima.

Why minibatch works.

The mathematics behind this is that, the "true" gradient of the cost function (the gradient for the generalization error or for infinitely large samples set) is the expectation of the gradient $\text{[math]}$ over the true data generating distribution $p_{data}$; the actual gradient $\text{[math]}$ computed over a batch of samples is always an approximation to the true gradient with the empirical data distribution $\text{[math]}$ . $\text{[math]}$ Batch gradient descent can bring you the possible "optimal" gradient given all your data samples, it is not the "true" gradient though. A smaller batch (i.e. a minibatch) is probably not as optimal as the full batch, but they are both approximations - so is the single-sample minibatch (SGD).

Assuming there is no dependence between the $\text{[math]}$ samples in one minibatch, the computed $\text{[math]}$ is an unbiased estimate of the true gradient. The (squared) standard errors of the estimates with different minibatch sizes is inversely proportional to the sizes of the minibatch. That is, $\text{[math]}$ I.e., the reduction of standard error is the square root of the increase of sample size. This means, if the minibatch size is small, the learning rate has to be small too, in order to achieve stability over the big variance. When the samples are not independent, the property of unbiased estimate is no longer maintained. That requires you to shuffle the samples before the training, if the samples are sequenced not randomly enough.

Why minibatch may work better.

Firstly, minibatch makes some learning problems from technically intractable to be tractable due to the reduced computation demand with smaller batch size.

Secondly, reduced batch size does not necessarily mean reduced gradient accuracy. The training samples many have lots of noises or outliers or biases. A randomly sampled minibatch may reflect the true data generating distribution better (or no worse) than the original full batch. If some iterations of the minibatch gradient updates give you a better estimation, overall the averaged result of one epoch can be better than the gradient computed from a full batch.

Thirdly, minibatch does not only help deal with unpleasant data samples, but also help deal with unpleasant cost function that has many local minima. As Jason_L_Bens mentions, sometimes the error manifolds may be easier to trap a regular gradient into a local minima, while more difficult to trap the temporarily random gradient computed with minibatch.

Finally, with gradient descent, you are not reaching the global minima in one step, but iterating on the error manifold. Gradient largely gives you only the direction to iterate. With minibatch, you can iterate much faster. In many cases, the more iterations, the better point you can reach. You do not really care at all weather the point is optimal globally or even locally. You just want to reach a reasonable model that brings you acceptable generalization error. Minibatch makes that easier.

You may find the book "Deep learning" by Ian Goodfellow, et al, has pretty good discussions on this topic if you read through it carefully.

Bogotobogo

bogotobogo.com › python › python_numpy_batch_gradient_descent_algorithm.php

Python Tutorial: batch gradient descent algorithm - 2020

Batch gradient descent algorithm Single Layer Neural Network - Perceptron model on the Iris dataset using Heaviside step activation function Batch gradient descent versus stochastic gradient descent Single Layer Neural Network - Adaptive Linear Neuron using linear (identity) activation function with batch gradient descent method Single Layer Neural Network : Adaptive Linear Neuron using linear (identity) activation function with stochastic gradient descent (SGD) Logistic Regression VC (Vapnik-Chervonenkis) Dimension and Shatter Bias-variance tradeoff Maximum Likelihood Estimation (MLE) Neural

DeepLearning.AI

community.deeplearning.ai › course q&a › deep learning specialization › neural networks and deep learning

Understanding batch gradient descent over the entire training set - Neural Networks and Deep Learning - DeepLearning.AI

Top answer

1 of 1

1

Hello, @Zijun_Liu! I think your steps are fine, and I want to highlight that computing a single cost value and a single gradient value are two different paths using two different equations (though the gradient equation is derived from the cost equation). Consequently, your step 4 and step 3 are ind…

Find elsewhere

Google Bing Mojeek

Analytics Vidhya

analyticsvidhya.com › home › variants of gradient descent algorithm

Variants of Gradient Descent Algorithm | Types of Gradient Descent

November 7, 2023 - Based on the way we are calculating this cost function there are different variants of Gradient Descent. Let’s say there are a total of ‘m’ observations in a data set and we use all these observations to calculate the cost function J, then this is known as Batch Gradient Descent.

Medium

medium.com › @lomashbhuva › batch-gradient-descent-a-comprehensive-guide-to-multi-dimensional-optimization-ccacd24569ba

Batch Gradient Descent: A Comprehensive Guide to Multi-Dimensional Optimization🌟🚀 | by Lomash Bhuva | Medium

February 23, 2025 - Stochastic Gradient Descent (SGD) — Uses a single random data point per iteration, making updates noisier but faster. Mini-Batch Gradient Descent — A compromise between BGD and SGD, using small batches of data per update.

Towards Data Science

towardsdatascience.com › home › latest › the math behind stochastic gradient descent

Batch, Mini Batch & Stochastic Gradient Descent

January 24, 2025 - Implementing Stochastic Gradient Descent (SGD) in machine learning models is a practical step that brings the theoretical aspects of the algorithm into real-world application. This section will guide you through the basic implementation of SGD and provide tips for integrating it into machine learning workflows. Now let’s consider a simple case of SGD applied to Linear Regression: class SGDRegressor: def __init__(self, learning_rate=0.01, epochs=100, batch_size=1, reg=None, reg_param=0.0): """ Constructor for the SGDRegressor.

GeeksforGeeks

geeksforgeeks.org › machine learning › gradient-descent-algorithm-and-its-variants

Gradient Descent Algorithm in Machine Learning - GeeksforGeeks

03:27

Variants include Batch Gradient Descent, Stochastic Gradient Descent and Mini Batch Gradient Descent

Published 2 weeks ago

StatusNeo

statusneo.com › home › deep learning › efficientdl: mini-batch gradient descent explained

EfficientDL: Mini-batch Gradient Descent Explained - StatusNeo

September 13, 2023 - Variance Reduction in Mini-batch Gradient Descent: In Mini-batch Gradient Descent, the training dataset is divided into smaller subsets known as mini-batches, each containing a fixed number of data points. When computing the gradient and updating model parameters, the algorithm considers the average gradient over the mini-batch.

MachineLearningMastery

machinelearningmastery.com › home › blog › a gentle introduction to mini-batch gradient descent and how to configure batch size

A Gentle Introduction to Mini-Batch Gradient Descent and How to Configure Batch Size - MachineLearningMastery.com

August 19, 2019 - Batch gradient descent is a variation of the gradient descent algorithm that calculates the error for each example in the training dataset, but only updates the model after all training examples have been evaluated.

GeeksforGeeks

geeksforgeeks.org › difference-between-batch-gradient-descent-and-stochastic-gradient-descent

Difference between Batch Gradient Descent and Stochastic Gradient Descent - GeeksforGeeks

March 4, 2025 - Stochastic Gradient Descent (SGD) addresses the inefficiencies of Batch Gradient Descent by computing the gradient using only a single training example (or a small subset) in each iteration.

Kaggle

kaggle.com › code › avadhutvarvatkar › gradient-descent-explanation

Gradient Descent Explanation 🔥💹

Checking your browser before accessing www.kaggle.com · Click here if you are not automatically redirected after 5 seconds

Baeldung

baeldung.com › home › artificial intelligence › machine learning › differences between gradient, stochastic and mini batch gradient descent

Differences Between Gradient, Stochastic and Mini Batch Gradient Descent | Baeldung on Computer Science

February 28, 2025 - We can see that, depending on the dataset, Gradient Descent may have to iterate through many samples, which can lead to being unproductive. As we can see, in this case, the gradients are calculated on one random shuffled part out of partitions. Let’s assume batches.

Medium

medium.com › @ugurozcan108 › batch-gradient-descent-in-python-4d3b16d40755

Batch Gradient Descent in Python. The gradient descent algorithm… | by Uğur Özcan | Medium

March 17, 2022 - In this problem, we expect you to implement the batch gradient descent algorithm manually. The cost/loss function will be the mean squared loss for linear regression: You have to initialize all the weights to zero to meet the desired output. Input: x: an array of training examples y: an array of output corresponding to each training example lr: the learning rate for the algorithm iter: number of iterations the algorithm will perform

Sebastian Raschka

sebastianraschka.com › faq › docs › sgd-methods.html

How is stochastic gradient descent implemented in the context of machine learning and deep learning? | Sebastian Raschka, PhD

January 17, 2026 - Note that using only one training example per update results in very noisy gradients since the loss is approximated from one training example only. Noisy gradients can be useful if we have non-convex loss functions and want to escape sharp local minima. Batch gradient descent or just “gradient descent” is the determinisic (not stochastic) variant.

Medium

medium.com › data-science › batch-mini-batch-stochastic-gradient-descent-7a62ecba642a

Batch, Mini Batch & Stochastic Gradient Descent | by Sushant Patrikar | TDS Archive | Medium

October 1, 2019 - Batch Gradient Descent can be used for smoother curves. SGD can be used when the dataset is large. Batch Gradient Descent converges directly to minima. SGD converges faster for larger datasets. But, since in SGD we use only one example at a time, we cannot implement the vectorized implementation on it.