Both formulas are correct. Here is how you can get formula 2, by minimize the sum of squared errors. Dividing it by gives you MSE, and by gives you SSE used in the formula 2. Since you are going to minimize this expression with partial derivation technique, choosing makes the derivation look nice.

Now you use a linear model where and , you get, (I omit the transpose symbol for in ) When you compute its partial derivative over for the additive term, you have,

This is the formula 2 you gave. I don't give the detailed steps, but it is quite straightforward.

Yes, is the activation function, and you do have the factor in the derivative expression as shown above. It disappears if it equals 1, i.e., , where is invariable w.r.t. .

For example, if , (i.e., ), and the prediction model is linear where , too, then you have and .

For another example, if , while the prediction model is sigmoid where , then . This is why in book Artificial Intelligence: A Modern Approach, the derivative of logistic regression is: On the other hand, the formula 1, although looking like a similar form, is deduced via a different approach. It is based on the maximum likelihood (or equivalently minimum negative log-likelihood) by multiplying the output probability function over all the samples and then taking its negative logarithm, as given below,

In a logistic regression problem, when the outputs are 0 and 1, then each additive term becomes,

$$ −logP(y^i|x^i;\theta) = -(y^i\log{h_\theta(x^i)} + (1-y^i)\log(1-h_\theta(x^i))) $$

The formula 1 is the derivative of it (and its sum) when , as below, The derivation details are well given in other post.

You can compare it with equation (1) and (2). Yes, equation (3) does not have the factor as equation (1), since that is not part of its deduction process at all. Equation (2) has additional factors compared to equation (3). Since as probability is within the range of (0, 1), you have , that means equation (2) brings you a gradient of smaller absolute value, hence a slower convergence speed in gradient descent than equation (3).

Note the sum squared errors (SSE) essentially a special case of maximum likelihood when we consider the prediction of is actually the mean of a conditional normal distribution. That is, If you want to get more in depth knowledge in this area, I would suggest the Deep Learning book by Ian Goodfellow, et al.

Answer from Xiao-Feng Li on Stack Exchange
🌐
Upgrad
upgrad.com › home › blog › artificial intelligence › understanding gradient descent in logistic regression: a guide for beginners
Gradient Descent in Logistic Regression - Learn in Minutes!
June 26, 2025 - We use the log loss function for logistic regression, which calculates the difference between predicted probabilities and the true labels. Log loss is used instead of squared error loss because it heavily penalizes confident wrong predictions—like predicting 0.99 for a class that is actually 0—ensuring better probability calibration. The goal is to minimize this cost by adjusting the model’s parameters (weights) through gradient descent in logistic regression.
🌐
Baeldung
baeldung.com › home › core concepts › math and logic › gradient descent equation in logistic regression
Gradient Descent Equation in Logistic Regression | Baeldung on Computer Science
February 13, 2025 - Gradient descent is an iterative optimization algorithm, which finds the minimum of a differentiable function. In this process, we try different values and update them to reach the optimal ones, minimizing the output.
Discussions

Gradient descent for logistic regression partial derivative doubt - Cross Validated
I'm a software engineer, and I have just started a Udacity's nanodegree of deep learning. I have also worked my way through Stanford professor Andrew Ng's online course on machine learning and now... More on stats.stackexchange.com
🌐 stats.stackexchange.com
February 13, 2017
REALLY breaking down logistic regression gradient descent
Great job. I appreciate all the effort you put to write up the equations in Latex. More on reddit.com
🌐 r/learnmachinelearning
19
189
October 13, 2020
Gradient descent for logistic regression (xj^(i).)
Hi all, Below is a slide of “Gradient descent for logistic regression.” I just don’t understand why the following formula has xj^(i). This might be beyond of the scope of our course, but could anyone explain that mathematically? Thanks! More on community.deeplearning.ai
🌐 community.deeplearning.ai
0
0
October 13, 2023
Gradient descent vs Gradient ascent in Logistic Regression
derivation of the cost function of gradient descent This makes me think you might be misunderstanding. Gradient descent isn't a specific function, it's a method that can be applied to any cost function. More on reddit.com
🌐 r/learnmachinelearning
7
3
May 1, 2020
People also ask

Can Gradient Descent in Logistic Regression handle non-linear data?
Gradient Descent in Logistic Regression is primarily used for linear classification tasks. However, if your data is non-linear, logistic regression can still work by using transformations like polynomial features. For more complex non-linear problems, consider using other models like support vector machines or neural networks, which can better handle non-linear data relationships.
🌐
upgrad.com
upgrad.com › home › blog › artificial intelligence › understanding gradient descent in logistic regression: a guide for beginners
Gradient Descent in Logistic Regression - Learn in Minutes!
How does Gradient Descent in Logistic Regression handle high-dimensional data?
When working with high-dimensional data, Gradient Descent in Logistic Regression can still be effective but may suffer from issues like slower convergence or overfitting. In these cases, dimensionality reduction techniques like PCA (Principal Component Analysis) or regularization methods (L1 or L2) can be applied to improve model performance and prevent overfitting while optimizing the cost function.
🌐
upgrad.com
upgrad.com › home › blog › artificial intelligence › understanding gradient descent in logistic regression: a guide for beginners
Gradient Descent in Logistic Regression - Learn in Minutes!
Can I use Gradient Descent in Logistic Regression for multi-class classification?
Yes, Gradient Descent in Logistic Regression can be extended to multi-class classification through techniques like one-vs-rest or softmax regression. These approaches allow you to apply logistic regression to problems where there are more than two possible classes. The fundamental optimization process remains the same, but the model is adapted to handle multiple classes efficiently.
🌐
upgrad.com
upgrad.com › home › blog › artificial intelligence › understanding gradient descent in logistic regression: a guide for beginners
Gradient Descent in Logistic Regression - Learn in Minutes!
🌐
Medium
medium.com › intro-to-artificial-intelligence › logistic-regression-using-gradient-descent-bf8cbe749ceb
Logistic regression using gradient descent | by Dhanoop Karunakaran | Intro to Artificial Intelligence | Medium
June 16, 2023 - The gradient always points towards the direction of the greatest increase, so in order to find the minimum or descent point, we update the weights in the opposite direction of the gradient.
🌐
Ml-explained
ml-explained.com › blog › logistic-regression-explained
Logistic Regression - ML Explained
September 29, 2020 - We can get the gradient descent formula for Logistic Regression by taking the derivative of the loss function.
🌐
Stanford University
web.stanford.edu › ~jurafsky › slp3 › 5.pdf pdf
Speech and Language Processing. Daniel Jurafsky & James H. Martin.
The update equations going from time step t to t + 1 in stochastic gradient descent · are thus: ct+1 · pos · = ct · pos −η[σ(ct · pos ·wt)−1]wt · (5.25) ct+1 · negi = ct · negi −η[σ(ct · negi ·wt)]wt · (5.26) wt+1 = wt −η · " [σ(ct · pos ·wt)−1]ct · pos + k · X · i=1 · [σ(ct · negi ·wt)]ct · negi · # (5.27) Just as in logistic regression, then, the learning algorithm starts with randomly ini- tialized W and C matrices, and then walks through the training corpus using gradient ·
🌐
Medium
medium.com › @thourayabchir1 › gradient-descent-09e824b4ef47
Gradient Descent for Logistic Regression | Medium
April 27, 2025 - Logistic regression uses the binary cross-entropy (log-loss): ... Minimizing J pulls the parameters so that the ŷ aligns with the true label y. Although J(w, b) is convex (there’s a single global minimum), it lacks a simple closed-form minimizer. Instead, we use gradient descent, which:
Find elsewhere
🌐
Rensselaer Polytechnic Institute
cs.rpi.edu › ~magdon › courses › LFD-Slides › SlidesLect09.pdf pdf
Learning From Data Lecture 9 Logistic Regression and Gradient Descent
Logistic Regression and Gradient Descent: 4 /23 · Data is binary ±1 −→ · The Data is Still Binary, ±1 · D = (x1, y1 = ±1), · · · , (xN, yN = ±1) xn · ←a person’s health information · yn = ±1 · ←did they have a heart attack or not · We cannot measure a probability.
🌐
GeeksforGeeks
geeksforgeeks.org › machine learning › gradient-descent-algorithm-and-its-variants
Gradient Descent Algorithm in Machine Learning - GeeksforGeeks
It predicts the probability that ... accuracy. ... Gradient Descent helps Logistic Regression find the best values of the model parameters so that the prediction error becomes smaller....
Published   1 week ago
Top answer
1 of 3
6

Both formulas are correct. Here is how you can get formula 2, by minimize the sum of squared errors. Dividing it by gives you MSE, and by gives you SSE used in the formula 2. Since you are going to minimize this expression with partial derivation technique, choosing makes the derivation look nice.

Now you use a linear model where and , you get, (I omit the transpose symbol for in ) When you compute its partial derivative over for the additive term, you have,

This is the formula 2 you gave. I don't give the detailed steps, but it is quite straightforward.

Yes, is the activation function, and you do have the factor in the derivative expression as shown above. It disappears if it equals 1, i.e., , where is invariable w.r.t. .

For example, if , (i.e., ), and the prediction model is linear where , too, then you have and .

For another example, if , while the prediction model is sigmoid where , then . This is why in book Artificial Intelligence: A Modern Approach, the derivative of logistic regression is: On the other hand, the formula 1, although looking like a similar form, is deduced via a different approach. It is based on the maximum likelihood (or equivalently minimum negative log-likelihood) by multiplying the output probability function over all the samples and then taking its negative logarithm, as given below,

In a logistic regression problem, when the outputs are 0 and 1, then each additive term becomes,

$$ −logP(y^i|x^i;\theta) = -(y^i\log{h_\theta(x^i)} + (1-y^i)\log(1-h_\theta(x^i))) $$

The formula 1 is the derivative of it (and its sum) when , as below, The derivation details are well given in other post.

You can compare it with equation (1) and (2). Yes, equation (3) does not have the factor as equation (1), since that is not part of its deduction process at all. Equation (2) has additional factors compared to equation (3). Since as probability is within the range of (0, 1), you have , that means equation (2) brings you a gradient of smaller absolute value, hence a slower convergence speed in gradient descent than equation (3).

Note the sum squared errors (SSE) essentially a special case of maximum likelihood when we consider the prediction of is actually the mean of a conditional normal distribution. That is, If you want to get more in depth knowledge in this area, I would suggest the Deep Learning book by Ian Goodfellow, et al.

2 of 3
5

So, I think you are mixing up minimizing your loss function, versus maximizing your log likelihood, but also, (since both are equivalent), the equations you have written are actually the same. Let's assume there is only one sample, to keep things simple.

The first equation shows the minimization of loss update equation: Here, the gradient of the loss is given by:

However, the third equation you have written:

is not the gradient with respect to the loss, but the gradient with respect to the log likelihood!

This is why you have a discrepancy in your signs.

Now look back at your first equation, and open the gradient with the negative sign - you will then be maximizing the log likelihood, which is the same as minimizing the loss. Therefore they are all equivalent. Hope that helped!

🌐
Reddit
reddit.com › r/learnmachinelearning › really breaking down logistic regression gradient descent
r/learnmachinelearning on Reddit: REALLY breaking down logistic regression gradient descent
October 13, 2020 -

I recently wrote a blog post that breaks down the math behind maximum likelihood estimation for logistic regression. My friends found it helpful, so decided to spread it around. If you've found the math to be hard to follow in other tutorials, hopefully mine will guide you through it step by step.

Here it is: https://statisticalmusings.netlify.app/post/logistic-regression-mle-a-full-breakdown/

If you can get a firm grasp of logistic regression, you'll be well set to understand deep learning!

🌐
Scribd
scribd.com › document › 574056368 › Lecture8
Gradient Descent in Logistic Regression | PDF | Mathematical Optimization | Dependent And Independent Variables
It discusses how gradient descent can be used to optimize parameters in linear and logistic regression models by minimizing a cost function. Gradient descent works by taking small steps in the direction of steepest descent.
🌐
YOU CANalytics
ucanalytics.com › home › gradient descent for logistic regression simplified – step by step visual guide
YOU CANalytics | Gradient Descent for Logistic Regression Simplified - Step by Step Visual Guide – YOU CANalytics |
September 16, 2018 - If you will run the gradient descent without assuming β1 = β2 then β0 =-15.4233, β1 = 0.1090, and β2 = 0.1097. I suggest you try all these solutions using this code: Gradient Descent – Logistic Regression (R Code).
🌐
DeepLearning.AI
community.deeplearning.ai › course q&a › machine learning specialization › supervised ml: regression and classification
Gradient descent for logistic regression (xj^(i).) - Supervised ML: Regression and Classification - DeepLearning.AI
October 13, 2023 - Hi all, Below is a slide of “Gradient descent for logistic regression.” I just don’t understand why the following formula has xj^(i). This might be beyond of the scope of our course, but could anyone explain that mathem…
🌐
scikit-learn
scikit-learn.org › stable › modules › sgd.html
1.5. Stochastic Gradient Descent — scikit-learn 1.8.0 documentation
For example, using SGDClassifier(loss='log_loss') results in logistic regression, i.e. a model equivalent to LogisticRegression which is fitted via SGD instead of being fitted by one of the other solvers in LogisticRegression. Similarly, SGDRegressor(loss='squared_error', penalty='l2') and Ridge solve the same optimization problem, via different means. The advantages of Stochastic Gradient Descent are:
🌐
Hongyangzhang
hongyangzhang.com › ds5220 › logistic_regression.pdf pdf
DS 5220, Lecture 5: Logistic Regression Using Gradient Descent
September 20, 2024 - Unlike the least squares problem, logistic regression does not permit a closed-form solution. One way to solve this regression problem is using an optimization algorithm such as gradient ... descent. We need to compute the gradient of the loss, ∇ˆL(β).
🌐
Nucleartalent
nucleartalent.github.io › MachineLearningECT › doc › pub › Day2 › html › Day2-bs.html
Data Analysis and Machine Learning: Logistic Regression and Gradient Methods
The cost function is convex which guarantees that gradient descent converges for small enough learning rates We revisit the example from homework set 1 where we had $$ y_i = 5x_i^2 + 0.1\xi_i, \ i=1,\cdots,100 $$ with \( x_i \in [0,1] \) chosen randomly with a uniform distribution. Additionally \( \xi_i \) represents stochastic noise chosen according to a normal distribution \( \cal {N}(0,1) \). The linear regression model is given by $$ h_\beta(x) = \boldsymbol{y} = \beta_0 + \beta_1 x, $$ such that $$ \boldsymbol{y}_i = \beta_0 + \beta_1 x_i.
🌐
arXiv
arxiv.org › abs › 2406.05033
[2406.05033] Gradient Descent on Logistic Regression with Non-Separable Data and Large Step Sizes
November 4, 2024 - We study gradient descent (GD) ... data, it is known that GD converges to the minimizer with arbitrarily large step sizes, a property which no longer holds when the problem is not separable....
🌐
ML Glossary
ml-cheatsheet.readthedocs.io › en › latest › logistic_regression.html
Logistic Regression — ML Glossary documentation
To minimize our cost, we use Gradient Descent just like before in Linear Regression. There are other more sophisticated optimization algorithms out there such as conjugate gradient like BFGS, but you don’t have to worry about these.
🌐
Medium
medium.com › analytics-vidhya › logistic-regression-with-gradient-descent-explained-machine-learning-a9a12b38d710
Logistic Regression with Gradient Descent Explained | Machine Learning | by Ashwin Prasad | Analytics Vidhya | Medium
February 5, 2024 - Logistic Regression with Gradient Descent Explained | Machine Learning What is Logistic Regression ? Why is it used for Classification ? What is Classification Problem ? In general , Supervised …