Both formulas are correct. Here is how you can get formula 2, by minimize the sum of squared errors. Dividing it by gives you MSE, and by gives you SSE used in the formula 2. Since you are going to minimize this expression with partial derivation technique, choosing makes the derivation look nice.

Now you use a linear model where and , you get, (I omit the transpose symbol for in ) When you compute its partial derivative over for the additive term, you have,

This is the formula 2 you gave. I don't give the detailed steps, but it is quite straightforward.

Yes, is the activation function, and you do have the factor in the derivative expression as shown above. It disappears if it equals 1, i.e., , where is invariable w.r.t. .

For example, if , (i.e., ), and the prediction model is linear where , too, then you have and .

For another example, if , while the prediction model is sigmoid where , then . This is why in book Artificial Intelligence: A Modern Approach, the derivative of logistic regression is: On the other hand, the formula 1, although looking like a similar form, is deduced via a different approach. It is based on the maximum likelihood (or equivalently minimum negative log-likelihood) by multiplying the output probability function over all the samples and then taking its negative logarithm, as given below,

In a logistic regression problem, when the outputs are 0 and 1, then each additive term becomes,

$$ −logP(y^i|x^i;\theta) = -(y^i\log{h_\theta(x^i)} + (1-y^i)\log(1-h_\theta(x^i))) $$

The formula 1 is the derivative of it (and its sum) when , as below, The derivation details are well given in other post.

You can compare it with equation (1) and (2). Yes, equation (3) does not have the factor as equation (1), since that is not part of its deduction process at all. Equation (2) has additional factors compared to equation (3). Since as probability is within the range of (0, 1), you have , that means equation (2) brings you a gradient of smaller absolute value, hence a slower convergence speed in gradient descent than equation (3).

Note the sum squared errors (SSE) essentially a special case of maximum likelihood when we consider the prediction of is actually the mean of a conditional normal distribution. That is, If you want to get more in depth knowledge in this area, I would suggest the Deep Learning book by Ian Goodfellow, et al.

Answer from Xiao-Feng Li on Stack Exchange
🌐
Baeldung
baeldung.com › home › core concepts › math and logic › gradient descent equation in logistic regression
Gradient Descent Equation in Logistic Regression | Baeldung on Computer Science
February 13, 2025 - Surprisingly, the update rule is the same as the one derived by using the sum of the squared errors in linear regression. As a result, we can use the same gradient descent formula for logistic regression as well.
🌐
Stanford University
web.stanford.edu › ~jurafsky › slp3 › 5.pdf pdf
Speech and Language Processing. Daniel Jurafsky & James H. Martin.
The update equations going from time step t to t + 1 in stochastic gradient descent · are thus: ct+1 · pos · = ct · pos −η[σ(ct · pos ·wt)−1]wt · (5.25) ct+1 · negi = ct · negi −η[σ(ct · negi ·wt)]wt · (5.26) wt+1 = wt −η · " [σ(ct · pos ·wt)−1]ct · pos + k · X · i=1 · [σ(ct · negi ·wt)]ct · negi · # (5.27) Just as in logistic regression, then, the learning algorithm starts with randomly ini- tialized W and C matrices, and then walks through the training corpus using gradient ·
Discussions

REALLY breaking down logistic regression gradient descent
Great job. I appreciate all the effort you put to write up the equations in Latex. More on reddit.com
🌐 r/learnmachinelearning
19
189
October 13, 2020
Gradient descent for logistic regression partial derivative doubt - Cross Validated
In the case of linear regression, ... first formula. In the case of logistic regression, f′(h) = f(h) * (1 - f(h)). I would not recommend using mean squared error loss for logistic regression, as it's very slow. Binary cross entropy is a much better loss function to use with logistic regression. ... Find the answer to your question by asking. Ask question ... See similar questions with these tags. 107 Solving for regression parameters in closed-form vs gradient ... More on stats.stackexchange.com
🌐 stats.stackexchange.com
February 13, 2017
MLE & Gradient Descent in Logistic Regression - Data Science Stack Exchange
In Logistic Regression, MLE is used to develop a mathematical function to estimate the model parameters, optimization techniques like Gradient Descent are used to solve this function. Can somebody ... More on datascience.stackexchange.com
🌐 datascience.stackexchange.com
January 9, 2022
Cases when a 'simpler' model was a better solution than Gradient Boosters in your job or project?
when theory suggests additivity, among other things-is a reasonable model. Especially if you care about inference. there's a reason why linear regression is still around. More on reddit.com
🌐 r/datascience
52
78
November 30, 2023
People also ask

Can Gradient Descent in Logistic Regression handle non-linear data?
Gradient Descent in Logistic Regression is primarily used for linear classification tasks. However, if your data is non-linear, logistic regression can still work by using transformations like polynomial features. For more complex non-linear problems, consider using other models like support vector machines or neural networks, which can better handle non-linear data relationships.
🌐
upgrad.com
upgrad.com › home › blog › artificial intelligence › understanding gradient descent in logistic regression: a guide for beginners
Gradient Descent in Logistic Regression - Learn in Minutes!
Can Gradient Descent in Logistic Regression be used for regression tasks?
While Gradient Descent in Logistic Regression is primarily used for binary classification tasks, the core concept of gradient descent can be applied to other types of regression. For regression problems, Linear Regression is used, where the goal is to predict continuous values. Logistic regression, however, is suited for classification tasks due to its probability output via the sigmoid function.
🌐
upgrad.com
upgrad.com › home › blog › artificial intelligence › understanding gradient descent in logistic regression: a guide for beginners
Gradient Descent in Logistic Regression - Learn in Minutes!
How does Gradient Descent in Logistic Regression handle high-dimensional data?
When working with high-dimensional data, Gradient Descent in Logistic Regression can still be effective but may suffer from issues like slower convergence or overfitting. In these cases, dimensionality reduction techniques like PCA (Principal Component Analysis) or regularization methods (L1 or L2) can be applied to improve model performance and prevent overfitting while optimizing the cost function.
🌐
upgrad.com
upgrad.com › home › blog › artificial intelligence › understanding gradient descent in logistic regression: a guide for beginners
Gradient Descent in Logistic Regression - Learn in Minutes!
🌐
Medium
medium.com › intro-to-artificial-intelligence › logistic-regression-using-gradient-descent-bf8cbe749ceb
Logistic regression using gradient descent | by Dhanoop Karunakaran | Intro to Artificial Intelligence | Medium
June 16, 2023 - We consider f(x) = g(z), and we know the fact that f(x) = wx+b for the linear regression. In this case, we can rewrite z as z = wx+b. Remember, z formula is not just tied with this equation, we can replace this with multiple linear regression ...
🌐
Upgrad
upgrad.com › home › blog › artificial intelligence › understanding gradient descent in logistic regression: a guide for beginners
Gradient Descent in Logistic Regression - Learn in Minutes!
June 26, 2025 - We use the log loss function for logistic regression, which calculates the difference between predicted probabilities and the true labels. Log loss is used instead of squared error loss because it heavily penalizes confident wrong predictions—like predicting 0.99 for a class that is actually 0—ensuring better probability calibration. The goal is to minimize this cost by adjusting the model’s parameters (weights) through gradient descent in logistic regression.
🌐
Reddit
reddit.com › r/learnmachinelearning › really breaking down logistic regression gradient descent
r/learnmachinelearning on Reddit: REALLY breaking down logistic regression gradient descent
October 13, 2020 -

I recently wrote a blog post that breaks down the math behind maximum likelihood estimation for logistic regression. My friends found it helpful, so decided to spread it around. If you've found the math to be hard to follow in other tutorials, hopefully mine will guide you through it step by step.

Here it is: https://statisticalmusings.netlify.app/post/logistic-regression-mle-a-full-breakdown/

If you can get a firm grasp of logistic regression, you'll be well set to understand deep learning!

Find elsewhere
Top answer
1 of 3
6

Both formulas are correct. Here is how you can get formula 2, by minimize the sum of squared errors. Dividing it by gives you MSE, and by gives you SSE used in the formula 2. Since you are going to minimize this expression with partial derivation technique, choosing makes the derivation look nice.

Now you use a linear model where and , you get, (I omit the transpose symbol for in ) When you compute its partial derivative over for the additive term, you have,

This is the formula 2 you gave. I don't give the detailed steps, but it is quite straightforward.

Yes, is the activation function, and you do have the factor in the derivative expression as shown above. It disappears if it equals 1, i.e., , where is invariable w.r.t. .

For example, if , (i.e., ), and the prediction model is linear where , too, then you have and .

For another example, if , while the prediction model is sigmoid where , then . This is why in book Artificial Intelligence: A Modern Approach, the derivative of logistic regression is: On the other hand, the formula 1, although looking like a similar form, is deduced via a different approach. It is based on the maximum likelihood (or equivalently minimum negative log-likelihood) by multiplying the output probability function over all the samples and then taking its negative logarithm, as given below,

In a logistic regression problem, when the outputs are 0 and 1, then each additive term becomes,

$$ −logP(y^i|x^i;\theta) = -(y^i\log{h_\theta(x^i)} + (1-y^i)\log(1-h_\theta(x^i))) $$

The formula 1 is the derivative of it (and its sum) when , as below, The derivation details are well given in other post.

You can compare it with equation (1) and (2). Yes, equation (3) does not have the factor as equation (1), since that is not part of its deduction process at all. Equation (2) has additional factors compared to equation (3). Since as probability is within the range of (0, 1), you have , that means equation (2) brings you a gradient of smaller absolute value, hence a slower convergence speed in gradient descent than equation (3).

Note the sum squared errors (SSE) essentially a special case of maximum likelihood when we consider the prediction of is actually the mean of a conditional normal distribution. That is, If you want to get more in depth knowledge in this area, I would suggest the Deep Learning book by Ian Goodfellow, et al.

2 of 3
5

So, I think you are mixing up minimizing your loss function, versus maximizing your log likelihood, but also, (since both are equivalent), the equations you have written are actually the same. Let's assume there is only one sample, to keep things simple.

The first equation shows the minimization of loss update equation: Here, the gradient of the loss is given by:

However, the third equation you have written:

is not the gradient with respect to the loss, but the gradient with respect to the log likelihood!

This is why you have a discrepancy in your signs.

Now look back at your first equation, and open the gradient with the negative sign - you will then be maximizing the log likelihood, which is the same as minimizing the loss. Therefore they are all equivalent. Hope that helped!

🌐
Medium
medium.com › analytics-vidhya › logistic-regression-with-gradient-descent-explained-machine-learning-a9a12b38d710
Logistic Regression with Gradient Descent Explained | Machine Learning | by Ashwin Prasad | Analytics Vidhya | Medium
February 5, 2024 - Logistic Regression with Gradient Descent Explained | Machine Learning What is Logistic Regression ? Why is it used for Classification ? What is Classification Problem ? In general , Supervised …
🌐
ML Glossary
ml-cheatsheet.readthedocs.io › en › latest › logistic_regression.html
Logistic Regression — ML Glossary documentation
The benefits of taking the logarithm reveal themselves when you look at the cost function graphs for y=1 and y=0. These smooth monotonic functions [7] (always increasing or always decreasing) make it easy to calculate the gradient and minimize cost. Image from Andrew Ng’s slides on logistic regression [1].
🌐
Compphysics
compphysics.github.io › CompSciProgram › doc › pub › week45 › html › week45.html
Logistic Regression and Gradient Methods
To find the optimal parameters we would typically use a gradient descent method. Newton's method and gradient descent methods are discussed in the material on optimization methods. We show here how we can use a simple regression case on the breast cancer data using Logistic regression as our algorithm for classification.
🌐
Medium
medium.com › @edwinvarghese4442 › logistic-regression-with-gradient-descent-tutorial-part-1-theory-529c93866001
Logistic regression with gradient descent —Tutorial Part 1 — Theory | by Edwin Varghese | Medium
June 18, 2018 - For this tutorial, we are going to use Logistic regression. Let’s frame the equation for the first observation as follows: ... 1/(1+e^(-z)) is the sigmoid function or also called as the logit function. It squashes the value of the output between 0 and 1. ... Suppose, a(probability) = .26 and ground truth is 1. Now there is an error. This error we have reduce by adjusting the weights using the method of gradient descent algorithm.
Top answer
1 of 2
1

Maximum Likelihood


Maximum likelihood estimation involves defining a likelihood function for calculating the conditional probability of observing the data sample given probability distribution and distribution parameters. This approach can be used to search a space of possible distributions and parameters.

The logistic model uses the sigmoid function (denoted by sigma) to estimate the probability that a given sample y belongs to class $1$ given inputs $X$ and weights $W$,

\begin{align} \ P(y=1 \mid x) = \sigma(W^TX) \end{align}

where the sigmoid of our activation function for a given $n$ is:

\begin{align} \large y_n = \sigma(a_n) = \frac{1}{1+e^{-a_n}} \end{align}

The accuracy of our model predictions can be captured by the objective function $L$, which we are trying to maximize.

\begin{align} \large L = \displaystyle\prod_{n=1}^N y_n^{t_n}(1-y_n)^{1-t_n} \end{align}

If we take the log of the above function, we obtain the maximum log-likelihood function, whose form will enable easier calculations of partial derivatives. Specifically, taking the log and maximizing it is acceptable because the log-likelihood is monotonically increasing, and therefore it will yield the same answer as our objective function.

\begin{align} \ L = \displaystyle \sum_{n=1}^N t_nlogy_n+(1-t_n)log(1-y_n) \end{align}

In our example, we will actually convert the objective function (which we would try to maximize) into a cost function (which we are trying to minimize) by converting it into the negative log-likelihood function:

\begin{align} \ J = -\displaystyle \sum_{n=1}^N t_nlogy_n+(1-t_n)log(1-y_n) \end{align}

Gradient Descent


Gradient descent is an iterative optimization algorithm, which finds the minimum of a differentiable function. In this process, we try different values and update them to reach the optimal ones, minimizing the output.

Once we have an objective function, we can generally take its derivative with respect to the parameters (weights), set it equal to zero, and solve for the parameters to obtain the ideal solution. However, in the case of logistic regression (and many other complexes or otherwise non-linear systems), this analytical method doesn’t work. Instead, we resort to a method known as gradient descent, whereby we randomly initialize and then incrementally update our weights by calculating the slope of our objective function. When applying the cost function, we want to continue updating our weights until the slope of the gradient gets as close to zero as possible. We can show this mathematically:

\begin{align} \ w:=w+\triangle w \end{align}

where the second term on the right is defined as the learning rate times the derivative of the cost function with respect to the weights (which is our gradient):

\begin{align} \ \triangle w = \eta\triangle J(w) \end{align}

Thus, we want to take the derivative of the cost function with respect to the weight, which, using the chain rule, gives us:

\begin{align} \frac{J}{\partial w_i} = \displaystyle \sum_{n=1}^N \frac{\partial J}{\partial y_n}\frac{\partial y_n}{\partial a_n}\frac{\partial a_n}{\partial w_i} \end{align}

Thus, we are looking to obtain three different derivatives. Let us start by solving for the derivative of the cost function with respect to $y$:

\begin{align} \frac{\partial J}{\partial y_n} = t_n \frac{1}{y_n} + (1-t_n) \frac{1}{1-y_n}(-1) = \frac{t_n}{y_n} - \frac{1-t_n}{1-y_n} \end{align}

Next, let us solve for the derivative of $y$ with respect to our activation function:

\begin{align} \large y_n = \sigma(a_n) = \frac{1}{1+e^{-a_n}} \end{align}

\begin{align} \frac{\partial y_n}{\partial a_n} = \frac{-1}{(1+e^{-a_n})^2}(e^{-a_n})(-1) = \frac{e^{-a_n}}{(1+e^-a_n)^2} = \frac{1}{1+e^{-a_n}} \frac{e^{-a_n}}{1+e^{-a_n}} \end{align}

\begin{align} \frac{\partial y_n}{\partial a_n} = y_n(1-y_n) \end{align}

And lastly, we solve for the derivative of the activation function with respect to the weights:

\begin{align} \ a_n = W^TX_n \end{align}

\begin{align} \ a_n = w_0x_{n0} + w_1x_{n1} + w_2x_{n2} + \cdots + w_Nx_{NN} \end{align}

\begin{align} \frac{\partial a_n}{\partial w_i} = x_{ni} \end{align}

Now we can put it all together and simply.

\begin{align} \frac{\partial J}{\partial w_i} = - \displaystyle\sum_{n=1}^N\frac{t_n}{y_n}y_n(1-y_n)x_{ni}-\frac{1-t_n}{1-y_n}y_n(1-y_n)x_{ni} \end{align}

\begin{align} = - \displaystyle\sum_{n=1}^Nt_n(1-y_n)x_{ni}-(1-t_n)y_nx_{ni} \end{align}

\begin{align} = - \displaystyle\sum_{n=1}^N[t_n-t_ny_n-y_n+t_ny_n]x_{ni} \end{align}

\begin{align} \frac{\partial J}{\partial w_i} = \displaystyle\sum_{n=1}^N(y_n-t_n)x_{ni} = \frac{\partial J}{\partial w} = \displaystyle\sum_{n=1}^{N}(y_n-t_n)x_n \end{align}

We can get rid of the summation above by applying the principle that a dot product between two vectors is a summover sum index. That is:

\begin{align} \ a^Tb = \displaystyle\sum_{n=1}^Na_nb_n \end{align}

Therefore, the gradient with respect to w is:

\begin{align} \frac{\partial J}{\partial w} = X^T(Y-T) \end{align}

If you are asking yourself where the bias term of our equation (w0) went, we calculate it the same way, except our x becomes 1.

  • Deep Learning Prerequisites: Logistic Regression in Python
  • Logistic Regression using Gradient descent and MLE (Projection)
  • Logistic Regression.pdf
  • Maximum likelihood and gradient descent
  • MAXIMUM LIKELIHOOD ESTIMATION (MLE)
  • Stanford.edu-Logistic Regression.pdf
  • Gradient Descent Equation in Logistic Regression
2 of 2
0

In short, Maximum Likelihood estimation is used to find parameters given target values y and x. The Maximum likelhood estimation finds the parameters maximises the probability of y given x. It has been proved that MLE estimation problem caan be solved by finding the parametrs which gives least cross entropy in case of binary classification.

Gradients descent is an optimisation algorithm get helps you update the parameters iteratively to find the parameters which gives highest probability of y.

For more details:https://www.google.com/amp/s/glassboxmedicine.com/2019/12/07/connections-log-likelihood-cross-entropy-kl-divergence-logistic-regression-and-neural-networks/amp/

🌐
Ml-explained
ml-explained.com › blog › logistic-regression-explained
Logistic Regression - ML Explained
September 29, 2020 - We can get the gradient descent formula for Logistic Regression by taking the derivative of the loss function.
🌐
YOU CANalytics
ucanalytics.com › home › gradient descent for logistic regression simplified – step by step visual guide
YOU CANalytics | Gradient Descent for Logistic Regression Simplified - Step by Step Visual Guide – YOU CANalytics |
September 16, 2018 - If you will run the gradient descent without assuming β1 = β2 then β0 =-15.4233, β1 = 0.1090, and β2 = 0.1097. I suggest you try all these solutions using this code: Gradient Descent – Logistic Regression (R Code).
🌐
GitHub
rasbt.github.io › mlxtend › user_guide › classifier › LogisticRegression
LogisticRegression: A binary classifier - mlxtend - GitHub Pages
Another advantage is that we can obtain the derivative more easily, using the addition trick to rewrite the product of factors as a summation term, which we can then maximize using optimization algorithms such as gradient ascent. An alternative to maximizing the log-likelihood, we can define a cost function to be minimized; we rewrite the log-likelihood as: $$ J\big(\phi(z), y; \mathbf{w}\big) =}\ -log\big(1- \phi(z) \big) & \text{if } \end{cases} $$ As we can see in the figure above, we penalize wrong predictions with an increasingly larger cost. To learn the weight coefficient of a logistic regression model via gradient-based optimization, we compute the partial derivative of the log-likelihood function -- w.r.t.
🌐
Medium
medium.com › analytics-vidhya › derivative-of-log-loss-function-for-logistic-regression-9b832f025c2d
The Derivative of Cost Function for Logistic Regression | by Saket Thavanani | Analytics Vidhya | Medium
February 8, 2024 - In order to preserve the convex nature for the loss function, a log loss error function has been designed for logistic regression. The cost function is split for two cases y=1 and y=0. For the case when we have y=1 we can observe that when hypothesis function tends to 1 the error is minimized to zero and when it tends to 0 the error is maximum. This criterion exactly follows the criterion as we wanted ... In order to optimize this convex function, we can either go with gradient-descent or newtons method.
🌐
MLU-Explain
mlu-explain.github.io › logistic-regression
Logistic Regression
A visual, interactive explanation of logistic regression for machine learning.
🌐
Google
developers.google.com › machine learning › linear regression: gradient descent
Linear regression: Gradient descent | Machine Learning | Google for Developers
February 3, 2026 - Learn how gradient descent iteratively finds the weight and bias that minimize a model's loss. This page explains how the gradient descent algorithm works, and how to determine that a model has converged by looking at its loss curve.
🌐
Atmamani
atmamani.github.io › projects › ml › implementing-logistic-regression-in-python
Implementing Gradient Descent for Logistic Regression - Atma's blog
Note: At this point, I realize my gradient descent is not really optimizing well. The equation of the decision boundary line is way off. Hence I approach to solve this problem using Scikit-Learn and see what its parameters are. Using the logistic regression from SKlearn, we fit the same data ...
🌐
Medium
medium.com › @ilmunabid › beginners-guide-to-finding-gradient-derivative-of-log-loss-by-hand-detailed-steps-74a6cacfe5cf
Beginner’s Guide to Finding Gradient/Derivative of Log Loss by Hand (Detailed Steps) | by Abid Ilmun Fisabil | Medium
August 17, 2022 - First we find the gradient of SR with respect to R. Then we find the gradient of R with respect to yhat. Finally we find the gradient of yhat with respect to w. And by multiplying the results to each other you get the gradient of SR with respect to w. I guess now you know why it is called The “Chain” Rule, init? For a quick reference to logistic regression.