Both formulas are correct. Here is how you can get formula 2, by minimize the sum of squared errors. Dividing it by gives you MSE, and by gives you SSE used in the formula 2. Since you are going to minimize this expression with partial derivation technique, choosing makes the derivation look nice.

Now you use a linear model where and , you get, (I omit the transpose symbol for in ) When you compute its partial derivative over for the additive term, you have,

This is the formula 2 you gave. I don't give the detailed steps, but it is quite straightforward.

Yes, is the activation function, and you do have the factor in the derivative expression as shown above. It disappears if it equals 1, i.e., , where is invariable w.r.t. .

For example, if , (i.e., ), and the prediction model is linear where , too, then you have and .

For another example, if , while the prediction model is sigmoid where , then . This is why in book Artificial Intelligence: A Modern Approach, the derivative of logistic regression is: On the other hand, the formula 1, although looking like a similar form, is deduced via a different approach. It is based on the maximum likelihood (or equivalently minimum negative log-likelihood) by multiplying the output probability function over all the samples and then taking its negative logarithm, as given below,

In a logistic regression problem, when the outputs are 0 and 1, then each additive term becomes,

$$ −logP(y^i|x^i;\theta) = -(y^i\log{h_\theta(x^i)} + (1-y^i)\log(1-h_\theta(x^i))) $$

The formula 1 is the derivative of it (and its sum) when , as below, The derivation details are well given in other post.

You can compare it with equation (1) and (2). Yes, equation (3) does not have the factor as equation (1), since that is not part of its deduction process at all. Equation (2) has additional factors compared to equation (3). Since as probability is within the range of (0, 1), you have , that means equation (2) brings you a gradient of smaller absolute value, hence a slower convergence speed in gradient descent than equation (3).

Note the sum squared errors (SSE) essentially a special case of maximum likelihood when we consider the prediction of is actually the mean of a conditional normal distribution. That is, If you want to get more in depth knowledge in this area, I would suggest the Deep Learning book by Ian Goodfellow, et al.

Answer from Xiao-Feng Li on Stack Exchange
🌐
Medium
medium.com › intro-to-artificial-intelligence › logistic-regression-using-gradient-descent-bf8cbe749ceb
Logistic regression using gradient descent | by Dhanoop Karunakaran | Intro to Artificial Intelligence | Medium
June 16, 2023 - The gradient always points towards the direction of the greatest increase, so in order to find the minimum or descent point, we update the weights in the opposite direction of the gradient. We apply the same approach to find the minimum of b as well. Based on the multiple linear regression algorithm, we know the fact that we can apply the gradient descent algorithm as shown below.
🌐
Baeldung
baeldung.com › home › core concepts › math and logic › gradient descent equation in logistic regression
Gradient Descent Equation in Logistic Regression | Baeldung on Computer Science
February 13, 2025 - Gradient descent is an iterative optimization algorithm, which finds the minimum of a differentiable function. In this process, we try different values and update them to reach the optimal ones, minimizing the output.
Discussions

REALLY breaking down logistic regression gradient descent
Great job. I appreciate all the effort you put to write up the equations in Latex. More on reddit.com
🌐 r/learnmachinelearning
19
189
October 13, 2020
Gradient descent for logistic regression partial derivative doubt - Cross Validated
Binary cross entropy is a much better loss function to use with logistic regression. ... Find the answer to your question by asking. Ask question ... See similar questions with these tags. 107 Solving for regression parameters in closed-form vs gradient descent More on stats.stackexchange.com
🌐 stats.stackexchange.com
February 13, 2017
Gradient descent vs Gradient ascent in Logistic Regression
derivation of the cost function of gradient descent This makes me think you might be misunderstanding. Gradient descent isn't a specific function, it's a method that can be applied to any cost function. More on reddit.com
🌐 r/learnmachinelearning
7
3
May 1, 2020
Is the code for linear and logistic regression always the same?
Short answer is "no"; longer answer is "noooooo." The longer answer is it depends what you are doing. I'm not primarily an ML person; my background is more traditional stats and econometrics. Linear regression, OLS, has a closed form solution, but you could use a cost function and gradient decent to get there -- in other words you can solve linear regression through (maximum likelihood estimation) MLE. You could also invert a matrix and get the actual solution. For medium to large size data you would compute a matrix inverse using a pseudo inverse technique to approximate it. This assumes that your problem is specified properly. My hunch is that the ML techniques will work (well "work" then answer might not make sense, but the routine won't crash) when you can't invert a matrix because the variables you gave it include duplicates or linear combinations, etc. Logistic regression is an MLE problem, ML folks usually use some form of gradient descent, while traditional stat packages use a specific form like Newton-Raphson. In my experience Newton-Raphson is much faster for problems where you aren't using more than a couple dozen independent variables (features). Gradient descent does well if you have hundreds of variables. More generally, N-R is faster, but has some constraints that G-D avoids (computing second derivatives and such. More discussion here: https://stats.stackexchange.com/questions/253632/why-is-newtons-method-not-widely-used-in-machine-learning For MLE type problems there is a lot of different code, but it is people trying to do the same thing either faster or with more uncertainty or better managing edge cases. More on reddit.com
🌐 r/learnmachinelearning
15
30
September 4, 2023
🌐
GeeksforGeeks
geeksforgeeks.org › machine learning › gradient-descent-algorithm-and-its-variants
Gradient Descent Algorithm in Machine Learning - GeeksforGeeks
It predicts the probability that ... accuracy. ... Gradient Descent helps Logistic Regression find the best values of the model parameters so that the prediction error becomes smaller....
Published   2 weeks ago
🌐
Rensselaer Polytechnic Institute
cs.rpi.edu › ~magdon › courses › LFD-Slides › SlidesLect09.pdf pdf
Learning From Data Lecture 9 Logistic Regression and Gradient Descent
Logistic Regression and Gradient Descent: 4 /23 · Data is binary ±1 −→ · The Data is Still Binary, ±1 · D = (x1, y1 = ±1), · · · , (xN, yN = ±1) xn · ←a person’s health information · yn = ±1 · ←did they have a heart attack or not · We cannot measure a probability.
🌐
Reddit
reddit.com › r/learnmachinelearning › really breaking down logistic regression gradient descent
r/learnmachinelearning on Reddit: REALLY breaking down logistic regression gradient descent
October 13, 2020 -

I recently wrote a blog post that breaks down the math behind maximum likelihood estimation for logistic regression. My friends found it helpful, so decided to spread it around. If you've found the math to be hard to follow in other tutorials, hopefully mine will guide you through it step by step.

Here it is: https://statisticalmusings.netlify.app/post/logistic-regression-mle-a-full-breakdown/

If you can get a firm grasp of logistic regression, you'll be well set to understand deep learning!

🌐
Hongyangzhang
hongyangzhang.com › ds5220 › logistic_regression.pdf pdf
DS 5220, Lecture 5: Logistic Regression Using Gradient Descent
We need to compute the gradient of the loss, ∇ˆL(β). Then, we set a step size parameter ηt · (usually between 0 and 1), for t = 1, 2, . . . , T. With the gradient, we can update β as follows: ... Recall that the gradient is a vector that includes the entry-wise partial derivative of ˆL.
Find elsewhere
Top answer
1 of 3
6

Both formulas are correct. Here is how you can get formula 2, by minimize the sum of squared errors. Dividing it by gives you MSE, and by gives you SSE used in the formula 2. Since you are going to minimize this expression with partial derivation technique, choosing makes the derivation look nice.

Now you use a linear model where and , you get, (I omit the transpose symbol for in ) When you compute its partial derivative over for the additive term, you have,

This is the formula 2 you gave. I don't give the detailed steps, but it is quite straightforward.

Yes, is the activation function, and you do have the factor in the derivative expression as shown above. It disappears if it equals 1, i.e., , where is invariable w.r.t. .

For example, if , (i.e., ), and the prediction model is linear where , too, then you have and .

For another example, if , while the prediction model is sigmoid where , then . This is why in book Artificial Intelligence: A Modern Approach, the derivative of logistic regression is: On the other hand, the formula 1, although looking like a similar form, is deduced via a different approach. It is based on the maximum likelihood (or equivalently minimum negative log-likelihood) by multiplying the output probability function over all the samples and then taking its negative logarithm, as given below,

In a logistic regression problem, when the outputs are 0 and 1, then each additive term becomes,

$$ −logP(y^i|x^i;\theta) = -(y^i\log{h_\theta(x^i)} + (1-y^i)\log(1-h_\theta(x^i))) $$

The formula 1 is the derivative of it (and its sum) when , as below, The derivation details are well given in other post.

You can compare it with equation (1) and (2). Yes, equation (3) does not have the factor as equation (1), since that is not part of its deduction process at all. Equation (2) has additional factors compared to equation (3). Since as probability is within the range of (0, 1), you have , that means equation (2) brings you a gradient of smaller absolute value, hence a slower convergence speed in gradient descent than equation (3).

Note the sum squared errors (SSE) essentially a special case of maximum likelihood when we consider the prediction of is actually the mean of a conditional normal distribution. That is, If you want to get more in depth knowledge in this area, I would suggest the Deep Learning book by Ian Goodfellow, et al.

2 of 3
5

So, I think you are mixing up minimizing your loss function, versus maximizing your log likelihood, but also, (since both are equivalent), the equations you have written are actually the same. Let's assume there is only one sample, to keep things simple.

The first equation shows the minimization of loss update equation: Here, the gradient of the loss is given by:

However, the third equation you have written:

is not the gradient with respect to the loss, but the gradient with respect to the log likelihood!

This is why you have a discrepancy in your signs.

Now look back at your first equation, and open the gradient with the negative sign - you will then be maximizing the log likelihood, which is the same as minimizing the loss. Therefore they are all equivalent. Hope that helped!

🌐
TTIC
home.ttic.edu › ~suriya › website-intromlss2018 › course_material › Day3b.pdf pdf
On Logistic Regression: Gradients of the Log
▶Unlike in linear regression, there is no closed-form solution for · wLOG · S · := arg min · w∈Rd JLOG · S · (w) ▶But JLOG · S · (w) is convex and differentiable! So we can do · gradient descent and approach an optimal solution. 5 / 22 · Gradient Descent for Logistic Regression ·
🌐
Nineleaps
nineleaps.com › blog › logistic-regression-explained-gradient-descent-optimization
Logistic Regression Explained- Gradient Descent ...
These processes are categorized ... the lower bound of your model's robustness. Common white-box techniques include the Fast Gradient Sign Method (FGSM) and Projected Gradient Descent (PGD).2....
🌐
Nucleartalent
nucleartalent.github.io › MachineLearningECT › doc › pub › Day2 › html › Day2-bs.html
Data Analysis and Machine Learning: Logistic Regression and Gradient Methods
The cost function is convex which guarantees that gradient descent converges for small enough learning rates We revisit the example from homework set 1 where we had $$ y_i = 5x_i^2 + 0.1\xi_i, \ i=1,\cdots,100 $$ with \( x_i \in [0,1] \) chosen randomly with a uniform distribution. Additionally \( \xi_i \) represents stochastic noise chosen according to a normal distribution \( \cal {N}(0,1) \). The linear regression model is given by $$ h_\beta(x) = \boldsymbol{y} = \beta_0 + \beta_1 x, $$ such that $$ \boldsymbol{y}_i = \beta_0 + \beta_1 x_i.
🌐
Proceedings of Machine Learning Research
proceedings.mlr.press › v202 › axiotis23a › axiotis23a.pdf pdf
Gradient Descent Converges Linearly for Logistic Regression on Separable Data
This is an ideal norm for sparse logistic regression, where · in addition to minimizing the loss we want to restrict the · solution to have few non-zero entries. In particular, it yields · a variant of the ℓ1 gradient descent algorithm (aka greedy
🌐
Medium
medium.com › @thourayabchir1 › gradient-descent-09e824b4ef47
Gradient Descent for Logistic Regression | Medium
April 27, 2025 - Logistic regression uses the binary cross-entropy (log-loss): ... Minimizing J pulls the parameters so that the ŷ aligns with the true label y. Although J(w, b) is convex (there’s a single global minimum), it lacks a simple closed-form minimizer. Instead, we use gradient descent, which:
🌐
Tencent Cloud
tencentcloud.com › techpedia › 112014
How to use gradient descent algorithm for logistic ...
June 4, 2025 - Tencent Cloud Encyclopedia is a technical knowledge base and resource sharing platform provided by Tencent Cloud for developers, aiming to help developers better understand and use Tencent Cloud products and services. This page brings together a large number of product and technology-related ...
🌐
arXiv
arxiv.org › abs › 2406.05033
[2406.05033] Gradient Descent on Logistic Regression with Non-Separable Data and Large Step Sizes
November 4, 2024 - For linearly-separable data, it is known that GD converges to the minimizer with arbitrarily large step sizes, a property which no longer holds when the problem is not separable. In fact, the ...
🌐
Khoury College of Computer Sciences
khoury.northeastern.edu › home › vip › teach › IRcourse › 6_ML › lecture_notes › lecture_regression_GD.pdf pdf
Regression with Numerical Optimization. Logistic regression
October 3, 2014 - squared error function J is called least square regression, or least square fit. We present two methods for ... The gradient descent algorithm finds a local minima of the objective function (J) by guessing an initial set
🌐
GeeksforGeeks
geeksforgeeks.org › machine learning › gradient-descent-in-linear-regression
Gradient Descent in Linear Regression - GeeksforGeeks
Gradient Descent is an optimization algorithm used in linear regression to find the best-fit line for the data. It works by gradually adjusting the line’s slope and intercept to reduce the difference between actual and predicted values.
Published   3 weeks ago
🌐
RPubs
rpubs.com › ttriche › ols_and_logistic_regression
RPubs - Gradient descent, OLS, and logistic regression
Sign in Register · Gradient descent, OLS, and logistic regression · by Tim Triche · Last updated almost 3 years ago · Hide Comments (–) Share Hide Toolbars ·
🌐
Cornell University Computational Optimization
optimization.cbe.cornell.edu › index.php
Stochastic gradient descent - Cornell University Computational Optimization Open Textbook - Optimization Wiki
To improve SVM scalability regarding the size of the data set, SGD algorithms are used as a simplified procedure for evaluating the gradient of a function.[12] Logistic regression models the probabilities for classification problems with two possible outcomes.
🌐
ScienceDirect
sciencedirect.com › topics › computer-science › gradient-descent
Gradient Descent - an overview | ScienceDirect Topics
Learning such models using gradient descent is easier than optimizing nonlinear neural networks because the objective function has a global minimum rather than many local minima, which is usually the case for nonlinear networks. For linear problems, a stochastic gradient descent procedure can be designed that is computationally simple and converges very rapidly, allowing models such as linear support vector machines and logistic regression to be learned from large datasets.
🌐
Ml-explained
ml-explained.com › blog › logistic-regression-explained
Logistic Regression
We can get the gradient descent formula for Logistic Regression by taking the derivative of the loss function.