logistic regression gradient formula

Gradient descent for logistic regression partial derivative doubt

stats.stackexchange.com › questions › 261692 › gradient-descent-for-logistic-regression-partial-derivative-doubt

Both formulas are correct. Here is how you can get formula 2, by minimize the sum of squared errors. $\text{[math]}$ Dividing it by $\text{[math]}$ gives you MSE, and by $\text{[math]}$ gives you SSE used in the formula 2. Since you are going to minimize this expression with partial derivation technique, choosing $\text{[math]}$ makes the derivation look nice.

Now you use a linear model where $\text{[math]}$ and $\text{[math]}$ , you get, (I omit the transpose symbol for $\text{[math]}$ in $\text{[math]}$ ) $\text{[math]}$ When you compute its partial derivative over $\text{[math]}$ for the additive term, you have, $\text{[math]}$

This is the formula 2 you gave. I don't give the detailed steps, but it is quite straightforward.

Yes, $\text{[math]}$ is the activation function, and you do have the factor $\text{[math]}$ in the derivative expression as shown above. It disappears if it equals 1, i.e., $\text{[math]}$ , where $\text{[math]}$ is invariable w.r.t. $\text{[math]}$ .

For example, if $\text{[math]}$ , (i.e., $\text{[math]}$ ), and the prediction model is linear where $\text{[math]}$ , too, then you have $\text{[math]}$ and $\text{[math]}$ .

For another example, if $\text{[math]}$ , while the prediction model is sigmoid where $\text{[math]}$ , then $\text{[math]}$ . This is why in book Artificial Intelligence: A Modern Approach, the derivative of logistic regression is: $\text{[math]}$ On the other hand, the formula 1, although looking like a similar form, is deduced via a different approach. It is based on the maximum likelihood (or equivalently minimum negative log-likelihood) by multiplying the output probability function over all the samples and then taking its negative logarithm, as given below, $\text{[math]}$

In a logistic regression problem, when the outputs are 0 and 1, then each additive term becomes,

$$ −logP(y^i|x^i;\theta) = -(y^i\log{h_\theta(x^i)} + (1-y^i)\log(1-h_\theta(x^i))) $$

The formula 1 is the derivative of it (and its sum) when $\text{[math]}$ , as below, $\text{[math]}$ The derivation details are well given in other post.

You can compare it with equation (1) and (2). Yes, equation (3) does not have the $\text{[math]}$ factor as equation (1), since that is not part of its deduction process at all. Equation (2) has additional factors $\text{[math]}$ compared to equation (3). Since $\text{[math]}$ as probability is within the range of (0, 1), you have $\text{[math]}$ , that means equation (2) brings you a gradient of smaller absolute value, hence a slower convergence speed in gradient descent than equation (3).

Note the sum squared errors (SSE) essentially a special case of maximum likelihood when we consider the prediction of $\text{[math]}$ is actually the mean of a conditional normal distribution. That is, $\text{[math]}$ If you want to get more in depth knowledge in this area, I would suggest the Deep Learning book by Ian Goodfellow, et al.

Answer from Xiao-Feng Li on Stack Exchange

Baeldung

baeldung.com › home › core concepts › math and logic › gradient descent equation in logistic regression

Gradient Descent Equation in Logistic Regression | Baeldung on Computer Science

February 13, 2025 - Surprisingly, the update rule is the same as the one derived by using the sum of the squared errors in linear regression. As a result, we can use the same gradient descent formula for logistic regression as well.

Stanford University

web.stanford.edu › ~jurafsky › slp3 › 5.pdf pdf

Speech and Language Processing. Daniel Jurafsky & James H. Martin.

The update equations going from time step t to t + 1 in stochastic gradient descent · are thus: ct+1 · pos · = ct · pos −η[σ(ct · pos ·wt)−1]wt · (5.25) ct+1 · negi = ct · negi −η[σ(ct · negi ·wt)]wt · (5.26) wt+1 = wt −η · " [σ(ct · pos ·wt)−1]ct · pos + k · X · i=1 · [σ(ct · negi ·wt)]ct · negi · # (5.27) Just as in logistic regression, then, the learning algorithm starts with randomly ini- tialized W and C matrices, and then walks through the training corpus using gradient ·

Discussions

REALLY breaking down logistic regression gradient descent

Great job. I appreciate all the effort you put to write up the equations in Latex. More on reddit.com

r/learnmachinelearning

189

October 13, 2020

Gradient descent for logistic regression partial derivative doubt - Cross Validated

In the case of linear regression, ... first formula. In the case of logistic regression, f′(h) = f(h) * (1 - f(h)). I would not recommend using mean squared error loss for logistic regression, as it's very slow. Binary cross entropy is a much better loss function to use with logistic regression. ... Find the answer to your question by asking. Ask question ... See similar questions with these tags. 107 Solving for regression parameters in closed-form vs gradient ... More on stats.stackexchange.com

stats.stackexchange.com

February 13, 2017

MLE & Gradient Descent in Logistic Regression - Data Science Stack Exchange

In Logistic Regression, MLE is used to develop a mathematical function to estimate the model parameters, optimization techniques like Gradient Descent are used to solve this function. Can somebody ... More on datascience.stackexchange.com

datascience.stackexchange.com

January 9, 2022

Cases when a 'simpler' model was a better solution than Gradient Boosters in your job or project?

when theory suggests additivity, among other things-is a reasonable model. Especially if you care about inference. there's a reason why linear regression is still around. More on reddit.com

r/datascience

November 30, 2023

Videos

19:52

YouTube

7.2.4. Gradient Descent for Logistic Regression - YouTube

October 8, 2021

05:10

YouTube

Logistic Regression 6 A worked example of gradient descent - YouTube

July 19, 2021

04:44

YouTube

Logistic Regression Gradient Descent | Derivation | Machine Learning ...

January 15, 2021

24:35

YouTube

Gradient of logistic regression - YouTube

Logistic Regression Gradient Descent (C1W2L09) - YouTube

August 25, 2017

View all

Rensselaer Polytechnic Institute

cs.rpi.edu › ~magdon › courses › LFD-Slides › SlidesLect09.pdf pdf

Learning From Data Lecture 9 Logistic Regression and Gradient Descent

Regression – pseudoinverse (analytic), from solving ∇wEin(w) = 0. Logistic Regression – analytic won’t work.

Medium

medium.com › intro-to-artificial-intelligence › logistic-regression-using-gradient-descent-bf8cbe749ceb

Logistic regression using gradient descent | by Dhanoop Karunakaran | Intro to Artificial Intelligence | Medium

June 16, 2023 - We consider f(x) = g(z), and we know the fact that f(x) = wx+b for the linear regression. In this case, we can rewrite z as z = wx+b. Remember, z formula is not just tied with this equation, we can replace this with multiple linear regression ...

Upgrad

upgrad.com › home › blog › artificial intelligence › understanding gradient descent in logistic regression: a guide for beginners

Gradient Descent in Logistic Regression - Learn in Minutes!

June 26, 2025 - We use the log loss function for logistic regression, which calculates the difference between predicted probabilities and the true labels. Log loss is used instead of squared error loss because it heavily penalizes confident wrong predictions—like predicting 0.99 for a class that is actually 0—ensuring better probability calibration. The goal is to minimize this cost by adjusting the model’s parameters (weights) through gradient descent in logistic regression.

reddit.com › r/learnmachinelearning › really breaking down logistic regression gradient descent

r/learnmachinelearning on Reddit: REALLY breaking down logistic regression gradient descent

October 13, 2020 -

I recently wrote a blog post that breaks down the math behind maximum likelihood estimation for logistic regression. My friends found it helpful, so decided to spread it around. If you've found the math to be hard to follow in other tutorials, hopefully mine will guide you through it step by step.

Here it is: https://statisticalmusings.netlify.app/post/logistic-regression-mle-a-full-breakdown/

If you can get a firm grasp of logistic regression, you'll be well set to understand deep learning!

Maximum Likelihood

Maximum likelihood estimation involves defining a likelihood function for calculating the conditional probability of observing the data sample given probability distribution and distribution parameters. This approach can be used to search a space of possible distributions and parameters.

The logistic model uses the sigmoid function (denoted by sigma) to estimate the probability that a given sample y belongs to class $1$ given inputs $X$ and weights $W$,

\begin{align} \ P(y=1 \mid x) = \sigma(W^TX) \end{align}

where the sigmoid of our activation function for a given $n$ is:

\begin{align} \large y_n = \sigma(a_n) = \frac{1}{1+e^{-a_n}} \end{align}

The accuracy of our model predictions can be captured by the objective function $L$, which we are trying to maximize.

\begin{align} \large L = \displaystyle\prod_{n=1}^N y_n^{t_n}(1-y_n)^{1-t_n} \end{align}

If we take the log of the above function, we obtain the maximum log-likelihood function, whose form will enable easier calculations of partial derivatives. Specifically, taking the log and maximizing it is acceptable because the log-likelihood is monotonically increasing, and therefore it will yield the same answer as our objective function.

\begin{align} \ L = \displaystyle \sum_{n=1}^N t_nlogy_n+(1-t_n)log(1-y_n) \end{align}

In our example, we will actually convert the objective function (which we would try to maximize) into a cost function (which we are trying to minimize) by converting it into the negative log-likelihood function:

\begin{align} \ J = -\displaystyle \sum_{n=1}^N t_nlogy_n+(1-t_n)log(1-y_n) \end{align}

Gradient Descent

Gradient descent is an iterative optimization algorithm, which finds the minimum of a differentiable function. In this process, we try different values and update them to reach the optimal ones, minimizing the output.

Once we have an objective function, we can generally take its derivative with respect to the parameters (weights), set it equal to zero, and solve for the parameters to obtain the ideal solution. However, in the case of logistic regression (and many other complexes or otherwise non-linear systems), this analytical method doesn’t work. Instead, we resort to a method known as gradient descent, whereby we randomly initialize and then incrementally update our weights by calculating the slope of our objective function. When applying the cost function, we want to continue updating our weights until the slope of the gradient gets as close to zero as possible. We can show this mathematically:

\begin{align} \ w:=w+\triangle w \end{align}

where the second term on the right is defined as the learning rate times the derivative of the cost function with respect to the weights (which is our gradient):

\begin{align} \ \triangle w = \eta\triangle J(w) \end{align}

Thus, we want to take the derivative of the cost function with respect to the weight, which, using the chain rule, gives us:

\begin{align} \frac{J}{\partial w_i} = \displaystyle \sum_{n=1}^N \frac{\partial J}{\partial y_n}\frac{\partial y_n}{\partial a_n}\frac{\partial a_n}{\partial w_i} \end{align}

Thus, we are looking to obtain three different derivatives. Let us start by solving for the derivative of the cost function with respect to $y$:

\begin{align} \frac{\partial J}{\partial y_n} = t_n \frac{1}{y_n} + (1-t_n) \frac{1}{1-y_n}(-1) = \frac{t_n}{y_n} - \frac{1-t_n}{1-y_n} \end{align}

Next, let us solve for the derivative of $y$ with respect to our activation function:

\begin{align} \large y_n = \sigma(a_n) = \frac{1}{1+e^{-a_n}} \end{align}

\begin{align} \frac{\partial y_n}{\partial a_n} = \frac{-1}{(1+e^{-a_n})^2}(e^{-a_n})(-1) = \frac{e^{-a_n}}{(1+e^-a_n)^2} = \frac{1}{1+e^{-a_n}} \frac{e^{-a_n}}{1+e^{-a_n}} \end{align}

\begin{align} \frac{\partial y_n}{\partial a_n} = y_n(1-y_n) \end{align}

And lastly, we solve for the derivative of the activation function with respect to the weights:

\begin{align} \ a_n = W^TX_n \end{align}

\begin{align} \ a_n = w_0x_{n0} + w_1x_{n1} + w_2x_{n2} + \cdots + w_Nx_{NN} \end{align}

\begin{align} \frac{\partial a_n}{\partial w_i} = x_{ni} \end{align}

Now we can put it all together and simply.

\begin{align} \frac{\partial J}{\partial w_i} = - \displaystyle\sum_{n=1}^N\frac{t_n}{y_n}y_n(1-y_n)x_{ni}-\frac{1-t_n}{1-y_n}y_n(1-y_n)x_{ni} \end{align}

\begin{align} = - \displaystyle\sum_{n=1}^Nt_n(1-y_n)x_{ni}-(1-t_n)y_nx_{ni} \end{align}

\begin{align} = - \displaystyle\sum_{n=1}^N[t_n-t_ny_n-y_n+t_ny_n]x_{ni} \end{align}

\begin{align} \frac{\partial J}{\partial w_i} = \displaystyle\sum_{n=1}^N(y_n-t_n)x_{ni} = \frac{\partial J}{\partial w} = \displaystyle\sum_{n=1}^{N}(y_n-t_n)x_n \end{align}

We can get rid of the summation above by applying the principle that a dot product between two vectors is a summover sum index. That is:

\begin{align} \ a^Tb = \displaystyle\sum_{n=1}^Na_nb_n \end{align}

Therefore, the gradient with respect to w is:

\begin{align} \frac{\partial J}{\partial w} = X^T(Y-T) \end{align}

If you are asking yourself where the bias term of our equation (w0) went, we calculate it the same way, except our x becomes 1.

Deep Learning Prerequisites: Logistic Regression in Python
Logistic Regression using Gradient descent and MLE (Projection)
Logistic Regression.pdf
Maximum likelihood and gradient descent
MAXIMUM LIKELIHOOD ESTIMATION (MLE)
Stanford.edu-Logistic Regression.pdf
Gradient Descent Equation in Logistic Regression

2 of 2

In short, Maximum Likelihood estimation is used to find parameters given target values y and x. The Maximum likelhood estimation finds the parameters maximises the probability of y given x. It has been proved that MLE estimation problem caan be solved by finding the parametrs which gives least cross entropy in case of binary classification.

Gradients descent is an optimisation algorithm get helps you update the parameters iteratively to find the parameters which gives highest probability of y.

For more details:https://www.google.com/amp/s/glassboxmedicine.com/2019/12/07/connections-log-likelihood-cross-entropy-kl-divergence-logistic-regression-and-neural-networks/amp/

Ml-explained

ml-explained.com › blog › logistic-regression-explained

Logistic Regression - ML Explained

September 29, 2020 - We can get the gradient descent formula for Logistic Regression by taking the derivative of the loss function.

YOU CANalytics

ucanalytics.com › home › gradient descent for logistic regression simplified – step by step visual guide

YOU CANalytics | Gradient Descent for Logistic Regression Simplified - Step by Step Visual Guide – YOU CANalytics |

September 16, 2018 - If you will run the gradient descent without assuming β1 = β2 then β0 =-15.4233, β1 = 0.1090, and β2 = 0.1097. I suggest you try all these solutions using this code: Gradient Descent – Logistic Regression (R Code).

GitHub

rasbt.github.io › mlxtend › user_guide › classifier › LogisticRegression

LogisticRegression: A binary classifier - mlxtend - GitHub Pages

Another advantage is that we can obtain the derivative more easily, using the addition trick to rewrite the product of factors as a summation term, which we can then maximize using optimization algorithms such as gradient ascent. An alternative to maximizing the log-likelihood, we can define a cost function to be minimized; we rewrite the log-likelihood as: $$ J\big(\phi(z), y; \mathbf{w}\big) =}\ -log\big(1- \phi(z) \big) & \text{if } \end{cases} $$ As we can see in the figure above, we penalize wrong predictions with an increasingly larger cost. To learn the weight coefficient of a logistic regression model via gradient-based optimization, we compute the partial derivative of the log-likelihood function -- w.r.t.

Medium

medium.com › analytics-vidhya › derivative-of-log-loss-function-for-logistic-regression-9b832f025c2d

The Derivative of Cost Function for Logistic Regression | by Saket Thavanani | Analytics Vidhya | Medium

February 8, 2024 - In order to preserve the convex nature for the loss function, a log loss error function has been designed for logistic regression. The cost function is split for two cases y=1 and y=0. For the case when we have y=1 we can observe that when hypothesis function tends to 1 the error is minimized to zero and when it tends to 0 the error is maximum. This criterion exactly follows the criterion as we wanted ... In order to optimize this convex function, we can either go with gradient-descent or newtons method.

MLU-Explain

mlu-explain.github.io › logistic-regression

Logistic Regression

A visual, interactive explanation of logistic regression for machine learning.

Google

developers.google.com › machine learning › linear regression: gradient descent

Linear regression: Gradient descent | Machine Learning | Google for Developers

February 3, 2026 - Learn how gradient descent iteratively finds the weight and bias that minimize a model's loss. This page explains how the gradient descent algorithm works, and how to determine that a model has converged by looking at its loss curve.

Atmamani

atmamani.github.io › projects › ml › implementing-logistic-regression-in-python

Implementing Gradient Descent for Logistic Regression - Atma's blog

Note: At this point, I realize my gradient descent is not really optimizing well. The equation of the decision boundary line is way off. Hence I approach to solve this problem using Scikit-Learn and see what its parameters are. Using the logistic regression from SKlearn, we fit the same data ...

Medium

medium.com › @ilmunabid › beginners-guide-to-finding-gradient-derivative-of-log-loss-by-hand-detailed-steps-74a6cacfe5cf

Beginner’s Guide to Finding Gradient/Derivative of Log Loss by Hand (Detailed Steps) | by Abid Ilmun Fisabil | Medium

August 17, 2022 - First we find the gradient of SR with respect to R. Then we find the gradient of R with respect to yhat. Finally we find the gradient of yhat with respect to w. And by multiplying the results to each other you get the gradient of SR with respect to w. I guess now you know why it is called The “Chain” Rule, init? For a quick reference to logistic regression.