logistic regression gradient descent

Gradient descent for logistic regression partial derivative doubt

stats.stackexchange.com › questions › 261692 › gradient-descent-for-logistic-regression-partial-derivative-doubt

Both formulas are correct. Here is how you can get formula 2, by minimize the sum of squared errors. $\text{[math]}$ Dividing it by $\text{[math]}$ gives you MSE, and by $\text{[math]}$ gives you SSE used in the formula 2. Since you are going to minimize this expression with partial derivation technique, choosing $\text{[math]}$ makes the derivation look nice.

Now you use a linear model where $\text{[math]}$ and $\text{[math]}$ , you get, (I omit the transpose symbol for $\text{[math]}$ in $\text{[math]}$ ) $\text{[math]}$ When you compute its partial derivative over $\text{[math]}$ for the additive term, you have, $\text{[math]}$

This is the formula 2 you gave. I don't give the detailed steps, but it is quite straightforward.

Yes, $\text{[math]}$ is the activation function, and you do have the factor $\text{[math]}$ in the derivative expression as shown above. It disappears if it equals 1, i.e., $\text{[math]}$ , where $\text{[math]}$ is invariable w.r.t. $\text{[math]}$ .

For example, if $\text{[math]}$ , (i.e., $\text{[math]}$ ), and the prediction model is linear where $\text{[math]}$ , too, then you have $\text{[math]}$ and $\text{[math]}$ .

For another example, if $\text{[math]}$ , while the prediction model is sigmoid where $\text{[math]}$ , then $\text{[math]}$ . This is why in book Artificial Intelligence: A Modern Approach, the derivative of logistic regression is: $\text{[math]}$ On the other hand, the formula 1, although looking like a similar form, is deduced via a different approach. It is based on the maximum likelihood (or equivalently minimum negative log-likelihood) by multiplying the output probability function over all the samples and then taking its negative logarithm, as given below, $\text{[math]}$

In a logistic regression problem, when the outputs are 0 and 1, then each additive term becomes,

$$ −logP(y^i|x^i;\theta) = -(y^i\log{h_\theta(x^i)} + (1-y^i)\log(1-h_\theta(x^i))) $$

The formula 1 is the derivative of it (and its sum) when $\text{[math]}$ , as below, $\text{[math]}$ The derivation details are well given in other post.

You can compare it with equation (1) and (2). Yes, equation (3) does not have the $\text{[math]}$ factor as equation (1), since that is not part of its deduction process at all. Equation (2) has additional factors $\text{[math]}$ compared to equation (3). Since $\text{[math]}$ as probability is within the range of (0, 1), you have $\text{[math]}$ , that means equation (2) brings you a gradient of smaller absolute value, hence a slower convergence speed in gradient descent than equation (3).

Note the sum squared errors (SSE) essentially a special case of maximum likelihood when we consider the prediction of $\text{[math]}$ is actually the mean of a conditional normal distribution. That is, $\text{[math]}$ If you want to get more in depth knowledge in this area, I would suggest the Deep Learning book by Ian Goodfellow, et al.

Answer from Xiao-Feng Li on Stack Exchange

Upgrad

upgrad.com › home › blog › artificial intelligence › understanding gradient descent in logistic regression: a guide for beginners

Gradient Descent in Logistic Regression - Learn in Minutes!

June 26, 2025 - We use the log loss function for logistic regression, which calculates the difference between predicted probabilities and the true labels. Log loss is used instead of squared error loss because it heavily penalizes confident wrong predictions—like predicting 0.99 for a class that is actually 0—ensuring better probability calibration. The goal is to minimize this cost by adjusting the model’s parameters (weights) through gradient descent in logistic regression.

Baeldung

baeldung.com › home › core concepts › math and logic › gradient descent equation in logistic regression

Gradient Descent Equation in Logistic Regression | Baeldung on Computer Science

February 13, 2025 - Gradient descent is an iterative optimization algorithm, which finds the minimum of a differentiable function. In this process, we try different values and update them to reach the optimal ones, minimizing the output.

Discussions

Gradient descent for logistic regression partial derivative doubt - Cross Validated

I'm a software engineer, and I have just started a Udacity's nanodegree of deep learning. I have also worked my way through Stanford professor Andrew Ng's online course on machine learning and now... More on stats.stackexchange.com

stats.stackexchange.com

February 13, 2017

REALLY breaking down logistic regression gradient descent

Great job. I appreciate all the effort you put to write up the equations in Latex. More on reddit.com

r/learnmachinelearning

19

189

October 13, 2020

Gradient descent for logistic regression (xj^(i).)

Hi all, Below is a slide of “Gradient descent for logistic regression.” I just don’t understand why the following formula has xj^(i). This might be beyond of the scope of our course, but could anyone explain that mathematically? Thanks! More on community.deeplearning.ai

community.deeplearning.ai

0

October 13, 2023

Gradient descent vs Gradient ascent in Logistic Regression

derivation of the cost function of gradient descent This makes me think you might be misunderstanding. Gradient descent isn't a specific function, it's a method that can be applied to any cost function. More on reddit.com

r/learnmachinelearning

7

3

May 1, 2020

Videos

04:44

YouTube

Logistic Regression Gradient Descent | Derivation | Machine Learning ...

January 15, 2021

19:52

YouTube

7.2.4. Gradient Descent for Logistic Regression - YouTube

October 8, 2021

youtube.com

Gradient Descent in Logistic Regression | Complete Derivation ...

February 15, 2025

53:47

YouTube

Machine Learning: logistic regression and gradient descent - YouTube

Logistic Regression: Gradient Descent - YouTube

September 13, 2019

17:33

YouTube

Logistic Regression in Python - Machine Learning Basics - YouTube

December 27, 2019

View all

Medium

medium.com › intro-to-artificial-intelligence › logistic-regression-using-gradient-descent-bf8cbe749ceb

Logistic regression using gradient descent | by Dhanoop Karunakaran | Intro to Artificial Intelligence | Medium

June 16, 2023 - The gradient always points towards the direction of the greatest increase, so in order to find the minimum or descent point, we update the weights in the opposite direction of the gradient.

Ml-explained

ml-explained.com › blog › logistic-regression-explained

Logistic Regression - ML Explained

September 29, 2020 - We can get the gradient descent formula for Logistic Regression by taking the derivative of the loss function.

Stanford University

web.stanford.edu › ~jurafsky › slp3 › 5.pdf pdf

Speech and Language Processing. Daniel Jurafsky & James H. Martin.

The update equations going from time step t to t + 1 in stochastic gradient descent · are thus: ct+1 · pos · = ct · pos −η[σ(ct · pos ·wt)−1]wt · (5.25) ct+1 · negi = ct · negi −η[σ(ct · negi ·wt)]wt · (5.26) wt+1 = wt −η · " [σ(ct · pos ·wt)−1]ct · pos + k · X · i=1 · [σ(ct · negi ·wt)]ct · negi · # (5.27) Just as in logistic regression, then, the learning algorithm starts with randomly ini- tialized W and C matrices, and then walks through the training corpus using gradient ·

Medium

medium.com › @thourayabchir1 › gradient-descent-09e824b4ef47

Gradient Descent for Logistic Regression | Medium

April 27, 2025 - Logistic regression uses the binary cross-entropy (log-loss): ... Minimizing J pulls the parameters so that the ŷ aligns with the true label y. Although J(w, b) is convex (there’s a single global minimum), it lacks a simple closed-form minimizer. Instead, we use gradient descent, which:

Find elsewhere

Google Bing Mojeek

Rensselaer Polytechnic Institute

cs.rpi.edu › ~magdon › courses › LFD-Slides › SlidesLect09.pdf pdf

Learning From Data Lecture 9 Logistic Regression and Gradient Descent

Logistic Regression and Gradient Descent: 4 /23 · Data is binary ±1 −→ · The Data is Still Binary, ±1 · D = (x1, y1 = ±1), · · · , (xN, yN = ±1) xn · ←a person’s health information · yn = ±1 · ←did they have a heart attack or not · We cannot measure a probability.

Stack Exchange

stats.stackexchange.com › questions › 261692 › gradient-descent-for-logistic-regression-partial-derivative-doubt

Gradient descent for logistic regression partial derivative doubt - Cross Validated

Top answer

1 of 3

6