gradient descent in logistic regression

Gradient descent for logistic regression partial derivative doubt

stats.stackexchange.com › questions › 261692 › gradient-descent-for-logistic-regression-partial-derivative-doubt

Both formulas are correct. Here is how you can get formula 2, by minimize the sum of squared errors. $\text{[math]}$ Dividing it by $\text{[math]}$ gives you MSE, and by $\text{[math]}$ gives you SSE used in the formula 2. Since you are going to minimize this expression with partial derivation technique, choosing $\text{[math]}$ makes the derivation look nice.

Now you use a linear model where $\text{[math]}$ and $\text{[math]}$ , you get, (I omit the transpose symbol for $\text{[math]}$ in $\text{[math]}$ ) $\text{[math]}$ When you compute its partial derivative over $\text{[math]}$ for the additive term, you have, $\text{[math]}$

This is the formula 2 you gave. I don't give the detailed steps, but it is quite straightforward.

Yes, $\text{[math]}$ is the activation function, and you do have the factor $\text{[math]}$ in the derivative expression as shown above. It disappears if it equals 1, i.e., $\text{[math]}$ , where $\text{[math]}$ is invariable w.r.t. $\text{[math]}$ .

For example, if $\text{[math]}$ , (i.e., $\text{[math]}$ ), and the prediction model is linear where $\text{[math]}$ , too, then you have $\text{[math]}$ and $\text{[math]}$ .

For another example, if $\text{[math]}$ , while the prediction model is sigmoid where $\text{[math]}$ , then $\text{[math]}$ . This is why in book Artificial Intelligence: A Modern Approach, the derivative of logistic regression is: $\text{[math]}$ On the other hand, the formula 1, although looking like a similar form, is deduced via a different approach. It is based on the maximum likelihood (or equivalently minimum negative log-likelihood) by multiplying the output probability function over all the samples and then taking its negative logarithm, as given below, $\text{[math]}$

In a logistic regression problem, when the outputs are 0 and 1, then each additive term becomes,

$$ −logP(y^i|x^i;\theta) = -(y^i\log{h_\theta(x^i)} + (1-y^i)\log(1-h_\theta(x^i))) $$

The formula 1 is the derivative of it (and its sum) when $\text{[math]}$ , as below, $\text{[math]}$ The derivation details are well given in other post.

You can compare it with equation (1) and (2). Yes, equation (3) does not have the $\text{[math]}$ factor as equation (1), since that is not part of its deduction process at all. Equation (2) has additional factors $\text{[math]}$ compared to equation (3). Since $\text{[math]}$ as probability is within the range of (0, 1), you have $\text{[math]}$ , that means equation (2) brings you a gradient of smaller absolute value, hence a slower convergence speed in gradient descent than equation (3).

Note the sum squared errors (SSE) essentially a special case of maximum likelihood when we consider the prediction of $\text{[math]}$ is actually the mean of a conditional normal distribution. That is, $\text{[math]}$ If you want to get more in depth knowledge in this area, I would suggest the Deep Learning book by Ian Goodfellow, et al.

Answer from Xiao-Feng Li on Stack Exchange

Medium

medium.com › intro-to-artificial-intelligence › logistic-regression-using-gradient-descent-bf8cbe749ceb

Logistic regression using gradient descent | by Dhanoop Karunakaran | Intro to Artificial Intelligence | Medium

June 16, 2023 - The gradient always points towards the direction of the greatest increase, so in order to find the minimum or descent point, we update the weights in the opposite direction of the gradient. We apply the same approach to find the minimum of b as well. Based on the multiple linear regression algorithm, we know the fact that we can apply the gradient descent algorithm as shown below.

Baeldung

baeldung.com › home › core concepts › math and logic › gradient descent equation in logistic regression

Gradient Descent Equation in Logistic Regression | Baeldung on Computer Science

February 13, 2025 - Gradient descent is an iterative optimization algorithm, which finds the minimum of a differentiable function. In this process, we try different values and update them to reach the optimal ones, minimizing the output.

Discussions

REALLY breaking down logistic regression gradient descent

Great job. I appreciate all the effort you put to write up the equations in Latex. More on reddit.com

r/learnmachinelearning

19

189

October 13, 2020

Gradient descent for logistic regression partial derivative doubt - Cross Validated

Binary cross entropy is a much better loss function to use with logistic regression. ... Find the answer to your question by asking. Ask question ... See similar questions with these tags. 107 Solving for regression parameters in closed-form vs gradient descent More on stats.stackexchange.com

stats.stackexchange.com

February 13, 2017

Gradient descent vs Gradient ascent in Logistic Regression

derivation of the cost function of gradient descent This makes me think you might be misunderstanding. Gradient descent isn't a specific function, it's a method that can be applied to any cost function. More on reddit.com

r/learnmachinelearning

7

3

May 1, 2020

Is the code for linear and logistic regression always the same?

Short answer is "no"; longer answer is "noooooo." The longer answer is it depends what you are doing. I'm not primarily an ML person; my background is more traditional stats and econometrics. Linear regression, OLS, has a closed form solution, but you could use a cost function and gradient decent to get there -- in other words you can solve linear regression through (maximum likelihood estimation) MLE. You could also invert a matrix and get the actual solution. For medium to large size data you would compute a matrix inverse using a pseudo inverse technique to approximate it. This assumes that your problem is specified properly. My hunch is that the ML techniques will work (well "work" then answer might not make sense, but the routine won't crash) when you can't invert a matrix because the variables you gave it include duplicates or linear combinations, etc. Logistic regression is an MLE problem, ML folks usually use some form of gradient descent, while traditional stat packages use a specific form like Newton-Raphson. In my experience Newton-Raphson is much faster for problems where you aren't using more than a couple dozen independent variables (features). Gradient descent does well if you have hundreds of variables. More generally, N-R is faster, but has some constraints that G-D avoids (computing second derivatives and such. More discussion here: https://stats.stackexchange.com/questions/253632/why-is-newtons-method-not-widely-used-in-machine-learning For MLE type problems there is a lot of different code, but it is people trying to do the same thing either faster or with more uncertainty or better managing edge cases. More on reddit.com

r/learnmachinelearning

15

30

September 4, 2023

Videos

04:44

YouTube

Logistic Regression Gradient Descent | Derivation | Machine Learning ...

January 15, 2021

19:52

YouTube

7.2.4. Gradient Descent for Logistic Regression - YouTube

October 8, 2021

youtube.com

Gradient Descent in Logistic Regression | Complete Derivation ...

February 15, 2025

youtube.com

Machine Learning 11: Logistic Regression and Gradient ...

youtube.com

Gradient Descent in Logistic Regression (Step-by-Step ...

53:47

YouTube

Machine Learning: logistic regression and gradient descent - YouTube

geeksforgeeks.org › machine learning › gradient-descent-algorithm-and-its-variants

Gradient Descent Algorithm in Machine Learning - GeeksforGeeks

11:09

It predicts the probability that ... accuracy. ... Gradient Descent helps Logistic Regression find the best values of the model parameters so that the prediction error becomes smaller....

Published 2 weeks ago

Rensselaer Polytechnic Institute

cs.rpi.edu › ~magdon › courses › LFD-Slides › SlidesLect09.pdf pdf

Learning From Data Lecture 9 Logistic Regression and Gradient Descent

Logistic Regression and Gradient Descent: 4 /23 · Data is binary ±1 −→ · The Data is Still Binary, ±1 · D = (x1, y1 = ±1), · · · , (xN, yN = ±1) xn · ←a person’s health information · yn = ±1 · ←did they have a heart attack or not · We cannot measure a probability.

reddit.com › r/learnmachinelearning › really breaking down logistic regression gradient descent

r/learnmachinelearning on Reddit: REALLY breaking down logistic regression gradient descent

October 13, 2020 -

I recently wrote a blog post that breaks down the math behind maximum likelihood estimation for logistic regression. My friends found it helpful, so decided to spread it around. If you've found the math to be hard to follow in other tutorials, hopefully mine will guide you through it step by step.

Here it is: https://statisticalmusings.netlify.app/post/logistic-regression-mle-a-full-breakdown/

If you can get a firm grasp of logistic regression, you'll be well set to understand deep learning!

Top answer

1 of 3

6