Both formulas are correct. Here is how you can get formula 2, by minimize the sum of squared errors.
Dividing it by
gives you MSE, and by
gives you SSE used in the formula 2. Since you are going to minimize this expression with partial derivation technique, choosing
makes the derivation look nice.
Now you use a linear model where and
, you get, (I omit the transpose symbol for
in
)
When you compute its partial derivative over
for the additive term, you have,
This is the formula 2 you gave. I don't give the detailed steps, but it is quite straightforward.
Yes, is the activation function, and you do have the factor
in the derivative expression as shown above. It disappears if it equals 1, i.e.,
, where
is invariable w.r.t.
.
For example, if , (i.e.,
), and the prediction model is linear where
, too, then you have
and
.
For another example, if , while the prediction model is sigmoid where
, then
. This is why in book Artificial Intelligence: A Modern Approach, the derivative of logistic regression is:
On the other hand, the formula 1, although looking like a similar form, is deduced via a different approach. It is based on the maximum likelihood (or equivalently minimum negative log-likelihood) by multiplying the output probability function over all the samples and then taking its negative logarithm, as given below,
In a logistic regression problem, when the outputs are 0 and 1, then each additive term becomes,
$$ −logP(y^i|x^i;\theta) = -(y^i\log{h_\theta(x^i)} + (1-y^i)\log(1-h_\theta(x^i))) $$
The formula 1 is the derivative of it (and its sum) when , as below,
The derivation details are well given in other post.
You can compare it with equation (1) and (2). Yes, equation (3) does not have the factor as equation (1), since that is not part of its deduction process at all. Equation (2) has additional factors
compared to equation (3). Since
as probability is within the range of (0, 1), you have
, that means equation (2) brings you a gradient of smaller absolute value, hence a slower convergence speed in gradient descent than equation (3).
Note the sum squared errors (SSE) essentially a special case of maximum likelihood when we consider the prediction of is actually the mean of a conditional normal distribution. That is,
If you want to get more in depth knowledge in this area, I would suggest the Deep Learning book by Ian Goodfellow, et al.
Gradient descent for logistic regression partial derivative doubt - Cross Validated
REALLY breaking down logistic regression gradient descent
Gradient descent for logistic regression (xj^(i).)
Gradient descent vs Gradient ascent in Logistic Regression
Can Gradient Descent in Logistic Regression handle non-linear data?
How does Gradient Descent in Logistic Regression handle high-dimensional data?
Can I use Gradient Descent in Logistic Regression for multi-class classification?
Videos
Both formulas are correct. Here is how you can get formula 2, by minimize the sum of squared errors.
Dividing it by
gives you MSE, and by
gives you SSE used in the formula 2. Since you are going to minimize this expression with partial derivation technique, choosing
makes the derivation look nice.
Now you use a linear model where and
, you get, (I omit the transpose symbol for
in
)
When you compute its partial derivative over
for the additive term, you have,
This is the formula 2 you gave. I don't give the detailed steps, but it is quite straightforward.
Yes, is the activation function, and you do have the factor
in the derivative expression as shown above. It disappears if it equals 1, i.e.,
, where
is invariable w.r.t.
.
For example, if , (i.e.,
), and the prediction model is linear where
, too, then you have
and
.
For another example, if , while the prediction model is sigmoid where
, then
. This is why in book Artificial Intelligence: A Modern Approach, the derivative of logistic regression is:
On the other hand, the formula 1, although looking like a similar form, is deduced via a different approach. It is based on the maximum likelihood (or equivalently minimum negative log-likelihood) by multiplying the output probability function over all the samples and then taking its negative logarithm, as given below,
In a logistic regression problem, when the outputs are 0 and 1, then each additive term becomes,
$$ −logP(y^i|x^i;\theta) = -(y^i\log{h_\theta(x^i)} + (1-y^i)\log(1-h_\theta(x^i))) $$
The formula 1 is the derivative of it (and its sum) when , as below,
The derivation details are well given in other post.
You can compare it with equation (1) and (2). Yes, equation (3) does not have the factor as equation (1), since that is not part of its deduction process at all. Equation (2) has additional factors
compared to equation (3). Since
as probability is within the range of (0, 1), you have
, that means equation (2) brings you a gradient of smaller absolute value, hence a slower convergence speed in gradient descent than equation (3).
Note the sum squared errors (SSE) essentially a special case of maximum likelihood when we consider the prediction of is actually the mean of a conditional normal distribution. That is,
If you want to get more in depth knowledge in this area, I would suggest the Deep Learning book by Ian Goodfellow, et al.
So, I think you are mixing up minimizing your loss function, versus maximizing your log likelihood, but also, (since both are equivalent), the equations you have written are actually the same. Let's assume there is only one sample, to keep things simple.
The first equation shows the minimization of loss update equation:
Here, the gradient of the loss is given by:
However, the third equation you have written:
is not the gradient with respect to the loss, but the gradient with respect to the log likelihood!
This is why you have a discrepancy in your signs.
Now look back at your first equation, and open the gradient with the negative sign - you will then be maximizing the log likelihood, which is the same as minimizing the loss. Therefore they are all equivalent. Hope that helped!
I recently wrote a blog post that breaks down the math behind maximum likelihood estimation for logistic regression. My friends found it helpful, so decided to spread it around. If you've found the math to be hard to follow in other tutorials, hopefully mine will guide you through it step by step.
Here it is: https://statisticalmusings.netlify.app/post/logistic-regression-mle-a-full-breakdown/
If you can get a firm grasp of logistic regression, you'll be well set to understand deep learning!