My answer for my question: yes, it can be shown that gradient for logistic loss is equal to difference between true values and predicted probabilities. Brief explanation was found here.
First, logistic loss is just negative log-likelihood, so we can start with expression for log-likelihood (p. 74 - this expression is log-likelihood itself, not negative log-likelihood):
is logistic function:
, where
is predicted values before logistic transformation (i.e., log-odds):
First derivative obtained using Wolfram Alpha:
After multiplying by :
After changing sign we have expression for gradient of logistic loss function:
My answer for my question: yes, it can be shown that gradient for logistic loss is equal to difference between true values and predicted probabilities. Brief explanation was found here.
First, logistic loss is just negative log-likelihood, so we can start with expression for log-likelihood (p. 74 - this expression is log-likelihood itself, not negative log-likelihood):
is logistic function:
, where
is predicted values before logistic transformation (i.e., log-odds):
First derivative obtained using Wolfram Alpha:
After multiplying by :
After changing sign we have expression for gradient of logistic loss function:
AdamO is correct, if you just want the gradient of the logistic loss (what the op asked for in the title), then it needs a 1/p(1-p). Unfortunately people from the DL community for some reason assume logistic loss to always be bundled with a sigmoid, and pack their gradients together and call that the logistic loss gradient (the internet is filled with posts asserting this). Since the gradient of sigmoid happens to be p(1-p) it eliminates the 1/p(1-p) of the logistic loss gradient. But if you are implementing SGD (walking back the layers), and applying the sigmoid gradient when you get to the sigmoid, then you need to start with the actual logistic loss gradient -- which has a 1/p(1-p).


