My answer for my question: yes, it can be shown that gradient for logistic loss is equal to difference between true values and predicted probabilities. Brief explanation was found here.
First, logistic loss is just negative log-likelihood, so we can start with expression for log-likelihood (p. 74 - this expression is log-likelihood itself, not negative log-likelihood):
is logistic function:
, where
is predicted values before logistic transformation (i.e., log-odds):
First derivative obtained using Wolfram Alpha:
After multiplying by :
After changing sign we have expression for gradient of logistic loss function:
My answer for my question: yes, it can be shown that gradient for logistic loss is equal to difference between true values and predicted probabilities. Brief explanation was found here.
First, logistic loss is just negative log-likelihood, so we can start with expression for log-likelihood (p. 74 - this expression is log-likelihood itself, not negative log-likelihood):
is logistic function:
, where
is predicted values before logistic transformation (i.e., log-odds):
First derivative obtained using Wolfram Alpha:
After multiplying by :
After changing sign we have expression for gradient of logistic loss function:
AdamO is correct, if you just want the gradient of the logistic loss (what the op asked for in the title), then it needs a 1/p(1-p). Unfortunately people from the DL community for some reason assume logistic loss to always be bundled with a sigmoid, and pack their gradients together and call that the logistic loss gradient (the internet is filled with posts asserting this). Since the gradient of sigmoid happens to be p(1-p) it eliminates the 1/p(1-p) of the logistic loss gradient. But if you are implementing SGD (walking back the layers), and applying the sigmoid gradient when you get to the sigmoid, then you need to start with the actual logistic loss gradient -- which has a 1/p(1-p).
Confused in the gradient descent of the logistic log loss function
numpy - How is the gradient and hessian of logarithmic loss computed in the custom objective function example script in xgboost's github repository? - Stack Overflow
sgd - Gradient for log regression loss - Stack Overflow
Find negative log-likelihood cost for logistic regression in python and gradient loss with respect to w,bF - Stack Overflow
Videos
There are some small mistakes like you should use np.sum(Y*np.log(A) + (1-Y)*np.log(1-A)) / m in place of using .mean() and the next mistake that I think is replace np.subtract(A-Y) with simple A-Y bcz. there is no need for numpy in this. It's working for me.
def propagate(w, b, X, Y):
"""
Implement the cost function and its gradient for the propagation explained above
Arguments:
w -- weights, a numpy array of size (num_px * num_px * 3, 1)
b -- bias, a scalar
X -- data of size (num_px * num_px * 3, number of examples)
Y -- true "label" vector (containing 0 if non-cat, 1 if cat) of size (1, number of examples)
Return:
cost -- negative log-likelihood cost for logistic regression
dw -- gradient of the loss with respect to w, thus same shape as w
db -- gradient of the loss with respect to b, thus same shape as b
Tips:
- Write your code step by step for the propagation. np.log(), np.dot()
"""
m = X.shape[1]
# FORWARD PROPAGATION (FROM X TO COST)
### START CODE HERE ### (≈ 2 lines of code)
A = sigmoid(np.dot(w.T,X)+b) # compute activation
cost = -np.sum(Y*np.log(A) + (1-Y)*np.log(1-A)) / m # compute cost
### END CODE HERE ###
# BACKWARD PROPAGATION (TO FIND GRAD)
### START CODE HERE ### (≈ 2 lines of code)
dw = np.dot(X,(A-Y).T)/m
db = np.sum(A-Y,axis=1)/m
### END CODE HERE ###
assert(dw.shape == w.shape)
assert(db.dtype == float)
cost = np.squeeze(cost)
assert(cost.shape == ())
grads = {"dw": dw,
"db": db}
return grads, cost
dw = np.dot(X,db.T)/m
is wrong.
Instead of db, it should be multiplied with the derivative of the activation function here, i.e sigmoid,
A = sigmoid(k)
dA = np.dot((1-A)*A,dloss.T) # This is the derivative of a sigmoid function
dw = np.dot(X,dA.T)
The code is not tested, but the solution would be along this line. See here to calculate dloss.


