My answer for my question: yes, it can be shown that gradient for logistic loss is equal to difference between true values and predicted probabilities. Brief explanation was found here.

First, logistic loss is just negative log-likelihood, so we can start with expression for log-likelihood (p. 74 - this expression is log-likelihood itself, not negative log-likelihood):

is logistic function: , where is predicted values before logistic transformation (i.e., log-odds):

First derivative obtained using Wolfram Alpha:

After multiplying by :

After changing sign we have expression for gradient of logistic loss function:

Answer from Ogurtsov on Stack Exchange
🌐
Medium
medium.com › @ilmunabid › beginners-guide-to-finding-gradient-derivative-of-log-loss-by-hand-detailed-steps-74a6cacfe5cf
Beginner’s Guide to Finding Gradient/Derivative of Log Loss by Hand (Detailed Steps) | by Abid Ilmun Fisabil | Medium
August 17, 2022 - For a quick reference to logistic regression. cost function is used to evaluate our prediction. And the prediction (using linear equation) is transformed into probability using sigmoid function before can be used inside the cost function. We calculate the gradient of cost function to know which direction our loss is moving, up or down.
Discussions

Confused in the gradient descent of the logistic log loss function
Lets keep the derivation part a side, it is too complicated for now. Why y is subtracted, in the previous lecture (simplified form), no matter what class we use, thr y term supposed to be multiplied to the ln part. … More on community.deeplearning.ai
🌐 community.deeplearning.ai
9
0
January 11, 2023
numpy - How is the gradient and hessian of logarithmic loss computed in the custom objective function example script in xgboost's github repository? - Stack Overflow
The log loss function is the sum of where . The gradient (with respect to p) is then however in the code its . More on stackoverflow.com
🌐 stackoverflow.com
September 18, 2016
sgd - Gradient for log regression loss - Stack Overflow
dev. of 7 runs, 1 loop each) %timeit ... res=calc_loss_grad_2(weights, X_batch, y_batch) #49.1 µs ± 503 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) ... Find the answer to your question by asking. Ask question ... See similar questions with these tags. ... 0 Find negative log-likelihood cost for logistic regression in python and gradient loss with ... More on stackoverflow.com
🌐 stackoverflow.com
reinforcement learning - Can we simply remove the log term for loss in policy gradient methods? - Artificial Intelligence Stack Exchange
2 Is the negative of the policy loss function in a simple policy gradient algorithm an estimator of expected returns? 2 What happens with policy gradient methods if rewards are differentiable? 3 What specifically is the gradient of the log of the probability in policy gradient methods? More on ai.stackexchange.com
🌐 ai.stackexchange.com
🌐
TTIC
home.ttic.edu › ~suriya › website-intromlss2018 › course_material › Day3b.pdf pdf
On Logistic Regression: Gradients of the Log Loss, Multi- ...
June 20, 2018 - The probability of offis · p (0|x, w) = 1 −σ(w · x) = σ(−w · x) ▶Today’s focus: 1. Optimizing the log loss by gradient descent · 2. Multi-class classification to handle more than two classes · 3. More on optimization: Newton, stochastic gradient descent ·
🌐
Medium
medium.com › @sumbatilinda › deep-learning-part-2-loss-function-and-gradient-function-2f64c566a1d6
Deep Learning(Part 2). Loss Function and Gradient Function | by Sumbatilinda | Medium
April 9, 2024 - I hope that was a good explanation for loss functions: ... A gradient is nothing but a derivative that defines the effects on outputs of the function with a little bit of variation in inputs.
🌐
Medium
medium.com › analytics-vidhya › derivative-of-log-loss-function-for-logistic-regression-9b832f025c2d
Derivative of Log-Loss function for Logistic Regression
February 8, 2024 - In order to preserve the convex nature for the loss function, a log loss error function has been designed for logistic regression. The cost function is split for two cases y=1 and y=0. For the case when we have y=1 we can observe that when hypothesis function tends to 1 the error is minimized to zero and when it tends to 0 the error is maximum. This criterion exactly follows the criterion as we wanted ... In order to optimize this convex function, we can either go with gradient-descent or newtons method.
🌐
Baeldung
baeldung.com › home › core concepts › math and logic › gradient descent equation in logistic regression
Gradient Descent Equation in Logistic Regression | Baeldung on Computer Science
February 13, 2025 - When dealing with a binary classification problem, the logarithmic cost of error depends on the value of . We can define the cost for two cases separately: ... Because when the actual outcome , the cost is for and takes the maximum value for . Similarly, if , the cost is for . As the output can either be or , we can simplify the equation to be: ... Gradient descent is an iterative optimization algorithm, which finds the minimum of a differentiable function.
Find elsewhere
🌐
ML Glossary
ml-cheatsheet.readthedocs.io › en › latest › logistic_regression.html
Logistic Regression — ML Glossary documentation
def update_weights(features, labels, weights, lr): ''' Vectorized Gradient Descent Features:(200, 3) Labels: (200, 1) Weights:(3, 1) ''' N = len(features) #1 - Get Predictions predictions = predict(features, weights) #2 Transpose features from (200, 3) to (3, 200) # So we can multiply w the (200,1) cost matrix. # Returns a (3,1) matrix holding 3 partial derivatives -- # one for each feature -- representing the aggregate # slope of the cost function across all observations gradient = np.dot(features.T, predictions - labels) #3 Take the average cost derivative for each feature gradient /= N #4 - Multiply the gradient by our learning rate gradient *= lr #5 - Subtract from our weights to minimize cost weights -= gradient return weights
🌐
High on Science
highonscience.com › blog › 2021 › 06 › 18 › ml-loss-function-cheat-sheet
Machine Learning Likelihood, Loss, Gradient, and Hessian Cheat Sheet - High on Science
June 18, 2021 - This is called the risk set, because they are the users at risk of canceling at the time user $i$ canceled. The risk set includes user $i$. In clinical studies, users are subjects and churn is non-survival, i.e. death. Loss \[\begin{equation} \ell_i = \delta_i \left[ - f_i + \log{\sum_{j:t_j \geq t_i} \exp{f_j}} \right] \end{equation}\] ... The efficient algorithm to compute the gradient and hessian involves ordering the $n$ survival data points, which are index by $i$, by time $t_i$. This turns $n^2$ time complexity into $n\log{n}$ for the sort followed by $n$ for the progressive total-loss compute (ref).
🌐
Buffalo
cedar.buffalo.edu › ~srihari › CSE676 › 18.1 Log-likelihood Gradient.pdf pdf
Deep Learning Srihari 1 The Log-likelihood Gradient Sargur N. Srihari
Determine parameters θ that maximize log-likelihood (negative loss) max · θ · L({x(1),..x(M )};θ) = log p(x(m) m∑ · ;θ) Intractable · Partition function · ∂ · ∂Wi,j · E(v,h) = −vihj · Probability Distribution of Undirected model (Gibbs) An identity · For stochastic gradient ascent, take derivatives: Derivative of negative phase: Derivative of positive phase: 1 ·
🌐
Tomasbeuzen
tomasbeuzen.com › deep-learning-with-pytorch › chapters › appendixB_logistic-loss.html
Appendix B: Logistic Loss — Deep Learning with PyTorch
y_hat = np.arange(0.01, 1.00, 0.01) log_loss = pd.DataFrame({"y_hat": y_hat, "y=0": -np.log(1 - y_hat), "y=1": -np.log(y_hat)}).melt(id_vars="y_hat", var_name="y", value_name="loss") fig = px.line(log_loss, x="y_hat", y="loss", color="y") fig.update_layout(width=500, height=400) In Chapter 1 we used the gradient of the log loss to implement gradient descent ·
🌐
CliffsNotes
cliffsnotes.com › home › computer science
Deriving Gradient in Logistic Regression & Sigmoid Function - CliffsNotes
November 22, 2024 - The derivative of with respect to is: This is an important result that will simplify the gradient calculation. Step 4: Gradient of the Negative Log-Likelihood To minimize the negative log-likelihood and find the optimal weight vector , we need to compute the gradient of the NLL with respect to each component of the weight vector .
🌐
Transactions on Machine Learning Research
jmlr.csail.mit.edu › papers › volume22 › 20-1372 › 20-1372.pdf pdf
When Does Gradient Descent with Logistic Loss Find ...
separation conditions then the loss after a single step of gradient descent decreases by an · amount that is exponential in p1/2−β with high probability. This result only requires the · width p to be poly-logarithmic in the number of samples, input dimension and 1/δ.
🌐
Google
developers.google.com › machine learning › logistic regression: loss and regularization
Logistic regression: Loss and regularization | Machine Learning | Google for Developers
Consequently, most logistic regression models use one of the following two strategies to decrease model complexity: L2 regularization · Early stopping: Limiting the number of training steps to halt training while loss is still decreasing. Note: You'll learn more about regularization in the Datasets, Generalization, and Overfitting module of the course. Key terms: Gradient descent ·
🌐
Analytics Vidhya
analyticsvidhya.com › home › log loss vs. mean squared error: choosing the right metric for classification
Log Loss vs. Mean Squared Error: Choosing the Right Metric for Classification
May 1, 2025 - When we try to optimize values using gradient descent it will create complications to find global minima. Another reason is in classification problems, we have target values like 0/1, So (Ŷ-Y)2 will always be in between 0-1 which can make it very difficult to keep track of the errors and it is difficult to store high precision floating numbers. The cost function used in Logistic Regression is Log Loss...
🌐
Colorado
cmci.colorado.edu › classes › INFO-4604 › fa17 › files › slides-5_logistic.pdf pdf
Logistic Regression INFO-4604, Applied Machine Learning
want to minimize (that’s why it’s called “loss”), but · we want to maximize probability! So let’s minimize the negative log-likelihood: L(w) = -log P(yi | xi) = -yi log(ϕ(wTxi)) – (1–yi) log(1–ϕ(wTxi)) i=1 · N · i=1 · N · Learning · We can use gradient descent to minimize the · negative log-likelihood, L(w) The partial derivative of L with respect to wj is: dL/dwj = xij (yi – ϕ(wTxi)) i=1 ·
Top answer
1 of 1
7

It's not advisable to remove the log term simply because it's monotone since intuitively as a score function the log term in policy gradient theorem (PGT) is not arbitrary but rather a necessary step to ensure proper scaling and direction of the gradients especially for actions with small probabilities. We can have a simple 1-d parametere space example to show this.

Let's say we have two possible actions $a_1,a_2$ of a simple bandit problem with the parameterized policy for both actions defined as $\pi_{\theta}(a_1)=\sigma(\theta), \pi_{\theta}(a_2)=1-\sigma(\theta)$, where $\sigma$ is the standard sigmoid function. Also suppose the rewards for the actions are $r(a_1)=1, r(a_2)=0$, since it's a bandit without state the return $G_t$ is simply $r(a)$ for each action $a$. Then it's straightforward to compute gradient as follows: $$\nabla_{\theta} J(\theta)=\mathbb{E}_{\pi_{\theta}}[\nabla_{\theta}\log\pi_{\theta}(a)G_t]=\pi_{\theta}(a_1)(1-\sigma(\theta))r(a_1)+\pi_{\theta}(a_2)(-\sigma(\theta))r(a_2)=\sigma(\theta)(1-\sigma(\theta))$$ Similarly you can get the gradient for the case of removing the log as $$\nabla_{\theta} J'(\theta)=\sigma(\theta)^2(1-\sigma(\theta))$$

Therefore clearly even in 1-d bandit case the gradient without log will become extremely small if $\sigma(\theta)$, i.e., $\pi_{\theta}(a_1)$ is very small, which means the initially very unlikely actions in the to-be-optimized policy cannot get updated efficiently consistent with above intuition due to losing the proper scaling effect of the log. And in the usual multi-dimensional policy for MDP cases both the magnitude and direction of the gradient will be similarly negatively impacted.

Finally with the common log gradient trick the existing form in PGT can be simply transformed to a quotient where the numerator is just your gradient of the policy without log and the denominator is the same policy. From the balance need of exploitation-exploration, this denominator is required for exploration since if the probability of taking certain action in a state is small, then the algorithm would update the parameters so that the probability of taking that action can increase. The other term $G_t \approx Q_t(s_t,a_t)$ reflects that if an action value is great then the algorithm intends to update the parameters so that the probability of taking that action is enhanced which is exploitation. Therefore if you remove the denominator policy then equivalently you're not exploring enough.