logistic regression cost function derivation

How is the cost function from Logistic Regression differentiated

stats.stackexchange.com › questions › 278771 › how-is-the-cost-function-from-logistic-regression-differentiated

Adapted from the notes in the course, which I don't see available (including this derivation) outside the notes contributed by students within the page of Andrew Ng's Coursera Machine Learning course.

In what follows, the superscript $\text{[math]}$ denotes individual measurements or training "examples."

$\small \frac{\partial J(\theta)}{\partial \theta_j} = \frac{\partial}{\partial \theta_j} \,\frac{-1}{m}\sum_{i=1}^m \left[ y^{(i)}\log\left(h_\theta \left(x^{(i)}\right)\right) + (1 -y^{(i)})\log\left(1-h_\theta \left(x^{(i)}\right)\right)\right] \\[2ex]\small\underset{\text{linearity}}= \,\frac{-1}{m}\,\sum_{i=1}^m \left[ y^{(i)}\frac{\partial}{\partial \theta_j}\log\left(h_\theta \left(x^{(i)}\right)\right) + (1 -y^{(i)})\frac{\partial}{\partial \theta_j}\log\left(1-h_\theta \left(x^{(i)}\right)\right) \right] \\[2ex]\Tiny\underset{\text{chain rule}}= \,\frac{-1}{m}\,\sum_{i=1}^m \left[ y^{(i)}\frac{\frac{\partial}{\partial \theta_j}h_\theta \left(x^{(i)}\right)}{h_\theta\left(x^{(i)}\right)} + (1 -y^{(i)})\frac{\frac{\partial}{\partial \theta_j}\left(1-h_\theta \left(x^{(i)}\right)\right)}{1-h_\theta\left(x^{(i)}\right)} \right] \\[2ex]\small\underset{h_\theta(x)=\sigma\left(\theta^\top x\right)}=\,\frac{-1}{m}\,\sum_{i=1}^m \left[ y^{(i)}\frac{\frac{\partial}{\partial \theta_j}\sigma\left(\theta^\top x^{(i)}\right)}{h_\theta\left(x^{(i)}\right)} + (1 -y^{(i)})\frac{\frac{\partial}{\partial \theta_j}\left(1-\sigma\left(\theta^\top x^{(i)}\right)\right)}{1-h_\theta\left(x^{(i)}\right)} \right] \\[2ex]\Tiny\underset{\sigma'}=\frac{-1}{m}\,\sum_{i=1}^m \left[ y^{(i)}\, \frac{\sigma\left(\theta^\top x^{(i)}\right)\left(1-\sigma\left(\theta^\top x^{(i)}\right)\right)\frac{\partial}{\partial \theta_j}\left(\theta^\top x^{(i)}\right)}{h_\theta\left(x^{(i)}\right)} - (1 -y^{(i)})\,\frac{\sigma\left(\theta^\top x^{(i)}\right)\left(1-\sigma\left(\theta^\top x^{(i)}\right)\right)\frac{\partial}{\partial \theta_j}\left(\theta^\top x^{(i)}\right)}{1-h_\theta\left(x^{(i)}\right)} \right] \\[2ex]\small\underset{\sigma\left(\theta^\top x\right)=h_\theta(x)}= \,\frac{-1}{m}\,\sum_{i=1}^m \left[ y^{(i)}\frac{h_\theta\left( x^{(i)}\right)\left(1-h_\theta\left( x^{(i)}\right)\right)\frac{\partial}{\partial \theta_j}\left(\theta^\top x^{(i)}\right)}{h_\theta\left(x^{(i)}\right)} - (1 -y^{(i)})\frac{h_\theta\left( x^{(i)}\right)\left(1-h_\theta\left(x^{(i)}\right)\right)\frac{\partial}{\partial \theta_j}\left( \theta^\top x^{(i)}\right)}{1-h_\theta\left(x^{(i)}\right)} \right] \\[2ex]\small\underset{\frac{\partial}{\partial \theta_j}\left(\theta^\top x^{(i)}\right)=x_j^{(i)}}=\,\frac{-1}{m}\,\sum_{i=1}^m \left[y^{(i)}\left(1-h_\theta\left(x^{(i)}\right)\right)x_j^{(i)}- \left(1-y^{i}\right)\,h_\theta\left(x^{(i)}\right)x_j^{(i)} \right] \\[2ex]\small\underset{\text{distribute}}=\,\frac{-1}{m}\,\sum_{i=1}^m \left[y^{i}-y^{i}h_\theta\left(x^{(i)}\right)- h_\theta\left(x^{(i)}\right)+y^{(i)}h_\theta\left(x^{(i)}\right) \right]\,x_j^{(i)} \\[2ex]\small\underset{\text{cancel}}=\,\frac{-1}{m}\,\sum_{i=1}^m \left[y^{(i)}-h_\theta\left(x^{(i)}\right)\right]\,x_j^{(i)} \\[2ex]\small=\frac{1}{m}\sum_{i=1}^m\left[h_\theta\left(x^{(i)}\right)-y^{(i)}\right]\,x_j^{(i)} $

The derivative of the sigmoid function is

$\text{[math]}$

Answer from Antoni Parellada on Stack Exchange

Medium

medium.com › analytics-vidhya › derivative-of-log-loss-function-for-logistic-regression-9b832f025c2d

The Derivative of Cost Function for Logistic Regression | by Saket Thavanani | Analytics Vidhya | Medium

February 8, 2024 - The cost function is split for two cases y=1 and y=0. For the case when we have y=1 we can observe that when hypothesis function tends to 1 the error is minimized to zero and when it tends to 0 the error is maximum.

Stack Exchange

stats.stackexchange.com › questions › 278771 › how-is-the-cost-function-from-logistic-regression-differentiated

How is the cost function from Logistic Regression differentiated - Cross Validated

Top answer

1 of 5

55

Adapted from the notes in the course, which I don't see available (including this derivation) outside the notes contributed by students within the page of Andrew Ng's Coursera Machine Learning course.

In what follows, the superscript $\text{[math]}$ denotes individual measurements or training "examples."

$\small \frac{\partial J(\theta)}{\partial \theta_j} = \frac{\partial}{\partial \theta_j} \,\frac{-1}{m}\sum_{i=1}^m \left[ y^{(i)}\log\left(h_\theta \left(x^{(i)}\right)\right) + (1 -y^{(i)})\log\left(1-h_\theta \left(x^{(i)}\right)\right)\right] \\[2ex]\small\underset{\text{linearity}}= \,\frac{-1}{m}\,\sum_{i=1}^m \left[ y^{(i)}\frac{\partial}{\partial \theta_j}\log\left(h_\theta \left(x^{(i)}\right)\right) + (1 -y^{(i)})\frac{\partial}{\partial \theta_j}\log\left(1-h_\theta \left(x^{(i)}\right)\right) \right] \\[2ex]\Tiny\underset{\text{chain rule}}= \,\frac{-1}{m}\,\sum_{i=1}^m \left[ y^{(i)}\frac{\frac{\partial}{\partial \theta_j}h_\theta \left(x^{(i)}\right)}{h_\theta\left(x^{(i)}\right)} + (1 -y^{(i)})\frac{\frac{\partial}{\partial \theta_j}\left(1-h_\theta \left(x^{(i)}\right)\right)}{1-h_\theta\left(x^{(i)}\right)} \right] \\[2ex]\small\underset{h_\theta(x)=\sigma\left(\theta^\top x\right)}=\,\frac{-1}{m}\,\sum_{i=1}^m \left[ y^{(i)}\frac{\frac{\partial}{\partial \theta_j}\sigma\left(\theta^\top x^{(i)}\right)}{h_\theta\left(x^{(i)}\right)} + (1 -y^{(i)})\frac{\frac{\partial}{\partial \theta_j}\left(1-\sigma\left(\theta^\top x^{(i)}\right)\right)}{1-h_\theta\left(x^{(i)}\right)} \right] \\[2ex]\Tiny\underset{\sigma'}=\frac{-1}{m}\,\sum_{i=1}^m \left[ y^{(i)}\, \frac{\sigma\left(\theta^\top x^{(i)}\right)\left(1-\sigma\left(\theta^\top x^{(i)}\right)\right)\frac{\partial}{\partial \theta_j}\left(\theta^\top x^{(i)}\right)}{h_\theta\left(x^{(i)}\right)} - (1 -y^{(i)})\,\frac{\sigma\left(\theta^\top x^{(i)}\right)\left(1-\sigma\left(\theta^\top x^{(i)}\right)\right)\frac{\partial}{\partial \theta_j}\left(\theta^\top x^{(i)}\right)}{1-h_\theta\left(x^{(i)}\right)} \right] \\[2ex]\small\underset{\sigma\left(\theta^\top x\right)=h_\theta(x)}= \,\frac{-1}{m}\,\sum_{i=1}^m \left[ y^{(i)}\frac{h_\theta\left( x^{(i)}\right)\left(1-h_\theta\left( x^{(i)}\right)\right)\frac{\partial}{\partial \theta_j}\left(\theta^\top x^{(i)}\right)}{h_\theta\left(x^{(i)}\right)} - (1 -y^{(i)})\frac{h_\theta\left( x^{(i)}\right)\left(1-h_\theta\left(x^{(i)}\right)\right)\frac{\partial}{\partial \theta_j}\left( \theta^\top x^{(i)}\right)}{1-h_\theta\left(x^{(i)}\right)} \right] \\[2ex]\small\underset{\frac{\partial}{\partial \theta_j}\left(\theta^\top x^{(i)}\right)=x_j^{(i)}}=\,\frac{-1}{m}\,\sum_{i=1}^m \left[y^{(i)}\left(1-h_\theta\left(x^{(i)}\right)\right)x_j^{(i)}- \left(1-y^{i}\right)\,h_\theta\left(x^{(i)}\right)x_j^{(i)} \right] \\[2ex]\small\underset{\text{distribute}}=\,\frac{-1}{m}\,\sum_{i=1}^m \left[y^{i}-y^{i}h_\theta\left(x^{(i)}\right)- h_\theta\left(x^{(i)}\right)+y^{(i)}h_\theta\left(x^{(i)}\right) \right]\,x_j^{(i)} \\[2ex]\small\underset{\text{cancel}}=\,\frac{-1}{m}\,\sum_{i=1}^m \left[y^{(i)}-h_\theta\left(x^{(i)}\right)\right]\,x_j^{(i)} \\[2ex]\small=\frac{1}{m}\sum_{i=1}^m\left[h_\theta\left(x^{(i)}\right)-y^{(i)}\right]\,x_j^{(i)} $

The derivative of the sigmoid function is

$\text{[math]}$

2 of 5

14

To avoid impression of excessive complexity of the matter, let us just see the structure of solution.

With simplification and some abuse of notation, let $\text{[math]}$ be a term in sum of $\text{[math]}$ , and $\text{[math]}$ is a function of $\text{[math]}$ : $\text{[math]}$

We may use chain rule: $\text{[math]}$ and solve it one by one ( $\text{[math]}$ and $\text{[math]}$ are constants).

$\text{[math]}$ For sigmoid $\text{[math]}$ holds, which is just a denominator of the previous statement.

Finally, $\text{[math]}$ .

Combining results all together gives sought-for expression: $\text{[math]}$ Hope that helps.

Discussions

How to get the derivatives of the logistic cost/loss function [TEACHING STAFF]

Hi! If you’re wondering how to get the derivatives for the logistic cost / loss function shown in course 1 week 3 “Gradient descent implementation”: I made a Google Colab (includes videos and code) that explains how to get these equations. (hold Ctrl + click for windows, Command + Click ... More on community.deeplearning.ai

community.deeplearning.ai

18

22

June 24, 2022

REALLY breaking down logistic regression gradient descent

Great job. I appreciate all the effort you put to write up the equations in Latex. More on reddit.com

r/learnmachinelearning

19

189

October 14, 2020

Videos

06:22

YouTube

Logistic Regression Cost Function | Machine Learning | Simply ...

January 5, 2021

08:12

YouTube

Logistic Regression Cost Function (C1W2L03) - YouTube

August 25, 2017

youtube.com

Understanding the Cost Function in Logistic Regression

44:28

Geeksforgeeks

Cost function in Logistic Regression in Machine Learning - ...

14 Cost function for logistic regression - YouTube

September 2, 2024

04:44

YouTube

Logistic Regression Gradient Descent | Derivation | Machine Learning ...

January 15, 2021

View all

Stack Exchange

math.stackexchange.com › questions › 477207 › derivative-of-cost-function-for-logistic-regression

statistics - derivative of cost function for Logistic Regression - Mathematics Stack Exchange

Top answer

1 of 8

175

The reason is the following. We use the notation:

$\text{[math]}$

Then

$\text{[math]}$ $\text{[math]}$ [ this used: $\text{[math]}$ the 1's in numerator cancel, then we used: $\text{[math]}$ ]

Since our original cost function is the form of:

$\text{[math]}$

Plugging in the two simplified expressions above, we obtain $\text{[math]}$ , which can be simplified to: $\text{[math]}$

where the second equality follows from

$\text{[math]}$ [ we used $\text{[math]}$ ]

All you need now is to compute the partial derivatives of $\text{[math]}$ w.r.t. $\text{[math]}$ . As $$\frac{\partial}{\partial \theta_j}y_i\theta x^i=y_ix^i_j, $$ $$\frac{\partial}{\partial \theta_j}\log(1+e^{\theta x^i})=\frac{x^i_je^{\theta x^i}}{1+e^{\theta x^i}}=x^i_jh_\theta(x^i),$$

the thesis follows.

2 of 8

7

You have to get the partial derivative with respect $\text{[math]}$ . Remember that the hypothesis function here is equal to the sigmoid function which is a function of $\text{[math]}$ ; in other words, we need to apply the chain rule. This is my approach:

$\text{[math]}$

Anything without $\text{[math]}$ is treated as constant:

$\text{[math]}$

Let's solve each derivative separately and then plug back in on (1):

$\text{[math]}$

Plug (3) and (2) in (1):

$\text{[math]}$

Notice that using the chain rule, the derivative of the hypothesis function can be understood as $\text{[math]}$

where

$\text{[math]}$ and $\text{[math]}$

Plug (5) in (4):

$\text{[math]}$

Applying some algebra and solving subtraction:

$\text{[math]}$

There is a $\text{[math]}$ factor missing on your expected answer.

Hope this helps.

University of Michigan

public.websites.umich.edu › ~yuekai › stats415 › posts › logistic-regression.html

Logistic regression cost function derivation | STATS 415

November 2, 2022 - For logistic regression, the log-likelihood is \[\begin{aligned} &\log L(\beta_0,\beta) \\ &\quad= \sum_{i=1}^nY_i\log s(\beta_0 + \beta^\top X_i) + (1-Y_i)(\log(1- s(\beta_0 + \beta^\top X_i)) &\text{(properties of log)}\\ &\quad= \sum_{i=1}^n{\textstyle Y_i\log\frac{s(\beta_0 + \beta^\top X_i)}{1-s(\beta_0 + \beta^\top X_i)}} + \log(1- s(\beta_0 + \beta^\top X_i).

GeeksforGeeks

geeksforgeeks.org › machine learning › ml-cost-function-in-logistic-regression

Cost function in Logistic Regression in Machine Learning - GeeksforGeeks

44:28

MSE works well for regression, but in Logistic Regression it creates a non-convex curve (multiple local minima). Log loss ensures a convex cost function, making optimization with Gradient Descent easier and guaranteeing a global minimum.

Published January 19, 2026

RPubs

rpubs.com › dnuttle › ml-logistic-cost-func_derivative

RPubs - Partial derivative of cost function for logistic regression

August 1, 2018 - Partial derivative of cost function for logistic regression · by Dan Nuttle · Last updated over 7 years ago · Hide Comments (–) Share Hide Toolbars ·

Internal Pointers

internalpointers.com › post › cost-function-logistic-regression

The cost function in logistic regression - Internal Pointers

October 29, 2017 - If you try to use the linear regression's cost function to generate [texi]J(\theta)[texi] in a logistic regression problem, you would end up with a non-convex function: a wierdly-shaped graph with no easy to find minimum global point, as seen in the picture below.

Find elsewhere

Google Bing Mojeek

Aman's AI Journal

aman.ai › primers › backprop › derivative-logistic-regression

Aman's AI Journal • Primers • Partial Derivative of the Cost Function for Logistic Regression

Let’s begin with the cost function used for logistic regression, which is the average of the log loss across all training examples, as given below: \[J(\theta)=-\frac{1}{m} \sum_{i=1}^{m} y^{(i)} \log \left(h_{\theta}\left(x^{(i)}\right)\right)+\left(1-y^{(i)}\right) \log \left(1-h_{\theta}\left(x^{(i)}\right)\right)\]

ML Glossary

ml-cheatsheet.readthedocs.io › en › latest › logistic_regression.html

Logistic Regression — ML Glossary documentation

The benefits of taking the logarithm reveal themselves when you look at the cost function graphs for y=1 and y=0. These smooth monotonic functions [7] (always increasing or always decreasing) make it easy to calculate the gradient and minimize cost. Image from Andrew Ng’s slides on logistic regression [1].

DeepLearning.AI

community.deeplearning.ai › course q&a › machine learning specialization › supervised ml: regression and classification

How to get the derivatives of the logistic cost/loss function [TEACHING STAFF] - Supervised ML: Regression and Classification - DeepLearning.AI

Top answer

1 of 15

3

Thank you! This was really good explanation and straightforward. I feel at peace now that I know how these equations are arrived at.

2 of 15

1

Great explanation for anyone trying to understand these derivatives. Thanks!

Medium

medium.com › @arnanbonny › 011-understanding-logistic-regression-cost-function-and-optimization-0edfd40b5568

011: Understanding Logistic Regression (Cost Function and Optimization) | by ArnanBonny | Medium

November 22, 2025 - Hence we use the log loss or cross-entropy loss function for logistic regression. This particular cost function is derived from statistics using a statistical principle called Maximum Likelihood Estimation.

Analytics Vidhya

analyticsvidhya.com › home › logistic regression in machine learning

Logistic Regression in Machine Learning

April 25, 2025 - In linear regression, we use the ... in linear regression is like this: In logistic regression Yi is a non-linear function (Ŷ=1/1+ ......

Nucleusbox

nucleusbox.com › cost-function-in-logistic-regression

Cost Function in Logistic Regression: Explanation & Insights

March 26, 2025 - So in order to get the parameter θ of the hypothesis. We can either maximize the likelihood or minimize the cost function. Now we can take a log from the above logistic regression likelihood equation.

SlideShare

slideshare.net › home › engineering › derivation of cost function for logistic regression

Derivation of cost function for logistic regression | PPT

January 8, 2017 - This document discusses logistic regression and its cost function. It introduces zero-one classification and the softmax function, which generalizes the logistic function to represent a categorical distribution for multi-class classification problems.

Stanford

web.stanford.edu › class › archive › cs › cs109 › cs109.1166 › pdfs › 40 LogisticRegression.pdf pdf

Logistic Regression Chris Piech CS109 Handout #40 May 20th, 2016

In the case of Logistic Regression you can prove that the result will always be a global maxima. The small step that we continually take given the training dataset can be calculated as: ... Where η is the magnitude of the step size that we take. If you keep updating θ using the equation above you ... In this section we provide the mathematical derivations for the log-likelihood function and the gradient.

Quora

quora.com › How-do-you-take-the-derivative-of-Logistic-regression-cost-function-for-Gradient-Descent

How to take the derivative of Logistic regression cost function for Gradient Descent - Quora

Answer: To start, here is a super slick way of writing the probability of one datapoint: Since each datapoint is independent, the probability of all the data is: And if you take the log of this function, you get the reported Log Likelihood for Logistic Regression. The next step is to calculate...

MLU-Explain

mlu-explain.github.io › logistic-regression

Logistic Regression

A visual, interactive explanation of logistic regression for machine learning.

Stack Overflow

stackoverflow.com › questions › 32986123 › why-the-cost-function-of-logistic-regression-has-a-logarithmic-expression

machine learning - Why the cost function of logistic regression has a logarithmic expression? - Stack Overflow

Top answer

1 of 5

57

Source: my own notes taken during Standford's Machine Learning course in Coursera, by Andrew Ng. All credits to him and this organization. The course is freely available for anybody to be taken at their own pace. The images are made by myself using LaTeX (formulas) and R (graphics).

Hypothesis function

Logistic regression is used when the variable y that is wanted to be predicted can only take discrete values (i.e.: classification).

Considering a binary classification problem (y can only take two values), then having a set of parameters θ and set of input features x, the hypothesis function could be defined so that is bounded between [0, 1], in which g() represents the sigmoid function:

This hypothesis function represents at the same time the estimated probability that y = 1 on input x parameterized by θ:

Cost function

The cost function represents the optimization objective.

Although a possible definition of the cost function could be the mean of the Euclidean distance between the hypothesis h_θ(x) and the actual value y among all the m samples in the training set, as long as the hypothesis function is formed with the sigmoid function, this definition would result in a non-convex cost function, which means that a local minimum could be easily found before reaching the global minimum. In order to ensure the cost function is convex (and therefore ensure convergence to the global minimum), the cost function is transformed using the logarithm of the sigmoid function.

This way the optimization objective function can be defined as the mean of the costs/errors in the training set:

2 of 5

19

This cost function is simply a reformulation of the maximum-(log-)likelihood criterion.

The model of the logistic regression is:

P(y=1 | x) = logistic(θ x)
P(y=0 | x) = 1 - P(y=1 | x) = 1 - logistic(θ x)

The likelihood is written as:

L = P(y_0, ..., y_n | x_0, ..., x_n) = \prod_i P(y_i | x_i)

The log-likelihood is:

l = log L = \sum_i log P(y_i | x_i)

We want to find θ which maximizes the likelihood:

max_θ \prod_i P(y_i | x_i)

This is the same as maximizing the log-likelihood:

max_θ \sum_i log P(y_i | x_i)

We can rewrite this as a minimization of the cost C=-l:

min_θ \sum_i - log P(y_i | x_i)
  P(y_i | x_i) = logistic(θ x_i)      when y_i = 1
  P(y_i | x_i) = 1 - logistic(θ x_i)  when y_i = 0

Ml-explained

ml-explained.com › blog › logistic-regression-explained

Logistic Regression - ML Explained

September 29, 2020 - To find the coefficients (weights) that minimize the loss function we will use Gradient Descent. There are more sophisticated optimization algorithms out there such as Adam but we won't worry about those in this article. Remember the general form of gradient descent looks like: We can get the gradient descent formula for Logistic Regression by taking the derivative of the loss function.

Medium

medium.com › @ilmunabid › beginners-guide-to-finding-gradient-derivative-of-log-loss-by-hand-detailed-steps-74a6cacfe5cf

Beginner’s Guide to Finding Gradient/Derivative of Log Loss by Hand (Detailed Steps) | by Abid Ilmun Fisabil | Medium

August 17, 2022 - Putting all the results of each function we have derived, we have the following function. Which then to be known as the derivative/gradient of our logistic regression’s cost function. Below is the gradient of our cost function with respect to w (weights).