logistic regression cost function derivation

derivative of cost function for Logistic Regression

math.stackexchange.com › questions › 477207 › derivative-of-cost-function-for-logistic-regression

The reason is the following. We use the notation:

$\text{[math]}$

Then

$\text{[math]}$ $\text{[math]}$ [ this used: $\text{[math]}$ the 1's in numerator cancel, then we used: $\text{[math]}$ ]

Since our original cost function is the form of:

$\text{[math]}$

Plugging in the two simplified expressions above, we obtain $\text{[math]}$ , which can be simplified to: $\text{[math]}$

where the second equality follows from

$\text{[math]}$ [ we used $\text{[math]}$ ]

All you need now is to compute the partial derivatives of $\text{[math]}$ w.r.t. $\text{[math]}$ . As $$\frac{\partial}{\partial \theta_j}y_i\theta x^i=y_ix^i_j, $$ $$\frac{\partial}{\partial \theta_j}\log(1+e^{\theta x^i})=\frac{x^i_je^{\theta x^i}}{1+e^{\theta x^i}}=x^i_jh_\theta(x^i),$$

the thesis follows.

Answer from Avitus on Stack Exchange

Medium

medium.com › analytics-vidhya › derivative-of-log-loss-function-for-logistic-regression-9b832f025c2d

The Derivative of Cost Function for Logistic Regression | by Saket Thavanani | Analytics Vidhya | Medium

February 8, 2024 - Evaluating the partial derivative using the pattern of the derivative of the sigmoid function. ... Removing the summation term by converting it into a matrix form for the gradient with respect to all the weights including the bias term. This little calculus exercise shows that both linear regression and logistic regression (actually a kind of classification) arrive at the same update rule. What we should appreciate is that the design of the cost function is part of the reasons why such “coincidence” happens.

Stack Exchange

math.stackexchange.com › questions › 477207 › derivative-of-cost-function-for-logistic-regression

statistics - derivative of cost function for Logistic Regression - Mathematics Stack Exchange

Top answer

1 of 8

175

The reason is the following. We use the notation:

$\text{[math]}$

Then

$\text{[math]}$ $\text{[math]}$ [ this used: $\text{[math]}$ the 1's in numerator cancel, then we used: $\text{[math]}$ ]

Since our original cost function is the form of:

$\text{[math]}$

Plugging in the two simplified expressions above, we obtain $\text{[math]}$ , which can be simplified to: $\text{[math]}$

where the second equality follows from

$\text{[math]}$ [ we used $\text{[math]}$ ]

the thesis follows.

2 of 8

You have to get the partial derivative with respect $\text{[math]}$ . Remember that the hypothesis function here is equal to the sigmoid function which is a function of $\text{[math]}$ ; in other words, we need to apply the chain rule. This is my approach:

$\text{[math]}$

Anything without $\text{[math]}$ is treated as constant:

$\text{[math]}$

Let's solve each derivative separately and then plug back in on (1):

$\text{[math]}$

Plug (3) and (2) in (1):

$\text{[math]}$

Notice that using the chain rule, the derivative of the hypothesis function can be understood as $\text{[math]}$

where

$\text{[math]}$ and $\text{[math]}$

Plug (5) in (4):

$\text{[math]}$

Applying some algebra and solving subtraction:

$\text{[math]}$

There is a $\text{[math]}$ factor missing on your expected answer.

Hope this helps.

Discussions

How is the cost function from Logistic Regression differentiated - Cross Validated

I am doing the Machine Learning Stanford course on Coursera. In the chapter on Logistic Regression, the cost function is this: Then, it is differentiated here: I tried getting the derivative of the More on stats.stackexchange.com

stats.stackexchange.com

How to get the derivatives of the logistic cost/loss function [TEACHING STAFF]

Hi! If you’re wondering how to get the derivatives for the logistic cost / loss function shown in course 1 week 3 “Gradient descent implementation”: I made a Google Colab (includes videos and code) that explains how to get these equations. (hold Ctrl + click for windows, Command + Click ... More on community.deeplearning.ai

community.deeplearning.ai

June 24, 2022

machine learning - Why the cost function of logistic regression has a logarithmic expression? - Stack Overflow

Logistic regression is used when the variable y that is wanted to be predicted can only take discrete values (i.e.: classification). Considering a binary classification problem (y can only take two values), then having a set of parameters θ and set of input features x, the hypothesis function ... More on stackoverflow.com

stackoverflow.com

machine learning - Is it ok to define your own cost function for logistic regression? - Stack Overflow

However, if you calculate the C'(w) and C'(b) for cross-entropy cost function, this problem doesn't occur, as unlike the derivatives of quadratic cost, the derivatives of cross entropy cost is not a multiple of $sigma'(wx+b)$, and hence when the logistic regression model outputs close to 0 ... More on stackoverflow.com

stackoverflow.com

Videos

06:22

YouTube

Logistic Regression Cost Function | Machine Learning | Simply ...

January 5, 2021

08:12

YouTube

Logistic Regression Cost Function (C1W2L03) - YouTube

August 25, 2017

youtube.com

Understanding the Cost Function in Logistic Regression

44:28

Geeksforgeeks

Cost function in Logistic Regression in Machine Learning - ...

Logistic Regression Gradient Descent | Derivation | Machine Learning ...

January 15, 2021

14:56

YouTube

14 Cost function for logistic regression - YouTube

September 2, 2024

View all

GeeksforGeeks

geeksforgeeks.org › machine learning › ml-cost-function-in-logistic-regression

Cost function in Logistic Regression in Machine Learning - GeeksforGeeks

44:28

MSE works well for regression, but in Logistic Regression it creates a non-convex curve (multiple local minima). Log loss ensures a convex cost function, making optimization with Gradient Descent easier and guaranteeing a global minimum.

Published January 19, 2026

University of Michigan

public.websites.umich.edu › ~yuekai › stats415 › posts › logistic-regression.html

Logistic regression cost function derivation | STATS 415

November 2, 2022 - For logistic regression, the log-likelihood is \[\begin{aligned} &\log L(\beta_0,\beta) \\ &\quad= \sum_{i=1}^nY_i\log s(\beta_0 + \beta^\top X_i) + (1-Y_i)(\log(1- s(\beta_0 + \beta^\top X_i)) &\text{(properties of log)}\\ &\quad= \sum_{i=1}^n{\textstyle Y_i\log\frac{s(\beta_0 + \beta^\top X_i)}{1-s(\beta_0 + \beta^\top X_i)}} + \log(1- s(\beta_0 + \beta^\top X_i).

RPubs

rpubs.com › dnuttle › ml-logistic-cost-func_derivative

RPubs - Partial derivative of cost function for logistic regression

August 1, 2018 - Partial derivative of cost function for logistic regression · by Dan Nuttle · Last updated over 7 years ago · Hide Comments (–) Share Hide Toolbars ·

Stack Exchange

stats.stackexchange.com › questions › 278771 › how-is-the-cost-function-from-logistic-regression-differentiated

How is the cost function from Logistic Regression differentiated - Cross Validated

Top answer

1 of 5

Adapted from the notes in the course, which I don't see available (including this derivation) outside the notes contributed by students within the page of Andrew Ng's Coursera Machine Learning course.

In what follows, the superscript $\text{[math]}$ denotes individual measurements or training "examples."

$\small \frac{\partial J(\theta)}{\partial \theta_j} = \frac{\partial}{\partial \theta_j} \,\frac{-1}{m}\sum_{i=1}^m \left[ y^{(i)}\log\left(h_\theta \left(x^{(i)}\right)\right) + (1 -y^{(i)})\log\left(1-h_\theta \left(x^{(i)}\right)\right)\right] \\[2ex]\small\underset{\text{linearity}}= \,\frac{-1}{m}\,\sum_{i=1}^m \left[ y^{(i)}\frac{\partial}{\partial \theta_j}\log\left(h_\theta \left(x^{(i)}\right)\right) + (1 -y^{(i)})\frac{\partial}{\partial \theta_j}\log\left(1-h_\theta \left(x^{(i)}\right)\right) \right] \\[2ex]\Tiny\underset{\text{chain rule}}= \,\frac{-1}{m}\,\sum_{i=1}^m \left[ y^{(i)}\frac{\frac{\partial}{\partial \theta_j}h_\theta \left(x^{(i)}\right)}{h_\theta\left(x^{(i)}\right)} + (1 -y^{(i)})\frac{\frac{\partial}{\partial \theta_j}\left(1-h_\theta \left(x^{(i)}\right)\right)}{1-h_\theta\left(x^{(i)}\right)} \right] \\[2ex]\small\underset{h_\theta(x)=\sigma\left(\theta^\top x\right)}=\,\frac{-1}{m}\,\sum_{i=1}^m \left[ y^{(i)}\frac{\frac{\partial}{\partial \theta_j}\sigma\left(\theta^\top x^{(i)}\right)}{h_\theta\left(x^{(i)}\right)} + (1 -y^{(i)})\frac{\frac{\partial}{\partial \theta_j}\left(1-\sigma\left(\theta^\top x^{(i)}\right)\right)}{1-h_\theta\left(x^{(i)}\right)} \right] \\[2ex]\Tiny\underset{\sigma'}=\frac{-1}{m}\,\sum_{i=1}^m \left[ y^{(i)}\, \frac{\sigma\left(\theta^\top x^{(i)}\right)\left(1-\sigma\left(\theta^\top x^{(i)}\right)\right)\frac{\partial}{\partial \theta_j}\left(\theta^\top x^{(i)}\right)}{h_\theta\left(x^{(i)}\right)} - (1 -y^{(i)})\,\frac{\sigma\left(\theta^\top x^{(i)}\right)\left(1-\sigma\left(\theta^\top x^{(i)}\right)\right)\frac{\partial}{\partial \theta_j}\left(\theta^\top x^{(i)}\right)}{1-h_\theta\left(x^{(i)}\right)} \right] \\[2ex]\small\underset{\sigma\left(\theta^\top x\right)=h_\theta(x)}= \,\frac{-1}{m}\,\sum_{i=1}^m \left[ y^{(i)}\frac{h_\theta\left( x^{(i)}\right)\left(1-h_\theta\left( x^{(i)}\right)\right)\frac{\partial}{\partial \theta_j}\left(\theta^\top x^{(i)}\right)}{h_\theta\left(x^{(i)}\right)} - (1 -y^{(i)})\frac{h_\theta\left( x^{(i)}\right)\left(1-h_\theta\left(x^{(i)}\right)\right)\frac{\partial}{\partial \theta_j}\left( \theta^\top x^{(i)}\right)}{1-h_\theta\left(x^{(i)}\right)} \right] \\[2ex]\small\underset{\frac{\partial}{\partial \theta_j}\left(\theta^\top x^{(i)}\right)=x_j^{(i)}}=\,\frac{-1}{m}\,\sum_{i=1}^m \left[y^{(i)}\left(1-h_\theta\left(x^{(i)}\right)\right)x_j^{(i)}- \left(1-y^{i}\right)\,h_\theta\left(x^{(i)}\right)x_j^{(i)} \right] \\[2ex]\small\underset{\text{distribute}}=\,\frac{-1}{m}\,\sum_{i=1}^m \left[y^{i}-y^{i}h_\theta\left(x^{(i)}\right)- h_\theta\left(x^{(i)}\right)+y^{(i)}h_\theta\left(x^{(i)}\right) \right]\,x_j^{(i)} \\[2ex]\small\underset{\text{cancel}}=\,\frac{-1}{m}\,\sum_{i=1}^m \left[y^{(i)}-h_\theta\left(x^{(i)}\right)\right]\,x_j^{(i)} \\[2ex]\small=\frac{1}{m}\sum_{i=1}^m\left[h_\theta\left(x^{(i)}\right)-y^{(i)}\right]\,x_j^{(i)} $

The derivative of the sigmoid function is

$\text{[math]}$

2 of 5

To avoid impression of excessive complexity of the matter, let us just see the structure of solution.

With simplification and some abuse of notation, let $\text{[math]}$ be a term in sum of $\text{[math]}$ , and $\text{[math]}$ is a function of $\text{[math]}$ : $\text{[math]}$

We may use chain rule: $\text{[math]}$ and solve it one by one ( $\text{[math]}$ and $\text{[math]}$ are constants).

$\text{[math]}$ For sigmoid $\text{[math]}$ holds, which is just a denominator of the previous statement.

Finally, $\text{[math]}$ .

Combining results all together gives sought-for expression: $\text{[math]}$ Hope that helps.

Internal Pointers

internalpointers.com › post › cost-function-logistic-regression

The cost function in logistic regression - Internal Pointers

October 29, 2017 - If you try to use the linear regression's cost function to generate [texi]J(\theta)[texi] in a logistic regression problem, you would end up with a non-convex function: a wierdly-shaped graph with no easy to find minimum global point, as seen in the picture below.

Find elsewhere

Google Bing Mojeek

Aman's AI Journal

aman.ai › primers › backprop › derivative-logistic-regression

Aman's AI Journal • Primers • Partial Derivative of the Cost Function for Logistic Regression

Let’s begin with the cost function used for logistic regression, which is the average of the log loss across all training examples, as given below: \[J(\theta)=-\frac{1}{m} \sum_{i=1}^{m} y^{(i)} \log \left(h_{\theta}\left(x^{(i)}\right)\right)+\left(1-y^{(i)}\right) \log \left(1-h_{\theta}\left(x^{(i)}\right)\right)\]

ML Glossary

ml-cheatsheet.readthedocs.io › en › latest › logistic_regression.html

Logistic Regression — ML Glossary documentation

The benefits of taking the logarithm reveal themselves when you look at the cost function graphs for y=1 and y=0. These smooth monotonic functions [7] (always increasing or always decreasing) make it easy to calculate the gradient and minimize cost. Image from Andrew Ng’s slides on logistic regression [1].

DeepLearning.AI

community.deeplearning.ai › course q&a › machine learning specialization › supervised ml: regression and classification

How to get the derivatives of the logistic cost/loss function [TEACHING STAFF] - Supervised ML: Regression and Classification - DeepLearning.AI

Top answer

1 of 15

Thank you! This was really good explanation and straightforward. I feel at peace now that I know how these equations are arrived at.

2 of 15

Great explanation for anyone trying to understand these derivatives. Thanks!

Medium

medium.com › @arnanbonny › 011-understanding-logistic-regression-cost-function-and-optimization-0edfd40b5568

011: Understanding Logistic Regression (Cost Function and Optimization) | by ArnanBonny | Medium

November 22, 2025 - Hence we use the log loss or cross-entropy loss function for logistic regression. This particular cost function is derived from statistics using a statistical principle called Maximum Likelihood Estimation.

Analytics Vidhya

analyticsvidhya.com › home › logistic regression in machine learning

Logistic Regression in Machine Learning

April 25, 2025 - In linear regression, we use the ... cost function in linear regression is like this: In logistic regression Yi is a non-linear function (Ŷ=1/1+ ......

Nucleusbox

nucleusbox.com › cost-function-in-logistic-regression

Cost Function in Logistic Regression: Explanation & Insights

March 26, 2025 - So in order to get the parameter θ of the hypothesis. We can either maximize the likelihood or minimize the cost function. Now we can take a log from the above logistic regression likelihood equation.

SlideShare

slideshare.net › home › engineering › derivation of cost function for logistic regression

Derivation of cost function for logistic regression | PPT

January 8, 2017 - This document discusses logistic regression and its cost function. It introduces zero-one classification and the softmax function, which generalizes the logistic function to represent a categorical distribution for multi-class classification problems.

Quora

quora.com › How-do-you-take-the-derivative-of-Logistic-regression-cost-function-for-Gradient-Descent

How to take the derivative of Logistic regression cost function for Gradient Descent - Quora

Answer: To start, here is a super slick way of writing the probability of one datapoint: Since each datapoint is independent, the probability of all the data is: And if you take the log of this function, you get the reported Log Likelihood for Logistic Regression. The next step is to calculate...

Stanford

web.stanford.edu › class › archive › cs › cs109 › cs109.1166 › pdfs › 40 LogisticRegression.pdf pdf

Logistic Regression Chris Piech CS109 Handout #40 May 20th, 2016

May 20, 2016 - In the case of Logistic Regression you can prove that the result will always be a global maxima. The small step that we continually take given the training dataset can be calculated as: ... Where η is the magnitude of the step size that we take. If you keep updating θ using the equation above you ... In this section we provide the mathematical derivations for the log-likelihood function and the gradient.

Medium

medium.com › @ilmunabid › beginners-guide-to-finding-gradient-derivative-of-log-loss-by-hand-detailed-steps-74a6cacfe5cf

Beginner’s Guide to Finding Gradient/Derivative of Log Loss by Hand (Detailed Steps) | by Abid Ilmun Fisabil | Medium

August 17, 2022 - Putting all the results of each function we have derived, we have the following function. Which then to be known as the derivative/gradient of our logistic regression’s cost function. Below is the gradient of our cost function with respect to w (weights).

Stack Overflow

stackoverflow.com › questions › 32986123 › why-the-cost-function-of-logistic-regression-has-a-logarithmic-expression

machine learning - Why the cost function of logistic regression has a logarithmic expression? - Stack Overflow

Top answer

1 of 5

Source: my own notes taken during Standford's Machine Learning course in Coursera, by Andrew Ng. All credits to him and this organization. The course is freely available for anybody to be taken at their own pace. The images are made by myself using LaTeX (formulas) and R (graphics).

Hypothesis function

Logistic regression is used when the variable y that is wanted to be predicted can only take discrete values (i.e.: classification).

Considering a binary classification problem (y can only take two values), then having a set of parameters θ and set of input features x, the hypothesis function could be defined so that is bounded between [0, 1], in which g() represents the sigmoid function:

This hypothesis function represents at the same time the estimated probability that y = 1 on input x parameterized by θ:

Cost function

The cost function represents the optimization objective.

Although a possible definition of the cost function could be the mean of the Euclidean distance between the hypothesis h_θ(x) and the actual value y among all the m samples in the training set, as long as the hypothesis function is formed with the sigmoid function, this definition would result in a non-convex cost function, which means that a local minimum could be easily found before reaching the global minimum. In order to ensure the cost function is convex (and therefore ensure convergence to the global minimum), the cost function is transformed using the logarithm of the sigmoid function.

This way the optimization objective function can be defined as the mean of the costs/errors in the training set:

2 of 5

This cost function is simply a reformulation of the maximum-(log-)likelihood criterion.

The model of the logistic regression is:

P(y=1 | x) = logistic(θ x)
P(y=0 | x) = 1 - P(y=1 | x) = 1 - logistic(θ x)

The likelihood is written as:

L = P(y_0, ..., y_n | x_0, ..., x_n) = \prod_i P(y_i | x_i)

The log-likelihood is:

l = log L = \sum_i log P(y_i | x_i)

We want to find θ which maximizes the likelihood:

max_θ \prod_i P(y_i | x_i)

This is the same as maximizing the log-likelihood:

max_θ \sum_i log P(y_i | x_i)

We can rewrite this as a minimization of the cost C=-l:

min_θ \sum_i - log P(y_i | x_i)
  P(y_i | x_i) = logistic(θ x_i)      when y_i = 1
  P(y_i | x_i) = 1 - logistic(θ x_i)  when y_i = 0

Stack Overflow

stackoverflow.com › questions › 12157881 › is-it-ok-to-define-your-own-cost-function-for-logistic-regression

machine learning - Is it ok to define your own cost function for logistic regression? - Stack Overflow

Top answer

1 of 6

Yes, you can define your own loss function, but if you're a novice, you're probably better off using one from the literature. There are conditions that loss functions should meet:

They should approximate the actual loss you're trying to minimize. As was said in the other answer, the standard loss functions for classification is zero-one-loss (misclassification rate) and the ones used for training classifiers are approximations of that loss.

The squared-error loss from linear regression isn't used because it doesn't approximate zero-one-loss well: when your model predicts +50 for some sample while the intended answer was +1 (positive class), the prediction is on the correct side of the decision boundary so the zero-one-loss is zero, but the squared-error loss is still 49² = 2401. Some training algorithms will waste a lot of time getting predictions very close to {-1, +1} instead of focusing on getting just the sign/class label right.(*)
The loss function should work with your intended optimization algorithm. That's why zero-one-loss is not used directly: it doesn't work with gradient-based optimization methods since it doesn't have a well-defined gradient (or even a subgradient, like the hinge loss for SVMs has).

The main algorithm that optimizes the zero-one-loss directly is the old perceptron algorithm.

Also, when you plug in a custom loss function, you're no longer building a logistic regression model but some other kind of linear classifier.

(*) Squared error is used with linear discriminant analysis, but that's usually solved in close form instead of iteratively.

2 of 6

The logistic function, hinge-loss, smoothed hinge-loss, etc. are used because they are upper bounds on the zero-one binary classification loss.

These functions generally also penalize examples that are correctly classified but are still near the decision boundary, thus creating a "margin."

So, if you are doing binary classification, then you should certainly choose a standard loss function.

If you are trying to solve a different problem, then a different loss function will likely perform better.

Ml-explained

ml-explained.com › blog › logistic-regression-explained

Logistic Regression - ML Explained

September 29, 2020 - Notice that the result is identical to the one of Linear Regression. First we need to calculate the derivative of the sigmoid function. The derivative of the sigmoid function is quite easy to calulcate using the quotient rule. Now we are ready to find out the partial derivative: Eventhough Logistic Regression was created to solve binary classification problems it can also be used for more than two classes.