The reason is the following. We use the notation:

Then

[ this used: the 1's in numerator cancel, then we used: ]

Since our original cost function is the form of:

Plugging in the two simplified expressions above, we obtain , which can be simplified to:

where the second equality follows from

[ we used ]

All you need now is to compute the partial derivatives of w.r.t. . As $$\frac{\partial}{\partial \theta_j}y_i\theta x^i=y_ix^i_j, $$ $$\frac{\partial}{\partial \theta_j}\log(1+e^{\theta x^i})=\frac{x^i_je^{\theta x^i}}{1+e^{\theta x^i}}=x^i_jh_\theta(x^i),$$

the thesis follows.

Answer from Avitus on Stack Exchange
🌐
Medium
medium.com › analytics-vidhya › derivative-of-log-loss-function-for-logistic-regression-9b832f025c2d
The Derivative of Cost Function for Logistic Regression | by Saket Thavanani | Analytics Vidhya | Medium
February 8, 2024 - Evaluating the partial derivative using the pattern of the derivative of the sigmoid function. ... Removing the summation term by converting it into a matrix form for the gradient with respect to all the weights including the bias term. This little calculus exercise shows that both linear regression and logistic regression (actually a kind of classification) arrive at the same update rule. What we should appreciate is that the design of the cost function is part of the reasons why such “coincidence” happens.
Discussions

How is the cost function from Logistic Regression differentiated - Cross Validated
I am doing the Machine Learning Stanford course on Coursera. In the chapter on Logistic Regression, the cost function is this: Then, it is differentiated here: I tried getting the derivative of the More on stats.stackexchange.com
🌐 stats.stackexchange.com
How to get the derivatives of the logistic cost/loss function [TEACHING STAFF]
Hi! If you’re wondering how to get the derivatives for the logistic cost / loss function shown in course 1 week 3 “Gradient descent implementation”: I made a Google Colab (includes videos and code) that explains how to get these equations. (hold Ctrl + click for windows, Command + Click ... More on community.deeplearning.ai
🌐 community.deeplearning.ai
18
22
June 24, 2022
machine learning - Why the cost function of logistic regression has a logarithmic expression? - Stack Overflow
Logistic regression is used when the variable y that is wanted to be predicted can only take discrete values (i.e.: classification). Considering a binary classification problem (y can only take two values), then having a set of parameters θ and set of input features x, the hypothesis function ... More on stackoverflow.com
🌐 stackoverflow.com
machine learning - Is it ok to define your own cost function for logistic regression? - Stack Overflow
However, if you calculate the C'(w) and C'(b) for cross-entropy cost function, this problem doesn't occur, as unlike the derivatives of quadratic cost, the derivatives of cross entropy cost is not a multiple of $sigma'(wx+b)$, and hence when the logistic regression model outputs close to 0 ... More on stackoverflow.com
🌐 stackoverflow.com
🌐
GeeksforGeeks
geeksforgeeks.org › machine learning › ml-cost-function-in-logistic-regression
Cost function in Logistic Regression in Machine Learning - GeeksforGeeks
MSE works well for regression, but in Logistic Regression it creates a non-convex curve (multiple local minima). Log loss ensures a convex cost function, making optimization with Gradient Descent easier and guaranteeing a global minimum.
Published   January 19, 2026
🌐
University of Michigan
public.websites.umich.edu › ~yuekai › stats415 › posts › logistic-regression.html
Logistic regression cost function derivation | STATS 415
November 2, 2022 - For logistic regression, the log-likelihood is \[\begin{aligned} &\log L(\beta_0,\beta) \\ &\quad= \sum_{i=1}^nY_i\log s(\beta_0 + \beta^\top X_i) + (1-Y_i)(\log(1- s(\beta_0 + \beta^\top X_i)) &\text{(properties of log)}\\ &\quad= \sum_{i=1}^n{\textstyle Y_i\log\frac{s(\beta_0 + \beta^\top X_i)}{1-s(\beta_0 + \beta^\top X_i)}} + \log(1- s(\beta_0 + \beta^\top X_i).
🌐
RPubs
rpubs.com › dnuttle › ml-logistic-cost-func_derivative
RPubs - Partial derivative of cost function for logistic regression
August 1, 2018 - Partial derivative of cost function for logistic regression · by Dan Nuttle · Last updated over 7 years ago · Hide Comments (–) Share Hide Toolbars ·
Top answer
1 of 5
55

Adapted from the notes in the course, which I don't see available (including this derivation) outside the notes contributed by students within the page of Andrew Ng's Coursera Machine Learning course.


In what follows, the superscript denotes individual measurements or training "examples."

$\small \frac{\partial J(\theta)}{\partial \theta_j} = \frac{\partial}{\partial \theta_j} \,\frac{-1}{m}\sum_{i=1}^m \left[ y^{(i)}\log\left(h_\theta \left(x^{(i)}\right)\right) + (1 -y^{(i)})\log\left(1-h_\theta \left(x^{(i)}\right)\right)\right] \\[2ex]\small\underset{\text{linearity}}= \,\frac{-1}{m}\,\sum_{i=1}^m \left[ y^{(i)}\frac{\partial}{\partial \theta_j}\log\left(h_\theta \left(x^{(i)}\right)\right) + (1 -y^{(i)})\frac{\partial}{\partial \theta_j}\log\left(1-h_\theta \left(x^{(i)}\right)\right) \right] \\[2ex]\Tiny\underset{\text{chain rule}}= \,\frac{-1}{m}\,\sum_{i=1}^m \left[ y^{(i)}\frac{\frac{\partial}{\partial \theta_j}h_\theta \left(x^{(i)}\right)}{h_\theta\left(x^{(i)}\right)} + (1 -y^{(i)})\frac{\frac{\partial}{\partial \theta_j}\left(1-h_\theta \left(x^{(i)}\right)\right)}{1-h_\theta\left(x^{(i)}\right)} \right] \\[2ex]\small\underset{h_\theta(x)=\sigma\left(\theta^\top x\right)}=\,\frac{-1}{m}\,\sum_{i=1}^m \left[ y^{(i)}\frac{\frac{\partial}{\partial \theta_j}\sigma\left(\theta^\top x^{(i)}\right)}{h_\theta\left(x^{(i)}\right)} + (1 -y^{(i)})\frac{\frac{\partial}{\partial \theta_j}\left(1-\sigma\left(\theta^\top x^{(i)}\right)\right)}{1-h_\theta\left(x^{(i)}\right)} \right] \\[2ex]\Tiny\underset{\sigma'}=\frac{-1}{m}\,\sum_{i=1}^m \left[ y^{(i)}\, \frac{\sigma\left(\theta^\top x^{(i)}\right)\left(1-\sigma\left(\theta^\top x^{(i)}\right)\right)\frac{\partial}{\partial \theta_j}\left(\theta^\top x^{(i)}\right)}{h_\theta\left(x^{(i)}\right)} - (1 -y^{(i)})\,\frac{\sigma\left(\theta^\top x^{(i)}\right)\left(1-\sigma\left(\theta^\top x^{(i)}\right)\right)\frac{\partial}{\partial \theta_j}\left(\theta^\top x^{(i)}\right)}{1-h_\theta\left(x^{(i)}\right)} \right] \\[2ex]\small\underset{\sigma\left(\theta^\top x\right)=h_\theta(x)}= \,\frac{-1}{m}\,\sum_{i=1}^m \left[ y^{(i)}\frac{h_\theta\left( x^{(i)}\right)\left(1-h_\theta\left( x^{(i)}\right)\right)\frac{\partial}{\partial \theta_j}\left(\theta^\top x^{(i)}\right)}{h_\theta\left(x^{(i)}\right)} - (1 -y^{(i)})\frac{h_\theta\left( x^{(i)}\right)\left(1-h_\theta\left(x^{(i)}\right)\right)\frac{\partial}{\partial \theta_j}\left( \theta^\top x^{(i)}\right)}{1-h_\theta\left(x^{(i)}\right)} \right] \\[2ex]\small\underset{\frac{\partial}{\partial \theta_j}\left(\theta^\top x^{(i)}\right)=x_j^{(i)}}=\,\frac{-1}{m}\,\sum_{i=1}^m \left[y^{(i)}\left(1-h_\theta\left(x^{(i)}\right)\right)x_j^{(i)}- \left(1-y^{i}\right)\,h_\theta\left(x^{(i)}\right)x_j^{(i)} \right] \\[2ex]\small\underset{\text{distribute}}=\,\frac{-1}{m}\,\sum_{i=1}^m \left[y^{i}-y^{i}h_\theta\left(x^{(i)}\right)- h_\theta\left(x^{(i)}\right)+y^{(i)}h_\theta\left(x^{(i)}\right) \right]\,x_j^{(i)} \\[2ex]\small\underset{\text{cancel}}=\,\frac{-1}{m}\,\sum_{i=1}^m \left[y^{(i)}-h_\theta\left(x^{(i)}\right)\right]\,x_j^{(i)} \\[2ex]\small=\frac{1}{m}\sum_{i=1}^m\left[h_\theta\left(x^{(i)}\right)-y^{(i)}\right]\,x_j^{(i)} $


The derivative of the sigmoid function is

2 of 5
14

To avoid impression of excessive complexity of the matter, let us just see the structure of solution.

With simplification and some abuse of notation, let be a term in sum of , and is a function of :

We may use chain rule: and solve it one by one ( and are constants).

For sigmoid holds, which is just a denominator of the previous statement.

Finally, .

Combining results all together gives sought-for expression: Hope that helps.

🌐
Internal Pointers
internalpointers.com › post › cost-function-logistic-regression
The cost function in logistic regression - Internal Pointers
October 29, 2017 - If you try to use the linear regression's cost function to generate [texi]J(\theta)[texi] in a logistic regression problem, you would end up with a non-convex function: a wierdly-shaped graph with no easy to find minimum global point, as seen in the picture below.
Find elsewhere
🌐
Aman's AI Journal
aman.ai › primers › backprop › derivative-logistic-regression
Aman's AI Journal • Primers • Partial Derivative of the Cost Function for Logistic Regression
Let’s begin with the cost function used for logistic regression, which is the average of the log loss across all training examples, as given below: \[J(\theta)=-\frac{1}{m} \sum_{i=1}^{m} y^{(i)} \log \left(h_{\theta}\left(x^{(i)}\right)\right)+\left(1-y^{(i)}\right) \log \left(1-h_{\theta}\left(x^{(i)}\right)\right)\]
🌐
ML Glossary
ml-cheatsheet.readthedocs.io › en › latest › logistic_regression.html
Logistic Regression — ML Glossary documentation
The benefits of taking the logarithm reveal themselves when you look at the cost function graphs for y=1 and y=0. These smooth monotonic functions [7] (always increasing or always decreasing) make it easy to calculate the gradient and minimize cost. Image from Andrew Ng’s slides on logistic regression [1].
🌐
Medium
medium.com › @arnanbonny › 011-understanding-logistic-regression-cost-function-and-optimization-0edfd40b5568
011: Understanding Logistic Regression (Cost Function and Optimization) | by ArnanBonny | Medium
November 22, 2025 - Hence we use the log loss or cross-entropy loss function for logistic regression. This particular cost function is derived from statistics using a statistical principle called Maximum Likelihood Estimation.
🌐
Analytics Vidhya
analyticsvidhya.com › home › logistic regression in machine learning
Logistic Regression in Machine Learning
April 25, 2025 - In linear regression, we use the ... cost function in linear regression is like this: In logistic regression Yi is a non-linear function (Ŷ=1​/1+ ......
🌐
Nucleusbox
nucleusbox.com › cost-function-in-logistic-regression
Cost Function in Logistic Regression: Explanation & Insights
March 26, 2025 - So in order to get the parameter θ of the hypothesis. We can either maximize the likelihood or minimize the cost function. Now we can take a log from the above logistic regression likelihood equation.
🌐
SlideShare
slideshare.net › home › engineering › derivation of cost function for logistic regression
Derivation of cost function for logistic regression | PPT
January 8, 2017 - This document discusses logistic regression and its cost function. It introduces zero-one classification and the softmax function, which generalizes the logistic function to represent a categorical distribution for multi-class classification problems.
🌐
Quora
quora.com › How-do-you-take-the-derivative-of-Logistic-regression-cost-function-for-Gradient-Descent
How to take the derivative of Logistic regression cost function for Gradient Descent - Quora
Answer: To start, here is a super slick way of writing the probability of one datapoint: Since each datapoint is independent, the probability of all the data is: And if you take the log of this function, you get the reported Log Likelihood for Logistic Regression. The next step is to calculate...
🌐
Stanford
web.stanford.edu › class › archive › cs › cs109 › cs109.1166 › pdfs › 40 LogisticRegression.pdf pdf
Logistic Regression Chris Piech CS109 Handout #40 May 20th, 2016
May 20, 2016 - In the case of Logistic Regression you can prove that the result will always be a global maxima. The small step that we continually take given the training dataset can be calculated as: ... Where η is the magnitude of the step size that we take. If you keep updating θ using the equation above you ... In this section we provide the mathematical derivations for the log-likelihood function and the gradient.
🌐
Medium
medium.com › @ilmunabid › beginners-guide-to-finding-gradient-derivative-of-log-loss-by-hand-detailed-steps-74a6cacfe5cf
Beginner’s Guide to Finding Gradient/Derivative of Log Loss by Hand (Detailed Steps) | by Abid Ilmun Fisabil | Medium
August 17, 2022 - Putting all the results of each function we have derived, we have the following function. Which then to be known as the derivative/gradient of our logistic regression’s cost function. Below is the gradient of our cost function with respect to w (weights).
Top answer
1 of 5
57

Source: my own notes taken during Standford's Machine Learning course in Coursera, by Andrew Ng. All credits to him and this organization. The course is freely available for anybody to be taken at their own pace. The images are made by myself using LaTeX (formulas) and R (graphics).

Hypothesis function

Logistic regression is used when the variable y that is wanted to be predicted can only take discrete values (i.e.: classification).

Considering a binary classification problem (y can only take two values), then having a set of parameters θ and set of input features x, the hypothesis function could be defined so that is bounded between [0, 1], in which g() represents the sigmoid function:

This hypothesis function represents at the same time the estimated probability that y = 1 on input x parameterized by θ:

Cost function

The cost function represents the optimization objective.

Although a possible definition of the cost function could be the mean of the Euclidean distance between the hypothesis h_θ(x) and the actual value y among all the m samples in the training set, as long as the hypothesis function is formed with the sigmoid function, this definition would result in a non-convex cost function, which means that a local minimum could be easily found before reaching the global minimum. In order to ensure the cost function is convex (and therefore ensure convergence to the global minimum), the cost function is transformed using the logarithm of the sigmoid function.

This way the optimization objective function can be defined as the mean of the costs/errors in the training set:

2 of 5
19

This cost function is simply a reformulation of the maximum-(log-)likelihood criterion.

The model of the logistic regression is:

P(y=1 | x) = logistic(θ x)
P(y=0 | x) = 1 - P(y=1 | x) = 1 - logistic(θ x)

The likelihood is written as:

L = P(y_0, ..., y_n | x_0, ..., x_n) = \prod_i P(y_i | x_i)

The log-likelihood is:

l = log L = \sum_i log P(y_i | x_i)

We want to find θ which maximizes the likelihood:

max_θ \prod_i P(y_i | x_i)

This is the same as maximizing the log-likelihood:

max_θ \sum_i log P(y_i | x_i)

We can rewrite this as a minimization of the cost C=-l:

min_θ \sum_i - log P(y_i | x_i)
  P(y_i | x_i) = logistic(θ x_i)      when y_i = 1
  P(y_i | x_i) = 1 - logistic(θ x_i)  when y_i = 0
Top answer
1 of 6
29

Yes, you can define your own loss function, but if you're a novice, you're probably better off using one from the literature. There are conditions that loss functions should meet:

  1. They should approximate the actual loss you're trying to minimize. As was said in the other answer, the standard loss functions for classification is zero-one-loss (misclassification rate) and the ones used for training classifiers are approximations of that loss.

    The squared-error loss from linear regression isn't used because it doesn't approximate zero-one-loss well: when your model predicts +50 for some sample while the intended answer was +1 (positive class), the prediction is on the correct side of the decision boundary so the zero-one-loss is zero, but the squared-error loss is still 49² = 2401. Some training algorithms will waste a lot of time getting predictions very close to {-1, +1} instead of focusing on getting just the sign/class label right.(*)

  2. The loss function should work with your intended optimization algorithm. That's why zero-one-loss is not used directly: it doesn't work with gradient-based optimization methods since it doesn't have a well-defined gradient (or even a subgradient, like the hinge loss for SVMs has).

    The main algorithm that optimizes the zero-one-loss directly is the old perceptron algorithm.

Also, when you plug in a custom loss function, you're no longer building a logistic regression model but some other kind of linear classifier.

(*) Squared error is used with linear discriminant analysis, but that's usually solved in close form instead of iteratively.

2 of 6
10

The logistic function, hinge-loss, smoothed hinge-loss, etc. are used because they are upper bounds on the zero-one binary classification loss.

These functions generally also penalize examples that are correctly classified but are still near the decision boundary, thus creating a "margin."

So, if you are doing binary classification, then you should certainly choose a standard loss function.

If you are trying to solve a different problem, then a different loss function will likely perform better.

🌐
Ml-explained
ml-explained.com › blog › logistic-regression-explained
Logistic Regression - ML Explained
September 29, 2020 - Notice that the result is identical to the one of Linear Regression. First we need to calculate the derivative of the sigmoid function. The derivative of the sigmoid function is quite easy to calulcate using the quotient rule. Now we are ready to find out the partial derivative: Eventhough Logistic Regression was created to solve binary classification problems it can also be used for more than two classes.