Adapted from the notes in the course, which I don't see available (including this derivation) outside the notes contributed by students within the page of Andrew Ng's Coursera Machine Learning course.


In what follows, the superscript denotes individual measurements or training "examples."

$\small \frac{\partial J(\theta)}{\partial \theta_j} = \frac{\partial}{\partial \theta_j} \,\frac{-1}{m}\sum_{i=1}^m \left[ y^{(i)}\log\left(h_\theta \left(x^{(i)}\right)\right) + (1 -y^{(i)})\log\left(1-h_\theta \left(x^{(i)}\right)\right)\right] \\[2ex]\small\underset{\text{linearity}}= \,\frac{-1}{m}\,\sum_{i=1}^m \left[ y^{(i)}\frac{\partial}{\partial \theta_j}\log\left(h_\theta \left(x^{(i)}\right)\right) + (1 -y^{(i)})\frac{\partial}{\partial \theta_j}\log\left(1-h_\theta \left(x^{(i)}\right)\right) \right] \\[2ex]\Tiny\underset{\text{chain rule}}= \,\frac{-1}{m}\,\sum_{i=1}^m \left[ y^{(i)}\frac{\frac{\partial}{\partial \theta_j}h_\theta \left(x^{(i)}\right)}{h_\theta\left(x^{(i)}\right)} + (1 -y^{(i)})\frac{\frac{\partial}{\partial \theta_j}\left(1-h_\theta \left(x^{(i)}\right)\right)}{1-h_\theta\left(x^{(i)}\right)} \right] \\[2ex]\small\underset{h_\theta(x)=\sigma\left(\theta^\top x\right)}=\,\frac{-1}{m}\,\sum_{i=1}^m \left[ y^{(i)}\frac{\frac{\partial}{\partial \theta_j}\sigma\left(\theta^\top x^{(i)}\right)}{h_\theta\left(x^{(i)}\right)} + (1 -y^{(i)})\frac{\frac{\partial}{\partial \theta_j}\left(1-\sigma\left(\theta^\top x^{(i)}\right)\right)}{1-h_\theta\left(x^{(i)}\right)} \right] \\[2ex]\Tiny\underset{\sigma'}=\frac{-1}{m}\,\sum_{i=1}^m \left[ y^{(i)}\, \frac{\sigma\left(\theta^\top x^{(i)}\right)\left(1-\sigma\left(\theta^\top x^{(i)}\right)\right)\frac{\partial}{\partial \theta_j}\left(\theta^\top x^{(i)}\right)}{h_\theta\left(x^{(i)}\right)} - (1 -y^{(i)})\,\frac{\sigma\left(\theta^\top x^{(i)}\right)\left(1-\sigma\left(\theta^\top x^{(i)}\right)\right)\frac{\partial}{\partial \theta_j}\left(\theta^\top x^{(i)}\right)}{1-h_\theta\left(x^{(i)}\right)} \right] \\[2ex]\small\underset{\sigma\left(\theta^\top x\right)=h_\theta(x)}= \,\frac{-1}{m}\,\sum_{i=1}^m \left[ y^{(i)}\frac{h_\theta\left( x^{(i)}\right)\left(1-h_\theta\left( x^{(i)}\right)\right)\frac{\partial}{\partial \theta_j}\left(\theta^\top x^{(i)}\right)}{h_\theta\left(x^{(i)}\right)} - (1 -y^{(i)})\frac{h_\theta\left( x^{(i)}\right)\left(1-h_\theta\left(x^{(i)}\right)\right)\frac{\partial}{\partial \theta_j}\left( \theta^\top x^{(i)}\right)}{1-h_\theta\left(x^{(i)}\right)} \right] \\[2ex]\small\underset{\frac{\partial}{\partial \theta_j}\left(\theta^\top x^{(i)}\right)=x_j^{(i)}}=\,\frac{-1}{m}\,\sum_{i=1}^m \left[y^{(i)}\left(1-h_\theta\left(x^{(i)}\right)\right)x_j^{(i)}- \left(1-y^{i}\right)\,h_\theta\left(x^{(i)}\right)x_j^{(i)} \right] \\[2ex]\small\underset{\text{distribute}}=\,\frac{-1}{m}\,\sum_{i=1}^m \left[y^{i}-y^{i}h_\theta\left(x^{(i)}\right)- h_\theta\left(x^{(i)}\right)+y^{(i)}h_\theta\left(x^{(i)}\right) \right]\,x_j^{(i)} \\[2ex]\small\underset{\text{cancel}}=\,\frac{-1}{m}\,\sum_{i=1}^m \left[y^{(i)}-h_\theta\left(x^{(i)}\right)\right]\,x_j^{(i)} \\[2ex]\small=\frac{1}{m}\sum_{i=1}^m\left[h_\theta\left(x^{(i)}\right)-y^{(i)}\right]\,x_j^{(i)} $


The derivative of the sigmoid function is

Answer from Antoni Parellada on Stack Exchange
🌐
Medium
medium.com › analytics-vidhya › derivative-of-log-loss-function-for-logistic-regression-9b832f025c2d
The Derivative of Cost Function for Logistic Regression | by Saket Thavanani | Analytics Vidhya | Medium
February 8, 2024 - The cost function is split for two cases y=1 and y=0. For the case when we have y=1 we can observe that when hypothesis function tends to 1 the error is minimized to zero and when it tends to 0 the error is maximum.
Top answer
1 of 5
55

Adapted from the notes in the course, which I don't see available (including this derivation) outside the notes contributed by students within the page of Andrew Ng's Coursera Machine Learning course.


In what follows, the superscript denotes individual measurements or training "examples."

$\small \frac{\partial J(\theta)}{\partial \theta_j} = \frac{\partial}{\partial \theta_j} \,\frac{-1}{m}\sum_{i=1}^m \left[ y^{(i)}\log\left(h_\theta \left(x^{(i)}\right)\right) + (1 -y^{(i)})\log\left(1-h_\theta \left(x^{(i)}\right)\right)\right] \\[2ex]\small\underset{\text{linearity}}= \,\frac{-1}{m}\,\sum_{i=1}^m \left[ y^{(i)}\frac{\partial}{\partial \theta_j}\log\left(h_\theta \left(x^{(i)}\right)\right) + (1 -y^{(i)})\frac{\partial}{\partial \theta_j}\log\left(1-h_\theta \left(x^{(i)}\right)\right) \right] \\[2ex]\Tiny\underset{\text{chain rule}}= \,\frac{-1}{m}\,\sum_{i=1}^m \left[ y^{(i)}\frac{\frac{\partial}{\partial \theta_j}h_\theta \left(x^{(i)}\right)}{h_\theta\left(x^{(i)}\right)} + (1 -y^{(i)})\frac{\frac{\partial}{\partial \theta_j}\left(1-h_\theta \left(x^{(i)}\right)\right)}{1-h_\theta\left(x^{(i)}\right)} \right] \\[2ex]\small\underset{h_\theta(x)=\sigma\left(\theta^\top x\right)}=\,\frac{-1}{m}\,\sum_{i=1}^m \left[ y^{(i)}\frac{\frac{\partial}{\partial \theta_j}\sigma\left(\theta^\top x^{(i)}\right)}{h_\theta\left(x^{(i)}\right)} + (1 -y^{(i)})\frac{\frac{\partial}{\partial \theta_j}\left(1-\sigma\left(\theta^\top x^{(i)}\right)\right)}{1-h_\theta\left(x^{(i)}\right)} \right] \\[2ex]\Tiny\underset{\sigma'}=\frac{-1}{m}\,\sum_{i=1}^m \left[ y^{(i)}\, \frac{\sigma\left(\theta^\top x^{(i)}\right)\left(1-\sigma\left(\theta^\top x^{(i)}\right)\right)\frac{\partial}{\partial \theta_j}\left(\theta^\top x^{(i)}\right)}{h_\theta\left(x^{(i)}\right)} - (1 -y^{(i)})\,\frac{\sigma\left(\theta^\top x^{(i)}\right)\left(1-\sigma\left(\theta^\top x^{(i)}\right)\right)\frac{\partial}{\partial \theta_j}\left(\theta^\top x^{(i)}\right)}{1-h_\theta\left(x^{(i)}\right)} \right] \\[2ex]\small\underset{\sigma\left(\theta^\top x\right)=h_\theta(x)}= \,\frac{-1}{m}\,\sum_{i=1}^m \left[ y^{(i)}\frac{h_\theta\left( x^{(i)}\right)\left(1-h_\theta\left( x^{(i)}\right)\right)\frac{\partial}{\partial \theta_j}\left(\theta^\top x^{(i)}\right)}{h_\theta\left(x^{(i)}\right)} - (1 -y^{(i)})\frac{h_\theta\left( x^{(i)}\right)\left(1-h_\theta\left(x^{(i)}\right)\right)\frac{\partial}{\partial \theta_j}\left( \theta^\top x^{(i)}\right)}{1-h_\theta\left(x^{(i)}\right)} \right] \\[2ex]\small\underset{\frac{\partial}{\partial \theta_j}\left(\theta^\top x^{(i)}\right)=x_j^{(i)}}=\,\frac{-1}{m}\,\sum_{i=1}^m \left[y^{(i)}\left(1-h_\theta\left(x^{(i)}\right)\right)x_j^{(i)}- \left(1-y^{i}\right)\,h_\theta\left(x^{(i)}\right)x_j^{(i)} \right] \\[2ex]\small\underset{\text{distribute}}=\,\frac{-1}{m}\,\sum_{i=1}^m \left[y^{i}-y^{i}h_\theta\left(x^{(i)}\right)- h_\theta\left(x^{(i)}\right)+y^{(i)}h_\theta\left(x^{(i)}\right) \right]\,x_j^{(i)} \\[2ex]\small\underset{\text{cancel}}=\,\frac{-1}{m}\,\sum_{i=1}^m \left[y^{(i)}-h_\theta\left(x^{(i)}\right)\right]\,x_j^{(i)} \\[2ex]\small=\frac{1}{m}\sum_{i=1}^m\left[h_\theta\left(x^{(i)}\right)-y^{(i)}\right]\,x_j^{(i)} $


The derivative of the sigmoid function is

2 of 5
14

To avoid impression of excessive complexity of the matter, let us just see the structure of solution.

With simplification and some abuse of notation, let be a term in sum of , and is a function of :

We may use chain rule: and solve it one by one ( and are constants).

For sigmoid holds, which is just a denominator of the previous statement.

Finally, .

Combining results all together gives sought-for expression: Hope that helps.

Discussions

How to get the derivatives of the logistic cost/loss function [TEACHING STAFF]
Hi! If you’re wondering how to get the derivatives for the logistic cost / loss function shown in course 1 week 3 “Gradient descent implementation”: I made a Google Colab (includes videos and code) that explains how to get these equations. (hold Ctrl + click for windows, Command + Click ... More on community.deeplearning.ai
🌐 community.deeplearning.ai
18
22
June 24, 2022
REALLY breaking down logistic regression gradient descent
Great job. I appreciate all the effort you put to write up the equations in Latex. More on reddit.com
🌐 r/learnmachinelearning
19
189
October 14, 2020
🌐
University of Michigan
public.websites.umich.edu › ~yuekai › stats415 › posts › logistic-regression.html
Logistic regression cost function derivation | STATS 415
November 2, 2022 - For logistic regression, the log-likelihood is \[\begin{aligned} &\log L(\beta_0,\beta) \\ &\quad= \sum_{i=1}^nY_i\log s(\beta_0 + \beta^\top X_i) + (1-Y_i)(\log(1- s(\beta_0 + \beta^\top X_i)) &\text{(properties of log)}\\ &\quad= \sum_{i=1}^n{\textstyle Y_i\log\frac{s(\beta_0 + \beta^\top X_i)}{1-s(\beta_0 + \beta^\top X_i)}} + \log(1- s(\beta_0 + \beta^\top X_i).
🌐
GeeksforGeeks
geeksforgeeks.org › machine learning › ml-cost-function-in-logistic-regression
Cost function in Logistic Regression in Machine Learning - GeeksforGeeks
MSE works well for regression, but in Logistic Regression it creates a non-convex curve (multiple local minima). Log loss ensures a convex cost function, making optimization with Gradient Descent easier and guaranteeing a global minimum.
Published   January 19, 2026
🌐
RPubs
rpubs.com › dnuttle › ml-logistic-cost-func_derivative
RPubs - Partial derivative of cost function for logistic regression
August 1, 2018 - Partial derivative of cost function for logistic regression · by Dan Nuttle · Last updated over 7 years ago · Hide Comments (–) Share Hide Toolbars ·
🌐
Internal Pointers
internalpointers.com › post › cost-function-logistic-regression
The cost function in logistic regression - Internal Pointers
October 29, 2017 - If you try to use the linear regression's cost function to generate [texi]J(\theta)[texi] in a logistic regression problem, you would end up with a non-convex function: a wierdly-shaped graph with no easy to find minimum global point, as seen in the picture below.
Find elsewhere
🌐
Aman's AI Journal
aman.ai › primers › backprop › derivative-logistic-regression
Aman's AI Journal • Primers • Partial Derivative of the Cost Function for Logistic Regression
Let’s begin with the cost function used for logistic regression, which is the average of the log loss across all training examples, as given below: \[J(\theta)=-\frac{1}{m} \sum_{i=1}^{m} y^{(i)} \log \left(h_{\theta}\left(x^{(i)}\right)\right)+\left(1-y^{(i)}\right) \log \left(1-h_{\theta}\left(x^{(i)}\right)\right)\]
🌐
ML Glossary
ml-cheatsheet.readthedocs.io › en › latest › logistic_regression.html
Logistic Regression — ML Glossary documentation
The benefits of taking the logarithm reveal themselves when you look at the cost function graphs for y=1 and y=0. These smooth monotonic functions [7] (always increasing or always decreasing) make it easy to calculate the gradient and minimize cost. Image from Andrew Ng’s slides on logistic regression [1].
🌐
Medium
medium.com › @arnanbonny › 011-understanding-logistic-regression-cost-function-and-optimization-0edfd40b5568
011: Understanding Logistic Regression (Cost Function and Optimization) | by ArnanBonny | Medium
November 22, 2025 - Hence we use the log loss or cross-entropy loss function for logistic regression. This particular cost function is derived from statistics using a statistical principle called Maximum Likelihood Estimation.
🌐
Analytics Vidhya
analyticsvidhya.com › home › logistic regression in machine learning
Logistic Regression in Machine Learning
April 25, 2025 - In linear regression, we use the ... in linear regression is like this: In logistic regression Yi is a non-linear function (Ŷ=1​/1+ ......
🌐
Nucleusbox
nucleusbox.com › cost-function-in-logistic-regression
Cost Function in Logistic Regression: Explanation & Insights
March 26, 2025 - So in order to get the parameter θ of the hypothesis. We can either maximize the likelihood or minimize the cost function. Now we can take a log from the above logistic regression likelihood equation.
🌐
SlideShare
slideshare.net › home › engineering › derivation of cost function for logistic regression
Derivation of cost function for logistic regression | PPT
January 8, 2017 - This document discusses logistic regression and its cost function. It introduces zero-one classification and the softmax function, which generalizes the logistic function to represent a categorical distribution for multi-class classification problems.
🌐
Stanford
web.stanford.edu › class › archive › cs › cs109 › cs109.1166 › pdfs › 40 LogisticRegression.pdf pdf
Logistic Regression Chris Piech CS109 Handout #40 May 20th, 2016
In the case of Logistic Regression you can prove that the result will always be a global maxima. The small step that we continually take given the training dataset can be calculated as: ... Where η is the magnitude of the step size that we take. If you keep updating θ using the equation above you ... In this section we provide the mathematical derivations for the log-likelihood function and the gradient.
🌐
Quora
quora.com › How-do-you-take-the-derivative-of-Logistic-regression-cost-function-for-Gradient-Descent
How to take the derivative of Logistic regression cost function for Gradient Descent - Quora
Answer: To start, here is a super slick way of writing the probability of one datapoint: Since each datapoint is independent, the probability of all the data is: And if you take the log of this function, you get the reported Log Likelihood for Logistic Regression. The next step is to calculate...
🌐
MLU-Explain
mlu-explain.github.io › logistic-regression
Logistic Regression
A visual, interactive explanation of logistic regression for machine learning.
Top answer
1 of 5
57

Source: my own notes taken during Standford's Machine Learning course in Coursera, by Andrew Ng. All credits to him and this organization. The course is freely available for anybody to be taken at their own pace. The images are made by myself using LaTeX (formulas) and R (graphics).

Hypothesis function

Logistic regression is used when the variable y that is wanted to be predicted can only take discrete values (i.e.: classification).

Considering a binary classification problem (y can only take two values), then having a set of parameters θ and set of input features x, the hypothesis function could be defined so that is bounded between [0, 1], in which g() represents the sigmoid function:

This hypothesis function represents at the same time the estimated probability that y = 1 on input x parameterized by θ:

Cost function

The cost function represents the optimization objective.

Although a possible definition of the cost function could be the mean of the Euclidean distance between the hypothesis h_θ(x) and the actual value y among all the m samples in the training set, as long as the hypothesis function is formed with the sigmoid function, this definition would result in a non-convex cost function, which means that a local minimum could be easily found before reaching the global minimum. In order to ensure the cost function is convex (and therefore ensure convergence to the global minimum), the cost function is transformed using the logarithm of the sigmoid function.

This way the optimization objective function can be defined as the mean of the costs/errors in the training set:

2 of 5
19

This cost function is simply a reformulation of the maximum-(log-)likelihood criterion.

The model of the logistic regression is:

P(y=1 | x) = logistic(θ x)
P(y=0 | x) = 1 - P(y=1 | x) = 1 - logistic(θ x)

The likelihood is written as:

L = P(y_0, ..., y_n | x_0, ..., x_n) = \prod_i P(y_i | x_i)

The log-likelihood is:

l = log L = \sum_i log P(y_i | x_i)

We want to find θ which maximizes the likelihood:

max_θ \prod_i P(y_i | x_i)

This is the same as maximizing the log-likelihood:

max_θ \sum_i log P(y_i | x_i)

We can rewrite this as a minimization of the cost C=-l:

min_θ \sum_i - log P(y_i | x_i)
  P(y_i | x_i) = logistic(θ x_i)      when y_i = 1
  P(y_i | x_i) = 1 - logistic(θ x_i)  when y_i = 0
🌐
Ml-explained
ml-explained.com › blog › logistic-regression-explained
Logistic Regression - ML Explained
September 29, 2020 - To find the coefficients (weights) that minimize the loss function we will use Gradient Descent. There are more sophisticated optimization algorithms out there such as Adam but we won't worry about those in this article. Remember the general form of gradient descent looks like: We can get the gradient descent formula for Logistic Regression by taking the derivative of the loss function.
🌐
Medium
medium.com › @ilmunabid › beginners-guide-to-finding-gradient-derivative-of-log-loss-by-hand-detailed-steps-74a6cacfe5cf
Beginner’s Guide to Finding Gradient/Derivative of Log Loss by Hand (Detailed Steps) | by Abid Ilmun Fisabil | Medium
August 17, 2022 - Putting all the results of each function we have derived, we have the following function. Which then to be known as the derivative/gradient of our logistic regression’s cost function. Below is the gradient of our cost function with respect to w (weights).