Here is my try

$$J(x) = -\frac{1}{m}\sum_{i = 1}^{m} b_iln(h_i) + (1 - b_i)ln(1 - h_i)$$

where . Let . Assuming work element-wise on vectors, is element-wise multiplication and is a vector of s we have

Now

Answer from Łukasz Grad on Stack Exchange
🌐
Stanford University
web.stanford.edu › ~jurafsky › slp3 › 5.pdf pdf
Speech and Language Processing. Daniel Jurafsky & James H. Martin.
gorithm learns is thus a matrix of 2|V| vectors, each of dimension d, formed by concatenating
🌐
ML Glossary
ml-cheatsheet.readthedocs.io › en › latest › logistic_regression.html
Logistic Regression — ML Glossary documentation
def update_weights(features, labels, weights, lr): ''' Vectorized Gradient Descent Features:(200, 3) Labels: (200, 1) Weights:(3, 1) ''' N = len(features) #1 - Get Predictions predictions = predict(features, weights) #2 Transpose features from (200, 3) to (3, 200) # So we can multiply w the ...
🌐
Nucleartalent
nucleartalent.github.io › MachineLearningECT › doc › pub › Day2 › html › Day2-bs.html
Data Analysis and Machine Learning: Logistic Regression and Gradient Methods
The Hessian matrix of \( C(\beta) \) is given by $$ \boldsymbol{H} \equiv \begin{bmatrix} \frac{\partial^2 C(\beta)}{\partial \beta_0^2} & \frac{\partial^2 C(\beta)}{\partial \beta_0 \partial \beta_1} \\ \frac{\partial^2 C(\beta)}{\partial \beta_0 \partial \beta_1} & \frac{\partial^2 C(\beta)}{\partial \beta_1^2} & \\ \end{bmatrix} = 2X^T X. $$ This result implies that \( C(\beta) \) is a convex function since the matrix \( X^T X \) always is positive semi-definite. We can now write a program that minimizes \( C(\beta) \) using the gradient descent method with a constant learning rate \( \gamma \) according to $$ \beta_{k+1} = \beta_k - \gamma \nabla_\beta C(\beta_k), \ k=0,1,\cdots $$
🌐
Baeldung
baeldung.com › home › core concepts › math and logic › gradient descent equation in logistic regression
Gradient Descent Equation in Logistic Regression | Baeldung on Computer Science
February 13, 2025 - Surprisingly, the update rule is the same as the one derived by using the sum of the squared errors in linear regression. As a result, we can use the same gradient descent formula for logistic regression as well.
🌐
Medium
medium.com › analytics-vidhya › ml-from-scratch-logistic-regression-gradient-descent-63b6beb1664c
[ML from scratch] Logistic Regression — Gradient Descent | by Giang Tran | Analytics Vidhya | Medium
December 12, 2021 - Surprisingly, the derivative J with respect to w of logistic regression is identical with the derivative of linear regression. The only difference is that the output of linear regression is h which is linear function, and in logistic is z which is sigmoid function. After found derivative we use gradient descent to update the parameters:
🌐
GitHub
rasbt.github.io › mlxtend › user_guide › classifier › LogisticRegression
LogisticRegression: A binary classifier - mlxtend - GitHub Pages
from mlxtend.data import iris_data from mlxtend.plotting import plot_decision_regions from mlxtend.classifier import LogisticRegression import matplotlib.pyplot as plt # Loading Data X, y = iris_data() X = X[:, [0, 3]] # sepal length and petal width X = X[0:100] # class 0 and class 1 y = y[0:100] # class 0 and class 1 # standardize X[:,0] = (X[:,0] - X[:,0].mean()) / X[:,0].std() X[:,1] = (X[:,1] - X[:,1].mean()) / X[:,1].std() lr = LogisticRegression(eta=0.5, epochs=30, l2_lambda=0.0, minibatches=len(y), # for SGD learning random_seed=1, print_progress=3) lr.fit(X, y) plot_decision_regions(X, y, clf=lr) plt.title('Logistic Regression - Stochastic Gradient Descent') plt.show() plt.plot(range(len(lr.cost_)), lr.cost_) plt.xlabel('Iterations') plt.ylabel('Cost') plt.show()
Find elsewhere
🌐
Medium
medium.com › analytics-vidhya › derivative-of-log-loss-function-for-logistic-regression-9b832f025c2d
The Derivative of Cost Function for Logistic Regression | by Saket Thavanani | Analytics Vidhya | Medium
February 8, 2024 - Removing the summation term by converting it into a matrix form for the gradient with respect to all the weights including the bias term. This little calculus exercise shows that both linear regression and logistic regression (actually a kind of classification) arrive at the same update rule.
🌐
Gatech
bloch.ece.gatech.edu › ece6254su20 › ece6254su20-lecture5.html
Logistic regression, gradient descent, Perceptron Learning Algorithm
In practice, gradient has to be evaluated from data · Newton-Raphson method uses the second derivative to automatically adapt step size · \[ \bfx^{(j+1)} = \bfx^{(j)} - [\nabla^2 f(\bfx)]^{-1}\nabla f(\bfx)|_{\bfx=\bfx_j}\] Hessian matrix \[\nabla^2 f(\bfx)=\left[\begin{array}{cccc} \frac{\partial^2 f}{\partial x_1^2}& \frac{\partial^2 f}{\partial x_1\partial x_2}&\cdots & \frac{\partial^2 f}{\partial x_1\partial x_d}\\\frac{\partial^2 f}{\partial x_1\partial x_2}& \frac{\partial^2 f}{\partial x_2^2}&\cdots & \frac{\partial^2 f}{\partial x_2\partial x_d}\\ \vdots&\vdots& \ddots &\vdots\\ \frac{\partial^2 f}{\partial x_d\partial x_1}& \frac{\partial^2 f}{\partial x_d\partial x_2}&\cdots & \frac{\partial^2 f}{\partial x_d^2}\end{array}\right]\]
🌐
Medium
medium.com › @thourayabchir1 › gradient-descent-09e824b4ef47
Gradient Descent for Logistic Regression | Medium
April 27, 2025 - Learn how gradient descent optimizes logistic regression in machine learning. Step-by-step math explanation & Python implementation with practical examples. Master ML fundamentals.
🌐
DeepLearning.AI
community.deeplearning.ai › course q&a › machine learning specialization › supervised ml: regression and classification
Gradient descent for logistic regression (xj^(i).) - Supervised ML: Regression and Classification - DeepLearning.AI
October 13, 2023 - Hi all, Below is a slide of “Gradient descent for logistic regression.” I just don’t understand why the following formula has xj^(i). This might be beyond of the scope of our course, but could anyone explain that mathem…
🌐
Scribd
scribd.com › document › 574056368 › Lecture8
Gradient Descent in Logistic Regression | PDF | Mathematical Optimization | Dependent And Independent Variables
Logistic regression applies gradient ... Newton's method can converge faster than gradient descent but require calculating the Hessian matrix.Read more...
🌐
Medium
medium.com › intro-to-artificial-intelligence › logistic-regression-using-gradient-descent-bf8cbe749ceb
Logistic regression using gradient descent | by Dhanoop Karunakaran | Intro to Artificial Intelligence | Medium
June 16, 2023 - This concept is taken from the linear regression. Partial derivative of cost function with respect to jth weight. Partial derivative of cost function with respect to bias, b · We can rewrite the f term with sigmoid function to complete the implementation ready algorithm for the gradient descent.
🌐
MachineLearningMastery
machinelearningmastery.com › home › blog › how to implement logistic regression from scratch in python
How To Implement Logistic Regression From Scratch in Python - MachineLearningMastery.com
December 11, 2019 - The iterative update formula is ... implementation of logistic regression stochastic gradient descent The function is (pi ^ y (1-pi) 1-y)....
🌐
Carnegie Mellon University
cs.cmu.edu › ~mgormley › courses › 10701-f16 › slides › lecture5.pdf pdf
Logistic Regression
(non-­‐stochastic) Gradient Descent · —! Difference of expectations · ¥! >*?&",@%(6*&7".(;"+(!"#$%&$’()*#+*%%$",( —! Taylor Series approximation · —! Hessian matrix · —! Newton’s Method · —! Iteratively Reweighted Least Squares · ¥! A$%’+$8$,-&$9*(9%B(:*,*+-&$9*(CD-%%$=*+%( 3 ·
🌐
Napsterinblue
napsterinblue.github.io › notes › machine_learning › regression › logit_grad_descent
Logistic Regression Gradient Descent
August 7, 2018 - The Building Blocks Recall our equation for the Cost Function of a Logistic Regression $\mathcal{L}(\hat{y}, y) = -\big(y\log\hat{y} + (1-y)\log(1-\hat{y})\big)$ We use the weights, w, our inputs, x, and a bias term, b to get a vector z. $z = w^{T} x + b$ And we want this vector to be between ...
🌐
Google
developers.google.com › machine learning › linear regression: gradient descent
Linear regression: Gradient descent | Machine Learning | Google for Developers
February 3, 2026 - Learn how gradient descent iteratively finds the weight and bias that minimize a model's loss. This page explains how the gradient descent algorithm works, and how to determine that a model has converged by looking at its loss curve.
🌐
Medium
medium.com › @IwriteDSblog › gradient-descent-for-logistics-regression-in-python-18e033775082
Gradient Descent for Logistics Regression in Python | by Hoang Phong | Medium
July 31, 2021 - Before coming to the gradient vector, ... regression: X is the matrix that contains all the values in the dataset, not include the values of the outcomes. We then generate the formula of gradient vector for the cost function: ... OK! We have basically been through all the formulas that we need to implement Gradient Descent for Logistics Regression ...
Top answer
1 of 2
20

In linear regression the Maximize Likelihood Estimation (MLE) solution for estimating $x$ has the following closed form solution (assuming that A is a matrix with full column rank):

$$\hat{x}_\text{lin}=\underset{x}{\text{argmin}} \|Ax-b\|_2^2 = (A^TA)^{-1}A^Tb$$

This is read as "find the $x$ that minimizes the objective function, $\|Ax-b\|_2^2$". The nice thing about representing the linear regression objective function in this way is that we can keep everything in matrix notation and solve for $\hat{x}_\text{lin}$ by hand. As Alex R. mentions, in practice we often don't consider $(A^TA)^{-1}$ directly because it is computationally inefficient and $A$ often does not meet the full rank criteria. Instead, we turn to the Moore-Penrose pseudoinverse. The details of computationally solving for the pseudo-inverse can involve the Cholesky decomposition or the Singular Value Decomposition.

Alternatively, the MLE solution for estimating the coefficients in logistic regression is:

$$\hat{x}_\text{log} = \underset{x}{\text{argmin}} \sum_{i=1}^{N} y^{(i)}\log(1+e^{-x^Ta^{(i)}}) + (1-y^{(i)})\log(1+e^{x^T a^{(i)}})$$

where (assuming each sample of data is stored row-wise):

$x$ is a vector represents regression coefficients

$a^{(i)}$ is a vector represents the $i^{th}$ sample/ row in data matrix $A$

$y^{(i)}$ is a scalar in $\{0, 1\}$, and the $i^{th}$ label corresponding to the $i^{th}$ sample

$N$ is the number of data samples / number of rows in data matrix $A$.

Again, this is read as "find the $x$ that minimizes the objective function".

If you wanted to, you could take it a step further and represent $\hat{x}_\text{log}$ in matrix notation as follows:

$$ \hat{x}_\text{log} = \underset{x}{\text{argmin}} \begin{bmatrix} 1 & (1-y^{(1)}) \\ \vdots & \vdots \\ 1 & (1-y^{(N)})\\\end{bmatrix} \begin{bmatrix} \log(1+e^{-x^Ta^{(1)}}) & ... & \log(1+e^{-x^Ta^{(N)}}) \\\log(1+e^{x^Ta^{(1)}}) & ... & \log(1+e^{x^Ta^{(N)}}) \end{bmatrix} $$

but you don't gain anything from doing this. Logistic regression does not have a closed form solution and does not gain the same benefits as linear regression does by representing it in matrix notation. To solve for $\hat{x}_\text{log}$ estimation techniques such as gradient descent and the Newton-Raphson method are used. Through using some of these techniques (i.e. Newton-Raphson), $\hat{x}_\text{log}$ is approximated and is represented in matrix notation (see link provided by Alex R.).

2 of 2
18

@joceratops answer focuses on the optimization problem of maximum likelihood for estimation. This is indeed a flexible approach that is amenable to many types of problems. For estimating most models, including linear and logistic regression models, there is another general approach that is based on the method of moments estimation.

The linear regression estimator can also be formulated as the root to the estimating equation:

$$0 = \mathbf{X}^T(Y - \mathbf{X}\beta)$$

In this regard $\beta$ is seen as the value which retrieves an average residual of 0. It needn't rely on any underlying probability model to have this interpretation. It is, however, interesting to go about deriving the score equations for a normal likelihood, you will see indeed that they take exactly the form displayed above. Maximizing the likelihood of regular exponential family for a linear model (e.g. linear or logistic regression) is equivalent to obtaining solutions to their score equations.

$$0 = \sum_{i=1}^n S_i(\alpha, \beta) = \frac{\partial}{\partial \beta} \log \mathcal{L}( \beta, \alpha, X, Y) = \mathbf{X}^T (Y - g(\mathbf{X}\beta))$$

Where $Y_i$ has expected value $g(\mathbf{X}_i \beta)$. In GLM estimation, $g$ is said to be the inverse of a link function. In normal likelihood equations, $g^{-1}$ is the identity function, and in logistic regression $g^{-1}$ is the logit function. A more general approach would be to require $0 = \sum_{i=1}^n Y - g(\mathbf{X}_i\beta)$ which allows for model misspecification.

Additionally, it is interesting to note that for regular exponential families, $\frac{\partial g(\mathbf{X}\beta)}{\partial \beta} = \mathbf{V}(g(\mathbf{X}\beta))$ which is called a mean-variance relationship. Indeed for logistic regression, the mean variance relationship is such that the mean $p = g(\mathbf{X}\beta)$ is related to the variance by $\mbox{var}(Y_i) = p_i(1-p_i)$. This suggests an interpretation of a model misspecified GLM as being one which gives a 0 average Pearson residual. This further suggests a generalization to allow non-proportional functional mean derivatives and mean-variance relationships.

A generalized estimating equation approach would specify linear models in the following way:

$$0 = \frac{\partial g(\mathbf{X}\beta)}{\partial \beta} \mathbf{V}^{-1}\left(Y - g(\mathbf{X}\beta)\right)$$

With $\mathbf{V}$ a matrix of variances based on the fitted value (mean) given by $g(\mathbf{X}\beta)$. This approach to estimation allows one to pick a link function and mean variance relationship as with GLMs.

In logistic regression $g$ would be the inverse logit, and $V_{ii}$ would be given by $g(\mathbf{X}_i \beta)(1-g(\mathbf{X}\beta))$. The solutions to this estimating equation, obtained by Newton-Raphson, will yield the $\beta$ obtained from logistic regression. However a somewhat broader class of models is estimable under a similar framework. For instance, the link function can be taken to be the log of the linear predictor so that the regression coefficients are relative risks and not odds ratios. Which--given the well documented pitfalls of interpreting ORs as RRs--behooves me to ask why anyone fits logistic regression models at all anymore.