logistic regression gradient descent matrix form

Derive logistic loss gradient in matrix form

stats.stackexchange.com › questions › 278866 › derive-logistic-loss-gradient-in-matrix-form

Here is my try

$$J(x) = -\frac{1}{m}\sum_{i = 1}^{m} b_iln(h_i) + (1 - b_i)ln(1 - h_i)$$

where $\text{[math]}$ . Let $\text{[math]}$ . Assuming $\text{[math]}$ work element-wise on vectors, $\text{[math]}$ is element-wise multiplication and $\text{[math]}$ is a vector of $\text{[math]}$ s we have

$\text{[math]}$

Now

$\text{[math]}$

Answer from Łukasz Grad on Stack Exchange

Stanford University

web.stanford.edu › ~jurafsky › slp3 › 5.pdf pdf

Speech and Language Processing. Daniel Jurafsky & James H. Martin.

gorithm learns is thus a matrix of 2|V| vectors, each of dimension d, formed by concatenating

ML Glossary

ml-cheatsheet.readthedocs.io › en › latest › logistic_regression.html

Logistic Regression — ML Glossary documentation

def update_weights(features, labels, weights, lr): ''' Vectorized Gradient Descent Features:(200, 3) Labels: (200, 1) Weights:(3, 1) ''' N = len(features) #1 - Get Predictions predictions = predict(features, weights) #2 Transpose features from (200, 3) to (3, 200) # So we can multiply w the ...

Videos

19:18

YouTube

Gradient Descent for Linear Regression - YouTube

September 23, 2021

20:24

YouTube

Linear regression, part 1d: Gradient descent algorithm - YouTube

February 2, 2024

youtube.com

The Mathematics Of Gradient Descent For Linear Regression

21:13

YouTube

ML 6. Gradient Descent Algorithm and Matrix Method for Simple and ...

Linear Regression Gradient Descent | Machine Learning | Explained ...

July 26, 2020

33:39

YouTube

Multivariate Regression and Gradient Descent - YouTube

stats.stackexchange.com › questions › 278866 › derive-logistic-loss-gradient-in-matrix-form

Derive logistic loss gradient in matrix form - Cross Validated

Top answer

1 of 9

I’m not certain what you’re asking. At first I thought you were asking why we can combine the two loss functions into one line. Then I saw your sigmoid question. So I’m going to answer both as I understand your question. Normally we’d use two different functions depending on if y is 0 or if y is 1.…

2 of 9

HI @tbhaxor In the linear regression we use the cost function is J(w,b) = \frac{1}{2m} \sum\limits_{i = 0}^{m-1} (f_{w,b}(x^{(i)}) - y^{(i)})^2 . we didn’t use log because the cost function is converge after each iteration without fall in local minimum cost and also if it falled in it the local mi…

Medium

medium.com › analytics-vidhya › ml-from-scratch-logistic-regression-gradient-descent-63b6beb1664c

[ML from scratch] Logistic Regression — Gradient Descent | by Giang Tran | Analytics Vidhya | Medium

December 12, 2021 - Surprisingly, the derivative J with respect to w of logistic regression is identical with the derivative of linear regression. The only difference is that the output of linear regression is h which is linear function, and in logistic is z which is sigmoid function. After found derivative we use gradient descent to update the parameters:

GitHub

rasbt.github.io › mlxtend › user_guide › classifier › LogisticRegression

LogisticRegression: A binary classifier - mlxtend - GitHub Pages

from mlxtend.data import iris_data from mlxtend.plotting import plot_decision_regions from mlxtend.classifier import LogisticRegression import matplotlib.pyplot as plt # Loading Data X, y = iris_data() X = X[:, [0, 3]] # sepal length and petal width X = X[0:100] # class 0 and class 1 y = y[0:100] # class 0 and class 1 # standardize X[:,0] = (X[:,0] - X[:,0].mean()) / X[:,0].std() X[:,1] = (X[:,1] - X[:,1].mean()) / X[:,1].std() lr = LogisticRegression(eta=0.5, epochs=30, l2_lambda=0.0, minibatches=len(y), # for SGD learning random_seed=1, print_progress=3) lr.fit(X, y) plot_decision_regions(X, y, clf=lr) plt.title('Logistic Regression - Stochastic Gradient Descent') plt.show() plt.plot(range(len(lr.cost_)), lr.cost_) plt.xlabel('Iterations') plt.ylabel('Cost') plt.show()

Find elsewhere

Google Bing Mojeek

Medium

medium.com › analytics-vidhya › derivative-of-log-loss-function-for-logistic-regression-9b832f025c2d

The Derivative of Cost Function for Logistic Regression | by Saket Thavanani | Analytics Vidhya | Medium

February 8, 2024 - Removing the summation term by converting it into a matrix form for the gradient with respect to all the weights including the bias term. This little calculus exercise shows that both linear regression and logistic regression (actually a kind of classification) arrive at the same update rule.

Gatech

bloch.ece.gatech.edu › ece6254su20 › ece6254su20-lecture5.html

Logistic regression, gradient descent, Perceptron Learning Algorithm

In practice, gradient has to be evaluated from data · Newton-Raphson method uses the second derivative to automatically adapt step size · \[ \bfx^{(j+1)} = \bfx^{(j)} - [\nabla^2 f(\bfx)]^{-1}\nabla f(\bfx)|_{\bfx=\bfx_j}\] Hessian matrix \[\nabla^2 f(\bfx)=\left[\begin{array}{cccc} \frac{\partial^2 f}{\partial x_1^2}& \frac{\partial^2 f}{\partial x_1\partial x_2}&\cdots & \frac{\partial^2 f}{\partial x_1\partial x_d}\\\frac{\partial^2 f}{\partial x_1\partial x_2}& \frac{\partial^2 f}{\partial x_2^2}&\cdots & \frac{\partial^2 f}{\partial x_2\partial x_d}\\ \vdots&\vdots& \ddots &\vdots\\ \frac{\partial^2 f}{\partial x_d\partial x_1}& \frac{\partial^2 f}{\partial x_d\partial x_2}&\cdots & \frac{\partial^2 f}{\partial x_d^2}\end{array}\right]\]

Medium

medium.com › @thourayabchir1 › gradient-descent-09e824b4ef47

Gradient Descent for Logistic Regression | Medium

April 27, 2025 - Learn how gradient descent optimizes logistic regression in machine learning. Step-by-step math explanation & Python implementation with practical examples. Master ML fundamentals.

DeepLearning.AI

community.deeplearning.ai › course q&a › machine learning specialization › supervised ml: regression and classification

Gradient descent for logistic regression (xj^(i).) - Supervised ML: Regression and Classification - DeepLearning.AI

October 13, 2023 - Hi all, Below is a slide of “Gradient descent for logistic regression.” I just don’t understand why the following formula has xj^(i). This might be beyond of the scope of our course, but could anyone explain that mathem…

Scribd

scribd.com › document › 574056368 › Lecture8

Gradient Descent in Logistic Regression | PDF | Mathematical Optimization | Dependent And Independent Variables

Logistic regression applies gradient ... Newton's method can converge faster than gradient descent but require calculating the Hessian matrix.Read more...

Medium

medium.com › intro-to-artificial-intelligence › logistic-regression-using-gradient-descent-bf8cbe749ceb

Logistic regression using gradient descent | by Dhanoop Karunakaran | Intro to Artificial Intelligence | Medium

June 16, 2023 - This concept is taken from the linear regression. Partial derivative of cost function with respect to jth weight. Partial derivative of cost function with respect to bias, b · We can rewrite the f term with sigmoid function to complete the implementation ready algorithm for the gradient descent.

MachineLearningMastery

machinelearningmastery.com › home › blog › how to implement logistic regression from scratch in python

How To Implement Logistic Regression From Scratch in Python - MachineLearningMastery.com

December 11, 2019 - The iterative update formula is ... implementation of logistic regression stochastic gradient descent The function is (pi ^ y (1-pi) 1-y)....

Carnegie Mellon University

cs.cmu.edu › ~mgormley › courses › 10701-f16 › slides › lecture5.pdf pdf

Logistic Regression

(non-‐stochastic) Gradient Descent · —! Diﬀerence of expectations · ¥! >*?&",@%(6*&7".(;"+(!"#$%&$’()*#+*%%$",( —! Taylor Series approximation · —! Hessian matrix · —! Newton’s Method · —! Iteratively Reweighted Least Squares · ¥! A$%’+$8$,-&$9*(9%B(:*,*+-&$9*(CD-%%$=*+%( 3 ·

Napsterinblue

napsterinblue.github.io › notes › machine_learning › regression › logit_grad_descent

Logistic Regression Gradient Descent

August 7, 2018 - The Building Blocks Recall our equation for the Cost Function of a Logistic Regression $\mathcal{L}(\hat{y}, y) = -\big(y\log\hat{y} + (1-y)\log(1-\hat{y})\big)$ We use the weights, w, our inputs, x, and a bias term, b to get a vector z. $z = w^{T} x + b$ And we want this vector to be between ...

Google

developers.google.com › machine learning › linear regression: gradient descent

Linear regression: Gradient descent | Machine Learning | Google for Developers

February 3, 2026 - Learn how gradient descent iteratively finds the weight and bias that minimize a model's loss. This page explains how the gradient descent algorithm works, and how to determine that a model has converged by looking at its loss curve.

Medium

medium.com › @IwriteDSblog › gradient-descent-for-logistics-regression-in-python-18e033775082

Gradient Descent for Logistics Regression in Python | by Hoang Phong | Medium

July 31, 2021 - Before coming to the gradient vector, ... regression: X is the matrix that contains all the values in the dataset, not include the values of the outcomes. We then generate the formula of gradient vector for the cost function: ... OK! We have basically been through all the formulas that we need to implement Gradient Descent for Logistics Regression ...

Stack Exchange

stats.stackexchange.com › questions › 229014 › matrix-notation-for-logistic-regression

Matrix notation for logistic regression - Cross Validated

Top answer

1 of 2

In linear regression the Maximize Likelihood Estimation (MLE) solution for estimating $x$ has the following closed form solution (assuming that A is a matrix with full column rank):

$$\hat{x}_\text{lin}=\underset{x}{\text{argmin}} \|Ax-b\|_2^2 = (A^TA)^{-1}A^Tb$$

This is read as "find the $x$ that minimizes the objective function, $\|Ax-b\|_2^2$". The nice thing about representing the linear regression objective function in this way is that we can keep everything in matrix notation and solve for $\hat{x}_\text{lin}$ by hand. As Alex R. mentions, in practice we often don't consider $(A^TA)^{-1}$ directly because it is computationally inefficient and $A$ often does not meet the full rank criteria. Instead, we turn to the Moore-Penrose pseudoinverse. The details of computationally solving for the pseudo-inverse can involve the Cholesky decomposition or the Singular Value Decomposition.

Alternatively, the MLE solution for estimating the coefficients in logistic regression is:

$$\hat{x}_\text{log} = \underset{x}{\text{argmin}} \sum_{i=1}^{N} y^{(i)}\log(1+e^{-x^Ta^{(i)}}) + (1-y^{(i)})\log(1+e^{x^T a^{(i)}})$$

where (assuming each sample of data is stored row-wise):

$x$ is a vector represents regression coefficients

$a^{(i)}$ is a vector represents the $i^{th}$ sample/ row in data matrix $A$

$y^{(i)}$ is a scalar in $\{0, 1\}$, and the $i^{th}$ label corresponding to the $i^{th}$ sample

$N$ is the number of data samples / number of rows in data matrix $A$.

Again, this is read as "find the $x$ that minimizes the objective function".

If you wanted to, you could take it a step further and represent $\hat{x}_\text{log}$ in matrix notation as follows:

$$ \hat{x}_\text{log} = \underset{x}{\text{argmin}} \begin{bmatrix} 1 & (1-y^{(1)}) \\ \vdots & \vdots \\ 1 & (1-y^{(N)})\\\end{bmatrix} \begin{bmatrix} \log(1+e^{-x^Ta^{(1)}}) & ... & \log(1+e^{-x^Ta^{(N)}}) \\\log(1+e^{x^Ta^{(1)}}) & ... & \log(1+e^{x^Ta^{(N)}}) \end{bmatrix} $$

but you don't gain anything from doing this. Logistic regression does not have a closed form solution and does not gain the same benefits as linear regression does by representing it in matrix notation. To solve for $\hat{x}_\text{log}$ estimation techniques such as gradient descent and the Newton-Raphson method are used. Through using some of these techniques (i.e. Newton-Raphson), $\hat{x}_\text{log}$ is approximated and is represented in matrix notation (see link provided by Alex R.).

2 of 2

@joceratops answer focuses on the optimization problem of maximum likelihood for estimation. This is indeed a flexible approach that is amenable to many types of problems. For estimating most models, including linear and logistic regression models, there is another general approach that is based on the method of moments estimation.

The linear regression estimator can also be formulated as the root to the estimating equation:

$$0 = \mathbf{X}^T(Y - \mathbf{X}\beta)$$

In this regard $\beta$ is seen as the value which retrieves an average residual of 0. It needn't rely on any underlying probability model to have this interpretation. It is, however, interesting to go about deriving the score equations for a normal likelihood, you will see indeed that they take exactly the form displayed above. Maximizing the likelihood of regular exponential family for a linear model (e.g. linear or logistic regression) is equivalent to obtaining solutions to their score equations.

$$0 = \sum_{i=1}^n S_i(\alpha, \beta) = \frac{\partial}{\partial \beta} \log \mathcal{L}( \beta, \alpha, X, Y) = \mathbf{X}^T (Y - g(\mathbf{X}\beta))$$

Where $Y_i$ has expected value $g(\mathbf{X}_i \beta)$. In GLM estimation, $g$ is said to be the inverse of a link function. In normal likelihood equations, $g^{-1}$ is the identity function, and in logistic regression $g^{-1}$ is the logit function. A more general approach would be to require $0 = \sum_{i=1}^n Y - g(\mathbf{X}_i\beta)$ which allows for model misspecification.

Additionally, it is interesting to note that for regular exponential families, $\frac{\partial g(\mathbf{X}\beta)}{\partial \beta} = \mathbf{V}(g(\mathbf{X}\beta))$ which is called a mean-variance relationship. Indeed for logistic regression, the mean variance relationship is such that the mean $p = g(\mathbf{X}\beta)$ is related to the variance by $\mbox{var}(Y_i) = p_i(1-p_i)$. This suggests an interpretation of a model misspecified GLM as being one which gives a 0 average Pearson residual. This further suggests a generalization to allow non-proportional functional mean derivatives and mean-variance relationships.

A generalized estimating equation approach would specify linear models in the following way:

$$0 = \frac{\partial g(\mathbf{X}\beta)}{\partial \beta} \mathbf{V}^{-1}\left(Y - g(\mathbf{X}\beta)\right)$$

With $\mathbf{V}$ a matrix of variances based on the fitted value (mean) given by $g(\mathbf{X}\beta)$. This approach to estimation allows one to pick a link function and mean variance relationship as with GLMs.

In logistic regression $g$ would be the inverse logit, and $V_{ii}$ would be given by $g(\mathbf{X}_i \beta)(1-g(\mathbf{X}\beta))$. The solutions to this estimating equation, obtained by Newton-Raphson, will yield the $\beta$ obtained from logistic regression. However a somewhat broader class of models is estimable under a similar framework. For instance, the link function can be taken to be the log of the linear predictor so that the regression coefficients are relative risks and not odds ratios. Which--given the well documented pitfalls of interpreting ORs as RRs--behooves me to ask why anyone fits logistic regression models at all anymore.