logistic regression gradient descent example

February 13, 2025 - As already explained, we’re using the sigmoid function as the hypothesis function in logistic regression. Assume we have a total of features. In this case, we have parameters for the vector. To minimize our cost function, we need to run the gradient descent on each parameter :

Upgrad

upgrad.com › home › blog › artificial intelligence › understanding gradient descent in logistic regression: a guide for beginners

Gradient Descent in Logistic Regression - Learn in Minutes!

June 26, 2025 - To see Gradient Descent in Logistic Regression in action, consider a simple example where the algorithm fine-tunes model parameters to reduce the cost function.

Discussions

machine learning - Gradient descent implementation of logistic regression - Data Science Stack Exchange

Objective Seeking for help, advise why the gradient descent implementation does not work below. Background Working on the task below to implement the logistic regression. Gradient descent Derived ... More on datascience.stackexchange.com

datascience.stackexchange.com

December 7, 2021

REALLY breaking down logistic regression gradient descent

Great job. I appreciate all the effort you put to write up the equations in Latex. More on reddit.com

r/learnmachinelearning

189

October 13, 2020

MLE & Gradient Descent in Logistic Regression - Data Science Stack Exchange

In Logistic Regression, MLE is used to develop a mathematical function to estimate the model parameters, optimization techniques like Gradient Descent are used to solve this function. Can somebody please provide a resource that explains this process with an example? More on datascience.stackexchange.com

datascience.stackexchange.com

January 9, 2022

Is the code for linear and logistic regression always the same?

Short answer is "no"; longer answer is "noooooo." The longer answer is it depends what you are doing. I'm not primarily an ML person; my background is more traditional stats and econometrics. Linear regression, OLS, has a closed form solution, but you could use a cost function and gradient decent to get there -- in other words you can solve linear regression through (maximum likelihood estimation) MLE. You could also invert a matrix and get the actual solution. For medium to large size data you would compute a matrix inverse using a pseudo inverse technique to approximate it. This assumes that your problem is specified properly. My hunch is that the ML techniques will work (well "work" then answer might not make sense, but the routine won't crash) when you can't invert a matrix because the variables you gave it include duplicates or linear combinations, etc. Logistic regression is an MLE problem, ML folks usually use some form of gradient descent, while traditional stat packages use a specific form like Newton-Raphson. In my experience Newton-Raphson is much faster for problems where you aren't using more than a couple dozen independent variables (features). Gradient descent does well if you have hundreds of variables. More generally, N-R is faster, but has some constraints that G-D avoids (computing second derivatives and such. More discussion here: https://stats.stackexchange.com/questions/253632/why-is-newtons-method-not-widely-used-in-machine-learning For MLE type problems there is a lot of different code, but it is people trying to do the same thing either faster or with more uncertainty or better managing edge cases. More on reddit.com

r/learnmachinelearning

September 4, 2023

Videos

09:46

YouTube

Logistic Regression 5 Stochastic Gradient Descent - YouTube

Logistic Regression 6 A worked example of gradient descent - YouTube

July 19, 2021

06:43

YouTube

Logistic Regression Gradient Descent (C1W2L09) - YouTube

August 25, 2017

youtube.com

Part 2: Logistic Regression, Gradient Descent, Log Likelihood ...

19:52

YouTube

7.2.4. Gradient Descent for Logistic Regression - YouTube

October 8, 2021

04:44

YouTube

Logistic Regression Gradient Descent | Derivation | Machine Learning ...

January 15, 2021

View all

Stanford University

web.stanford.edu › ~jurafsky › slp3 › 5.pdf pdf

Speech and Language Processing. Daniel Jurafsky & James H. Martin.

The update equations going from time step t to t + 1 in stochastic gradient descent · are thus: ct+1 · pos · = ct · pos −η[σ(ct · pos ·wt)−1]wt · (5.25) ct+1 · negi = ct · negi −η[σ(ct · negi ·wt)]wt · (5.26) wt+1 = wt −η · " [σ(ct · pos ·wt)−1]ct · pos + k · X · i=1 · [σ(ct · negi ·wt)]ct · negi · # (5.27) Just as in logistic regression, then, the learning algorithm starts with randomly ini- tialized W and C matrices, and then walks through the training corpus using gradient ·

Rensselaer Polytechnic Institute

cs.rpi.edu › ~magdon › courses › LFD-Slides › SlidesLect09.pdf pdf

Learning From Data Lecture 9 Logistic Regression and Gradient Descent

Gradient descent can minimize any smooth function, for example · Ein(w) = 1 · N · N · X · n=1 · ln(1 + e−yn·wtx) ←logistic regression · © AML Creator: Malik Magdon-Ismail · Logistic Regression and Gradient Descent: 21 /23 · Stochastic gradient descent −→ ·

MachineLearningMastery

machinelearningmastery.com › home › blog › how to implement logistic regression from scratch in python

How To Implement Logistic Regression From Scratch in Python - MachineLearningMastery.com

December 11, 2019 - In this section, we will train a logistic regression model using stochastic gradient descent on the diabetes dataset. The example assumes that a CSV copy of the dataset is in the current working directory with the filename pima-indians-diabetes.csv.

Medium

medium.com › @thourayabchir1 › gradient-descent-09e824b4ef47

Gradient Descent for Logistic Regression | Medium

April 27, 2025 - Logistic regression uses the binary cross-entropy (log-loss): ... Minimizing J pulls the parameters so that the ŷ aligns with the true label y. Although J(w, b) is convex (there’s a single global minimum), it lacks a simple closed-form minimizer. Instead, we use gradient descent, which:

Medium

medium.com › analytics-vidhya › logistic-regression-with-gradient-descent-explained-machine-learning-a9a12b38d710

Logistic Regression with Gradient Descent Explained | Machine Learning | by Ashwin Prasad | Analytics Vidhya | Medium

February 5, 2024 - Classification: It is a type of ... to give accurate future predictions. For example, Predicting whether it wound rain or not based on the temperature and humidity readings....

Find elsewhere

Google Bing Mojeek

GeeksforGeeks

geeksforgeeks.org › machine learning › gradient-descent-algorithm-and-its-variants

Gradient Descent Algorithm in Machine Learning - GeeksforGeeks

11:09

Logistic Regression is a supervised learning algorithm used for binary classification. It predicts the probability that a data point belongs to a particular class using the sigmoid function, which gives outputs between 0 and 1. The model is trained by minimizing binary cross entropy loss to improve prediction accuracy. ... Gradient Descent ...

Published 1 week ago

GitHub

rasbt.github.io › mlxtend › user_guide › classifier › LogisticRegression

LogisticRegression: A binary classifier - mlxtend - GitHub Pages

from mlxtend.data import iris_data from mlxtend.plotting import plot_decision_regions from mlxtend.classifier import LogisticRegression import matplotlib.pyplot as plt # Loading Data X, y = iris_data() X = X[:, [0, 3]] # sepal length and petal width X = X[0:100] # class 0 and class 1 y = y[0:100] # class 0 and class 1 # standardize X[:,0] = (X[:,0] - X[:,0].mean()) / X[:,0].std() X[:,1] = (X[:,1] - X[:,1].mean()) / X[:,1].std() lr = LogisticRegression(eta=0.1, l2_lambda=0.0, epochs=100, minibatches=1, # for Gradient Descent random_seed=1, print_progress=3) lr.fit(X, y) plot_decision_regions(X, y, clf=lr) plt.title('Logistic Regression - Gradient Descent') plt.show() plt.plot(range(len(lr.cost_)), lr.cost_) plt.xlabel('Iterations') plt.ylabel('Cost') plt.show()

Medium

medium.com › intro-to-artificial-intelligence › logistic-regression-using-gradient-descent-bf8cbe749ceb

Logistic regression using gradient descent | by Dhanoop Karunakaran | Intro to Artificial Intelligence | Medium

June 16, 2023 - This concept is taken from the linear regression. Partial derivative of cost function with respect to jth weight. Partial derivative of cost function with respect to bias, b · We can rewrite the f term with sigmoid function to complete the implementation ready algorithm for the gradient descent.

YOU CANalytics

ucanalytics.com › home › gradient descent for logistic regression simplified – step by step visual guide

YOU CANalytics | Gradient Descent for Logistic Regression Simplified - Step by Step Visual Guide – YOU CANalytics |

September 16, 2018 - If you will run the gradient descent without assuming β1 = β2 then β0 =-15.4233, β1 = 0.1090, and β2 = 0.1097. I suggest you try all these solutions using this code: Gradient Descent – Logistic Regression (R Code).

ML Glossary

ml-cheatsheet.readthedocs.io › en › latest › logistic_regression.html

Logistic Regression — ML Glossary documentation

The benefits of taking the logarithm reveal themselves when you look at the cost function graphs for y=1 and y=0. These smooth monotonic functions [7] (always increasing or always decreasing) make it easy to calculate the gradient and minimize cost. Image from Andrew Ng’s slides on logistic regression [1].

Stack Exchange

datascience.stackexchange.com › questions › 104852 › gradient-descent-implementation-of-logistic-regression

machine learning - Gradient descent implementation of logistic regression - Data Science Stack Exchange

Top answer

1 of 2

You are missing a minus sign before your binary cross entropy loss function. The loss function you currently have becomes more negative (positive) if the predictions are worse (better), therefore if you minimize this loss function the model will change its weights in the wrong direction and start performing worse. To make the model perform better you either maximize the loss function you currently have (i.e. use gradient ascent instead of gradient descent, as you have in your second example), or you add a minus sign so that a decrease in the loss is linked to a better prediction.

2 of 2

I think your implementation is correct and the answer provided is just wrong.

Just for reference, the below figure represents the theory / math we are using here to implement Logistic Regression with Gradient Descent:

Here, we have the learnable parameter vector $\theta = [b,\;a]^T$ and $m=1$ (since a singe data point), with $X=[1,\; x]$, where $1$ corresponds to the intercept (bias) term.

Just making your implementation a little modular and increasing the number of epochs to 10 (instead of 1):

def update_params(a, b, x, y, z, lr):
    a = a + lr * x * (z-y)
    a = np.round(a, decimals=3)
    b = b + lr * (z-y)
    b = np.round(b, decimals=3)
    return a, b          

def LogitRegression(arr):
  x, y, a, b = arr
  lr = 1.0
  num_epochs = 10
  #losses, preds = [], []
  for _ in range(num_epochs):    
      z = 1.0 / (1.0 + np.exp(-a * x - b))    
      bce = -y*np.log(z) -(1-y)*np.log(1-z)
      #losses.append(bce)
      #preds.append(z)
      print(bce, y, z, a, b)
      a, b = update_params(a, b, x, y, z, lr)            
  
  return ", ".join([str(a), str(b)])

LogitRegression([1,1,1,1])
# 0.12692801104297263 1 0.8807970779778823 1 1
# 0.10135698320837692 1 0.9036104015620354 1.119 1.119 # values after 1 epoch
# 0.08437500133718023 1 0.9190865327845347 1.215 1.215
# 0.0721998635352405 1 0.9303449352007099 1.296 1.296
# 0.06305834631954188 1 0.9388886913913739 1.366 1.366
# 0.05601486564909184 1 0.9455250799418752 1.427 1.427
# 0.05042252914331105 1 0.9508275872468411 1.481 1.481
# 0.04582166273506799 1 0.9552122969502131 1.53 1.53
# 0.041959389233941616 1 0.958908721799535 1.575 1.575
# 0.03871910934525996 1 0.962020893877162 1.616 1.616

If you plot the BCE loss and the predicted y (i.e., z) over iterations, you get the following figure (as expected, BCE loss is monotonically decreasing and z is getting closer to ground truth y with increasing iterations, leading to convergence):

Now, if you change your update_params() to the following:

def update_params(a, b, x, y, z, lr):
    a = a + lr * x * (z-y)
    a = np.round(a, decimals=3)
    b = b + lr * (z-y)
    b = np.round(b, decimals=3)
    return a, b

and call LogitRegression() with the same set of inputs:

LogitRegression([1,1,1,1])
# 0.12692801104297263 1 0.8807970779778823 1 1
# 0.15845663982299638 1 0.8534599691639768 0.881 0.881 # provided in the answer
# 0.2073277757451888 1 0.8127532055353431 0.734 0.734
# 0.28883714051459425 1 0.7491341990786479 0.547 0.547
# 0.4403300268044629 1 0.6438239068707556 0.296 0.296
# 0.7549461015956136 1 0.4700359482354282 -0.06 -0.06
# 1.4479476778575628 1 0.2350521962362353 -0.59 -0.59
# 2.774416770021533 1 0.0623858513799944 -1.355 -1.355
# 4.596141947283801 1 0.010090691161759239 -2.293 -2.293
# 6.56740642634977 1 0.0014054377957286094 -3.283 -3.283

and you will end up with the following figure if you plot (clearly this is wrong, since the loss function increases with every epoch and z goes further away from ground-truth y, leading to divergence):

Also, the above implementation can easily be extended to multi-dimensional data containing many data points like the following:

def VanillaLogisticRegression(x, y): # LR without regularization
    m, n = x.shape
    w = np.zeros((n+1, 1))
    X = np.hstack((np.ones(m)[:,None],x)) # include the feature corresponding to the bias term
    num_epochs = 1000 # number of epochs to run gradient descent, tune this hyperparametrer
    lr = 0.5 # learning rate, tune this hyperparameter
    losses = []
    for _ in range(num_epochs):
        y_hat = 1. / (1. + np.exp(-np.dot(X, w))) # predicted y by the LR model
        J = np.mean(-y*np.log2(y_hat) - (1-y)*np.log2(1-y_hat)) # the binary cross entropy loss function
        grad_J = np.mean((y_hat - y)*X, axis=0) # the gradient of the loss function
        w -= lr * grad_J[:, None] # the gradient descent step, update the parameter vector w
        losses.append(J)
        # test corretness of the implementation
        # loss J should monotonically decrease & y_hat should be closer to y, with increasing iterations
        # print(J)            
    return w

m, n = 1000, 5 # 1000 rows, 5 columns
# randomly generate dataset, note that y can have values as 0 and 1 only
x, y = np.random.random(m*n).reshape(m,n), np.random.randint(0,2,m).reshape(-1,1)
w = VanillaLogisticRegression(x, y)
w # learnt parameters
# array([[-0.0749518 ],
#   [ 0.28592107],
#   [ 0.15202566],
#   [-0.15020757],
#   [ 0.08147078],
#   [-0.18823631]])

If you plot the loss function value over iterations, you will get a plot like the following one, showing how it converges.

Finally, let's compare the above implementation with sklearn's implementation, which uses a more advanced optimization algorithm lbfgs by default, hence likely to converge much faster, but if our implementation is correct both of then should converge to the same global minima, since the loss function is convex (note that sklearn by default uses regularization, in order to have almost no regularization, we need to have the value of the input hyper-parameter $C$ very high):

from sklearn.linear_model import LogisticRegression
clf = LogisticRegression(random_state=0, C=10**12).fit(x, y)
print(clf.coef_, clf.intercept_)
# [[ 0.28633262  0.15256914 -0.14975667  0.08192404 -0.18780851]] [-0.07612282]

Compare the parameter values obtained from the above implementation and the one obtained with sklearn's implementation: they are almost equal.

Also, let's compare the predicted probabilities obtained using these two different implementations of LR (one from scratch, another one from sklearn's library function), as can be seen from the following scatterplot, they are almost identical:

pred_probs = 1 / (1 + np.exp(-X@w))
plt.scatter(pred_probs, clf.predict_proba(x)[:,1])
plt.grid()
plt.xlabel('pred prob', size=20)
plt.ylabel('pred prob (sklearn)', size=20)
plt.show()

Finally, let's compute the accuracies obtained, they are identical too:

print(sum((pred_probs > 0.5) == y) / len(y)) 
# [0.527]
clf.score(x, y)   
# 0.527

This also shows the correctness of the implementation.

Nucleartalent

nucleartalent.github.io › MachineLearningECT › doc › pub › Day2 › html › Day2-bs.html

Data Analysis and Machine Learning: Logistic Regression and Gradient Methods

Data Analysis and Machine Learning: Logistic Regression and Gradient Methods ... Simple codes for steepest descent and conjugate gradient using a $ 2\times 2 $ matrix, in c++, Python code to come

GitHub

github.com › zillur01 › logistic-regression-gradient-descent

GitHub - zillur01/logistic-regression-gradient-descent: This project implements the logistic regression algorithm from scratch

This repository contains an implementation of logistic regression with gradient descent from scratch in Python. The implementation is done in Jupyter Notebook and provides an example dataset to test the implementation.

Author zillur01

Medium

medium.com › technology-nineleaps › logistic-regression-gradient-descent-optimization-part-1-ed320325a67e

Logistic Regression — Gradient Descent Optimization — Part 1 | by Abhinav Mazumdar | Technology at Nineleaps | Medium

April 13, 2018 - Gradient Descent Illustration. Image by: machinelearning-blog.com · In real examples, w can be a much higher dimension. J(w,b) becomes a surface as shown above for various values of w and b.

reddit.com › r/learnmachinelearning › really breaking down logistic regression gradient descent

r/learnmachinelearning on Reddit: REALLY breaking down logistic regression gradient descent

October 13, 2020 -

I recently wrote a blog post that breaks down the math behind maximum likelihood estimation for logistic regression. My friends found it helpful, so decided to spread it around. If you've found the math to be hard to follow in other tutorials, hopefully mine will guide you through it step by step.

Here it is: https://statisticalmusings.netlify.app/post/logistic-regression-mle-a-full-breakdown/

If you can get a firm grasp of logistic regression, you'll be well set to understand deep learning!

Maximum Likelihood

Maximum likelihood estimation involves defining a likelihood function for calculating the conditional probability of observing the data sample given probability distribution and distribution parameters. This approach can be used to search a space of possible distributions and parameters.

The logistic model uses the sigmoid function (denoted by sigma) to estimate the probability that a given sample y belongs to class $1$ given inputs $X$ and weights $W$,

\begin{align} \ P(y=1 \mid x) = \sigma(W^TX) \end{align}

where the sigmoid of our activation function for a given $n$ is:

\begin{align} \large y_n = \sigma(a_n) = \frac{1}{1+e^{-a_n}} \end{align}

The accuracy of our model predictions can be captured by the objective function $L$, which we are trying to maximize.

\begin{align} \large L = \displaystyle\prod_{n=1}^N y_n^{t_n}(1-y_n)^{1-t_n} \end{align}

If we take the log of the above function, we obtain the maximum log-likelihood function, whose form will enable easier calculations of partial derivatives. Specifically, taking the log and maximizing it is acceptable because the log-likelihood is monotonically increasing, and therefore it will yield the same answer as our objective function.

\begin{align} \ L = \displaystyle \sum_{n=1}^N t_nlogy_n+(1-t_n)log(1-y_n) \end{align}

In our example, we will actually convert the objective function (which we would try to maximize) into a cost function (which we are trying to minimize) by converting it into the negative log-likelihood function:

\begin{align} \ J = -\displaystyle \sum_{n=1}^N t_nlogy_n+(1-t_n)log(1-y_n) \end{align}

Gradient Descent

Gradient descent is an iterative optimization algorithm, which finds the minimum of a differentiable function. In this process, we try different values and update them to reach the optimal ones, minimizing the output.

Once we have an objective function, we can generally take its derivative with respect to the parameters (weights), set it equal to zero, and solve for the parameters to obtain the ideal solution. However, in the case of logistic regression (and many other complexes or otherwise non-linear systems), this analytical method doesn’t work. Instead, we resort to a method known as gradient descent, whereby we randomly initialize and then incrementally update our weights by calculating the slope of our objective function. When applying the cost function, we want to continue updating our weights until the slope of the gradient gets as close to zero as possible. We can show this mathematically:

\begin{align} \ w:=w+\triangle w \end{align}

where the second term on the right is defined as the learning rate times the derivative of the cost function with respect to the weights (which is our gradient):

\begin{align} \ \triangle w = \eta\triangle J(w) \end{align}

Thus, we want to take the derivative of the cost function with respect to the weight, which, using the chain rule, gives us:

\begin{align} \frac{J}{\partial w_i} = \displaystyle \sum_{n=1}^N \frac{\partial J}{\partial y_n}\frac{\partial y_n}{\partial a_n}\frac{\partial a_n}{\partial w_i} \end{align}

Thus, we are looking to obtain three different derivatives. Let us start by solving for the derivative of the cost function with respect to $y$:

\begin{align} \frac{\partial J}{\partial y_n} = t_n \frac{1}{y_n} + (1-t_n) \frac{1}{1-y_n}(-1) = \frac{t_n}{y_n} - \frac{1-t_n}{1-y_n} \end{align}

Next, let us solve for the derivative of $y$ with respect to our activation function:

\begin{align} \large y_n = \sigma(a_n) = \frac{1}{1+e^{-a_n}} \end{align}

\begin{align} \frac{\partial y_n}{\partial a_n} = \frac{-1}{(1+e^{-a_n})^2}(e^{-a_n})(-1) = \frac{e^{-a_n}}{(1+e^-a_n)^2} = \frac{1}{1+e^{-a_n}} \frac{e^{-a_n}}{1+e^{-a_n}} \end{align}

\begin{align} \frac{\partial y_n}{\partial a_n} = y_n(1-y_n) \end{align}

And lastly, we solve for the derivative of the activation function with respect to the weights:

\begin{align} \ a_n = W^TX_n \end{align}

\begin{align} \ a_n = w_0x_{n0} + w_1x_{n1} + w_2x_{n2} + \cdots + w_Nx_{NN} \end{align}

\begin{align} \frac{\partial a_n}{\partial w_i} = x_{ni} \end{align}

Now we can put it all together and simply.

\begin{align} \frac{\partial J}{\partial w_i} = - \displaystyle\sum_{n=1}^N\frac{t_n}{y_n}y_n(1-y_n)x_{ni}-\frac{1-t_n}{1-y_n}y_n(1-y_n)x_{ni} \end{align}

\begin{align} = - \displaystyle\sum_{n=1}^Nt_n(1-y_n)x_{ni}-(1-t_n)y_nx_{ni} \end{align}

\begin{align} = - \displaystyle\sum_{n=1}^N[t_n-t_ny_n-y_n+t_ny_n]x_{ni} \end{align}

\begin{align} \frac{\partial J}{\partial w_i} = \displaystyle\sum_{n=1}^N(y_n-t_n)x_{ni} = \frac{\partial J}{\partial w} = \displaystyle\sum_{n=1}^{N}(y_n-t_n)x_n \end{align}

We can get rid of the summation above by applying the principle that a dot product between two vectors is a summover sum index. That is:

\begin{align} \ a^Tb = \displaystyle\sum_{n=1}^Na_nb_n \end{align}

Therefore, the gradient with respect to w is:

\begin{align} \frac{\partial J}{\partial w} = X^T(Y-T) \end{align}

If you are asking yourself where the bias term of our equation (w0) went, we calculate it the same way, except our x becomes 1.

Deep Learning Prerequisites: Logistic Regression in Python
Logistic Regression using Gradient descent and MLE (Projection)
Logistic Regression.pdf
Maximum likelihood and gradient descent
MAXIMUM LIKELIHOOD ESTIMATION (MLE)
Stanford.edu-Logistic Regression.pdf
Gradient Descent Equation in Logistic Regression

2 of 2

In short, Maximum Likelihood estimation is used to find parameters given target values y and x. The Maximum likelhood estimation finds the parameters maximises the probability of y given x. It has been proved that MLE estimation problem caan be solved by finding the parametrs which gives least cross entropy in case of binary classification.

Gradients descent is an optimisation algorithm get helps you update the parameters iteratively to find the parameters which gives highest probability of y.

For more details:https://www.google.com/amp/s/glassboxmedicine.com/2019/12/07/connections-log-likelihood-cross-entropy-kl-divergence-logistic-regression-and-neural-networks/amp/