It looks like you have some stuff mixed up in here. It's critical when doing this that you keep track of the shape of your vectors and makes sure you're getting sensible results. For example, you are calculating cost with:

cost = ((-y) * np.log(sigmoid(X[i]))) - ((1 - y) * np.log(1 - sigmoid(X[i])))

In your case y is vector with 20 items and X[i] is a single value. This makes your cost calculation a 20 item vector which doesn't makes sense. Your cost should be a single value. (you're also calculating this cost a bunch of times for no reason in your gradient descent function).

Also, if you want this to be able to fit your data you need to add a bias terms to X. So let's start there.

X = np.asarray([
    [0.50],[0.75],[1.00],[1.25],[1.50],[1.75],[1.75],
    [2.00],[2.25],[2.50],[2.75],[3.00],[3.25],[3.50],
    [4.00],[4.25],[4.50],[4.75],[5.00],[5.50]])

ones = np.ones(X.shape)
X = np.hstack([ones, X])
# X.shape is now (20, 2)

Theta will now need 2 values for each X. So initialize that and Y:

Y = np.array([0,0,0,0,0,0,1,0,1,0,1,0,1,0,1,1,1,1,1,1]).reshape([-1, 1])
# reshape Y so it's column vector so matrix multiplication is easier
Theta = np.array([[0], [0]])

Your sigmoid function is good. Let's also make a vectorized cost function:

def sigmoid(a):
    return 1.0 / (1 + np.exp(-a))

def cost(x, y, theta):
    m = x.shape[0]
    h = sigmoid(np.matmul(x, theta))
    cost = (np.matmul(-y.T, np.log(h)) - np.matmul((1 -y.T), np.log(1 - h)))/m
    return cost

The cost function works because Theta has a shape of (2, 1) and X has a shape of (20, 2) so matmul(X, Theta) will be shaped (20, 1). The then matrix multiply the transpose of Y (y.T shape is (1, 20)), which result in a single value, our cost given a particular value of Theta.

We can then write a function that performs a single step of batch gradient descent:

def gradient_Descent(theta, alpha, x , y):
    m = x.shape[0]
    h = sigmoid(np.matmul(x, theta))
    grad = np.matmul(X.T, (h - y)) / m;
    theta = theta - alpha * grad
    return theta

Notice np.matmul(X.T, (h - y)) is multiplying shapes (2, 20) and (20, 1) which results in a shape of (2, 1) โ€” the same shape as Theta, which is what you want from your gradient. This allows you to multiply is by your learning rate and subtract it from the initial Theta, which is what gradient descent is supposed to do.

So now you just write a loop for a number of iterations and update Theta until it looks like it converges:

n_iterations = 500
learning_rate = 0.5

for i in range(n_iterations):
    Theta = gradient_Descent(Theta, learning_rate, X, Y)
    if i % 50 == 0:
        print(cost(X, Y, Theta))

This will print the cost every 50 iterations resulting in a steadily decreasing cost, which is what you hope for:

[[ 0.6410409]]
[[ 0.44766253]]
[[ 0.41593581]]
[[ 0.40697167]]
[[ 0.40377785]]
[[ 0.4024982]]
[[ 0.40195]]
[[ 0.40170533]]
[[ 0.40159325]]
[[ 0.40154101]]

You can try different initial values of Theta and you will see it always converges to the same thing.

Now you can use your newly found values of Theta to make predictions:

h = sigmoid(np.matmul(X, Theta))
print((h > .5).astype(int) )

This prints what you would expect for a linear fit to your data:

[[0]
 [0]
 [0]
 [0]
 [0]
 [0]
 [0]
 [0]
 [0]
 [0]
 [1]
 [1]
 [1]
 [1]
 [1]
 [1]
 [1]
 [1]
 [1]
 [1]]
Answer from Mark on Stack Overflow
๐ŸŒ
Medium
medium.com โ€บ @IwriteDSblog โ€บ gradient-descent-for-logistics-regression-in-python-18e033775082
Gradient Descent for Logistics Regression in Python | by Hoang Phong | Medium
July 31, 2021 - Therefore, for the binary classifier, we need to compute an algorithm that lives in the range 0 to 1, inclusively. This article will cover how Logistics Regression utilizes Gradient Descent to find the optimized parameters and how to implement the algorithm in Python.
๐ŸŒ
IBM
developer.ibm.com โ€บ articles โ€บ implementing-logistic-regression-from-scratch-in-python
Implementing logistic regression from scratch in Python
Implement binary logistic regression from scratch in Python using NumPy. Learn sigmoid functions, binary cross-entropy loss, and gradient descent with real code.
Discussions

python - Logistic Regression Gradient Descent - Stack Overflow
I have to do Logistic regression using batch gradient descent. More on stackoverflow.com
๐ŸŒ stackoverflow.com
machine learning - Gradient descent implementation of logistic regression - Data Science Stack Exchange
Objective Seeking for help, advise why the gradient descent implementation does not work below. Background Working on the task below to implement the logistic regression. Gradient descent Derived ... More on datascience.stackexchange.com
๐ŸŒ datascience.stackexchange.com
December 7, 2021
REALLY breaking down logistic regression gradient descent
Great job. I appreciate all the effort you put to write up the equations in Latex. More on reddit.com
๐ŸŒ r/learnmachinelearning
19
189
October 13, 2020
Implementing multiclass logistic regression from scratch (using stochastic gradient descent)
Hi everyone, I'm the one who made this video! If anyone wants more details about how the formula for the gradient of the objective function L is computed, here's a worksheet I made that walks you through the calculation. The calculation is a bit complicated, and this is my best attempt to organize it in a way that's clean and clear. More on reddit.com
๐ŸŒ r/learnmachinelearning
21
218
March 18, 2020
๐ŸŒ
Atmamani
atmamani.github.io โ€บ projects โ€บ ml โ€บ implementing-logistic-regression-in-python
Implementing Gradient Descent for Logistic Regression - Atma's blog
Note: At this point, I realize my gradient descent is not really optimizing well. The equation of the decision boundary line is way off. Hence I approach to solve this problem using Scikit-Learn and see what its parameters are. Using the logistic regression from SKlearn, we fit the same data ...
๐ŸŒ
Dataquest
dataquest.io โ€บ blog โ€บ logistic-regression-in-python
An Intro to Logistic Regression in Python (100+ Code Examples)
January 7, 2025 - This tutorial covers L1 and L2 regularization, hyperparameter tuning using grid search, automating machine learning workflow with pipeline, one vs rest classifier, object-oriented programming, modular programming, and documenting Python modules with docstring. In this section, you'll build your own custom logistic regression model using stochastic gradient descent.
Top answer
1 of 1
10

It looks like you have some stuff mixed up in here. It's critical when doing this that you keep track of the shape of your vectors and makes sure you're getting sensible results. For example, you are calculating cost with:

cost = ((-y) * np.log(sigmoid(X[i]))) - ((1 - y) * np.log(1 - sigmoid(X[i])))

In your case y is vector with 20 items and X[i] is a single value. This makes your cost calculation a 20 item vector which doesn't makes sense. Your cost should be a single value. (you're also calculating this cost a bunch of times for no reason in your gradient descent function).

Also, if you want this to be able to fit your data you need to add a bias terms to X. So let's start there.

X = np.asarray([
    [0.50],[0.75],[1.00],[1.25],[1.50],[1.75],[1.75],
    [2.00],[2.25],[2.50],[2.75],[3.00],[3.25],[3.50],
    [4.00],[4.25],[4.50],[4.75],[5.00],[5.50]])

ones = np.ones(X.shape)
X = np.hstack([ones, X])
# X.shape is now (20, 2)

Theta will now need 2 values for each X. So initialize that and Y:

Y = np.array([0,0,0,0,0,0,1,0,1,0,1,0,1,0,1,1,1,1,1,1]).reshape([-1, 1])
# reshape Y so it's column vector so matrix multiplication is easier
Theta = np.array([[0], [0]])

Your sigmoid function is good. Let's also make a vectorized cost function:

def sigmoid(a):
    return 1.0 / (1 + np.exp(-a))

def cost(x, y, theta):
    m = x.shape[0]
    h = sigmoid(np.matmul(x, theta))
    cost = (np.matmul(-y.T, np.log(h)) - np.matmul((1 -y.T), np.log(1 - h)))/m
    return cost

The cost function works because Theta has a shape of (2, 1) and X has a shape of (20, 2) so matmul(X, Theta) will be shaped (20, 1). The then matrix multiply the transpose of Y (y.T shape is (1, 20)), which result in a single value, our cost given a particular value of Theta.

We can then write a function that performs a single step of batch gradient descent:

def gradient_Descent(theta, alpha, x , y):
    m = x.shape[0]
    h = sigmoid(np.matmul(x, theta))
    grad = np.matmul(X.T, (h - y)) / m;
    theta = theta - alpha * grad
    return theta

Notice np.matmul(X.T, (h - y)) is multiplying shapes (2, 20) and (20, 1) which results in a shape of (2, 1) โ€” the same shape as Theta, which is what you want from your gradient. This allows you to multiply is by your learning rate and subtract it from the initial Theta, which is what gradient descent is supposed to do.

So now you just write a loop for a number of iterations and update Theta until it looks like it converges:

n_iterations = 500
learning_rate = 0.5

for i in range(n_iterations):
    Theta = gradient_Descent(Theta, learning_rate, X, Y)
    if i % 50 == 0:
        print(cost(X, Y, Theta))

This will print the cost every 50 iterations resulting in a steadily decreasing cost, which is what you hope for:

[[ 0.6410409]]
[[ 0.44766253]]
[[ 0.41593581]]
[[ 0.40697167]]
[[ 0.40377785]]
[[ 0.4024982]]
[[ 0.40195]]
[[ 0.40170533]]
[[ 0.40159325]]
[[ 0.40154101]]

You can try different initial values of Theta and you will see it always converges to the same thing.

Now you can use your newly found values of Theta to make predictions:

h = sigmoid(np.matmul(X, Theta))
print((h > .5).astype(int) )

This prints what you would expect for a linear fit to your data:

[[0]
 [0]
 [0]
 [0]
 [0]
 [0]
 [0]
 [0]
 [0]
 [0]
 [1]
 [1]
 [1]
 [1]
 [1]
 [1]
 [1]
 [1]
 [1]
 [1]]
๐ŸŒ
GitHub
github.com โ€บ KhaledAshrafH โ€บ Logistic-Regression
GitHub - KhaledAshrafH/Logistic-Regression: This program implements logistic regression from scratch using the gradient descent algorithm in Python to predict whether customers will purchase a new car based on their age and salary.
This program implements logistic regression from scratch using the gradient descent algorithm in Python to predict whether customers will purchase a new car based on their age and salary. - KhaledAshrafH/Logistic-Regression
Starred by 8 users
Forked by 4 users
Languages ย  Python 100.0% | Python 100.0%
Find elsewhere
๐ŸŒ
Kaggle
kaggle.com โ€บ code โ€บ marissafernandes โ€บ logistic-regression-sgd-in-python-from-scratch
Logistic Regression + SGD in Python from scratch
Checking your browser before accessing www.kaggle.com ยท Click here if you are not automatically redirected after 5 seconds
๐ŸŒ
Optimization Daily
optimization-daily.netlify.app โ€บ posts โ€บ 2022-07-13-logistic-regression-and-gradient-descent-in-python
Optimization Daily: Logistic Regression and Gradient Descent in Python
July 13, 2022 - In this post we will walk through how to train a Logistic Regression model from scratch using Gradient Descent in Python.
๐ŸŒ
CodeSignal
codesignal.com โ€บ learn โ€บ courses โ€บ regression-and-gradient-descent โ€บ lessons โ€บ understanding-logistic-regression-and-its-implementation-using-gradient-descent
Understanding Logistic Regression and Its Implementation ...
In this lesson, we'll guide you through the basic concepts that define Logistic Regression, focusing on its unique components like the Sigmoid function and Log-Likelihood. Eventually, we'll utilize Python to engineer a straightforward Logistic Regression model using Gradient Descent.
๐ŸŒ
ResearchGate
researchgate.net โ€บ publication โ€บ 383264729_Implementation_of_Logistic_Regression_using_Gradient_Descent_in_Python
(PDF) Implementation of Logistic Regression using Gradient Descent in Python
March 1, 2024 - Logistic regression helps to classify data into categories based on the features of samples. Sigmoid function is used to transform values into probabilities and predict the required categories.
Top answer
1 of 2
2

You are missing a minus sign before your binary cross entropy loss function. The loss function you currently have becomes more negative (positive) if the predictions are worse (better), therefore if you minimize this loss function the model will change its weights in the wrong direction and start performing worse. To make the model perform better you either maximize the loss function you currently have (i.e. use gradient ascent instead of gradient descent, as you have in your second example), or you add a minus sign so that a decrease in the loss is linked to a better prediction.

2 of 2
2

I think your implementation is correct and the answer provided is just wrong.

Just for reference, the below figure represents the theory / math we are using here to implement Logistic Regression with Gradient Descent:

Here, we have the learnable parameter vector $\theta = [b,\;a]^T$ and $m=1$ (since a singe data point), with $X=[1,\; x]$, where $1$ corresponds to the intercept (bias) term.

Just making your implementation a little modular and increasing the number of epochs to 10 (instead of 1):

def update_params(a, b, x, y, z, lr):
    a = a + lr * x * (z-y)
    a = np.round(a, decimals=3)
    b = b + lr * (z-y)
    b = np.round(b, decimals=3)
    return a, b          

def LogitRegression(arr):
  x, y, a, b = arr
  lr = 1.0
  num_epochs = 10
  #losses, preds = [], []
  for _ in range(num_epochs):    
      z = 1.0 / (1.0 + np.exp(-a * x - b))    
      bce = -y*np.log(z) -(1-y)*np.log(1-z)
      #losses.append(bce)
      #preds.append(z)
      print(bce, y, z, a, b)
      a, b = update_params(a, b, x, y, z, lr)            
  
  return ", ".join([str(a), str(b)])

LogitRegression([1,1,1,1])
# 0.12692801104297263 1 0.8807970779778823 1 1
# 0.10135698320837692 1 0.9036104015620354 1.119 1.119 # values after 1 epoch
# 0.08437500133718023 1 0.9190865327845347 1.215 1.215
# 0.0721998635352405 1 0.9303449352007099 1.296 1.296
# 0.06305834631954188 1 0.9388886913913739 1.366 1.366
# 0.05601486564909184 1 0.9455250799418752 1.427 1.427
# 0.05042252914331105 1 0.9508275872468411 1.481 1.481
# 0.04582166273506799 1 0.9552122969502131 1.53 1.53
# 0.041959389233941616 1 0.958908721799535 1.575 1.575
# 0.03871910934525996 1 0.962020893877162 1.616 1.616

If you plot the BCE loss and the predicted y (i.e., z) over iterations, you get the following figure (as expected, BCE loss is monotonically decreasing and z is getting closer to ground truth y with increasing iterations, leading to convergence):

Now, if you change your update_params() to the following:

def update_params(a, b, x, y, z, lr):
    a = a + lr * x * (z-y)
    a = np.round(a, decimals=3)
    b = b + lr * (z-y)
    b = np.round(b, decimals=3)
    return a, b

and call LogitRegression() with the same set of inputs:

LogitRegression([1,1,1,1])
# 0.12692801104297263 1 0.8807970779778823 1 1
# 0.15845663982299638 1 0.8534599691639768 0.881 0.881 # provided in the answer
# 0.2073277757451888 1 0.8127532055353431 0.734 0.734
# 0.28883714051459425 1 0.7491341990786479 0.547 0.547
# 0.4403300268044629 1 0.6438239068707556 0.296 0.296
# 0.7549461015956136 1 0.4700359482354282 -0.06 -0.06
# 1.4479476778575628 1 0.2350521962362353 -0.59 -0.59
# 2.774416770021533 1 0.0623858513799944 -1.355 -1.355
# 4.596141947283801 1 0.010090691161759239 -2.293 -2.293
# 6.56740642634977 1 0.0014054377957286094 -3.283 -3.283

and you will end up with the following figure if you plot (clearly this is wrong, since the loss function increases with every epoch and z goes further away from ground-truth y, leading to divergence):

Also, the above implementation can easily be extended to multi-dimensional data containing many data points like the following:

def VanillaLogisticRegression(x, y): # LR without regularization
    m, n = x.shape
    w = np.zeros((n+1, 1))
    X = np.hstack((np.ones(m)[:,None],x)) # include the feature corresponding to the bias term
    num_epochs = 1000 # number of epochs to run gradient descent, tune this hyperparametrer
    lr = 0.5 # learning rate, tune this hyperparameter
    losses = []
    for _ in range(num_epochs):
        y_hat = 1. / (1. + np.exp(-np.dot(X, w))) # predicted y by the LR model
        J = np.mean(-y*np.log2(y_hat) - (1-y)*np.log2(1-y_hat)) # the binary cross entropy loss function
        grad_J = np.mean((y_hat - y)*X, axis=0) # the gradient of the loss function
        w -= lr * grad_J[:, None] # the gradient descent step, update the parameter vector w
        losses.append(J)
        # test corretness of the implementation
        # loss J should monotonically decrease & y_hat should be closer to y, with increasing iterations
        # print(J)            
    return w

m, n = 1000, 5 # 1000 rows, 5 columns
# randomly generate dataset, note that y can have values as 0 and 1 only
x, y = np.random.random(m*n).reshape(m,n), np.random.randint(0,2,m).reshape(-1,1)
w = VanillaLogisticRegression(x, y)
w # learnt parameters
# array([[-0.0749518 ],
#   [ 0.28592107],
#   [ 0.15202566],
#   [-0.15020757],
#   [ 0.08147078],
#   [-0.18823631]])

If you plot the loss function value over iterations, you will get a plot like the following one, showing how it converges.

Finally, let's compare the above implementation with sklearn's implementation, which uses a more advanced optimization algorithm lbfgs by default, hence likely to converge much faster, but if our implementation is correct both of then should converge to the same global minima, since the loss function is convex (note that sklearn by default uses regularization, in order to have almost no regularization, we need to have the value of the input hyper-parameter $C$ very high):

from sklearn.linear_model import LogisticRegression
clf = LogisticRegression(random_state=0, C=10**12).fit(x, y)
print(clf.coef_, clf.intercept_)
# [[ 0.28633262  0.15256914 -0.14975667  0.08192404 -0.18780851]] [-0.07612282]

Compare the parameter values obtained from the above implementation and the one obtained with sklearn's implementation: they are almost equal.

Also, let's compare the predicted probabilities obtained using these two different implementations of LR (one from scratch, another one from sklearn's library function), as can be seen from the following scatterplot, they are almost identical:

pred_probs = 1 / (1 + np.exp(-X@w))
plt.scatter(pred_probs, clf.predict_proba(x)[:,1])
plt.grid()
plt.xlabel('pred prob', size=20)
plt.ylabel('pred prob (sklearn)', size=20)
plt.show()

Finally, let's compute the accuracies obtained, they are identical too:

print(sum((pred_probs > 0.5) == y) / len(y)) 
# [0.527]
clf.score(x, y)   
# 0.527

This also shows the correctness of the implementation.

๐ŸŒ
GitHub
github.com โ€บ nikadeap โ€บ Gradient-Descent-Algorithm-for-Logistic-Regression
GitHub - nikadeap/Gradient-Descent-Algorithm-for-Logistic-Regression: Implement a gradient descent algorithm for logistic regression
The derivative term is same as derivative term for the linear regression as discussed in the class. Write the code for gradient descent iterations. Plot the cost function for different alpha (learning parameters) values. Use sklearn logistic regression API and compare the estimation of beta values.
Starred by 10 users
Forked by 12 users
Languages ย  Jupyter Notebook 100.0% | Jupyter Notebook 100.0%
๐ŸŒ
GitHub
rasbt.github.io โ€บ mlxtend โ€บ user_guide โ€บ classifier โ€บ LogisticRegression
LogisticRegression: A binary classifier - mlxtend - GitHub Pages
from mlxtend.data import iris_data from mlxtend.plotting import plot_decision_regions from mlxtend.classifier import LogisticRegression import matplotlib.pyplot as plt # Loading Data X, y = iris_data() X = X[:, [0, 3]] # sepal length and petal width X = X[0:100] # class 0 and class 1 y = y[0:100] # class 0 and class 1 # standardize X[:,0] = (X[:,0] - X[:,0].mean()) / X[:,0].std() X[:,1] = (X[:,1] - X[:,1].mean()) / X[:,1].std() lr = LogisticRegression(eta=0.1, l2_lambda=0.0, epochs=100, minibatches=1, # for Gradient Descent random_seed=1, print_progress=3) lr.fit(X, y) plot_decision_regions(X, y, clf=lr) plt.title('Logistic Regression - Gradient Descent') plt.show() plt.plot(range(len(lr.cost_)), lr.cost_) plt.xlabel('Iterations') plt.ylabel('Cost') plt.show()