machine learning - Gradient descent implementation of logistic regression - Data Science Stack Exchange
REALLY breaking down logistic regression gradient descent
MLE & Gradient Descent in Logistic Regression - Data Science Stack Exchange
Is the code for linear and logistic regression always the same?
Can Gradient Descent in Logistic Regression handle non-linear data?
Can I use Gradient Descent in Logistic Regression for multi-class classification?
How does Gradient Descent in Logistic Regression handle high-dimensional data?
Videos
You are missing a minus sign before your binary cross entropy loss function. The loss function you currently have becomes more negative (positive) if the predictions are worse (better), therefore if you minimize this loss function the model will change its weights in the wrong direction and start performing worse. To make the model perform better you either maximize the loss function you currently have (i.e. use gradient ascent instead of gradient descent, as you have in your second example), or you add a minus sign so that a decrease in the loss is linked to a better prediction.
I think your implementation is correct and the answer provided is just wrong.
Just for reference, the below figure represents the theory / math we are using here to implement Logistic Regression with Gradient Descent:

Here, we have the learnable parameter vector $\theta = [b,\;a]^T$ and $m=1$ (since a singe data point), with $X=[1,\; x]$, where $1$ corresponds to the intercept (bias) term.
Just making your implementation a little modular and increasing the number of epochs to 10 (instead of 1):
def update_params(a, b, x, y, z, lr):
a = a + lr * x * (z-y)
a = np.round(a, decimals=3)
b = b + lr * (z-y)
b = np.round(b, decimals=3)
return a, b
def LogitRegression(arr):
x, y, a, b = arr
lr = 1.0
num_epochs = 10
#losses, preds = [], []
for _ in range(num_epochs):
z = 1.0 / (1.0 + np.exp(-a * x - b))
bce = -y*np.log(z) -(1-y)*np.log(1-z)
#losses.append(bce)
#preds.append(z)
print(bce, y, z, a, b)
a, b = update_params(a, b, x, y, z, lr)
return ", ".join([str(a), str(b)])
LogitRegression([1,1,1,1])
# 0.12692801104297263 1 0.8807970779778823 1 1
# 0.10135698320837692 1 0.9036104015620354 1.119 1.119 # values after 1 epoch
# 0.08437500133718023 1 0.9190865327845347 1.215 1.215
# 0.0721998635352405 1 0.9303449352007099 1.296 1.296
# 0.06305834631954188 1 0.9388886913913739 1.366 1.366
# 0.05601486564909184 1 0.9455250799418752 1.427 1.427
# 0.05042252914331105 1 0.9508275872468411 1.481 1.481
# 0.04582166273506799 1 0.9552122969502131 1.53 1.53
# 0.041959389233941616 1 0.958908721799535 1.575 1.575
# 0.03871910934525996 1 0.962020893877162 1.616 1.616
If you plot the BCE loss and the predicted y (i.e., z) over iterations, you get the following figure (as expected, BCE loss is monotonically decreasing and z is getting closer to ground truth y with increasing iterations, leading to convergence):

Now, if you change your update_params() to the following:
def update_params(a, b, x, y, z, lr):
a = a + lr * x * (z-y)
a = np.round(a, decimals=3)
b = b + lr * (z-y)
b = np.round(b, decimals=3)
return a, b
and call LogitRegression() with the same set of inputs:
LogitRegression([1,1,1,1])
# 0.12692801104297263 1 0.8807970779778823 1 1
# 0.15845663982299638 1 0.8534599691639768 0.881 0.881 # provided in the answer
# 0.2073277757451888 1 0.8127532055353431 0.734 0.734
# 0.28883714051459425 1 0.7491341990786479 0.547 0.547
# 0.4403300268044629 1 0.6438239068707556 0.296 0.296
# 0.7549461015956136 1 0.4700359482354282 -0.06 -0.06
# 1.4479476778575628 1 0.2350521962362353 -0.59 -0.59
# 2.774416770021533 1 0.0623858513799944 -1.355 -1.355
# 4.596141947283801 1 0.010090691161759239 -2.293 -2.293
# 6.56740642634977 1 0.0014054377957286094 -3.283 -3.283
and you will end up with the following figure if you plot (clearly this is wrong, since the loss function increases with every epoch and z goes further away from ground-truth y, leading to divergence):

Also, the above implementation can easily be extended to multi-dimensional data containing many data points like the following:
def VanillaLogisticRegression(x, y): # LR without regularization
m, n = x.shape
w = np.zeros((n+1, 1))
X = np.hstack((np.ones(m)[:,None],x)) # include the feature corresponding to the bias term
num_epochs = 1000 # number of epochs to run gradient descent, tune this hyperparametrer
lr = 0.5 # learning rate, tune this hyperparameter
losses = []
for _ in range(num_epochs):
y_hat = 1. / (1. + np.exp(-np.dot(X, w))) # predicted y by the LR model
J = np.mean(-y*np.log2(y_hat) - (1-y)*np.log2(1-y_hat)) # the binary cross entropy loss function
grad_J = np.mean((y_hat - y)*X, axis=0) # the gradient of the loss function
w -= lr * grad_J[:, None] # the gradient descent step, update the parameter vector w
losses.append(J)
# test corretness of the implementation
# loss J should monotonically decrease & y_hat should be closer to y, with increasing iterations
# print(J)
return w
m, n = 1000, 5 # 1000 rows, 5 columns
# randomly generate dataset, note that y can have values as 0 and 1 only
x, y = np.random.random(m*n).reshape(m,n), np.random.randint(0,2,m).reshape(-1,1)
w = VanillaLogisticRegression(x, y)
w # learnt parameters
# array([[-0.0749518 ],
# [ 0.28592107],
# [ 0.15202566],
# [-0.15020757],
# [ 0.08147078],
# [-0.18823631]])
If you plot the loss function value over iterations, you will get a plot like the following one, showing how it converges.

Finally, let's compare the above implementation with sklearn's implementation, which uses a more advanced optimization algorithm lbfgs by default, hence likely to converge much faster, but if our implementation is correct both of then should converge to the same global minima, since the loss function is convex (note that sklearn by default uses regularization, in order to have almost no regularization, we need to have the value of the input hyper-parameter $C$ very high):
from sklearn.linear_model import LogisticRegression
clf = LogisticRegression(random_state=0, C=10**12).fit(x, y)
print(clf.coef_, clf.intercept_)
# [[ 0.28633262 0.15256914 -0.14975667 0.08192404 -0.18780851]] [-0.07612282]
Compare the parameter values obtained from the above implementation and the one obtained with sklearn's implementation: they are almost equal.
Also, let's compare the predicted probabilities obtained using these two different implementations of LR (one from scratch, another one from sklearn's library function), as can be seen from the following scatterplot, they are almost identical:
pred_probs = 1 / (1 + np.exp(-X@w))
plt.scatter(pred_probs, clf.predict_proba(x)[:,1])
plt.grid()
plt.xlabel('pred prob', size=20)
plt.ylabel('pred prob (sklearn)', size=20)
plt.show()

Finally, let's compute the accuracies obtained, they are identical too:
print(sum((pred_probs > 0.5) == y) / len(y))
# [0.527]
clf.score(x, y)
# 0.527
This also shows the correctness of the implementation.
I recently wrote a blog post that breaks down the math behind maximum likelihood estimation for logistic regression. My friends found it helpful, so decided to spread it around. If you've found the math to be hard to follow in other tutorials, hopefully mine will guide you through it step by step.
Here it is: https://statisticalmusings.netlify.app/post/logistic-regression-mle-a-full-breakdown/
If you can get a firm grasp of logistic regression, you'll be well set to understand deep learning!
Maximum Likelihood
Maximum likelihood estimation involves defining a likelihood function for calculating the conditional probability of observing the data sample given probability distribution and distribution parameters. This approach can be used to search a space of possible distributions and parameters.
The logistic model uses the sigmoid function (denoted by sigma) to estimate the probability that a given sample y belongs to class $1$ given inputs $X$ and weights $W$,
\begin{align} \ P(y=1 \mid x) = \sigma(W^TX) \end{align}
where the sigmoid of our activation function for a given $n$ is:
\begin{align} \large y_n = \sigma(a_n) = \frac{1}{1+e^{-a_n}} \end{align}
The accuracy of our model predictions can be captured by the objective function $L$, which we are trying to maximize.
\begin{align} \large L = \displaystyle\prod_{n=1}^N y_n^{t_n}(1-y_n)^{1-t_n} \end{align}
If we take the log of the above function, we obtain the maximum log-likelihood function, whose form will enable easier calculations of partial derivatives. Specifically, taking the log and maximizing it is acceptable because the log-likelihood is monotonically increasing, and therefore it will yield the same answer as our objective function.
\begin{align} \ L = \displaystyle \sum_{n=1}^N t_nlogy_n+(1-t_n)log(1-y_n) \end{align}
In our example, we will actually convert the objective function (which we would try to maximize) into a cost function (which we are trying to minimize) by converting it into the negative log-likelihood function:
\begin{align} \ J = -\displaystyle \sum_{n=1}^N t_nlogy_n+(1-t_n)log(1-y_n) \end{align}
Gradient Descent
Gradient descent is an iterative optimization algorithm, which finds the minimum of a differentiable function. In this process, we try different values and update them to reach the optimal ones, minimizing the output.
Once we have an objective function, we can generally take its derivative with respect to the parameters (weights), set it equal to zero, and solve for the parameters to obtain the ideal solution. However, in the case of logistic regression (and many other complexes or otherwise non-linear systems), this analytical method doesn’t work. Instead, we resort to a method known as gradient descent, whereby we randomly initialize and then incrementally update our weights by calculating the slope of our objective function. When applying the cost function, we want to continue updating our weights until the slope of the gradient gets as close to zero as possible. We can show this mathematically:
\begin{align} \ w:=w+\triangle w \end{align}
where the second term on the right is defined as the learning rate times the derivative of the cost function with respect to the weights (which is our gradient):
\begin{align} \ \triangle w = \eta\triangle J(w) \end{align}
Thus, we want to take the derivative of the cost function with respect to the weight, which, using the chain rule, gives us:
\begin{align} \frac{J}{\partial w_i} = \displaystyle \sum_{n=1}^N \frac{\partial J}{\partial y_n}\frac{\partial y_n}{\partial a_n}\frac{\partial a_n}{\partial w_i} \end{align}
Thus, we are looking to obtain three different derivatives. Let us start by solving for the derivative of the cost function with respect to $y$:
\begin{align} \frac{\partial J}{\partial y_n} = t_n \frac{1}{y_n} + (1-t_n) \frac{1}{1-y_n}(-1) = \frac{t_n}{y_n} - \frac{1-t_n}{1-y_n} \end{align}
Next, let us solve for the derivative of $y$ with respect to our activation function:
\begin{align} \large y_n = \sigma(a_n) = \frac{1}{1+e^{-a_n}} \end{align}
\begin{align} \frac{\partial y_n}{\partial a_n} = \frac{-1}{(1+e^{-a_n})^2}(e^{-a_n})(-1) = \frac{e^{-a_n}}{(1+e^-a_n)^2} = \frac{1}{1+e^{-a_n}} \frac{e^{-a_n}}{1+e^{-a_n}} \end{align}
\begin{align} \frac{\partial y_n}{\partial a_n} = y_n(1-y_n) \end{align}
And lastly, we solve for the derivative of the activation function with respect to the weights:
\begin{align} \ a_n = W^TX_n \end{align}
\begin{align} \ a_n = w_0x_{n0} + w_1x_{n1} + w_2x_{n2} + \cdots + w_Nx_{NN} \end{align}
\begin{align} \frac{\partial a_n}{\partial w_i} = x_{ni} \end{align}
Now we can put it all together and simply.
\begin{align} \frac{\partial J}{\partial w_i} = - \displaystyle\sum_{n=1}^N\frac{t_n}{y_n}y_n(1-y_n)x_{ni}-\frac{1-t_n}{1-y_n}y_n(1-y_n)x_{ni} \end{align}
\begin{align} = - \displaystyle\sum_{n=1}^Nt_n(1-y_n)x_{ni}-(1-t_n)y_nx_{ni} \end{align}
\begin{align} = - \displaystyle\sum_{n=1}^N[t_n-t_ny_n-y_n+t_ny_n]x_{ni} \end{align}
\begin{align} \frac{\partial J}{\partial w_i} = \displaystyle\sum_{n=1}^N(y_n-t_n)x_{ni} = \frac{\partial J}{\partial w} = \displaystyle\sum_{n=1}^{N}(y_n-t_n)x_n \end{align}
We can get rid of the summation above by applying the principle that a dot product between two vectors is a summover sum index. That is:
\begin{align} \ a^Tb = \displaystyle\sum_{n=1}^Na_nb_n \end{align}
Therefore, the gradient with respect to w is:
\begin{align} \frac{\partial J}{\partial w} = X^T(Y-T) \end{align}
If you are asking yourself where the bias term of our equation (w0) went, we calculate it the same way, except our x becomes 1.
- Deep Learning Prerequisites: Logistic Regression in Python
- Logistic Regression using Gradient descent and MLE (Projection)
- Logistic Regression.pdf
- Maximum likelihood and gradient descent
- MAXIMUM LIKELIHOOD ESTIMATION (MLE)
- Stanford.edu-Logistic Regression.pdf
- Gradient Descent Equation in Logistic Regression
In short, Maximum Likelihood estimation is used to find parameters given target values y and x. The Maximum likelhood estimation finds the parameters maximises the probability of y given x. It has been proved that MLE estimation problem caan be solved by finding the parametrs which gives least cross entropy in case of binary classification.
Gradients descent is an optimisation algorithm get helps you update the parameters iteratively to find the parameters which gives highest probability of y.
For more details:https://www.google.com/amp/s/glassboxmedicine.com/2019/12/07/connections-log-likelihood-cross-entropy-kl-divergence-logistic-regression-and-neural-networks/amp/