I think your code is a bit too complicated and it needs more structure, because otherwise you'll be lost in all equations and operations. In the end this regression boils down to four operations:

  1. Calculate the hypothesis h = X * theta
  2. Calculate the loss = h - y and maybe the squared cost (loss^2)/2m
  3. Calculate the gradient = X' * loss / m
  4. Update the parameters theta = theta - alpha * gradient

In your case, I guess you have confused m with n. Here m denotes the number of examples in your training set, not the number of features.

Let's have a look at my variation of your code:

import numpy as np
import random

# m denotes the number of examples here, not the number of features
def gradientDescent(x, y, theta, alpha, m, numIterations):
    xTrans = x.transpose()
    for i in range(0, numIterations):
        hypothesis = np.dot(x, theta)
        loss = hypothesis - y
        # avg cost per example (the 2 in 2*m doesn't really matter here.
        # But to be consistent with the gradient, I include it)
        cost = np.sum(loss ** 2) / (2 * m)
        print("Iteration %d | Cost: %f" % (i, cost))
        # avg gradient per example
        gradient = np.dot(xTrans, loss) / m
        # update
        theta = theta - alpha * gradient
    return theta


def genData(numPoints, bias, variance):
    x = np.zeros(shape=(numPoints, 2))
    y = np.zeros(shape=numPoints)
    # basically a straight line
    for i in range(0, numPoints):
        # bias feature
        x[i][0] = 1
        x[i][1] = i
        # our target variable
        y[i] = (i + bias) + random.uniform(0, 1) * variance
    return x, y

# gen 100 points with a bias of 25 and 10 variance as a bit of noise
x, y = genData(100, 25, 10)
m, n = np.shape(x)
numIterations= 100000
alpha = 0.0005
theta = np.ones(n)
theta = gradientDescent(x, y, theta, alpha, m, numIterations)
print(theta)

At first I create a small random dataset which should look like this:

As you can see I also added the generated regression line and formula that was calculated by excel.

You need to take care about the intuition of the regression using gradient descent. As you do a complete batch pass over your data X, you need to reduce the m-losses of every example to a single weight update. In this case, this is the average of the sum over the gradients, thus the division by m.

The next thing you need to take care about is to track the convergence and adjust the learning rate. For that matter you should always track your cost every iteration, maybe even plot it.

If you run my example, the theta returned will look like this:

Iteration 99997 | Cost: 47883.706462
Iteration 99998 | Cost: 47883.706462
Iteration 99999 | Cost: 47883.706462
[ 29.25567368   1.01108458]

Which is actually quite close to the equation that was calculated by excel (y = x + 30). Note that as we passed the bias into the first column, the first theta value denotes the bias weight.

Answer from Thomas Jungblut on Stack Overflow
🌐
GeeksforGeeks
geeksforgeeks.org › machine learning › how-to-implement-a-gradient-descent-in-python-to-find-a-local-minimum
Implementing gradient descent in Python to find a local minimum - GeeksforGeeks
October 25, 2025 - Gradient Descent is an optimization algorithm used to find the local minimum of a function. It is used in machine learning to minimize a cost or loss function by iteratively updating parameters in the opposite direction of the gradient.
🌐
Medium
induraj2020.medium.com › implementing-gradient-descent-in-python-d1c6aeb9a448
Implementing Gradient descent in python | by Induraj | Medium
February 22, 2023 - During each iteration of gradient descent, the parameters θ are updated according to the above formula, where ∇J(θ) is evaluated using the current values of θ. This means that in each iteration, the algorithm takes a step in the direction of the steepest descent of the cost function, with a step size determined by the learning rate.
🌐
Real Python
realpython.com › gradient-descent-algorithm-python
Stochastic Gradient Descent Algorithm With Python and NumPy – Real Python
October 21, 2023 - You’ll use only plain Python and NumPy, which enables you to write concise code when working with arrays (or vectors) and gain a performance boost. This is a basic implementation of the algorithm that starts with an arbitrary point, start, iteratively moves it toward the minimum, and returns a point that is hopefully at or near the minimum: ... 1def gradient_descent(gradient, start, learn_rate, n_iter): 2 vector = start 3 for _ in range(n_iter): 4 diff = -learn_rate * gradient(vector) 5 vector += diff 6 return vector
Top answer
1 of 6
146

I think your code is a bit too complicated and it needs more structure, because otherwise you'll be lost in all equations and operations. In the end this regression boils down to four operations:

  1. Calculate the hypothesis h = X * theta
  2. Calculate the loss = h - y and maybe the squared cost (loss^2)/2m
  3. Calculate the gradient = X' * loss / m
  4. Update the parameters theta = theta - alpha * gradient

In your case, I guess you have confused m with n. Here m denotes the number of examples in your training set, not the number of features.

Let's have a look at my variation of your code:

import numpy as np
import random

# m denotes the number of examples here, not the number of features
def gradientDescent(x, y, theta, alpha, m, numIterations):
    xTrans = x.transpose()
    for i in range(0, numIterations):
        hypothesis = np.dot(x, theta)
        loss = hypothesis - y
        # avg cost per example (the 2 in 2*m doesn't really matter here.
        # But to be consistent with the gradient, I include it)
        cost = np.sum(loss ** 2) / (2 * m)
        print("Iteration %d | Cost: %f" % (i, cost))
        # avg gradient per example
        gradient = np.dot(xTrans, loss) / m
        # update
        theta = theta - alpha * gradient
    return theta


def genData(numPoints, bias, variance):
    x = np.zeros(shape=(numPoints, 2))
    y = np.zeros(shape=numPoints)
    # basically a straight line
    for i in range(0, numPoints):
        # bias feature
        x[i][0] = 1
        x[i][1] = i
        # our target variable
        y[i] = (i + bias) + random.uniform(0, 1) * variance
    return x, y

# gen 100 points with a bias of 25 and 10 variance as a bit of noise
x, y = genData(100, 25, 10)
m, n = np.shape(x)
numIterations= 100000
alpha = 0.0005
theta = np.ones(n)
theta = gradientDescent(x, y, theta, alpha, m, numIterations)
print(theta)

At first I create a small random dataset which should look like this:

As you can see I also added the generated regression line and formula that was calculated by excel.

You need to take care about the intuition of the regression using gradient descent. As you do a complete batch pass over your data X, you need to reduce the m-losses of every example to a single weight update. In this case, this is the average of the sum over the gradients, thus the division by m.

The next thing you need to take care about is to track the convergence and adjust the learning rate. For that matter you should always track your cost every iteration, maybe even plot it.

If you run my example, the theta returned will look like this:

Iteration 99997 | Cost: 47883.706462
Iteration 99998 | Cost: 47883.706462
Iteration 99999 | Cost: 47883.706462
[ 29.25567368   1.01108458]

Which is actually quite close to the equation that was calculated by excel (y = x + 30). Note that as we passed the bias into the first column, the first theta value denotes the bias weight.

2 of 6
12

Below you can find my implementation of gradient descent for linear regression problem.

At first, you calculate gradient like X.T * (X * w - y) / N and update your current theta with this gradient simultaneously.

  • X: feature matrix
  • y: target values
  • w: weights/values
  • N: size of training set

Here is the python code:

import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
import random

def generateSample(N, variance=100):
    X = np.matrix(range(N)).T + 1
    Y = np.matrix([random.random() * variance + i * 10 + 900 for i in range(len(X))]).T
    return X, Y

def fitModel_gradient(x, y):
    N = len(x)
    w = np.zeros((x.shape[1], 1))
    eta = 0.0001

    maxIteration = 100000
    for i in range(maxIteration):
        error = x * w - y
        gradient = x.T * error / N
        w = w - eta * gradient
    return w

def plotModel(x, y, w):
    plt.plot(x[:,1], y, "x")
    plt.plot(x[:,1], x * w, "r-")
    plt.show()

def test(N, variance, modelFunction):
    X, Y = generateSample(N, variance)
    X = np.hstack([np.matrix(np.ones(len(X))).T, X])
    w = modelFunction(X, Y)
    plotModel(X, Y, w)


test(50, 600, fitModel_gradient)
test(50, 1000, fitModel_gradient)
test(100, 200, fitModel_gradient)

🌐
Sbu-python-class
sbu-python-class.github.io › python-science › 11-machine-learning › gradient-descent.html
Gradient Descent — PHY 546: Python for Scientific Computing
def do_descent(dfdx, x0, eps=1.e-5, eta=2.e-3, args=None, ax=None): # dx will be the change in the solution -- we'll iterate until this # is small dx = 1.e30 xp_old = x0.copy() if args: grad = dfdx(xp_old, *args) else: grad = dfdx(xp_old) while dx > eps: xp = xp_old - eta * grad if ax: ax.plot([xp_old[0], xp[0]], [xp_old[1], xp[1]], color="C1") dx = np.linalg.norm(xp - xp_old) if args: grad_new = dfdx(xp, *args) else: grad_new = dfdx(xp) #eta_new = np.abs(np.transpose(xp) @ (grad_new - grad)) / np.linalg.norm(grad_new - grad)**2 #eta = min(10*eta, eta_new) grad = grad_new xp_old[:] = xp
🌐
scikit-learn
scikit-learn.org › stable › modules › sgd.html
1.5. Stochastic Gradient Descent — scikit-learn 1.8.0 documentation
Stochastic Gradient Descent is sensitive to feature scaling, so it is highly recommended to scale your data. For example, scale each attribute on the input vector \(X\) to \([0,1]\) or \([-1,1]\), or standardize it to have mean \(0\) and variance \(1\). Note that the same scaling must be applied to the test vector to obtain meaningful results.
🌐
DataCamp
datacamp.com › tutorial › stochastic-gradient-descent
Stochastic Gradient Descent in Python: A Complete Guide for ML Optimization | DataCamp
July 24, 2024 - Learn Stochastic Gradient Descent, an essential optimization technique for machine learning, with this comprehensive Python guide. Perfect for beginners and experts.
🌐
LinkedIn
linkedin.com › pulse › understanding-gradient-descent-python-rany-elhousieny-phdᴬᴮᴰ
Understanding Gradient Descent in Python
February 7, 2024 - In each iteration, we compute the gradient and update x using the Gradient Descent formula. After optimization, minimum_x contains the value of x that minimizes the cost function, and minimum_cost contains the minimum cost. Now, let's run the code and see the expected output. When you run the code, you should see the following output: ... Hrithik S. 4 years ago · Using Weakref in Python for Efficient Caching in…
Find elsewhere
🌐
Towards Data Science
towardsdatascience.com › home › latest › implementing gradient descent in python from scratch
Implementing Gradient Descent in Python from Scratch | Towards Data Science
January 21, 2025 - Gradient Descent is an optimisation algorithm which helps you find the optimal weights for your model. It does it by trying various weights and finding the weights which fit the models best i.e. minimises the cost function.
🌐
GitHub
github.com › xbeat › Machine-Learning › blob › main › Building a Gradient Descent Optimizer from Scratch in Python.md
Machine-Learning/Building a Gradient Descent Optimizer from Scratch in Python.md at main · xbeat/Machine-Learning
Gradient descent is a fundamental optimization algorithm in machine learning. It's used to minimize a cost function by iteratively moving in the direction of steepest descent. In this presentation, we'll build a gradient descent optimizer from ...
Author   xbeat
🌐
Stack Abuse
stackabuse.com › gradient-descent-in-python-implementation-and-theory
Gradient Descent in Python: Implementation and Theory
November 16, 2023 - In this tutorial, we'll go over the theory on how does gradient descent work and how to implement it in Python. Then, we'll implement batch and stochastic gradient descent to minimize Mean Squared Error functions.
🌐
PyImageSearch
pyimagesearch.com › home › blog › gradient descent with python
Gradient Descent with Python - PyImageSearch
August 10, 2022 - Learn how to implement the gradient descent algorithm for machine learning, neural networks, and deep learning using Python.
🌐
Paperspace
blog.paperspace.com › part-1-generic-python-implementation-of-gradient-descent-for-nn-optimization
Implementing Gradient Descent in Python Part 1
April 9, 2021 - Through a series of tutorials, the gradient descent (GD) algorithm will be implemented from scratch in Python for optimizing parameters of artificial neural network (ANN) in the backpropagation phase.
🌐
GeeksforGeeks
geeksforgeeks.org › how-to-implement-a-gradient-descent-in-python-to-find-a-local-minimum
How to implement a gradient descent in Python to find a local minimum ? - GeeksforGeeks
December 14, 2022 - Now that we are clear with the gradient descent's internal working, let us look into the python implementation of gradient descent where we will be minimizing the cost function of the linear regression algorithm and finding the best fit line.
🌐
Kaggle
kaggle.com › code › penchalaiah123 › step-by-step-guide-to-gradient-descent
Step by Step Guide to Gradient Descent
Checking your browser before accessing www.kaggle.com · Click here if you are not automatically redirected after 5 seconds
🌐
GitHub
github.com › Ravi-IISc › Gradient-Descent-Algorithm-in-Python
GitHub - Ravi-IISc/Gradient-Descent-Algorithm-in-Python: Gradient Descent method is a conventional method for optimization of a function. Since gradient of a function is the direction of the steepest ascent, this method chooses negative of the gradient, that is direction of steepest descent. · GitHub
- GitHub - Ravi-IISc/Gradient-Descent-Algorithm-in-Python: Gradient Descent method is a conventional method for optimization of a function. Since gradient of a function is the direction of the steepest ascent, this method chooses negative of ...
Author   Ravi-IISc
🌐
Medium
medium.com › @saeedkohans85 › gradient-descent-a-step-by-step-explanation-with-python-implementation-5b5a1664e460
Gradient Descent: A Step-by-Step Explanation with Python Implementation | by Saeedkohansal | Medium
March 18, 2025 - In this article, we will implement and explain Gradient Descent for optimizing a convex function, covering both the mathematical concepts and the Python code implementation step by step.
🌐
Duchesnay
duchesnay.github.io › pystatsml › optimization › optim_gradient_descent.html
Gradient descent — Statistics and Machine Learning in Python 0.5 documentation
Our goal is to move from the mountain in the top right corner (high cost) to the dark blue sea in the bottom left (low cost). The arrows represent the direction of steepest descent (negative gradient) from any given point–the direction that decreases the cost function as quickly as possible
Top answer
1 of 4
3

But I can't understand when optimization happens. When gradient descent happens and most importantly, What is the relation with the rounded bucket example?

For all machine learning problems, you have a loss function. The loss is higher the farther you are away from a desirable solution. For example, in a classification problem you can calculate the error of you current classifier. You could take the error as a simple loss functions. The more errors your classifier makes, the worse it is.

Now your models have parameters. Lets call those "weights" w. If you have n of those, you can write w \in R^n.

For each set of weights w, you can assign it an error. If n=2, you can plot a graph for this error function. It could look like this:

Each position at the x-y- plane is one set of parameters. The point in the z direction is the error. You want to minimize the error. Hence your optimization problem is a minimization problem. You want to go down in that bowl. You don't know it is a bowl, this is just a visualiztion. But by looking at the gradient, you can calculate which direction will reduce the error. Hence gradient descent. Reducing the error by optimizing the weights.

Usually, you don't have n=2, but rather n=100 * 10^6 or something similar.

Alec Redford made a couple of great visualizations for this process for different kinds of gradient descent:

Source

2 of 4
2

For classical neural networks you have two steps:

  • Feeding inputs through the network
  • Backpropagation of the error and correction of the weights (synapses) The second one is where gradient descent is used.

This is the example from your link http://iamtrask.github.io/2015/07/27/python-network-part2/

import numpy as np   
X = np.array([ [0,0,1],[0,1,1],[1,0,1],[1,1,1] ])   
y = np.array([[0,1,1,0]]).T   
alpha,hidden_dim = (0.5,4)   
synapse_0 = 2*np.random.random((3,hidden_dim)) - 1   
synapse_1 = 2*np.random.random((hidden_dim,1)) - 1   
for j in xrange(60000):   
    layer_1 = 1/(1+np.exp(-(np.dot(X,synapse_0))))   
    layer_2 = 1/(1+np.exp(-(np.dot(layer_1,synapse_1))))   
    layer_2_delta = (layer_2 - y)*(layer_2*(1-layer_2))  
    layer_1_delta = layer_2_delta.dot(synapse_1.T) * (layer_1 * (1-layer_1))   
    synapse_1 -= (alpha * layer_1.T.dot(layer_2_delta))  
    synapse_0 -= (alpha * X.T.dot(layer_1_delta))  

In the forward step you apply f(x)=1/(1+exp(-x)) (activation function) to the weighted sum of the inputs (dot-product aka scalar product is a short form for that) to a neuron's state.

The gradient descent is hidden in the backpropagation in the line where you calc. the layer_x_delta:

  • layer_2*(1-layer_2) is the derivation (also known as gradient) of the f above at position layer_2. So the learning delta is essentially following this gradient in the right direction.
  • In the layer_1_delta you take the calculated delta from the second layer, pull it backwards in a linear way with np.dot (again just weighted sum) and then take the direction of the gradient as above with x(1-x)
  • Then one changes the weights according to the delta (error) in the target neuron and the activation of the source neuron. (np.dot(layer_1, delta_layer_2)). alpha is just a learning rate (usually 0 < alpha < 1) to avoid overcorrection.

I hope you can get something out of this answer!