I think your code is a bit too complicated and it needs more structure, because otherwise you'll be lost in all equations and operations. In the end this regression boils down to four operations:
- Calculate the hypothesis h = X * theta
- Calculate the loss = h - y and maybe the squared cost (loss^2)/2m
- Calculate the gradient = X' * loss / m
- Update the parameters theta = theta - alpha * gradient
In your case, I guess you have confused m with n. Here m denotes the number of examples in your training set, not the number of features.
Let's have a look at my variation of your code:
import numpy as np
import random
# m denotes the number of examples here, not the number of features
def gradientDescent(x, y, theta, alpha, m, numIterations):
xTrans = x.transpose()
for i in range(0, numIterations):
hypothesis = np.dot(x, theta)
loss = hypothesis - y
# avg cost per example (the 2 in 2*m doesn't really matter here.
# But to be consistent with the gradient, I include it)
cost = np.sum(loss ** 2) / (2 * m)
print("Iteration %d | Cost: %f" % (i, cost))
# avg gradient per example
gradient = np.dot(xTrans, loss) / m
# update
theta = theta - alpha * gradient
return theta
def genData(numPoints, bias, variance):
x = np.zeros(shape=(numPoints, 2))
y = np.zeros(shape=numPoints)
# basically a straight line
for i in range(0, numPoints):
# bias feature
x[i][0] = 1
x[i][1] = i
# our target variable
y[i] = (i + bias) + random.uniform(0, 1) * variance
return x, y
# gen 100 points with a bias of 25 and 10 variance as a bit of noise
x, y = genData(100, 25, 10)
m, n = np.shape(x)
numIterations= 100000
alpha = 0.0005
theta = np.ones(n)
theta = gradientDescent(x, y, theta, alpha, m, numIterations)
print(theta)
At first I create a small random dataset which should look like this:

As you can see I also added the generated regression line and formula that was calculated by excel.
You need to take care about the intuition of the regression using gradient descent. As you do a complete batch pass over your data X, you need to reduce the m-losses of every example to a single weight update. In this case, this is the average of the sum over the gradients, thus the division by m.
The next thing you need to take care about is to track the convergence and adjust the learning rate. For that matter you should always track your cost every iteration, maybe even plot it.
If you run my example, the theta returned will look like this:
Iteration 99997 | Cost: 47883.706462
Iteration 99998 | Cost: 47883.706462
Iteration 99999 | Cost: 47883.706462
[ 29.25567368 1.01108458]
Which is actually quite close to the equation that was calculated by excel (y = x + 30). Note that as we passed the bias into the first column, the first theta value denotes the bias weight.
Answer from Thomas Jungblut on Stack OverflowVideos
I think your code is a bit too complicated and it needs more structure, because otherwise you'll be lost in all equations and operations. In the end this regression boils down to four operations:
- Calculate the hypothesis h = X * theta
- Calculate the loss = h - y and maybe the squared cost (loss^2)/2m
- Calculate the gradient = X' * loss / m
- Update the parameters theta = theta - alpha * gradient
In your case, I guess you have confused m with n. Here m denotes the number of examples in your training set, not the number of features.
Let's have a look at my variation of your code:
import numpy as np
import random
# m denotes the number of examples here, not the number of features
def gradientDescent(x, y, theta, alpha, m, numIterations):
xTrans = x.transpose()
for i in range(0, numIterations):
hypothesis = np.dot(x, theta)
loss = hypothesis - y
# avg cost per example (the 2 in 2*m doesn't really matter here.
# But to be consistent with the gradient, I include it)
cost = np.sum(loss ** 2) / (2 * m)
print("Iteration %d | Cost: %f" % (i, cost))
# avg gradient per example
gradient = np.dot(xTrans, loss) / m
# update
theta = theta - alpha * gradient
return theta
def genData(numPoints, bias, variance):
x = np.zeros(shape=(numPoints, 2))
y = np.zeros(shape=numPoints)
# basically a straight line
for i in range(0, numPoints):
# bias feature
x[i][0] = 1
x[i][1] = i
# our target variable
y[i] = (i + bias) + random.uniform(0, 1) * variance
return x, y
# gen 100 points with a bias of 25 and 10 variance as a bit of noise
x, y = genData(100, 25, 10)
m, n = np.shape(x)
numIterations= 100000
alpha = 0.0005
theta = np.ones(n)
theta = gradientDescent(x, y, theta, alpha, m, numIterations)
print(theta)
At first I create a small random dataset which should look like this:

As you can see I also added the generated regression line and formula that was calculated by excel.
You need to take care about the intuition of the regression using gradient descent. As you do a complete batch pass over your data X, you need to reduce the m-losses of every example to a single weight update. In this case, this is the average of the sum over the gradients, thus the division by m.
The next thing you need to take care about is to track the convergence and adjust the learning rate. For that matter you should always track your cost every iteration, maybe even plot it.
If you run my example, the theta returned will look like this:
Iteration 99997 | Cost: 47883.706462
Iteration 99998 | Cost: 47883.706462
Iteration 99999 | Cost: 47883.706462
[ 29.25567368 1.01108458]
Which is actually quite close to the equation that was calculated by excel (y = x + 30). Note that as we passed the bias into the first column, the first theta value denotes the bias weight.
Below you can find my implementation of gradient descent for linear regression problem.
At first, you calculate gradient like X.T * (X * w - y) / N and update your current theta with this gradient simultaneously.
- X: feature matrix
- y: target values
- w: weights/values
- N: size of training set
Here is the python code:
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
import random
def generateSample(N, variance=100):
X = np.matrix(range(N)).T + 1
Y = np.matrix([random.random() * variance + i * 10 + 900 for i in range(len(X))]).T
return X, Y
def fitModel_gradient(x, y):
N = len(x)
w = np.zeros((x.shape[1], 1))
eta = 0.0001
maxIteration = 100000
for i in range(maxIteration):
error = x * w - y
gradient = x.T * error / N
w = w - eta * gradient
return w
def plotModel(x, y, w):
plt.plot(x[:,1], y, "x")
plt.plot(x[:,1], x * w, "r-")
plt.show()
def test(N, variance, modelFunction):
X, Y = generateSample(N, variance)
X = np.hstack([np.matrix(np.ones(len(X))).T, X])
w = modelFunction(X, Y)
plotModel(X, Y, w)
test(50, 600, fitModel_gradient)
test(50, 1000, fitModel_gradient)
test(100, 200, fitModel_gradient)

But I can't understand when optimization happens. When gradient descent happens and most importantly, What is the relation with the rounded bucket example?
For all machine learning problems, you have a loss function. The loss is higher the farther you are away from a desirable solution. For example, in a classification problem you can calculate the error of you current classifier. You could take the error as a simple loss functions. The more errors your classifier makes, the worse it is.
Now your models have parameters. Lets call those "weights" w. If you have n of those, you can write w \in R^n.
For each set of weights w, you can assign it an error. If n=2, you can plot a graph for this error function. It could look like this:

Each position at the x-y- plane is one set of parameters. The point in the z direction is the error. You want to minimize the error. Hence your optimization problem is a minimization problem. You want to go down in that bowl. You don't know it is a bowl, this is just a visualiztion. But by looking at the gradient, you can calculate which direction will reduce the error. Hence gradient descent. Reducing the error by optimizing the weights.
Usually, you don't have n=2, but rather n=100 * 10^6 or something similar.
Alec Redford made a couple of great visualizations for this process for different kinds of gradient descent:

Source
For classical neural networks you have two steps:
- Feeding inputs through the network
- Backpropagation of the error and correction of the weights (synapses) The second one is where gradient descent is used.
This is the example from your link http://iamtrask.github.io/2015/07/27/python-network-part2/
import numpy as np
X = np.array([ [0,0,1],[0,1,1],[1,0,1],[1,1,1] ])
y = np.array([[0,1,1,0]]).T
alpha,hidden_dim = (0.5,4)
synapse_0 = 2*np.random.random((3,hidden_dim)) - 1
synapse_1 = 2*np.random.random((hidden_dim,1)) - 1
for j in xrange(60000):
layer_1 = 1/(1+np.exp(-(np.dot(X,synapse_0))))
layer_2 = 1/(1+np.exp(-(np.dot(layer_1,synapse_1))))
layer_2_delta = (layer_2 - y)*(layer_2*(1-layer_2))
layer_1_delta = layer_2_delta.dot(synapse_1.T) * (layer_1 * (1-layer_1))
synapse_1 -= (alpha * layer_1.T.dot(layer_2_delta))
synapse_0 -= (alpha * X.T.dot(layer_1_delta))
In the forward step you apply f(x)=1/(1+exp(-x)) (activation function) to the weighted sum of the inputs (dot-product aka scalar product is a short form for that) to a neuron's state.
The gradient descent is hidden in the backpropagation in the line where you calc. the layer_x_delta:
layer_2*(1-layer_2)is the derivation (also known as gradient) of thefabove at positionlayer_2. So the learning delta is essentially following this gradient in the right direction.- In the
layer_1_deltayou take the calculated delta from the second layer, pull it backwards in a linear way withnp.dot(again just weighted sum) and then take the direction of the gradient as above withx(1-x) - Then one changes the weights according to the delta (error) in the target neuron and the activation of the source neuron. (
np.dot(layer_1, delta_layer_2)). alpha is just a learning rate (usually0 < alpha < 1) to avoid overcorrection.
I hope you can get something out of this answer!