You are gravely misunderstanding SVM. Sub-gradient descent algorithm for SVM is a method to solve the underlying optimization problem of SVM.
SVM is always a linear classifier, which can yet by using kernels operate in a higher dimensional space. Therefore in input space, the separating hyperplane (linear!) computed in feature space (kernel!), seems non-linear.
Effectively you are thereby solving a non-linear classification task, but you are projecting into a higher dimensional feature space where the classification task is solved by a linear classifier.
Please read:
Wikipedia on svm
Please watch:
great lecture about SVM
Answer from Nikolas Rieble on Stack ExchangeThe method to calculate gradient in this case is Calculus (analytically, NOT numerically!). So we differentiate loss function with respect to W(yi) like this:

and with respect to W(j) when j!=yi is:

The 1 is just indicator function so we can ignore the middle form when condition is true. And when you write in code, the example you provided is the answer.
Since you are using cs231n example, you should definitely check note and videos if needed.
Hope this helps!
If the substraction less than zero the loss is zero so the gradient of W is also zero. If the substarction larger than zero, then the gradient of W is the partial derviation of the loss.
Let's start with basics. The so-called gradient is just the ordinary derivative, that is, slope. For example, slope of the linear function $y=kx+b$ equals $k$, so its gradient w.r.t. $x$ equals $k$. If $x$ and $k$ are not numbers, but vectors, then the gradient is also a vector.
Another piece of good news is that gradient is a linear operator. It means, you can add functions and multiply by constants before or after differentiation, it doesn't make any difference
Now take the definition of SVM loss function for a single $i$-th observation. It is
$\mathrm{loss} = \mathrm{max}(0, \mathrm{something} - w_y*x)$
where $\mathrm{something}=wx+\Delta$. Thus, loss equals $\mathrm{something}-w_y*x$, if the latter is non-negative, and $0$ otherwise.
In the first (non-negative) case the loss $\mathrm{something}-w_y*x$ is linear in $w_y$, so the gradient is just the slope of this function of $w_y$, that is , $-x$.
In the second (negative) case the loss $0$ is constant, so its derivative is also $0$.
To write all this cases in one equation, we invent a function (it is called indicator) $I(x)$, which equals $1$ if $x$ is true, and $0$ otherwise. With this function, we can write
$\mathrm{derivative} = I(\mathrm{something} - w_y*x > 0) * (-x)$
If $\mathrm{something} - w_y*x > 0$, the first multiplier equals 1, and gradient equals $x$. Otherwise, the first multiplier equals 0, and gradient as well. So I just rewrote the two cases in a single line.
Now let's turn from a single $i$-th observation to the whole loss. The loss is sum of individual losses. Thus, because differentiation is linear, the gradient of a sum equals sum of gradients, so we can write
$\text{total derivative} = \sum(I(something - w_y*x_i > 0) * (-x_i))$
Now, move the $-$ multiplier from $x_i$ to the beginning of the formula, and you will get your expression.
David has provided good answer. But I would point out that the sum() in David's answer:
total_derivative = sum(I(something - w_y*x[i] > 0) * (-x[i]))
is different from the one in the original Nikhil's question:
$$ \def\w{{\mathbf w}} \nabla_{\w_{y_i}} L_i=-\left[\sum_{j\ne y_i} \mathbf{I}( \w_j^T x_i - \w_{y_i}^T x_i + \Delta >0) \right] x_i $$ The above equation is still the gradient due to the i-th observation, but for the weight of the ground truth class, i.e. $w_{y_i}$. There is the summation $\sum_{j \ne y_i}$, because $w_{y_i}$ is in every term of the SVM loss $L_i$:
$$ \def\w{{\mathbf w}} L_i = \sum_{j \ne y_i} \max (0, \w_j^T x_i - \w_{y_i}^T x_i + \Delta) $$ For every non-zero term, i.e. $w^T_j x_i - w^T_{y_i} x_i + \Delta > 0$, you would obtain the gradient $-x_i$. In total, the gradient $\nabla_{w_{y_i}} L_i$ is $numOfNonZeroTerm \times (- x_i)$, same as the equation above.
Gradients of individual observations $\nabla L_i$ (computed above) are then averaged to obtain the gradient of the batch of observations $\nabla L$.
The objective function for the Primal Kernel SVM (No bias term):
$$ \arg \min_{\boldsymbol{\beta}} \frac{\lambda}{2} \boldsymbol{\beta}^{\top} \boldsymbol{K} \boldsymbol{\beta} + \frac{1}{n} \sum_{i = 1}^{n} \max \left\{ 0, 1 - {y}_{i} \boldsymbol{k}_{i}^{\top} \boldsymbol{\beta} \right\} $$
The simplest formulations of the Kernel Pegasos algorithm, in my opinion, is in Shai Shalev Shwartz, Shai Ben David - Understanding Machine Learning: From Theory to Algorithms which is a book with the same author of the paper Pegasos: Primal Estimated Sub GrAdient SOlver for SVM.
For the Kernel SVM (Page 223):

My implementation (Kernel SVM):
function PegasosKernelSVM!( mX :: Matrix{T}, hKₖ :: Function, vY :: Vector{T}, λ :: T, vαₜ :: Vector{T}, vβₜ :: Vector{T} ) where {T <: AbstractFloat}
# Following Shai Shalev Shwartz, Shai Ben David - Understanding Machine Learning: From Theory to Algorithms.
# See page 223 - SGD for Solving Soft SVM with Kernels.
# `hKₖ(vZ, ii)` - Returns the dot product of the `ii` -th data sample with `vZ`.
numSamples = length(vY);
dataDim = size(mX, 1);
numIterations = size(mX, 2);
# First Iteration
ii = 1;
tt = ii;
ηₜ = inv(λ * tt);
vαₜ .= ηₜ .* vβₜ;
kk = rand(1:numSamples);
yₖ = vY[kk];
valSum = yₖ * hKₖ(vαₜ, kk);
vβₜ[kk] += (valSum < one(T)) * yₖ;
@views mX[:, ii] = vαₜ;
for ii in 2:numIterations
tt = ii;
ηₜ = inv(λ * tt);
vαₜ .= ηₜ .* vβₜ;
kk = rand(1:numSamples);
yₖ = vY[kk];
# @views valSum = yₖ * dot(mK[:, kk], vαₜ);
valSum = yₖ * hKₖ(vαₜ, kk);
vβₜ[kk] += (valSum < one(T)) * yₖ;
@views mX[:, ii] .= inv(tt) .* ((T(ii - 1) .* mX[:, ii - 1]) .+ vαₜ);
end
return mX;
end
Remarks:
- The algorithm assumes no bias / intercept term. It may be adapted to include one, yet will loose its speed.
- The code calculates the whole path of the estimation of the parameter $\boldsymbol{\beta}$ (Which is
vαₜin the code).
I verified the implementation vs. a DCP Solver and on Non Separable synthetic data set:


The prediction, for a given new sample $\boldsymbol{z}$ is given by:
$$ \boldsymbol{z} \to \operatorname{sign} \left( \sum_{i} {\beta}_{i} k \left( \boldsymbol{z}, \boldsymbol{x}_{i} \right) \right) $$
Where $\boldsymbol{x}_{i}$ is the $i$ -th data sample and $k \left( \cdot, \cdot \right)$ is the kernel function.
It is different from the calculation for the Dual Variable as seen in Prediction of a Sample Given the Optimal Dual Variable of Kernel SVM.
The code is available on my StackExchange Code GitHub Repository (Look at the CrossValidated\Q215733 folder).
The objective function for the Primal Linear SVM (No bias term):
$$ \arg \min_{\boldsymbol{w}} \frac{\lambda}{2} {\left\| \boldsymbol{w} \right\|}_{2}^{2} + \frac{1}{n} \sum_{k = 1}^{n} \max \left\{ 0, 1 - {y}_{k} \boldsymbol{x}_{k}^{\top} \boldsymbol{w} \right\} $$
There are simpler formulations, in my opinion, of the algorithm in Shai Shalev Shwartz, Shai Ben David - Understanding Machine Learning: From Theory to Algorithms which is a book with the same author of the paper Pegasos: Primal Estimated Sub GrAdient SOlver for SVM.
For the Linear SVM:

My implementation (Linear SVM):
function PegasosSVM!( mW :: Matrix{T}, hXₖ! :: Function, vY :: Vector{T}, λ :: T, vWₜ :: Vector{T}, vθₜ :: Vector{T}, vXₜ :: Vector{T} ) where {T <: AbstractFloat}
# Following Shai Shalev Shwartz, Shai Ben David - Understanding Machine Learning: From Theory to Algorithms.
# See page 213 - SGD for Solving Soft SVM.
# `hXₖ!(vZ, ii)` - Returns the `ii` -th data sample in `vZ`.
numSamples = length(vY);
dataDim = size(mW, 1);
numIterations = size(mW, 2);
# First iteration
tt = 1;
ηₜ = inv(λ * tt);
vWₜ .= ηₜ .* vθₜ;
kk = rand(1:numSamples);
yₖ = vY[kk];
hXₖ!(vXₜ, kk);
valSum = yₖ * dot(vWₜ, vXₜ);
if valSum < one(T)
vθₜ .+= yₖ .* vXₜ;
end
@views mW[:, 1] = vWₜ;
for ii in 2:numIterations
# tt = ii - 1;
tt = ii;
ηₜ = inv(λ * tt);
vWₜ .= ηₜ .* vθₜ;
kk = rand(1:numSamples);
yₖ = vY[kk];
hXₖ!(vXₜ, kk);
valSum = yₖ * dot(vWₜ, vXₜ);
if valSum < one(T)
vθₜ .+= yₖ .* vXₜ;
end
@views mW[:, ii] .= inv(tt) .* ((T(ii - 1) .* mW[:, ii - 1]) .+ vWₜ);
end
return mW;
end
Remarks:
- The algorithm assumes no bias / intercept term. It may adapted to include the bias term yet it will become much slower (Strongly Convex to Convex).
- The code calculates the whole path of the estimation of the parameter $\boldsymbol{w}$.
I verified the implementation vs. a DCP Solver and on Non Separable synthetic data set:


The code is available on my StackExchange Code GitHub Repository (Look at the CrossValidated\Q215733 folder).