The method to calculate gradient in this case is Calculus (analytically, NOT numerically!). So we differentiate loss function with respect to W(yi) like this:

and with respect to W(j) when j!=yi is:

The 1 is just indicator function so we can ignore the middle form when condition is true. And when you write in code, the example you provided is the answer.
Since you are using cs231n example, you should definitely check note and videos if needed.
Hope this helps!
If the substraction less than zero the loss is zero so the gradient of W is also zero. If the substarction larger than zero, then the gradient of W is the partial derviation of the loss.
Let's start with basics. The so-called gradient is just the ordinary derivative, that is, slope. For example, slope of the linear function $y=kx+b$ equals $k$, so its gradient w.r.t. $x$ equals $k$. If $x$ and $k$ are not numbers, but vectors, then the gradient is also a vector.
Another piece of good news is that gradient is a linear operator. It means, you can add functions and multiply by constants before or after differentiation, it doesn't make any difference
Now take the definition of SVM loss function for a single $i$-th observation. It is
$\mathrm{loss} = \mathrm{max}(0, \mathrm{something} - w_y*x)$
where $\mathrm{something}=wx+\Delta$. Thus, loss equals $\mathrm{something}-w_y*x$, if the latter is non-negative, and $0$ otherwise.
In the first (non-negative) case the loss $\mathrm{something}-w_y*x$ is linear in $w_y$, so the gradient is just the slope of this function of $w_y$, that is , $-x$.
In the second (negative) case the loss $0$ is constant, so its derivative is also $0$.
To write all this cases in one equation, we invent a function (it is called indicator) $I(x)$, which equals $1$ if $x$ is true, and $0$ otherwise. With this function, we can write
$\mathrm{derivative} = I(\mathrm{something} - w_y*x > 0) * (-x)$
If $\mathrm{something} - w_y*x > 0$, the first multiplier equals 1, and gradient equals $x$. Otherwise, the first multiplier equals 0, and gradient as well. So I just rewrote the two cases in a single line.
Now let's turn from a single $i$-th observation to the whole loss. The loss is sum of individual losses. Thus, because differentiation is linear, the gradient of a sum equals sum of gradients, so we can write
$\text{total derivative} = \sum(I(something - w_y*x_i > 0) * (-x_i))$
Now, move the $-$ multiplier from $x_i$ to the beginning of the formula, and you will get your expression.
David has provided good answer. But I would point out that the sum() in David's answer:
total_derivative = sum(I(something - w_y*x[i] > 0) * (-x[i]))
is different from the one in the original Nikhil's question:
$$ \def\w{{\mathbf w}} \nabla_{\w_{y_i}} L_i=-\left[\sum_{j\ne y_i} \mathbf{I}( \w_j^T x_i - \w_{y_i}^T x_i + \Delta >0) \right] x_i $$ The above equation is still the gradient due to the i-th observation, but for the weight of the ground truth class, i.e. $w_{y_i}$. There is the summation $\sum_{j \ne y_i}$, because $w_{y_i}$ is in every term of the SVM loss $L_i$:
$$ \def\w{{\mathbf w}} L_i = \sum_{j \ne y_i} \max (0, \w_j^T x_i - \w_{y_i}^T x_i + \Delta) $$ For every non-zero term, i.e. $w^T_j x_i - w^T_{y_i} x_i + \Delta > 0$, you would obtain the gradient $-x_i$. In total, the gradient $\nabla_{w_{y_i}} L_i$ is $numOfNonZeroTerm \times (- x_i)$, same as the equation above.
Gradients of individual observations $\nabla L_i$ (computed above) are then averaged to obtain the gradient of the batch of observations $\nabla L$.
The objective function for the Primal Kernel SVM (No bias term):
$$ \arg \min_{\boldsymbol{\beta}} \frac{\lambda}{2} \boldsymbol{\beta}^{\top} \boldsymbol{K} \boldsymbol{\beta} + \frac{1}{n} \sum_{i = 1}^{n} \max \left\{ 0, 1 - {y}_{i} \boldsymbol{k}_{i}^{\top} \boldsymbol{\beta} \right\} $$
The simplest formulations of the Kernel Pegasos algorithm, in my opinion, is in Shai Shalev Shwartz, Shai Ben David - Understanding Machine Learning: From Theory to Algorithms which is a book with the same author of the paper Pegasos: Primal Estimated Sub GrAdient SOlver for SVM.
For the Kernel SVM (Page 223):

My implementation (Kernel SVM):
function PegasosKernelSVM!( mX :: Matrix{T}, hKₖ :: Function, vY :: Vector{T}, λ :: T, vαₜ :: Vector{T}, vβₜ :: Vector{T} ) where {T <: AbstractFloat}
# Following Shai Shalev Shwartz, Shai Ben David - Understanding Machine Learning: From Theory to Algorithms.
# See page 223 - SGD for Solving Soft SVM with Kernels.
# `hKₖ(vZ, ii)` - Returns the dot product of the `ii` -th data sample with `vZ`.
numSamples = length(vY);
dataDim = size(mX, 1);
numIterations = size(mX, 2);
# First Iteration
ii = 1;
tt = ii;
ηₜ = inv(λ * tt);
vαₜ .= ηₜ .* vβₜ;
kk = rand(1:numSamples);
yₖ = vY[kk];
valSum = yₖ * hKₖ(vαₜ, kk);
vβₜ[kk] += (valSum < one(T)) * yₖ;
@views mX[:, ii] = vαₜ;
for ii in 2:numIterations
tt = ii;
ηₜ = inv(λ * tt);
vαₜ .= ηₜ .* vβₜ;
kk = rand(1:numSamples);
yₖ = vY[kk];
# @views valSum = yₖ * dot(mK[:, kk], vαₜ);
valSum = yₖ * hKₖ(vαₜ, kk);
vβₜ[kk] += (valSum < one(T)) * yₖ;
@views mX[:, ii] .= inv(tt) .* ((T(ii - 1) .* mX[:, ii - 1]) .+ vαₜ);
end
return mX;
end
Remarks:
- The algorithm assumes no bias / intercept term. It may be adapted to include one, yet will loose its speed.
- The code calculates the whole path of the estimation of the parameter $\boldsymbol{\beta}$ (Which is
vαₜin the code).
I verified the implementation vs. a DCP Solver and on Non Separable synthetic data set:


The prediction, for a given new sample $\boldsymbol{z}$ is given by:
$$ \boldsymbol{z} \to \operatorname{sign} \left( \sum_{i} {\beta}_{i} k \left( \boldsymbol{z}, \boldsymbol{x}_{i} \right) \right) $$
Where $\boldsymbol{x}_{i}$ is the $i$ -th data sample and $k \left( \cdot, \cdot \right)$ is the kernel function.
It is different from the calculation for the Dual Variable as seen in Prediction of a Sample Given the Optimal Dual Variable of Kernel SVM.
The code is available on my StackExchange Code GitHub Repository (Look at the CrossValidated\Q215733 folder).
The objective function for the Primal Linear SVM (No bias term):
$$ \arg \min_{\boldsymbol{w}} \frac{\lambda}{2} {\left\| \boldsymbol{w} \right\|}_{2}^{2} + \frac{1}{n} \sum_{k = 1}^{n} \max \left\{ 0, 1 - {y}_{k} \boldsymbol{x}_{k}^{\top} \boldsymbol{w} \right\} $$
There are simpler formulations, in my opinion, of the algorithm in Shai Shalev Shwartz, Shai Ben David - Understanding Machine Learning: From Theory to Algorithms which is a book with the same author of the paper Pegasos: Primal Estimated Sub GrAdient SOlver for SVM.
For the Linear SVM:

My implementation (Linear SVM):
function PegasosSVM!( mW :: Matrix{T}, hXₖ! :: Function, vY :: Vector{T}, λ :: T, vWₜ :: Vector{T}, vθₜ :: Vector{T}, vXₜ :: Vector{T} ) where {T <: AbstractFloat}
# Following Shai Shalev Shwartz, Shai Ben David - Understanding Machine Learning: From Theory to Algorithms.
# See page 213 - SGD for Solving Soft SVM.
# `hXₖ!(vZ, ii)` - Returns the `ii` -th data sample in `vZ`.
numSamples = length(vY);
dataDim = size(mW, 1);
numIterations = size(mW, 2);
# First iteration
tt = 1;
ηₜ = inv(λ * tt);
vWₜ .= ηₜ .* vθₜ;
kk = rand(1:numSamples);
yₖ = vY[kk];
hXₖ!(vXₜ, kk);
valSum = yₖ * dot(vWₜ, vXₜ);
if valSum < one(T)
vθₜ .+= yₖ .* vXₜ;
end
@views mW[:, 1] = vWₜ;
for ii in 2:numIterations
# tt = ii - 1;
tt = ii;
ηₜ = inv(λ * tt);
vWₜ .= ηₜ .* vθₜ;
kk = rand(1:numSamples);
yₖ = vY[kk];
hXₖ!(vXₜ, kk);
valSum = yₖ * dot(vWₜ, vXₜ);
if valSum < one(T)
vθₜ .+= yₖ .* vXₜ;
end
@views mW[:, ii] .= inv(tt) .* ((T(ii - 1) .* mW[:, ii - 1]) .+ vWₜ);
end
return mW;
end
Remarks:
- The algorithm assumes no bias / intercept term. It may adapted to include the bias term yet it will become much slower (Strongly Convex to Convex).
- The code calculates the whole path of the estimation of the parameter $\boldsymbol{w}$.
I verified the implementation vs. a DCP Solver and on Non Separable synthetic data set:


The code is available on my StackExchange Code GitHub Repository (Look at the CrossValidated\Q215733 folder).
Set $\mathbf w = \phi(\mathbf x)\cdot \mathbf u$ so that $\mathbf w^t \phi(\mathbf x)=\mathbf u^t \cdot \mathbf K$ and $\mathbf w^t\mathbf w = \mathbf u^t\mathbf K\mathbf u$, with $\mathbf K = \phi(\mathbf x)^t\phi(\mathbf x)$, where $\phi(x)$ is a mapping of the original input matrix, $\mathbf x$. This allows one to solve the SVM through the primal formulation. Using your notation for the loss:
$$J(\mathbf{w}, b) = C {\displaystyle \sum\limits_{i=1}^{m} max\left(0, 1 - y^{(i)} (\mathbf{u}^t \cdot \mathbf{K}^{(i)} + b)\right)} + \dfrac{1}{2} \mathbf{u}^t \cdot \mathbf{K} \cdot \mathbf{u}$$
$ \mathbf{K}$ is a $m \times m$ matrix, and $\mathbf{u}$ is a $m \times 1$ matrix. Neither is infinite.
Indeed, the dual is usually faster to solve, but the primal has it's advantages as well, such as approximate solutions (which are not guaranteed in the dual formulation).
Now, why is the dual so much more prominent isn't obvious at all: [1]
The historical reasons for which most of the research in the last decade has been about dual optimization are unclear. We believe that it is because SVMs were first introduced in their hard margin formulation [Boser et al., 1992], for which a dual optimization (because of the constraints) seems more natural. In general, however, soft margin SVMs should be preferred, even if the training data are separable: the decision boundary is more robust because more training points are taken into account [Chapelle et al., 2000]
Chapelle (2007) argues the time complexity of both primal and dual optimization is $\mathcal{O}\left(nn_{sv} + n_{sv}^3\right)$, worst case being $\mathcal{O}\left(n^3\right)$, but they analyzed quadratic and approximate hinge losses, so not a proper hinge loss, as it's not differentiable to be used with Newton's method.
[1] Chapelle, O. (2007). Training a support vector machine in the primal. Neural computation, 19(5), 1155-1178.
If we apply a transformation $\phi$ to all input weight vectors ($\mathbf{x}^{(i)}$), we get the following cost function:
$J(\mathbf{w}, b) = C {\displaystyle \sum\limits_{i=1}^{m} max\left(0, 1 - y^{(i)} (\mathbf{w}^t \cdot \phi(\mathbf{x}^{(i)}) + b)\right)} \quad + \quad \dfrac{1}{2} \mathbf{w}^t \cdot \mathbf{w}$
The kernel trick replaces $\phi(\mathbf{u})^t \cdot \phi(\mathbf{v})$ by $K(\mathbf{u}, \mathbf{v})$. Since the weight vector $\mathbf{w}$ is not transformed, the kernel trick cannot be applied to the cost function above.
The cost function above corresponds to the primal form of the SVM objective:
$\underset{\mathbf{w}, b, \mathbf{\zeta}}\min{C \sum\limits_{i=1}^m{\zeta^{(i)}} + \dfrac{1}{2}\mathbf{w}^t \cdot \mathbf{w}}$
subject to $y^{(i)}(\mathbf{w}^t \cdot \phi(\mathbf{x}^{(i)}) + b) \ge 1 - \zeta^{(i)})$ and $\zeta^{(i)} \ge 0$ for $i=1, \cdots, m$
The dual form is:
$\underset{\mathbf{\alpha}}\min{\dfrac{1}{2}\mathbf{\alpha}^t \cdot \mathbf{Q} \cdot \mathbf{\alpha} - \mathbf{1}^t \cdot \mathbf{\alpha}}$
subject to $\mathbf{y}^t \cdot \mathbf{\alpha} = 0$ and $0 \le \alpha_i \le C$ for $i = 1, 2, \cdots, m$
where $\mathbf{1}$ is a vector full of 1s and $\mathbf{Q}$ is an $m \times m$ matrix with elements $Q_{ij} = y^{(i)} y^{(j)} \phi(\mathbf{x}^{(i)})^t \cdot \phi(\mathbf{x}^{(j)})$.
Now we can use the kernel trick by computing $Q_{ij}$ like so:
$Q_{ij} = y^{(i)} y^{(j)} K(\mathbf{x}^{(i)}, \mathbf{x}^{(j)})$
So the kernel trick can only be used on the dual form of the SVM problem (plus some other algorithms such as logistic regression).
Now you can use off-the-shelf Quadratic Programming libraries to solve this problem, or use Lagrangian multipliers to get an unconstrained function (the dual cost function), then search for a minimum using Gradient Descent or any other optimization technique. One of the most efficient approach seems to be the SMO algorithm implemented by the libsvm library (for kernelized SVM).