The method to calculate gradient in this case is Calculus (analytically, NOT numerically!). So we differentiate loss function with respect to W(yi) like this:

and with respect to W(j) when j!=yi is:

The 1 is just indicator function so we can ignore the middle form when condition is true. And when you write in code, the example you provided is the answer.

Since you are using cs231n example, you should definitely check note and videos if needed.

Hope this helps!

Answer from dexhunter on Stack Overflow
Top answer
1 of 3
12

Let's start with basics. The so-called gradient is just the ordinary derivative, that is, slope. For example, slope of the linear function equals , so its gradient w.r.t. equals . If and are not numbers, but vectors, then the gradient is also a vector.

Another piece of good news is that gradient is a linear operator. It means, you can add functions and multiply by constants before or after differentiation, it doesn't make any difference

Now take the definition of SVM loss function for a single -th observation. It is

where . Thus, loss equals , if the latter is non-negative, and otherwise.

In the first (non-negative) case the loss is linear in , so the gradient is just the slope of this function of , that is , .

In the second (negative) case the loss is constant, so its derivative is also .

To write all this cases in one equation, we invent a function (it is called indicator) , which equals if is true, and otherwise. With this function, we can write

If , the first multiplier equals 1, and gradient equals . Otherwise, the first multiplier equals 0, and gradient as well. So I just rewrote the two cases in a single line.

Now let's turn from a single -th observation to the whole loss. The loss is sum of individual losses. Thus, because differentiation is linear, the gradient of a sum equals sum of gradients, so we can write

$\text{total derivative} = \sum(I(something - w_y*x_i > 0) * (-x_i))$

Now, move the multiplier from to the beginning of the formula, and you will get your expression.

2 of 3
1

David has provided good answer. But I would point out that the sum() in David's answer:

total_derivative = sum(I(something - w_y*x[i] > 0) * (-x[i]))

is different from the one in the original Nikhil's question:

The above equation is still the gradient due to the i-th observation, but for the weight of the ground truth class, i.e. . There is the summation , because is in every term of the SVM loss :

For every non-zero term, i.e. , you would obtain the gradient . In total, the gradient is $numOfNonZeroTerm \times (- x_i)$, same as the equation above.

Gradients of individual observations (computed above) are then averaged to obtain the gradient of the batch of observations .

Discussions

svm loss function gradient - Cross Validated - Stack Exchange
Bring the best of human thought and AI automation together at your work. Explore Stack Internal ... I was taking Stanford's cs231n class and was unable to understand the gradient calculated using the SVM loss function. More on stats.stackexchange.com
🌐 stats.stackexchange.com
June 8, 2019
Gradient for hinge loss multiclass - Cross Validated
While the expression may look scary when it is written out, when you're implementing this in code you'd simply count the number of classes that didn't meet the desired margin (and hence contributed to the loss function) and then the data vector $x_i$ scaled by this number is the gradient. More on stats.stackexchange.com
🌐 stats.stackexchange.com
June 2, 2015
Concept explanation - AI Discussions - DeepLearning.AI
Hi community, I am following a computer vision class and I am trying to implement the naive SVM. The aim is to compute the gradient of the SVM term of the loss function: compute the derivative at the same time that the loss is being computed. Here is the function code: def svm_loss_naive( W: ... More on community.deeplearning.ai
🌐 community.deeplearning.ai
0
July 11, 2024
Calculating SVM Gradient

Can you be more specific about the part you didn't understand? Are you just asking about the numerical gradient?

More on reddit.com
🌐 r/cs231n
4
1
January 19, 2016
🌐
University of Utah
users.cs.utah.edu › ~zhe › pdf › lec-19-2-svm-sgd-upload.pdf pdf
1 Support Vector Machines: Training with Stochastic Gradient Descent
Compute gradient of J(w) at wt. Call it ∇J(w ... This algorithm is guaranteed to converge to the minimum of J if %t is small enough. ... Hinge loss is not differentiable!
🌐
GitHub
github.com › amanchadha › stanford-cs231n-assignments-2020 › blob › master › assignment1 › svm.ipynb
stanford-cs231n-assignments-2020/assignment1/svm.ipynb at master · amanchadha/stanford-cs231n-assignments-2020
For instance, the gradient of the SVM loss function is undefined at the hinge, i.e., at x = 0. Generally, when we have max(x, y), at x = y the gradient is undefined. These non-differentiable parts of the function are called ``kinks'' and they ...
Author   amanchadha
🌐
Stack Exchange
stats.stackexchange.com › questions › 412077 › svm-loss-function-gradient
svm loss function gradient - Cross Validated - Stack Exchange
June 8, 2019 - When you take the derivative wrt some $k\neq y_i$, the $w_k$ appears in the whole expression just once, i.e. when $j=k$. And, the gradient will be just $x_i$ times the indicator. For example, let the set $j\neq y_i$ be $\{a,b,k\}$. The expanded version of the loss function will be $$L_i=\max(0,w_ax_i-w_{y_i}x_i+\Delta)+\max(0,w_bx_i-w_{y_i}x_i+\Delta)+\max(0,w_kx_i-w_{y_i}x_i+\Delta)$$
Find elsewhere
🌐
MIT CSAIL
people.csail.mit.edu › dsontag › courses › ml16 › slides › lecture5.pdf pdf
Support vector machines (SVMs) Lecture 5 David Sontag
Empirical loss · RegularizaNon · term · Equivalent if · So5 margin SVM · Subgradient · (for non-­‐differenNable funcNons) (Sub)gradient descent of SVM objecNve
🌐
Anna-Lena Popkes
alpopkes.com › posts › machine_learning › support_vector_machines
Support vector machines
April 13, 2021 - The hinge loss function is not differentiable (namely at the point $t=1$). Therefore, we cannot compute the gradient right away. However, we can use a method called subgradient descent to solve our optimization problem.
Top answer
1 of 2
8

Let's use the example of the SVM loss function for a single datapoint:

$L_i = \sum_{j\neq y_i} \left[ \max(0, w_j^Tx_i - w_{y_i}^Tx_i + \Delta) \right]$

Where $\Delta$ is the desired margin.

We can differentiate the function with respect to the weights. For example, taking the gradient with respect to $w_{yi}$ we obtain:

$\nabla_{w_{y_i}} L_i = - \left( \sum_{j\neq y_i} \mathbb{1}(w_j^Tx_i - w_{y_i}^Tx_i + \Delta > 0) \right) x_i$

Where 1 is the indicator function that is one if the condition inside is true or zero otherwise. While the expression may look scary when it is written out, when you're implementing this in code you'd simply count the number of classes that didn't meet the desired margin (and hence contributed to the loss function) and then the data vector $x_i$ scaled by this number is the gradient. Notice that this is the gradient only with respect to the row of $W$ that corresponds to the correct class. For the other rows where $j≠{{y}_{i}}$ the gradient is:

$\nabla_{w_j} L_i = \mathbb{1}(w_j^Tx_i - w_{y_i}^Tx_i + \Delta > 0) x_i$

Once you derive the expression for the gradient it is straight-forward to implement the expressions and use them to perform the gradient update.

Taken from Stanford CS231N optimization notes posted on github.

2 of 2
0

First of all, note that multi-class hinge loss function is a function of $W_r$. \begin{equation} l(W_r) = \max( 0, 1 + \underset{r \neq y_i}{ \max } W_r \cdot x_i - W_{y_i} \cdot x_i) \end{equation} Next, max function is non-differentiable at $0$. So, we need to calculate the subgradient of it. \begin{equation} \frac{\partial l(W_r)}{\partial W_r} = \begin{cases} \{0\}, & W_{y_i}\cdot x_i > 1 + \underset{r \neq y_i}{ \max } W_r \cdot x_i \\ \{x_i\}, & W_{y_i}\cdot x_i < 1 + \underset{r \neq y_i}{ \max } W_r \cdot x_i\\ \{\alpha x_i\}, & \alpha \in [0,1], W_{y_i}\cdot x_i = 1 + \underset{r \neq y_i}{ \max } W_r \cdot x_i \end{cases} \end{equation} In the second case, $W_{y_i}$ is independent of $W_r$. Above definition of subgradient of multi-class hinge loss is similar to subgradient of binary class hinge loss.

🌐
Kaggle
kaggle.com › code › residentmario › support-vector-machines-and-stoch-gradient-descent
Support vector machines and stoch gradient descent
Checking your browser before accessing www.kaggle.com · Click here if you are not automatically redirected after 5 seconds
🌐
Gitbooks
sharad-s.gitbooks.io › cs231n › content › lecture_3_-_loss_functions_and_optimization › multiclass_svm_loss_deep_dive.html
Multiclass SVM Loss (Deep Dive) · CS231n
A: You would expect a loss of approximately (C-1) where C is the number of classes. This is because if you look at the equation for Multiclass SVM Loss, you will see that max(0, 0-0 + 1) evaluates to a loss of 1 for each class.
🌐
Twice22
twice22.github.io › hingeloss
Hinge Loss Gradient Computation
assign $x_i$ to each column of this matrix if ($j \neq y_i$ and $(x_iw_j - x_iw_{y_i} + \Delta > 0)$) assign $-\sum\limits_{j \neq y_{i}}1(x_iw_j - x_iw_{y_i} + \Delta > 0)x_i$ to the $y_i$ column · dW = np.zeros(W.shape) # initialize the gradient as zero # compute the loss and the gradient num_classes = W.shape[1] num_train = X.shape[0] loss = 0.0 for i in xrange(num_train): scores = X[i].dot(W) correct_class_score = scores[y[i]] nb_sup_zero = 0 for j in xrange(num_classes): if j == y[i]: continue margin = scores[j] - correct_class_score + 1 # note delta = 1 if margin > 0: nb_sup_zero += 1 loss += margin dW[:, j] += X[i] dW[:, y[i]] -= nb_sup_zero*X[i]
🌐
scikit-learn
scikit-learn.org › stable › modules › generated › sklearn.linear_model.SGDClassifier.html
SGDClassifier — scikit-learn 1.8.0 documentation
This estimator implements regularized ... (SGD) learning: the gradient of the loss is estimated each sample at a time and the model is updated along the way with a decreasing strength schedule (aka learning rate)....
🌐
Wikipedia
en.wikipedia.org › wiki › Hinge_loss
Hinge loss - Wikipedia
January 26, 2026 - The hinge loss is a convex function, so many of the usual convex optimizers used in machine learning can work with it. It is not differentiable, but has a subgradient with respect to model parameters w of a linear SVM with score function
🌐
GitHub
github.com › huyouare › CS231n › blob › master › assignment1 › cs231n › classifiers › linear_svm.py
CS231n/assignment1/cs231n/classifiers/linear_svm.py at master · huyouare/CS231n
Structured SVM loss function, naive implementation (with loops) Inputs: - W: C x D array of weights · - X: D x N array of data. Data are D-dimensional columns · - y: 1-dimensional array of length N with labels 0...K-1, for K classes · - reg: (float) regularization strength · Returns: a tuple of: - loss as single float · - gradient with respect to weights W; an array of same shape as W ·
Author   huyouare
🌐
DeepLearning.AI
community.deeplearning.ai › ai discussions
Concept explanation - AI Discussions - DeepLearning.AI
July 11, 2024 - The aim is to compute the gradient of the SVM term of the loss function: compute the derivative at the same time that the loss is being computed. Here is the function code: def svm_loss_naive( W: ...
🌐
University of Oxford
robots.ox.ac.uk › ~az › lectures › ml › lect2.pdf pdf
Lecture 2: The SVM classifier
Feature: histogram of oriented gradients (HOG) Feature vector dimension = 16 x 8 (for tiling) x 8 (orientations) = 1024 · image · dominant · direction · HOG · frequency · orientation · • tile window into 8 x 8 pixel cells · • each cell represented by HOG · Averaged positive examples · Training (Learning) • Represent each example window by a HOG feature vector · • Train a SVM ...