The method to calculate gradient in this case is Calculus (analytically, NOT numerically!). So we differentiate loss function with respect to W(yi) like this:

and with respect to W(j) when j!=yi is:

The 1 is just indicator function so we can ignore the middle form when condition is true. And when you write in code, the example you provided is the answer.
Since you are using cs231n example, you should definitely check note and videos if needed.
Hope this helps!
Answer from dexhunter on Stack OverflowThe method to calculate gradient in this case is Calculus (analytically, NOT numerically!). So we differentiate loss function with respect to W(yi) like this:

and with respect to W(j) when j!=yi is:

The 1 is just indicator function so we can ignore the middle form when condition is true. And when you write in code, the example you provided is the answer.
Since you are using cs231n example, you should definitely check note and videos if needed.
Hope this helps!
If the substraction less than zero the loss is zero so the gradient of W is also zero. If the substarction larger than zero, then the gradient of W is the partial derviation of the loss.
Let's start with basics. The so-called gradient is just the ordinary derivative, that is, slope. For example, slope of the linear function equals
, so its gradient w.r.t.
equals
. If
and
are not numbers, but vectors, then the gradient is also a vector.
Another piece of good news is that gradient is a linear operator. It means, you can add functions and multiply by constants before or after differentiation, it doesn't make any difference
Now take the definition of SVM loss function for a single -th observation. It is
where . Thus, loss equals
, if the latter is non-negative, and
otherwise.
In the first (non-negative) case the loss is linear in
, so the gradient is just the slope of this function of
, that is ,
.
In the second (negative) case the loss is constant, so its derivative is also
.
To write all this cases in one equation, we invent a function (it is called indicator) , which equals
if
is true, and
otherwise. With this function, we can write
If , the first multiplier equals 1, and gradient equals
. Otherwise, the first multiplier equals 0, and gradient as well. So I just rewrote the two cases in a single line.
Now let's turn from a single -th observation to the whole loss. The loss is sum of individual losses. Thus, because differentiation is linear, the gradient of a sum equals sum of gradients, so we can write
$\text{total derivative} = \sum(I(something - w_y*x_i > 0) * (-x_i))$
Now, move the multiplier from
to the beginning of the formula, and you will get your expression.
David has provided good answer. But I would point out that the sum() in David's answer:
total_derivative = sum(I(something - w_y*x[i] > 0) * (-x[i]))
is different from the one in the original Nikhil's question:
The above equation is still the gradient due to the i-th observation, but for the weight of the ground truth class, i.e.
. There is the summation
, because
is in every term of the SVM loss
:
For every non-zero term, i.e.
, you would obtain the gradient
. In total, the gradient
is $numOfNonZeroTerm \times (- x_i)$, same as the equation above.
Gradients of individual observations (computed above) are then averaged to obtain the gradient of the batch of observations
.
svm loss function gradient - Cross Validated - Stack Exchange
Gradient for hinge loss multiclass - Cross Validated
Concept explanation - AI Discussions - DeepLearning.AI
Calculating SVM Gradient
Can you be more specific about the part you didn't understand? Are you just asking about the numerical gradient?
More on reddit.comVideos
Let's use the example of the SVM loss function for a single datapoint:
$L_i = \sum_{j\neq y_i} \left[ \max(0, w_j^Tx_i - w_{y_i}^Tx_i + \Delta) \right]$
Where $\Delta$ is the desired margin.
We can differentiate the function with respect to the weights. For example, taking the gradient with respect to $w_{yi}$ we obtain:
$\nabla_{w_{y_i}} L_i = - \left( \sum_{j\neq y_i} \mathbb{1}(w_j^Tx_i - w_{y_i}^Tx_i + \Delta > 0) \right) x_i$
Where 1 is the indicator function that is one if the condition inside is true or zero otherwise. While the expression may look scary when it is written out, when you're implementing this in code you'd simply count the number of classes that didn't meet the desired margin (and hence contributed to the loss function) and then the data vector $x_i$ scaled by this number is the gradient. Notice that this is the gradient only with respect to the row of $W$ that corresponds to the correct class. For the other rows where $j≠{{y}_{i}}$ the gradient is:
$\nabla_{w_j} L_i = \mathbb{1}(w_j^Tx_i - w_{y_i}^Tx_i + \Delta > 0) x_i$
Once you derive the expression for the gradient it is straight-forward to implement the expressions and use them to perform the gradient update.
Taken from Stanford CS231N optimization notes posted on github.
First of all, note that multi-class hinge loss function is a function of $W_r$.
\begin{equation}
l(W_r) = \max( 0, 1 + \underset{r \neq y_i}{ \max } W_r \cdot x_i - W_{y_i} \cdot x_i)
\end{equation}
Next, max function is non-differentiable at $0$. So, we need to calculate the subgradient of it.
\begin{equation}
\frac{\partial l(W_r)}{\partial W_r} =
\begin{cases}
\{0\}, & W_{y_i}\cdot x_i > 1 + \underset{r \neq y_i}{ \max } W_r \cdot x_i \\
\{x_i\}, & W_{y_i}\cdot x_i < 1 + \underset{r \neq y_i}{ \max } W_r \cdot x_i\\
\{\alpha x_i\}, & \alpha \in [0,1], W_{y_i}\cdot x_i = 1 + \underset{r \neq y_i}{ \max } W_r \cdot x_i
\end{cases}
\end{equation}
In the second case, $W_{y_i}$ is independent of $W_r$. Above definition of subgradient of multi-class hinge loss is similar to subgradient of binary class hinge loss.
Can anyone elaborate the SVM gradient equation described here on practical basis - http://cs231n.github.io/optimization-1/#gradcompute
Can you be more specific about the part you didn't understand? Are you just asking about the numerical gradient?
score = X.dot(W)
#predicted values
y_pred = score[range(score.shape[0]),y]
#calculating loss
margins = score - y_pred[:,None] + delta
margins[range(score.shape[0]),y] = 0
loss = np.sum(margins,axis=1)
#indicator variable equation - number of classes with loss > 0
non_zeros_count = (margins > 0).sum(axis=1)
#scaling input values by the count of non_zero loss
grad = X * non_zeros_count[:,None]
#updating the actual class value by multiplying by -1 - have a look at the link posted above
grad[range(score.shape[0]),y] *= -1
Please look at the code. I am having a hard time calculating gradients. How to do it further? Is it right till now?