svm loss function gradient

Compute the gradient of the SVM loss function

stackoverflow.com › questions › 36020583 › compute-the-gradient-of-the-svm-loss-function

The method to calculate gradient in this case is Calculus (analytically, NOT numerically!). So we differentiate loss function with respect to W(yi) like this:

and with respect to W(j) when j!=yi is:

The 1 is just indicator function so we can ignore the middle form when condition is true. And when you write in code, the example you provided is the answer.

Since you are using cs231n example, you should definitely check note and videos if needed.

Hope this helps!

Answer from dexhunter on Stack Overflow

Stack Overflow

stackoverflow.com › questions › 36020583 › compute-the-gradient-of-the-svm-loss-function

python - Compute the gradient of the SVM loss function - Stack Overflow

Top answer

1 of 3

The method to calculate gradient in this case is Calculus (analytically, NOT numerically!). So we differentiate loss function with respect to W(yi) like this:

and with respect to W(j) when j!=yi is:

The 1 is just indicator function so we can ignore the middle form when condition is true. And when you write in code, the example you provided is the answer.

Since you are using cs231n example, you should definitely check note and videos if needed.

Hope this helps!

2 of 3

If the substraction less than zero the loss is zero so the gradient of W is also zero. If the substarction larger than zero, then the gradient of W is the partial derviation of the loss.

Stack Exchange

math.stackexchange.com › questions › 2572318 › derivation-of-gradient-of-svm-loss

machine learning - Derivation of gradient of SVM loss - Mathematics Stack Exchange

Top answer

1 of 3

Let's start with basics. The so-called gradient is just the ordinary derivative, that is, slope. For example, slope of the linear function $\text{[math]}$ equals $\text{[math]}$ , so its gradient w.r.t. $\text{[math]}$ equals $\text{[math]}$ . If $\text{[math]}$ and $\text{[math]}$ are not numbers, but vectors, then the gradient is also a vector.

Another piece of good news is that gradient is a linear operator. It means, you can add functions and multiply by constants before or after differentiation, it doesn't make any difference

Now take the definition of SVM loss function for a single $\text{[math]}$ -th observation. It is

$\text{[math]}$

where $\text{[math]}$ . Thus, loss equals $\text{[math]}$ , if the latter is non-negative, and $\text{[math]}$ otherwise.

In the first (non-negative) case the loss $\text{[math]}$ is linear in $\text{[math]}$ , so the gradient is just the slope of this function of $\text{[math]}$ , that is , $\text{[math]}$ .

In the second (negative) case the loss $\text{[math]}$ is constant, so its derivative is also $\text{[math]}$ .

To write all this cases in one equation, we invent a function (it is called indicator) $\text{[math]}$ , which equals $\text{[math]}$ if $\text{[math]}$ is true, and $\text{[math]}$ otherwise. With this function, we can write

$\text{[math]}$

If $\text{[math]}$ , the first multiplier equals 1, and gradient equals $\text{[math]}$ . Otherwise, the first multiplier equals 0, and gradient as well. So I just rewrote the two cases in a single line.

Now let's turn from a single $\text{[math]}$ -th observation to the whole loss. The loss is sum of individual losses. Thus, because differentiation is linear, the gradient of a sum equals sum of gradients, so we can write

$\text{total derivative} = \sum(I(something - w_y*x_i > 0) * (-x_i))$

Now, move the $\text{[math]}$ multiplier from $\text{[math]}$ to the beginning of the formula, and you will get your expression.

2 of 3

David has provided good answer. But I would point out that the sum() in David's answer:

total_derivative = sum(I(something - w_y*x[i] > 0) * (-x[i]))

is different from the one in the original Nikhil's question:

$\text{[math]}$ The above equation is still the gradient due to the i-th observation, but for the weight of the ground truth class, i.e. $\text{[math]}$ . There is the summation $\text{[math]}$ , because $\text{[math]}$ is in every term of the SVM loss $\text{[math]}$ :

$\text{[math]}$ For every non-zero term, i.e. $\text{[math]}$ , you would obtain the gradient $\text{[math]}$ . In total, the gradient $\text{[math]}$ is $numOfNonZeroTerm \times (- x_i)$, same as the equation above.

Gradients of individual observations $\text{[math]}$ (computed above) are then averaged to obtain the gradient of the batch of observations $\text{[math]}$ .

Videos

05:30

YouTube

What is the Hinge Loss in SVM in Machine Learning | Data Science ...

Introduction to Hinge Loss | Loss function SVM | Machine Learning ...

February 16, 2023

22:18

YouTube

7.3.4. Loss Function for Support Vector Machine Classifier - Hinge ...

November 1, 2021

18:28

YouTube

7.3.5. Gradient Descent for Support Vector Machine Classifier - ...

November 3, 2021

10:46

YouTube

Gradient Descent for Support Vector Machines and Subgradients - ...

stackoverflow.com › questions › 38581242 › gradient-of-a-loss-function-for-an-svm

python - Gradient of a Loss Function for an SVM - Stack Overflow

Top answer

1 of 1

I don't have enough reputation to comment, so I am answering here. Whenever you compute loss vector for x[i], ith training example and get some nonzero loss, that means you should move your weight vector for the incorrect class (j != y[i]) away by x[i], and at the same time, move the weights or hyperplane for the correct class (j==y[i]) near x[i]. By parallelogram law, w + x lies in between w and x. So this way w[y[i]] tries to come nearer to x[i] each time it finds loss>0.

Thus, dW[:,y[i]] += -X[i] and dW[:,j] += X[i] is done in the loop, but while update, we will do in direction of decreasing gradient, so we are essentially adding X[i] to correct class weights and going away by X[i] from weights that miss classify.

GitHub

github.com › amanchadha › stanford-cs231n-assignments-2020 › blob › master › assignment1 › svm.ipynb

stanford-cs231n-assignments-2020/assignment1/svm.ipynb at master · amanchadha/stanford-cs231n-assignments-2020

We have provided code that does this for you:"]},{"cell_type":"code","metadata":{"id":"Ys4ZyO1g1Gzt","colab_type":"code","outputId":"6555ab0a-f3c1-45f9-8c25-4d118d1f2161","executionInfo":{"status":"ok","timestamp":1587284901777,"user_tz":420,"elapsed":20564,"user":{"displayName":"Aman Chadha","photoUrl":"https://lh3.googleusercontent.com/a-/AOh14GjiX3RiuA7fo0mAOK6_lUUA-fdtlCDBrW7oDjZEuQ=s64","userId":"13683555200340050974"}},"colab":{"base_uri":"https://localhost:8080/","height":357}},"source":["# Once you've implemented the gradient, recompute it with the code below\n","# and gradient check it with the function we provided for you\n","\n","# Compute the loss and its gradient at W.\n","loss, grad = svm_loss_naive(W, X_dev, y_dev, 0.0)\n","\n","# Numerically compute the gradient along several randomly chosen dimensions, and\n","# compare them with your analytically computed gradient.

Author amanchadha

Stack Exchange

stats.stackexchange.com › questions › 412077 › svm-loss-function-gradient

svm loss function gradient - Cross Validated - Stack Exchange

June 8, 2019 - When you take the derivative wrt some $k\neq y_i$, the $w_k$ appears in the whole expression just once, i.e. when $j=k$. And, the gradient will be just $x_i$ times the indicator. For example, let the set $j\neq y_i$ be $\{a,b,k\}$. The expanded version of the loss function will be $$L_i=\max(0,w_ax_i-w_{y_i}x_i+\Delta)+\max(0,w_bx_i-w_{y_i}x_i+\Delta)+\max(0,w_kx_i-w_{y_i}x_i+\Delta)$$

University of Oxford

robots.ox.ac.uk › ~az › lectures › ml › lect2.pdf pdf

Lecture 2: The SVM classifier

• Support Vector Machine (SVM) classifier · • Wide margin · • Cost function · • Slack variables · • Loss functions revisited · • Optimization · Binary Classification · Given training data (xi, yi) for i = 1 . . . N, with · xi ∈Rd and yi ∈{−1, 1}, learn a classiﬁer f(x) such that ·

Stack Exchange

stats.stackexchange.com › questions › 155088 › gradient-for-hinge-loss-multiclass

Gradient for hinge loss multiclass - Cross Validated

Top answer

1 of 2

Let's use the example of the SVM loss function for a single datapoint:

$L_i = \sum_{j\neq y_i} \left[ \max(0, w_j^Tx_i - w_{y_i}^Tx_i + \Delta) \right]$

Where $\text{[math]}$ is the desired margin.

We can differentiate the function with respect to the weights. For example, taking the gradient with respect to $\text{[math]}$ we obtain:

$\nabla_{w_{y_i}} L_i = - \left( \sum_{j\neq y_i} \mathbb{1}(w_j^Tx_i - w_{y_i}^Tx_i + \Delta > 0) \right) x_i$

Where 1 is the indicator function that is one if the condition inside is true or zero otherwise. While the expression may look scary when it is written out, when you're implementing this in code you'd simply count the number of classes that didn't meet the desired margin (and hence contributed to the loss function) and then the data vector $\text{[math]}$ scaled by this number is the gradient. Notice that this is the gradient only with respect to the row of $\text{[math]}$ that corresponds to the correct class. For the other rows where $j≠{{y}_{i}}$ the gradient is:

$\nabla_{w_j} L_i = \mathbb{1}(w_j^Tx_i - w_{y_i}^Tx_i + \Delta > 0) x_i$

Once you derive the expression for the gradient it is straight-forward to implement the expressions and use them to perform the gradient update.

Taken from Stanford CS231N optimization notes posted on github.

2 of 2

First of all, note that multi-class hinge loss function is a function of $\text{[math]}$ . \begin{equation} l(W_r) = \max( 0, 1 + \underset{r \neq y_i}{ \max } W_r \cdot x_i - W_{y_i} \cdot x_i) \end{equation} Next, max function is non-differentiable at $\text{[math]}$ . So, we need to calculate the subgradient of it. \begin{equation} \frac{\partial l(W_r)}{\partial W_r} = \begin{cases} \{0\}, & W_{y_i}\cdot x_i > 1 + \underset{r \neq y_i}{ \max } W_r \cdot x_i \\ \{x_i\}, & W_{y_i}\cdot x_i < 1 + \underset{r \neq y_i}{ \max } W_r \cdot x_i\\ \{\alpha x_i\}, & \alpha \in [0,1], W_{y_i}\cdot x_i = 1 + \underset{r \neq y_i}{ \max } W_r \cdot x_i \end{cases} \end{equation} In the second case, $\text{[math]}$ is independent of $\text{[math]}$ . Above definition of subgradient of multi-class hinge loss is similar to subgradient of binary class hinge loss.

GitHub

github.com › huyouare › CS231n › blob › master › assignment1 › cs231n › classifiers › linear_svm.py

CS231n/assignment1/cs231n/classifiers/linear_svm.py at master · huyouare/CS231n

Structured SVM loss function, naive implementation (with loops) Inputs: - W: C x D array of weights · - X: D x N array of data. Data are D-dimensional columns · - y: 1-dimensional array of length N with labels 0...K-1, for K classes · - reg: (float) regularization strength · Returns: a tuple of: - loss as single float · - gradient with respect to weights W; an array of same shape as W ·

Author huyouare

Find elsewhere

Google Bing Mojeek

Wikipedia

en.wikipedia.org › wiki › Hinge_loss

Hinge loss - Wikipedia

January 26, 2026 - The hinge loss is a convex function, so many of the usual convex optimizers used in machine learning can work with it. It is not differentiable, but has a subgradient with respect to model parameters w of a linear SVM with score function

Extensions Optimization

Gitbooks

sharad-s.gitbooks.io › cs231n › content › lecture_3_-_loss_functions_and_optimization › multiclass_svm_loss_deep_dive.html

Multiclass SVM Loss (Deep Dive) · CS231n

A: You would expect a loss of approximately (C-1) where C is the number of classes. This is because if you look at the equation for Multiclass SVM Loss, you will see that max(0, 0-0 + 1) evaluates to a loss of 1 for each class.

Aman's AI Journal

aman.ai › primers › backprop › derivative-svm

Aman's AI Journal • Primers • Partial Derivative for SVM/Hinge Loss

Starting with the SVM loss function for a single datapoint: \[L_i = \sum_{j\neq y_i} \left[ \max(0, w_j^Tx_i - w_{y_i}^Tx_i + \Delta) \right]\] Differentiating the function to obtain the gradient with respect to the weights corresponding to the correct class $w_{y_i}$, we obtain: \[\boxed{\nabla_{w_{y_i}} L_i = - \left( \sum_{j\neq y_i} \mathbb{1}(w_j^Tx_i - w_{y_i}^Tx_i + \Delta > 0) \right) x_i}\]

Twice22

twice22.github.io › pdf › gradient_svm.pdf pdf

Gradient of the SVM Hinge loss

April 9, 2017 - Machine learning enthusiast

CS231n

cs231n.github.io › linear-classify

CS231n Deep Learning for Computer Vision

You can see that the first term gives zero since [-7 - 13 + 10] gives a negative number, which is then thresholded to zero with the $max(0,-)$ function. We get zero loss for this pair because the correct class score (13) was greater than the incorrect class score (-7) by at least the margin 10. In fact the difference was 20, which is much greater than 10 but the SVM only cares that the difference is at least 10; Any additional difference above the margin is clamped at zero with the max operation.

University of Utah

users.cs.utah.edu › ~zhe › pdf › lec-19-2-svm-sgd-upload.pdf pdf

1 Support Vector Machines: Training with Stochastic Gradient Descent

Compute gradient of J(w) at wt. Call it ∇J(w ... This algorithm is guaranteed to converge to the minimum of J if %t is small enough. ... Hinge loss is not differentiable!

Codemia

codemia.io › knowledge-hub › path › compute_the_gradient_of_the_svm_loss_function

Compute the gradient of the SVM loss function

Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises

Stack Overflow

stackoverflow.com › questions › 47607814 › vectorized-svm-gradient

python - Vectorized SVM gradient - Stack Overflow

Top answer

1 of 2

Let us recap the scenario and the loss function first, so we are on the same page:

Given are P sample points in N-dimensional space in the form of a PxN matrix X, so the points are the rows of this matrix. Each point in X is assigned to one out of M categories. These are given as a vector Y of length P that has integer values between 0 and M-1.

The goal is to predict the classes of all points by M linear classifiers (one for each category) given in the form of a weight matrix W of shape NxM, so the classifiers are the columns of W. To predict the categories of all samples X the scalar products between all points and all weight vectors are formed. This is the same as matrix multiplying X and W yielding a score matrix Y0 that is arranged such that its rows are ordered like theh elements of Y, each row corresponds to one sample. The predicted category for each sample is simply that with the largest score.

There are no bias terms so I presume there is some kind of symmetry or zero mean assumption.

Now, to find a good set of weights we want a loss function that is small for good predictions and large for bad predictions and that lets us do gradient descent. One of the most straight-forward ways is to just punish for each sample i each score that is larger than the score of the correct category for that sample and let the penalty grow linearly with the difference. So if we write A[i] for the set of categories j that score more than the correct category Y0[i, j] > Y0[i, Y[i]] the loss for sample i could be written as

sum_{j in A[i]} (Y0[i, j] - Y0[i, Y[i]])

or equivalently if we write #A[i] for the number of elements in A[i]

(sum_{j in A[i]} Y0[i, j]) - #A[i] Y0[i, Y[i]]

The partial derivatives with respect to the score are thus simply

                    | -#A[i]      if j == Y[i]
dloss / dY0[i, j] = {      1      if j in A[i]
                    |      0      else

which is precisely what the first four lines you say you don't understand compute.

The next line applies the chain rule dloss/dW = dloss/dY0 dY0/dW.

It remains to divide by the number of samples to get a per sample loss and to add the derivative of the regulatization term which the regularization being just a componentwise quadratic function is easy.

2 of 2

Personally, I found it much easier to understand the whole gradient calculation through looking at the analytic derivation of the loss function in more detail. To extend on the given answer, I would like to point to the derivatives of the loss function

with respect to the weights as follows:

Loss gradient wrt w_yi (correct class)

Hence, we count the cases where w_j is not meeting the margin requirement and sum those cases up. This negative sum is then specified as weight for the position of the correct class w_yi. (we later need to multiply this value with xi, this is what you do in your code in line 5)

2) Loss gradient wrt w_j (incorrect classes)

where 1 is the indicator function, 1 if true, else 0.

In other words, "programatically" we need to apply equation (2) to all cases where the margin requirement is not met, and adding the negative sum of all unmet requirements to the true class column (as in (1)).

So what you did in the first 3 lines of your code is to determine the cases where the margin is not met, as well as adding the negative sum of these cases to the correct class column (j). In the 5 line, you do the final step where you multiply the x_i's to the other term - and this completes the gradient calculations as in (1) and (2).

I hope this makes it easier to understand, let me know if anything remains unclear. source

Twice22

twice22.github.io › hingeloss

Hinge Loss Gradient Computation

Actually, in the lecture we can see the formula of the gradient of the SVM loss. Although the formula seems understandable, I still thinks we might need to get our hands dirty by doing the math. In this part, I will quickly define the problem according to the data of the first assignment of CS231n. Let’s define our Loss function by: \[L_i = \sum\limits_{j \neq y_i}[\ max(0, x_iw_j - x_iw_{y_i} + \Delta)\ ]\]

DeepLearning.AI

community.deeplearning.ai › ai discussions

Concept explanation - AI Discussions - DeepLearning.AI

July 11, 2024 - The aim is to compute the gradient of the SVM term of the loss function: compute the derivative at the same time that the loss is being computed. Here is the function code: def svm_loss_naive( W: ...

Stack Exchange

stats.stackexchange.com › questions › 74499 › what-is-the-loss-function-of-hard-margin-svm

What is the loss function of hard margin SVM? - Cross Validated

Top answer

1 of 3

The hinge loss term $\sum_i\max(0,1-y_i(\mathbf{w}^\intercal \mathbf{x}_i+b))$ in soft margin SVM penalizes misclassifications. In hard margin SVM there are, by definition, no misclassifications.

This indeed means that hard margin SVM tries to minimize $\|\mathbf{w}\|^2$. Due to the formulation of the SVM problem, the margin is $2/\|\mathbf{w}\|$. As such, minimizing the norm of $\mathbf{w}$ is geometrically equivalent to maximizing the margin. Exactly what we want!

Regularization is a technique to avoid overfitting by penalizing large coefficients in the solution vector. In hard margin SVM $\|\mathbf{w}\|^2$ is both the loss function and an $L_2$ regularizer.

In soft-margin SVM, the hinge loss term also acts like a regularizer but on the slack variables instead of $\mathbf{w}$ and in $L_1$ rather than $L_2$. $L_1$ regularization induces sparsity, which is why standard SVM is sparse in terms of support vectors (in contrast to least-squares SVM).

2 of 3

There's no "loss function" for hard-margin SVMs, but when we're solving soft-margin SVMs, it turns out the loss exists.

Now is the detailed explanation:

When we talk about loss function, what we really mean is a training objective that we want to minimize.

In hard-margin SVM setting, the "objective" is to maximize the geometric margin s.t each training example lies outside the separating hyperplane, i.e. $$\begin{aligned} & \max_{\gamma, w, b}\frac{1}{\Vert w \Vert} \\ &s.t\quad y(w^Tx+b) \ge 1 \end{aligned} $$ Note that this is a quadratic programming problem, so we cannot solve it numerically using direct gradient descent approach, that is, there is no analytic "loss function" for hard-margin SVMs.

However, in soft-margin SVM setting, we add a slack variable to allow our SVM to made mistakes. We now try to solve $$\begin{aligned} & \min_{w,b,\boldsymbol{\xi}}\frac{1}{2}\Vert w \Vert_2^2 + C\sum \xi_i \\ s.t\quad &y_i(w^Tx_i+b) \ge 1-\xi_i \\ & \boldsymbol{\xi} \succeq \mathbf{0} \end{aligned} $$ This is the same as we try to penalize the misclassified training example $x_i$ by adding $C\xi_i$ to our objective to be minimized. Recall hinge loss: $$ \ell_{\mbox{hinge}}(z) = \max\{0, 1-z\}, $$ since if the training example lies outside the margin $\xi_i$ will be zero and it will only be nonzero when training example falls into margin region, and since hinge loss is always nonnegative, it happens we can rephrase our problem as $$ \min \frac{1}{2}\Vert w \Vert_2^2 + C\sum\ell_{\mbox{hinge}}(y_i(w^Tx_i)). $$ We know that hinge loss is convex and its derivative is known, thus we can solve for soft-margin SVM directly by gradient descent.

So the slack variable is just hinge loss in disguise, and the property of hinge loss happens to wrap up our optimization constraints (i.e. nonnegativity and activates input when it's less than 1).

ScienceDirect

sciencedirect.com › science › article › abs › pii › S0360835222006180

Robust support vector machine classifier with truncated loss function by gradient algorithm - ScienceDirect

September 5, 2022 - The SVM classifier with the huberied truncated pinball loss (HTPSVM) is proposed. The HTPSVM combines the elastic net penalty and the nonconvex huberied truncated pinball loss. It inherits the benefits of both ... 2 norm regularizers. The HTPSVM involves nonconvex minimization, the accelerated proximal gradient (APG) algorithm was used to solve the corresponding optimization.