To get the gradient we differentiate the loss with respect to th component of .

Rewrite hinge loss in terms of as where and

Using chain rule we get

First derivative term is evaluated at becoming when , and 0 when . Second derivative term becomes . So in the end you get

Since ranges over the components of , you can view the above as a vector quantity, and write as shorthand for

Answer from Yaroslav Bulatov on Stack Exchange
🌐
Twice22
twice22.github.io › hingeloss
Hinge Loss Gradient Computation
assign $x_i$ to each column of this matrix if ($j \neq y_i$ and $(x_iw_j - x_iw_{y_i} + \Delta > 0)$) assign $-\sum\limits_{j \neq y_{i}}1(x_iw_j - x_iw_{y_i} + \Delta > 0)x_i$ to the $y_i$ column · dW = np.zeros(W.shape) # initialize the gradient as zero # compute the loss and the gradient num_classes = W.shape[1] num_train = X.shape[0] loss = 0.0 for i in xrange(num_train): scores = X[i].dot(W) correct_class_score = scores[y[i]] nb_sup_zero = 0 for j in xrange(num_classes): if j == y[i]: continue margin = scores[j] - correct_class_score + 1 # note delta = 1 if margin > 0: nb_sup_zero += 1 loss += margin dW[:, j] += X[i] dW[:, y[i]] -= nb_sup_zero*X[i]
Top answer
1 of 3
43

To get the gradient we differentiate the loss with respect to th component of .

Rewrite hinge loss in terms of as where and

Using chain rule we get

First derivative term is evaluated at becoming when , and 0 when . Second derivative term becomes . So in the end you get

Since ranges over the components of , you can view the above as a vector quantity, and write as shorthand for

2 of 3
19

This is 3 years late, but still may be relevant for someone...

Let denote a sample of points and the set of corresponding labels . We search to find a hyperplane that would minimize the total hinge-loss: \begin{equation} w^* = \underset{w}{\text{argmin }} L^{hinge}_S(w) = \underset{w}{\text{argmin }} \sum_i{l_{hinge}(w,x_i,y_i)}= \underset{w}{\text{argmin }} \sum_i{\max{\{0,1-y_iw\cdot x}\}} \end{equation} To find take derivative of the total hinge loss . Gradient of each component is: $$ \frac{\partial{l_{hinge}}}{\partial w}= \begin{cases} 0 & y_iw\cdot x \geq 1 \\ -y_ix & y_iw\cdot x < 1 \end{cases} $$

The gradient of the sum is a sum of gradients. Python example, which uses GD to find hinge-loss optimal separatinig hyperplane follows (its probably not the most efficient code, but it works)

import numpy as np
import matplotlib.pyplot as plt

def hinge_loss(w,x,y):
    """ evaluates hinge loss and its gradient at w

    rows of x are data points
    y is a vector of labels
    """
    loss,grad = 0,0
    for (x_,y_) in zip(x,y):
        v = y_*np.dot(w,x_)
        loss += max(0,1-v)
        grad += 0 if v > 1 else -y_*x_
    return (loss,grad)

def grad_descent(x,y,w,step,thresh=0.001):
    grad = np.inf
    ws = np.zeros((2,0))
    ws = np.hstack((ws,w.reshape(2,1)))
    step_num = 1
    delta = np.inf
    loss0 = np.inf
    while np.abs(delta)>thresh:
        loss,grad = hinge_loss(w,x,y)
        delta = loss0-loss
        loss0 = loss
        grad_dir = grad/np.linalg.norm(grad)
        w = w-step*grad_dir/step_num
        ws = np.hstack((ws,w.reshape((2,1))))
        step_num += 1
    return np.sum(ws,1)/np.size(ws,1)

def test1():
    # sample data points
    x1 = np.array((0,1,3,4,1))
    x2 = np.array((1,2,0,1,1))
    x  = np.vstack((x1,x2)).T
    # sample labels
    y = np.array((1,1,-1,-1,-1))
    w = grad_descent(x,y,np.array((0,0)),0.1)
    loss, grad = hinge_loss(w,x,y)
    plot_test(x,y,w)

def plot_test(x,y,w):
    plt.figure()
    x1, x2 = x[:,0], x[:,1]
    x1_min, x1_max = np.min(x1)*.7, np.max(x1)*1.3
    x2_min, x2_max = np.min(x2)*.7, np.max(x2)*1.3
    gridpoints = 2000
    x1s = np.linspace(x1_min, x1_max, gridpoints)
    x2s = np.linspace(x2_min, x2_max, gridpoints)
    gridx1, gridx2 = np.meshgrid(x1s,x2s)
    grid_pts = np.c_[gridx1.ravel(), gridx2.ravel()]
    predictions = np.array([np.sign(np.dot(w,x_)) for x_ in grid_pts]).reshape((gridpoints,gridpoints))
    plt.contourf(gridx1, gridx2, predictions, cmap=plt.cm.Paired)
    plt.scatter(x[:, 0], x[:, 1], c=y, cmap=plt.cm.Paired)
    plt.title('total hinge loss: %g' % hinge_loss(w,x,y)[0])
    plt.show()

if __name__ == '__main__':
    np.set_printoptions(precision=3)
    test1()

To get the gradient we differentiate the loss with respect to th component of .

Rewrite hinge loss in terms of as where and

Using chain rule we get

First derivative term is evaluated at becoming when , and 0 when . Second derivative term becomes . So in the end you get

Since ranges over the components of , you can view the above as a vector quantity, and write as shorthand for

Answer from Yaroslav Bulatov on Stack Exchange
🌐
Wikipedia
en.wikipedia.org › wiki › Hinge_loss
Hinge loss - Wikipedia
January 26, 2026 - In machine learning, the hinge loss is a loss function used for training classifiers. The hinge loss is used for "maximum-margin" classification, most notably for support vector machines (SVMs). For an intended output t = ±1 and a classifier score y, the hinge loss of the prediction y is defined ...
Top answer
1 of 2
1

Hinge loss is difficult to work with when the derivative is needed because the derivative will be a piece-wise function. max has one non-differentiable point in its solution, and thus the derivative has the same. This was a very prominent issue with non-separable cases of SVM (and a good reason to use ridge regression).

Here's a slide (Original source from Zhuowen Tu, apologies for the title typo):

Where hinge loss is defined as max(0, 1-v) and v is the decision boundary of the SVM classifier. More can be found on the Hinge Loss Wikipedia.

As for your equation: you can easily pick out the v of the equation, however without more context of those functions it's hard to say how to derive. Unfortunately I don't have access to the paper and cannot guide you any further...

2 of 2
1

I disagree with the earlier answer that this is difficult to calculate. If we have the function \begin{align*} \sum_{t\in\mathcal{T}} \max \{0, 1 - d(t) \, y(t, \theta)\} \end{align*} the gradient with respect to is \begin{align*} & \sum_{t\in\mathcal{T}}g(t) \\ & g(t) := \begin{cases} 0 & \text{ if }1 - d(t) y(t, \theta) < 0 \\ -d(t)\dfrac{\partial y}{\partial \theta} & \text{ otherwise} \\ \end{cases} \end{align*} Theoretically this is ok, it just means that the gradient is not continuous. However, the objective is still continuous assuming that and are both continuous.

In practice, it's not a problem either. Any automatic differentiation software (tensorflow, pytorch, jax) will handle something like this automatically and correctly.

🌐
ScienceDirect
sciencedirect.com › topics › engineering › hinge-loss-function
Hinge Loss Function - an overview | ScienceDirect Topics
The hinge loss encourages the network to maximize the margin around the decision boundary separating the two classes, which can lead to better generalization performance than using cross-entropy. Additionally, the hinge loss has sparse gradients, which can be useful for training large models with limited memory (unlike cross-entropy with dense gradients). A frequently used variant of the hinge loss is the squared hinge loss, given by
🌐
arXiv
arxiv.org › pdf › 2004.00179 pdf
Fully-Corrective Gradient Boosting with Squared Hinge
curves of the number of selected weak learners. The detailed · experimental settings can be found in Simulation IV in ... C. Squared hinge loss · Since the gradient descent viewpoint connects the gradient
Find elsewhere
🌐
arXiv
arxiv.org › abs › 2302.11062
[2302.11062] A Log-linear Gradient Descent Algorithm for Unbalanced Binary Classification using the All Pairs Squared Hinge Loss
February 21, 2023 - Naive learning algorithms compute the gradient in quadratic time, which is too slow for learning using large batch sizes. We propose a new functional representation of the square loss and squared hinge loss, which results in algorithms that compute the gradient in either linear or log-linear time, and makes it possible to use gradient descent learning with large batch sizes.
🌐
GitHub
github.com › Gmoog › Svm
GitHub - Gmoog/Svm: Implement Linear SVM using squared hinge loss in python
We look at how to implement the Linear Support Vector Machine with a squared hinge loss in python. The code uses the fast gradient descent algorithm, and we find the optimal value for the regularization parameter using cross validation.
Author   Gmoog
🌐
Vivian Website
csie.ntu.edu.tw › ~cjlin › papers › l2mcsvm › l2mcsvm.pdf pdf
A Study on L2-Loss (Squared Hinge-Loss) Multi-Class SVM
From Tables 5 and 6, L2 loss is worse than L1 loss on the average training time · and sparsity. The higher percentage of support vectors is the same as the situation · in binary classification because the squared hinge loss leads to many small but non-
🌐
arXiv
arxiv.org › pdf › 2302.11062 pdf
A Log-linear Gradient Descent Algorithm for Unbalanced Binary
Theorem 2. If ℓis the squared hinge loss then the total loss over all pairs of positive and negative examples
🌐
Quora
quora.com › Why-is-squared-hinge-loss-differentiable
Why is squared hinge loss differentiable? - Quora
Answer (1 of 4): Let’s start by defining the hinge loss function h(x) = max(1-x,0). Now let’s think about the derivative h’(x). This does not exist at x = 1 because the left and right limits do not converge to the same number (ie: the derivative is undefined at x=1, but it is -1 for x
🌐
ScienceDirect
sciencedirect.com › science › article › abs › pii › S0893608021004950
Fully corrective gradient boosting with squared hinge: Fast learning rates and early stopping - ScienceDirect
December 29, 2021 - Besides the classical exponential ... least square loss in ... 2 Boosting (Buhlmann & Yu, 2003), and hinge loss in HingeBoost (Gao & Koller, 2011). The update scheme iteratively derives a new estimator based on the selected weak learners. According to the gradient descent view, ...
🌐
University of Maryland Department of Computer Science
cs.umd.edu › class › spring2017 › cmsc422 › slides0101 › lecture12.pdf pdf
(Sub)gradient Descent
CMSC422 University of Maryland · Machine learning is all about finding patterns in data to get computers to solve complex problems. Instead of explicitly programming computers to perform a task, machine learning lets us program the computer to learn from examples and improve over time without ...
🌐
Taylor & Francis
taylorandfrancis.com › knowledge › Engineering_and_technology › Engineering_support_and_special_topics › Hinge_loss
Hinge loss – Knowledge and References - Taylor & Francis
Hinge-loss function for multi-class SVM is mathematically defined as In this case when = r, else . Observe that when p = 1, equation (14) is hinge loss function (-loss ), whereas if p = 2, it is a squared hinge loss ...
Top answer
1 of 3
43

To get the gradient we differentiate the loss with respect to $i$th component of $w$.

Rewrite hinge loss in terms of $w$ as $f(g(w))$ where $f(z)=\max(0,1-y\ z)$ and $g(w)=\mathbf{x}\cdot \mathbf{w}$

Using chain rule we get

$$\frac{\partial}{\partial w_i} f(g(w))=\frac{\partial f}{\partial z} \frac{\partial g}{\partial w_i} $$

First derivative term is evaluated at $g(w)=x\cdot w$ becoming $-y$ when $\mathbf{x}\cdot w<1$, and 0 when $\mathbf{x}\cdot w>1$. Second derivative term becomes $x_i$. So in the end you get $$ \frac{\partial f(g(w))}{\partial w_i} = \begin{cases} -y\ x_i &\text{if } y\ \mathbf{x}\cdot \mathbf{w} < 1 \\ 0&\text{if } y\ \mathbf{x}\cdot \mathbf{w} > 1 \end{cases} $$

Since $i$ ranges over the components of $x$, you can view the above as a vector quantity, and write $\frac{\partial}{\partial w}$ as shorthand for $(\frac{\partial}{\partial w_1},\frac{\partial}{\partial w_2},\ldots)$

2 of 3
19

This is 3 years late, but still may be relevant for someone...

Let $S$ denote a sample of points $x_i \in R^d$ and the set of corresponding labels $y_i \in \{-1,1\}$. We search to find a hyperplane $w$ that would minimize the total hinge-loss: \begin{equation} w^* = \underset{w}{\text{argmin }} L^{hinge}_S(w) = \underset{w}{\text{argmin }} \sum_i{l_{hinge}(w,x_i,y_i)}= \underset{w}{\text{argmin }} \sum_i{\max{\{0,1-y_iw\cdot x}\}} \end{equation} To find $w^*$ take derivative of the total hinge loss . Gradient of each component is: $$ \frac{\partial{l_{hinge}}}{\partial w}= \begin{cases} 0 & y_iw\cdot x \geq 1 \\ -y_ix & y_iw\cdot x < 1 \end{cases} $$

The gradient of the sum is a sum of gradients. $$ \frac{\partial{L_S^{hinge}}}{\partial{w}}=\sum_i{\frac{\partial{l_{hinge}}}{\partial w}} $$ Python example, which uses GD to find hinge-loss optimal separatinig hyperplane follows (its probably not the most efficient code, but it works)

import numpy as np
import matplotlib.pyplot as plt

def hinge_loss(w,x,y):
    """ evaluates hinge loss and its gradient at w

    rows of x are data points
    y is a vector of labels
    """
    loss,grad = 0,0
    for (x_,y_) in zip(x,y):
        v = y_*np.dot(w,x_)
        loss += max(0,1-v)
        grad += 0 if v > 1 else -y_*x_
    return (loss,grad)

def grad_descent(x,y,w,step,thresh=0.001):
    grad = np.inf
    ws = np.zeros((2,0))
    ws = np.hstack((ws,w.reshape(2,1)))
    step_num = 1
    delta = np.inf
    loss0 = np.inf
    while np.abs(delta)>thresh:
        loss,grad = hinge_loss(w,x,y)
        delta = loss0-loss
        loss0 = loss
        grad_dir = grad/np.linalg.norm(grad)
        w = w-step*grad_dir/step_num
        ws = np.hstack((ws,w.reshape((2,1))))
        step_num += 1
    return np.sum(ws,1)/np.size(ws,1)

def test1():
    # sample data points
    x1 = np.array((0,1,3,4,1))
    x2 = np.array((1,2,0,1,1))
    x  = np.vstack((x1,x2)).T
    # sample labels
    y = np.array((1,1,-1,-1,-1))
    w = grad_descent(x,y,np.array((0,0)),0.1)
    loss, grad = hinge_loss(w,x,y)
    plot_test(x,y,w)

def plot_test(x,y,w):
    plt.figure()
    x1, x2 = x[:,0], x[:,1]
    x1_min, x1_max = np.min(x1)*.7, np.max(x1)*1.3
    x2_min, x2_max = np.min(x2)*.7, np.max(x2)*1.3
    gridpoints = 2000
    x1s = np.linspace(x1_min, x1_max, gridpoints)
    x2s = np.linspace(x2_min, x2_max, gridpoints)
    gridx1, gridx2 = np.meshgrid(x1s,x2s)
    grid_pts = np.c_[gridx1.ravel(), gridx2.ravel()]
    predictions = np.array([np.sign(np.dot(w,x_)) for x_ in grid_pts]).reshape((gridpoints,gridpoints))
    plt.contourf(gridx1, gridx2, predictions, cmap=plt.cm.Paired)
    plt.scatter(x[:, 0], x[:, 1], c=y, cmap=plt.cm.Paired)
    plt.title('total hinge loss: %g' % hinge_loss(w,x,y)[0])
    plt.show()

if __name__ == '__main__':
    np.set_printoptions(precision=3)
    test1()
🌐
GitHub
github.com › tejasmhos › Linear-SVM-Using-Squared-Hinge-Loss
GitHub - tejasmhos/Linear-SVM-Using-Squared-Hinge-Loss: This is an implementation, from scratch, of the linear SVM using squared hinge loss · GitHub
This is an implementation of a Linear SVM that uses a squared hinge loss. This algorithm was coded using Python. This is my submission for the polished code release for DATA 558 - Statistical Machine Learning. The code was developed by Tejas Hosangadi. The Linear SVM that Uses Squared Hinge Loss writes out as shown below: The above equation is differentiable and convex, hence we can apply gradient descent.
Starred by 5 users
Forked by 3 users
Languages   Python
🌐
Analytics Vidhya
analyticsvidhya.com › home › what is hinge loss in machine learning?
What is Hinge loss in Machine Learning?
December 23, 2024 - Binary Classification: Hinge loss is highly effective for binary classification tasks and works well with linear classifiers. Sparse Gradients: When the prediction is correct with a margin (i.e., y⋅f(x)>1), the hinge loss gradient is zero.
🌐
arXiv
arxiv.org › pdf › 2103.00233 pdf
Learning with Smooth Hinge Losses Junru Luo ∗, Hong Qiao †, and Bo Zhang ‡
Due to the non-differentiability and non-convexity of · the 0/1 loss, it is an NP-hard problem to optimize (1) directly. To overcome · this difficulty, it is common to replace the 0/1 loss with a convex surrogate loss · function. And many efficient convex optimization methods can thus be applied · to obtain a good solution, such as the gradient-based method and the coordinate
Top answer
1 of 2
8

Let's use the example of the SVM loss function for a single datapoint:

$L_i = \sum_{j\neq y_i} \left[ \max(0, w_j^Tx_i - w_{y_i}^Tx_i + \Delta) \right]$

Where $\Delta$ is the desired margin.

We can differentiate the function with respect to the weights. For example, taking the gradient with respect to $w_{yi}$ we obtain:

$\nabla_{w_{y_i}} L_i = - \left( \sum_{j\neq y_i} \mathbb{1}(w_j^Tx_i - w_{y_i}^Tx_i + \Delta > 0) \right) x_i$

Where 1 is the indicator function that is one if the condition inside is true or zero otherwise. While the expression may look scary when it is written out, when you're implementing this in code you'd simply count the number of classes that didn't meet the desired margin (and hence contributed to the loss function) and then the data vector $x_i$ scaled by this number is the gradient. Notice that this is the gradient only with respect to the row of $W$ that corresponds to the correct class. For the other rows where $j≠{{y}_{i}}$ the gradient is:

$\nabla_{w_j} L_i = \mathbb{1}(w_j^Tx_i - w_{y_i}^Tx_i + \Delta > 0) x_i$

Once you derive the expression for the gradient it is straight-forward to implement the expressions and use them to perform the gradient update.

Taken from Stanford CS231N optimization notes posted on github.

2 of 2
0

First of all, note that multi-class hinge loss function is a function of $W_r$. \begin{equation} l(W_r) = \max( 0, 1 + \underset{r \neq y_i}{ \max } W_r \cdot x_i - W_{y_i} \cdot x_i) \end{equation} Next, max function is non-differentiable at $0$. So, we need to calculate the subgradient of it. \begin{equation} \frac{\partial l(W_r)}{\partial W_r} = \begin{cases} \{0\}, & W_{y_i}\cdot x_i > 1 + \underset{r \neq y_i}{ \max } W_r \cdot x_i \\ \{x_i\}, & W_{y_i}\cdot x_i < 1 + \underset{r \neq y_i}{ \max } W_r \cdot x_i\\ \{\alpha x_i\}, & \alpha \in [0,1], W_{y_i}\cdot x_i = 1 + \underset{r \neq y_i}{ \max } W_r \cdot x_i \end{cases} \end{equation} In the second case, $W_{y_i}$ is independent of $W_r$. Above definition of subgradient of multi-class hinge loss is similar to subgradient of binary class hinge loss.