gradient of squared hinge loss

stats.stackexchange.com › questions › 4608 › gradient-of-hinge-loss

To get the gradient we differentiate the loss with respect to $\text{[math]}$ th component of $\text{[math]}$ .

Rewrite hinge loss in terms of $\text{[math]}$ as $\text{[math]}$ where $\text{[math]}$ and $\text{[math]}$

Using chain rule we get

$\text{[math]}$

First derivative term is evaluated at $\text{[math]}$ becoming $\text{[math]}$ when $\text{[math]}$ , and 0 when $\text{[math]}$ . Second derivative term becomes $\text{[math]}$ . So in the end you get $\text{[math]}$

Since $\text{[math]}$ ranges over the components of $\text{[math]}$ , you can view the above as a vector quantity, and write $\text{[math]}$ as shorthand for $\text{[math]}$

Answer from Yaroslav Bulatov on Stack Exchange

Twice22

twice22.github.io › hingeloss

Hinge Loss Gradient Computation

assign $x_i$ to each column of this matrix if ($j \neq y_i$ and $(x_iw_j - x_iw_{y_i} + \Delta > 0)$) assign $-\sum\limits_{j \neq y_{i}}1(x_iw_j - x_iw_{y_i} + \Delta > 0)x_i$ to the $y_i$ column · dW = np.zeros(W.shape) # initialize the gradient as zero # compute the loss and the gradient num_classes = W.shape[1] num_train = X.shape[0] loss = 0.0 for i in xrange(num_train): scores = X[i].dot(W) correct_class_score = scores[y[i]] nb_sup_zero = 0 for j in xrange(num_classes): if j == y[i]: continue margin = scores[j] - correct_class_score + 1 # note delta = 1 if margin > 0: nb_sup_zero += 1 loss += margin dW[:, j] += X[i] dW[:, y[i]] -= nb_sup_zero*X[i]

Stack Exchange

stats.stackexchange.com › questions › 4608 › gradient-of-hinge-loss

Gradient of Hinge loss - Cross Validated - Stack Exchange

Top answer

1 of 3

To get the gradient we differentiate the loss with respect to $\text{[math]}$ th component of $\text{[math]}$ .

Rewrite hinge loss in terms of $\text{[math]}$ as $\text{[math]}$ where $\text{[math]}$ and $\text{[math]}$

Using chain rule we get

$\text{[math]}$

Since $\text{[math]}$ ranges over the components of $\text{[math]}$ , you can view the above as a vector quantity, and write $\text{[math]}$ as shorthand for $\text{[math]}$

2 of 3

This is 3 years late, but still may be relevant for someone...

Let $\text{[math]}$ denote a sample of points $\text{[math]}$ and the set of corresponding labels $\text{[math]}$ . We search to find a hyperplane $\text{[math]}$ that would minimize the total hinge-loss: \begin{equation} w^* = \underset{w}{\text{argmin }} L^{hinge}_S(w) = \underset{w}{\text{argmin }} \sum_i{l_{hinge}(w,x_i,y_i)}= \underset{w}{\text{argmin }} \sum_i{\max{\{0,1-y_iw\cdot x}\}} \end{equation} To find $\text{[math]}$ take derivative of the total hinge loss . Gradient of each component is: $$ \frac{\partial{l_{hinge}}}{\partial w}= \begin{cases} 0 & y_iw\cdot x \geq 1 \\ -y_ix & y_iw\cdot x < 1 \end{cases} $$

The gradient of the sum is a sum of gradients. $\text{[math]}$ Python example, which uses GD to find hinge-loss optimal separatinig hyperplane follows (its probably not the most efficient code, but it works)

import numpy as np
import matplotlib.pyplot as plt

def hinge_loss(w,x,y):
    """ evaluates hinge loss and its gradient at w

    rows of x are data points
    y is a vector of labels
    """
    loss,grad = 0,0
    for (x_,y_) in zip(x,y):
        v = y_*np.dot(w,x_)
        loss += max(0,1-v)
        grad += 0 if v > 1 else -y_*x_
    return (loss,grad)

def grad_descent(x,y,w,step,thresh=0.001):
    grad = np.inf
    ws = np.zeros((2,0))
    ws = np.hstack((ws,w.reshape(2,1)))
    step_num = 1
    delta = np.inf
    loss0 = np.inf
    while np.abs(delta)>thresh:
        loss,grad = hinge_loss(w,x,y)
        delta = loss0-loss
        loss0 = loss
        grad_dir = grad/np.linalg.norm(grad)
        w = w-step*grad_dir/step_num
        ws = np.hstack((ws,w.reshape((2,1))))
        step_num += 1
    return np.sum(ws,1)/np.size(ws,1)

def test1():
    # sample data points
    x1 = np.array((0,1,3,4,1))
    x2 = np.array((1,2,0,1,1))
    x  = np.vstack((x1,x2)).T
    # sample labels
    y = np.array((1,1,-1,-1,-1))
    w = grad_descent(x,y,np.array((0,0)),0.1)
    loss, grad = hinge_loss(w,x,y)
    plot_test(x,y,w)

def plot_test(x,y,w):
    plt.figure()
    x1, x2 = x[:,0], x[:,1]
    x1_min, x1_max = np.min(x1)*.7, np.max(x1)*1.3
    x2_min, x2_max = np.min(x2)*.7, np.max(x2)*1.3
    gridpoints = 2000
    x1s = np.linspace(x1_min, x1_max, gridpoints)
    x2s = np.linspace(x2_min, x2_max, gridpoints)
    gridx1, gridx2 = np.meshgrid(x1s,x2s)
    grid_pts = np.c_[gridx1.ravel(), gridx2.ravel()]
    predictions = np.array([np.sign(np.dot(w,x_)) for x_ in grid_pts]).reshape((gridpoints,gridpoints))
    plt.contourf(gridx1, gridx2, predictions, cmap=plt.cm.Paired)
    plt.scatter(x[:, 0], x[:, 1], c=y, cmap=plt.cm.Paired)
    plt.title('total hinge loss: %g' % hinge_loss(w,x,y)[0])
    plt.show()

if __name__ == '__main__':
    np.set_printoptions(precision=3)
    test1()

Gradient of Hinge loss

stats.stackexchange.com › questions › 4608 › gradient-of-hinge-loss

To get the gradient we differentiate the loss with respect to $\text{[math]}$ th component of $\text{[math]}$ .

Rewrite hinge loss in terms of $\text{[math]}$ as $\text{[math]}$ where $\text{[math]}$ and $\text{[math]}$

Using chain rule we get

$\text{[math]}$

Since $\text{[math]}$ ranges over the components of $\text{[math]}$ , you can view the above as a vector quantity, and write $\text{[math]}$ as shorthand for $\text{[math]}$

Answer from Yaroslav Bulatov on Stack Exchange

Videos

m.youtube.com

Gradient Descent for Support Vector Machines and ...

m.youtube.com

Stochastic Gradient Descent Classifier - Machine Learning # 2

27:22

YouTube

Perceptron Loss Function | Loss functions in Deep Learning | Hinge ...

Hinge Loss, SVMs, and the Loss of Users - YouTube

Hinge Loss for Binary Classifiers - YouTube

March 14, 2020

View all

Wikipedia

en.wikipedia.org › wiki › Hinge_loss

Hinge loss - Wikipedia

January 26, 2026 - In machine learning, the hinge loss is a loss function used for training classifiers. The hinge loss is used for "maximum-margin" classification, most notably for support vector machines (SVMs). For an intended output t = ±1 and a classifier score y, the hinge loss of the prediction y is defined ...

Extensions Optimization

Stack Exchange

ai.stackexchange.com › questions › 8281 › how-do-i-calculate-the-gradient-of-the-hinge-loss-function

neural networks - How do I calculate the gradient of the hinge loss function? - Artificial Intelligence Stack Exchange

Top answer

1 of 2

Hinge loss is difficult to work with when the derivative is needed because the derivative will be a piece-wise function. max has one non-differentiable point in its solution, and thus the derivative has the same. This was a very prominent issue with non-separable cases of SVM (and a good reason to use ridge regression).

Here's a slide (Original source from Zhuowen Tu, apologies for the title typo):

Where hinge loss is defined as max(0, 1-v) and v is the decision boundary of the SVM classifier. More can be found on the Hinge Loss Wikipedia.

As for your equation: you can easily pick out the v of the equation, however without more context of those functions it's hard to say how to derive. Unfortunately I don't have access to the paper and cannot guide you any further...

2 of 2

I disagree with the earlier answer that this is difficult to calculate. If we have the function \begin{align*} \sum_{t\in\mathcal{T}} \max \{0, 1 - d(t) \, y(t, \theta)\} \end{align*} the gradient with respect to $\text{[math]}$ is \begin{align*} & \sum_{t\in\mathcal{T}}g(t) \\ & g(t) := \begin{cases} 0 & \text{ if }1 - d(t) y(t, \theta) < 0 \\ -d(t)\dfrac{\partial y}{\partial \theta} & \text{ otherwise} \\ \end{cases} \end{align*} Theoretically this is ok, it just means that the gradient is not continuous. However, the objective is still continuous assuming that $\text{[math]}$ and $\text{[math]}$ are both continuous.

In practice, it's not a problem either. Any automatic differentiation software (tensorflow, pytorch, jax) will handle something like this automatically and correctly.

ScienceDirect

sciencedirect.com › topics › engineering › hinge-loss-function

Hinge Loss Function - an overview | ScienceDirect Topics

The hinge loss encourages the network to maximize the margin around the decision boundary separating the two classes, which can lead to better generalization performance than using cross-entropy. Additionally, the hinge loss has sparse gradients, which can be useful for training large models with limited memory (unlike cross-entropy with dense gradients). A frequently used variant of the hinge loss is the squared hinge loss, given by

Stack Overflow

stackoverflow.com › questions › 28988732 › correct-implementation-of-hinge-loss-minimization-for-gradient-descent

java - correct implementation of Hinge loss minimization for gradient descent - Stack Overflow

Top answer

1 of 1

The code you provided for gradient does not look like a gradient of Hinge loss. Take a look at a valid equation, for example here: https://stats.stackexchange.com/questions/4608/gradient-of-hinge-loss

arXiv

arxiv.org › pdf › 2004.00179 pdf

Fully-Corrective Gradient Boosting with Squared Hinge

curves of the number of selected weak learners. The detailed · experimental settings can be found in Simulation IV in ... C. Squared hinge loss · Since the gradient descent viewpoint connects the gradient

Find elsewhere

Google Bing Mojeek

arXiv

arxiv.org › abs › 2302.11062

[2302.11062] A Log-linear Gradient Descent Algorithm for Unbalanced Binary Classification using the All Pairs Squared Hinge Loss

February 21, 2023 - Naive learning algorithms compute the gradient in quadratic time, which is too slow for learning using large batch sizes. We propose a new functional representation of the square loss and squared hinge loss, which results in algorithms that compute the gradient in either linear or log-linear time, and makes it possible to use gradient descent learning with large batch sizes.

GitHub

github.com › Gmoog › Svm

GitHub - Gmoog/Svm: Implement Linear SVM using squared hinge loss in python

We look at how to implement the Linear Support Vector Machine with a squared hinge loss in python. The code uses the fast gradient descent algorithm, and we find the optimal value for the regularization parameter using cross validation.

Author Gmoog

Vivian Website

csie.ntu.edu.tw › ~cjlin › papers › l2mcsvm › l2mcsvm.pdf pdf

A Study on L2-Loss (Squared Hinge-Loss) Multi-Class SVM

From Tables 5 and 6, L2 loss is worse than L1 loss on the average training time · and sparsity. The higher percentage of support vectors is the same as the situation · in binary classiﬁcation because the squared hinge loss leads to many small but non-

arXiv

arxiv.org › pdf › 2302.11062 pdf

A Log-linear Gradient Descent Algorithm for Unbalanced Binary

Theorem 2. If ℓis the squared hinge loss then the total loss over all pairs of positive and negative examples

Quora

quora.com › Why-is-squared-hinge-loss-differentiable

Why is squared hinge loss differentiable? - Quora

Answer (1 of 4): Let’s start by defining the hinge loss function h(x) = max(1-x,0). Now let’s think about the derivative h’(x). This does not exist at x = 1 because the left and right limits do not converge to the same number (ie: the derivative is undefined at x=1, but it is -1 for x

ScienceDirect

sciencedirect.com › science › article › abs › pii › S0893608021004950

Fully corrective gradient boosting with squared hinge: Fast learning rates and early stopping - ScienceDirect

December 29, 2021 - Besides the classical exponential ... least square loss in ... 2 Boosting (Buhlmann & Yu, 2003), and hinge loss in HingeBoost (Gao & Koller, 2011). The update scheme iteratively derives a new estimator based on the selected weak learners. According to the gradient descent view, ...

University of Maryland Department of Computer Science

cs.umd.edu › class › spring2017 › cmsc422 › slides0101 › lecture12.pdf pdf

(Sub)gradient Descent

CMSC422 University of Maryland · Machine learning is all about finding patterns in data to get computers to solve complex problems. Instead of explicitly programming computers to perform a task, machine learning lets us program the computer to learn from examples and improve over time without ...

Taylor & Francis

taylorandfrancis.com › knowledge › Engineering_and_technology › Engineering_support_and_special_topics › Hinge_loss

Hinge loss – Knowledge and References - Taylor & Francis

Hinge-loss function for multi-class SVM is mathematically defined as In this case when = r, else . Observe that when p = 1, equation (14) is hinge loss function (-loss ), whereas if p = 2, it is a squared hinge loss ...

Stack Exchange

stats.stackexchange.com › questions › 4608 › gradient-of-hinge-loss › 4621

Gradient of Hinge loss - Cross Validated

Top answer

1 of 3

To get the gradient we differentiate the loss with respect to $i$th component of $w$.

Rewrite hinge loss in terms of $w$ as $f(g(w))$ where $f(z)=\max(0,1-y\ z)$ and $g(w)=\mathbf{x}\cdot \mathbf{w}$

Using chain rule we get

$$\frac{\partial}{\partial w_i} f(g(w))=\frac{\partial f}{\partial z} \frac{\partial g}{\partial w_i} $$

First derivative term is evaluated at $g(w)=x\cdot w$ becoming $-y$ when $\mathbf{x}\cdot w<1$, and 0 when $\mathbf{x}\cdot w>1$. Second derivative term becomes $x_i$. So in the end you get $$ \frac{\partial f(g(w))}{\partial w_i} = \begin{cases} -y\ x_i &\text{if } y\ \mathbf{x}\cdot \mathbf{w} < 1 \\ 0&\text{if } y\ \mathbf{x}\cdot \mathbf{w} > 1 \end{cases} $$

Since $i$ ranges over the components of $x$, you can view the above as a vector quantity, and write $\frac{\partial}{\partial w}$ as shorthand for $(\frac{\partial}{\partial w_1},\frac{\partial}{\partial w_2},\ldots)$

2 of 3

This is 3 years late, but still may be relevant for someone...

Let $S$ denote a sample of points $x_i \in R^d$ and the set of corresponding labels $y_i \in \{-1,1\}$. We search to find a hyperplane $w$ that would minimize the total hinge-loss: \begin{equation} w^* = \underset{w}{\text{argmin }} L^{hinge}_S(w) = \underset{w}{\text{argmin }} \sum_i{l_{hinge}(w,x_i,y_i)}= \underset{w}{\text{argmin }} \sum_i{\max{\{0,1-y_iw\cdot x}\}} \end{equation} To find $w^*$ take derivative of the total hinge loss . Gradient of each component is: $$ \frac{\partial{l_{hinge}}}{\partial w}= \begin{cases} 0 & y_iw\cdot x \geq 1 \\ -y_ix & y_iw\cdot x < 1 \end{cases} $$

The gradient of the sum is a sum of gradients. $$ \frac{\partial{L_S^{hinge}}}{\partial{w}}=\sum_i{\frac{\partial{l_{hinge}}}{\partial w}} $$ Python example, which uses GD to find hinge-loss optimal separatinig hyperplane follows (its probably not the most efficient code, but it works)

import numpy as np
import matplotlib.pyplot as plt

def hinge_loss(w,x,y):
    """ evaluates hinge loss and its gradient at w

    rows of x are data points
    y is a vector of labels
    """
    loss,grad = 0,0
    for (x_,y_) in zip(x,y):
        v = y_*np.dot(w,x_)
        loss += max(0,1-v)
        grad += 0 if v > 1 else -y_*x_
    return (loss,grad)

def grad_descent(x,y,w,step,thresh=0.001):
    grad = np.inf
    ws = np.zeros((2,0))
    ws = np.hstack((ws,w.reshape(2,1)))
    step_num = 1
    delta = np.inf
    loss0 = np.inf
    while np.abs(delta)>thresh:
        loss,grad = hinge_loss(w,x,y)
        delta = loss0-loss
        loss0 = loss
        grad_dir = grad/np.linalg.norm(grad)
        w = w-step*grad_dir/step_num
        ws = np.hstack((ws,w.reshape((2,1))))
        step_num += 1
    return np.sum(ws,1)/np.size(ws,1)

def test1():
    # sample data points
    x1 = np.array((0,1,3,4,1))
    x2 = np.array((1,2,0,1,1))
    x  = np.vstack((x1,x2)).T
    # sample labels
    y = np.array((1,1,-1,-1,-1))
    w = grad_descent(x,y,np.array((0,0)),0.1)
    loss, grad = hinge_loss(w,x,y)
    plot_test(x,y,w)

def plot_test(x,y,w):
    plt.figure()
    x1, x2 = x[:,0], x[:,1]
    x1_min, x1_max = np.min(x1)*.7, np.max(x1)*1.3
    x2_min, x2_max = np.min(x2)*.7, np.max(x2)*1.3
    gridpoints = 2000
    x1s = np.linspace(x1_min, x1_max, gridpoints)
    x2s = np.linspace(x2_min, x2_max, gridpoints)
    gridx1, gridx2 = np.meshgrid(x1s,x2s)
    grid_pts = np.c_[gridx1.ravel(), gridx2.ravel()]
    predictions = np.array([np.sign(np.dot(w,x_)) for x_ in grid_pts]).reshape((gridpoints,gridpoints))
    plt.contourf(gridx1, gridx2, predictions, cmap=plt.cm.Paired)
    plt.scatter(x[:, 0], x[:, 1], c=y, cmap=plt.cm.Paired)
    plt.title('total hinge loss: %g' % hinge_loss(w,x,y)[0])
    plt.show()

if __name__ == '__main__':
    np.set_printoptions(precision=3)
    test1()

GitHub

github.com › tejasmhos › Linear-SVM-Using-Squared-Hinge-Loss

GitHub - tejasmhos/Linear-SVM-Using-Squared-Hinge-Loss: This is an implementation, from scratch, of the linear SVM using squared hinge loss · GitHub

This is an implementation of a Linear SVM that uses a squared hinge loss. This algorithm was coded using Python. This is my submission for the polished code release for DATA 558 - Statistical Machine Learning. The code was developed by Tejas Hosangadi. The Linear SVM that Uses Squared Hinge Loss writes out as shown below: The above equation is differentiable and convex, hence we can apply gradient descent.

Starred by 5 users

Forked by 3 users

Languages Python

Analytics Vidhya

analyticsvidhya.com › home › what is hinge loss in machine learning?

What is Hinge loss in Machine Learning?

December 23, 2024 - Binary Classification: Hinge loss is highly effective for binary classification tasks and works well with linear classifiers. Sparse Gradients: When the prediction is correct with a margin (i.e., y⋅f(x)>1), the hinge loss gradient is zero.

arXiv

arxiv.org › pdf › 2103.00233 pdf

Learning with Smooth Hinge Losses Junru Luo ∗, Hong Qiao †, and Bo Zhang ‡

Due to the non-diﬀerentiability and non-convexity of · the 0/1 loss, it is an NP-hard problem to optimize (1) directly. To overcome · this diﬃculty, it is common to replace the 0/1 loss with a convex surrogate loss · function. And many eﬃcient convex optimization methods can thus be applied · to obtain a good solution, such as the gradient-based method and the coordinate

Stack Exchange

stats.stackexchange.com › questions › 155088 › gradient-for-hinge-loss-multiclass

Gradient for hinge loss multiclass - Cross Validated

Top answer

1 of 2

Let's use the example of the SVM loss function for a single datapoint:

$L_i = \sum_{j\neq y_i} \left[ \max(0, w_j^Tx_i - w_{y_i}^Tx_i + \Delta) \right]$

Where $\Delta$ is the desired margin.

We can differentiate the function with respect to the weights. For example, taking the gradient with respect to $w_{yi}$ we obtain:

$\nabla_{w_{y_i}} L_i = - \left( \sum_{j\neq y_i} \mathbb{1}(w_j^Tx_i - w_{y_i}^Tx_i + \Delta > 0) \right) x_i$

Where 1 is the indicator function that is one if the condition inside is true or zero otherwise. While the expression may look scary when it is written out, when you're implementing this in code you'd simply count the number of classes that didn't meet the desired margin (and hence contributed to the loss function) and then the data vector $x_i$ scaled by this number is the gradient. Notice that this is the gradient only with respect to the row of $W$ that corresponds to the correct class. For the other rows where $j≠{{y}_{i}}$ the gradient is:

$\nabla_{w_j} L_i = \mathbb{1}(w_j^Tx_i - w_{y_i}^Tx_i + \Delta > 0) x_i$

Once you derive the expression for the gradient it is straight-forward to implement the expressions and use them to perform the gradient update.

Taken from Stanford CS231N optimization notes posted on github.

2 of 2

First of all, note that multi-class hinge loss function is a function of $W_r$. \begin{equation} l(W_r) = \max( 0, 1 + \underset{r \neq y_i}{ \max } W_r \cdot x_i - W_{y_i} \cdot x_i) \end{equation} Next, max function is non-differentiable at $0$. So, we need to calculate the subgradient of it. \begin{equation} \frac{\partial l(W_r)}{\partial W_r} = \begin{cases} \{0\}, & W_{y_i}\cdot x_i > 1 + \underset{r \neq y_i}{ \max } W_r \cdot x_i \\ \{x_i\}, & W_{y_i}\cdot x_i < 1 + \underset{r \neq y_i}{ \max } W_r \cdot x_i\\ \{\alpha x_i\}, & \alpha \in [0,1], W_{y_i}\cdot x_i = 1 + \underset{r \neq y_i}{ \max } W_r \cdot x_i \end{cases} \end{equation} In the second case, $W_{y_i}$ is independent of $W_r$. Above definition of subgradient of multi-class hinge loss is similar to subgradient of binary class hinge loss.