To get the gradient we differentiate the loss with respect to th component of .

Rewrite hinge loss in terms of as where and

Using chain rule we get

First derivative term is evaluated at becoming when , and 0 when . Second derivative term becomes . So in the end you get

Since ranges over the components of , you can view the above as a vector quantity, and write as shorthand for

Answer from Yaroslav Bulatov on Stack Exchange

To get the gradient we differentiate the loss with respect to th component of .

Rewrite hinge loss in terms of as where and

Using chain rule we get

First derivative term is evaluated at becoming when , and 0 when . Second derivative term becomes . So in the end you get

Since ranges over the components of , you can view the above as a vector quantity, and write as shorthand for

Answer from Yaroslav Bulatov on Stack Exchange
Top answer
1 of 3
43

To get the gradient we differentiate the loss with respect to th component of .

Rewrite hinge loss in terms of as where and

Using chain rule we get

First derivative term is evaluated at becoming when , and 0 when . Second derivative term becomes . So in the end you get

Since ranges over the components of , you can view the above as a vector quantity, and write as shorthand for

2 of 3
19

This is 3 years late, but still may be relevant for someone...

Let denote a sample of points and the set of corresponding labels . We search to find a hyperplane that would minimize the total hinge-loss: \begin{equation} w^* = \underset{w}{\text{argmin }} L^{hinge}_S(w) = \underset{w}{\text{argmin }} \sum_i{l_{hinge}(w,x_i,y_i)}= \underset{w}{\text{argmin }} \sum_i{\max{\{0,1-y_iw\cdot x}\}} \end{equation} To find take derivative of the total hinge loss . Gradient of each component is: $$ \frac{\partial{l_{hinge}}}{\partial w}= \begin{cases} 0 & y_iw\cdot x \geq 1 \\ -y_ix & y_iw\cdot x < 1 \end{cases} $$

The gradient of the sum is a sum of gradients. Python example, which uses GD to find hinge-loss optimal separatinig hyperplane follows (its probably not the most efficient code, but it works)

import numpy as np
import matplotlib.pyplot as plt

def hinge_loss(w,x,y):
    """ evaluates hinge loss and its gradient at w

    rows of x are data points
    y is a vector of labels
    """
    loss,grad = 0,0
    for (x_,y_) in zip(x,y):
        v = y_*np.dot(w,x_)
        loss += max(0,1-v)
        grad += 0 if v > 1 else -y_*x_
    return (loss,grad)

def grad_descent(x,y,w,step,thresh=0.001):
    grad = np.inf
    ws = np.zeros((2,0))
    ws = np.hstack((ws,w.reshape(2,1)))
    step_num = 1
    delta = np.inf
    loss0 = np.inf
    while np.abs(delta)>thresh:
        loss,grad = hinge_loss(w,x,y)
        delta = loss0-loss
        loss0 = loss
        grad_dir = grad/np.linalg.norm(grad)
        w = w-step*grad_dir/step_num
        ws = np.hstack((ws,w.reshape((2,1))))
        step_num += 1
    return np.sum(ws,1)/np.size(ws,1)

def test1():
    # sample data points
    x1 = np.array((0,1,3,4,1))
    x2 = np.array((1,2,0,1,1))
    x  = np.vstack((x1,x2)).T
    # sample labels
    y = np.array((1,1,-1,-1,-1))
    w = grad_descent(x,y,np.array((0,0)),0.1)
    loss, grad = hinge_loss(w,x,y)
    plot_test(x,y,w)

def plot_test(x,y,w):
    plt.figure()
    x1, x2 = x[:,0], x[:,1]
    x1_min, x1_max = np.min(x1)*.7, np.max(x1)*1.3
    x2_min, x2_max = np.min(x2)*.7, np.max(x2)*1.3
    gridpoints = 2000
    x1s = np.linspace(x1_min, x1_max, gridpoints)
    x2s = np.linspace(x2_min, x2_max, gridpoints)
    gridx1, gridx2 = np.meshgrid(x1s,x2s)
    grid_pts = np.c_[gridx1.ravel(), gridx2.ravel()]
    predictions = np.array([np.sign(np.dot(w,x_)) for x_ in grid_pts]).reshape((gridpoints,gridpoints))
    plt.contourf(gridx1, gridx2, predictions, cmap=plt.cm.Paired)
    plt.scatter(x[:, 0], x[:, 1], c=y, cmap=plt.cm.Paired)
    plt.title('total hinge loss: %g' % hinge_loss(w,x,y)[0])
    plt.show()

if __name__ == '__main__':
    np.set_printoptions(precision=3)
    test1()
🌐
Wikipedia
en.wikipedia.org › wiki › Hinge_loss
Hinge loss - Wikipedia
January 26, 2026 - In machine learning, the hinge loss is a loss function used for training classifiers. The hinge loss is used for "maximum-margin" classification, most notably for support vector machines (SVMs). For an intended output t = ±1 and a classifier score y, the hinge loss of the prediction y is defined ...
Discussions

python - How to vectorize hinge loss gradient computation - Stack Overflow
I'm computing thousands of gradients and would like to vectorize the computations in Python. The context is SVM and the loss function is Hinge Loss. Y is Mx1, X is MxN and w is Nx1. L(w) = lam/2 ... More on stackoverflow.com
🌐 stackoverflow.com
correct implementation of Hinge loss minimization for gradient descent

I'd start with writing some really simple test cases especially for your helper functions. Functions don't look crazy but bugs are subtle usually.

More on reddit.com
🌐 r/MachineLearning
3
0
March 11, 2015
java - correct implementation of Hinge loss minimization for gradient descent - Stack Overflow
I copied the hinge loss function from here (also LossC and LossFunc upon which it's based. Then I included it in my gradient descent algorithm like so: do { iteration++; error = 0.0;... More on stackoverflow.com
🌐 stackoverflow.com
functions - How do you minimize "hinge-loss"? - Mathematics Stack Exchange
And for the special problem of minimizing regularized the hinge loss, there is a special name to it known as support vector machine and most modern programming language has library to solve it. ... $\begingroup$ nice example. could you add the missing right brackets of the formula $f(b, w_1, w_2)$? $\endgroup$ ... Find the answer to your question by asking. Ask question ... See similar questions with these tags. 2 Gradient ... More on math.stackexchange.com
🌐 math.stackexchange.com
Top answer
1 of 2
1

Hinge loss is difficult to work with when the derivative is needed because the derivative will be a piece-wise function. max has one non-differentiable point in its solution, and thus the derivative has the same. This was a very prominent issue with non-separable cases of SVM (and a good reason to use ridge regression).

Here's a slide (Original source from Zhuowen Tu, apologies for the title typo):

Where hinge loss is defined as max(0, 1-v) and v is the decision boundary of the SVM classifier. More can be found on the Hinge Loss Wikipedia.

As for your equation: you can easily pick out the v of the equation, however without more context of those functions it's hard to say how to derive. Unfortunately I don't have access to the paper and cannot guide you any further...

2 of 2
1

I disagree with the earlier answer that this is difficult to calculate. If we have the function \begin{align*} \sum_{t\in\mathcal{T}} \max \{0, 1 - d(t) \, y(t, \theta)\} \end{align*} the gradient with respect to is \begin{align*} & \sum_{t\in\mathcal{T}}g(t) \\ & g(t) := \begin{cases} 0 & \text{ if }1 - d(t) y(t, \theta) < 0 \\ -d(t)\dfrac{\partial y}{\partial \theta} & \text{ otherwise} \\ \end{cases} \end{align*} Theoretically this is ok, it just means that the gradient is not continuous. However, the objective is still continuous assuming that and are both continuous.

In practice, it's not a problem either. Any automatic differentiation software (tensorflow, pytorch, jax) will handle something like this automatically and correctly.

🌐
Twice22
twice22.github.io › hingeloss
Hinge Loss Gradient Computation
assign $x_i$ to each column of this matrix if ($j \neq y_i$ and $(x_iw_j - x_iw_{y_i} + \Delta > 0)$) assign $-\sum\limits_{j \neq y_{i}}1(x_iw_j - x_iw_{y_i} + \Delta > 0)x_i$ to the $y_i$ column · dW = np.zeros(W.shape) # initialize the gradient as zero # compute the loss and the gradient num_classes = W.shape[1] num_train = X.shape[0] loss = 0.0 for i in xrange(num_train): scores = X[i].dot(W) correct_class_score = scores[y[i]] nb_sup_zero = 0 for j in xrange(num_classes): if j == y[i]: continue margin = scores[j] - correct_class_score + 1 # note delta = 1 if margin > 0: nb_sup_zero += 1 loss += margin dW[:, j] += X[i] dW[:, y[i]] -= nb_sup_zero*X[i]
🌐
GeeksforGeeks
geeksforgeeks.org › machine learning › gradient-descent-algorithm-and-its-variants
Gradient Descent Algorithm in Machine Learning - GeeksforGeeks
... Creates a dataset: Generates ... the SVM using Gradient Descent: Calculates hinge loss, updates the model parameters step by step and improves the decision boundary during training....
Published   1 week ago
Find elsewhere
🌐
Medium
medium.com › @jainilgosalia › hinge-loss-understanding-and-implementing-it-from-scratch-a273d786f8e6
Hinge Loss: Understanding and Implementing it from Scratch | by Jainil Gosalia | Medium
May 15, 2025 - Here, C is a constant that controls the trade-off between having a wide margin and allowing some errors. Here’s a simple way to calculate hinge loss and its gradient, to build intuition.
🌐
Reddit
reddit.com › r/machinelearning › correct implementation of hinge loss minimization for gradient descent
r/MachineLearning on Reddit: correct implementation of Hinge loss minimization for gradient descent
March 11, 2015 -

I copied the hinge loss function from here (also LossC and LossFunc upon which it's based. Then I included it in my gradient descent algorithm like so:

  do 
  {
    iteration++;
    error = 0.0;
    cost = 0.0;
    
    //loop through all instances (complete one epoch)
    for (p = 0; p < number_of_files__train; p++) 
    {
    	
      // 1. Calculate the hypothesis h = X * theta
      hypothesis = calculateHypothesis( theta, feature_matrix__train, p, globo_dict_size );

      // 2. Calculate the loss = h - y and maybe the squared cost (loss^2)/2m
      //cost = hypothesis - outputs__train[p];
      cost = HingeLoss.loss(hypothesis, outputs__train[p]);
      System.out.println( "cost " + cost );
      
      // 3. Calculate the gradient = X' * loss / m
      gradient = calculateGradent( theta, feature_matrix__train, p, globo_dict_size, cost, number_of_files__train);
      
      // 4. Update the parameters theta = theta - alpha * gradient
      for (int i = 0; i < globo_dict_size; i++) 
      {
    	  theta[i] = theta[i] - LEARNING_RATE * gradient[i];
      }

    }
    
	//summation of squared error (error value for all instances)
    error += (cost*cost);	    
  
  /* Root Mean Squared Error */
  //System.out.println("Iteration " + iteration + " : RMSE = " + Math.sqrt( error/number_of_files__train ) );
  System.out.println("Iteration " + iteration + " : RMSE = " + Math.sqrt( error/number_of_files__train ) );
  
  } 
  while( error != 0 );

But this doesnt work at all. Is that due to the loss function? Maybe how I added the loss function to my code?

I guess it's also possible that my implementation of gradient descent is faulty.

Here are my methods for calculating the gradient and the hypothesis, are these right?

static double calculateHypothesis( double[] theta, double[][] feature_matrix, int file_index, int globo_dict_size )
{
	double hypothesis = 0.0;

	 for (int i = 0; i < globo_dict_size; i++) 
	 {
		 hypothesis += ( theta[i] * feature_matrix[file_index][i] );
	 }
	 //bias
	 hypothesis += theta[ globo_dict_size ];

	 return hypothesis;
}

static double[] calculateGradent( double theta[], double[][] feature_matrix, int file_index, int globo_dict_size, double cost, int number_of_files__train)
{
	double m = number_of_files__train;

	double[] gradient = new double[ globo_dict_size];//one for bias?
	
	for (int i = 0; i < gradient.length; i++) 
	{
		gradient[i] = (1.0/m) * cost * feature_matrix[ file_index ][ i ] ;
	}
	
	return gradient;
}

The rest of the code is here if you're interested to take a look.

🌐
Analytics Vidhya
analyticsvidhya.com › home › what is hinge loss in machine learning?
What is Hinge loss in Machine Learning?
December 23, 2024 - Binary Classification: Hinge loss is highly effective for binary classification tasks and works well with linear classifiers. Sparse Gradients: When the prediction is correct with a margin (i.e., y⋅f(x)>1), the hinge loss gradient is zero.
🌐
IEEE Xplore
ieeexplore.ieee.org › document › 6580579
Epoch gradient descent for smoothed hinge-loss linear SVMs | IEEE Conference Publication | IEEE Xplore
A gradient descent method for strongly convex problems with Lipschitz continuous gradients requires only O(logq ε) iterations to obtain an ε-accurate solution (q is a constant in (0; 1)). Support Vector Machines (SVMs) penalized with the popular ...
🌐
Vivian Website
csie.ntu.edu.tw › ~cjlin › papers › l2mcsvm › l2mcsvm.pdf pdf
A Study on L2-Loss (Squared Hinge-Loss) Multi-Class SVM
From Tables 5 and 6, L2 loss is worse than L1 loss on the average training time · and sparsity. The higher percentage of support vectors is the same as the situation · in binary classification because the squared hinge loss leads to many small but non-
🌐
Medium
koshurai.medium.com › understanding-hinge-loss-in-machine-learning-a-comprehensive-guide-0a1c82478de4
Understanding Hinge Loss in Machine Learning: A Comprehensive Guide | by KoshurAI | Medium
January 12, 2024 - One common task in machine learning is classification, where the goal is to assign a label to a given input. To optimize the performance of these models, it is essential to choose an appropriate loss function. Hinge loss is one such function that is commonly used in classification problems, especially in the context of support vector machines (SVM).
Top answer
1 of 3
20

To answer your questions directly:

  • A loss function is a scoring function used to evaluate how well a given boundary separates the training data. Each loss function represents a different set of priorities about what the scoring criteria are. In particular, the hinge loss function doesn't care about correctly classified points as long as they're correct, but imposes a penalty for incorrectly classified points which is directly proportional to how far away they are on the wrong side of the boundary in question.
  • A boundary's loss score is computed by seeing how well it classifies each training point, computing each training point's loss value (which is a function of its distance from the boundary), then adding up the results.

    By plotting how a single training point's loss score would vary based on how well it is classified, you can get a feel for the loss function's priorities. That's what your graphs are showing— the size of the penalty that would hypothetically be assigned to a single point based on how confidently it is classified or misclassified. They're pictures of the scoring rubric, not calculations of an actual score. [See diagram below!]

  • At least conceptually, you minimize the loss for a dataset by considering all possible linear boundaries, computing their loss scores, and picking the boundary whose loss score is smallest. Remember that the plots just show how an individual point would be scored in each case based on how accurately it is classified.

  • Interpret loss plots as follows: The horizontal axis corresponds to $\hat{y}$, which is how accurately a point is classified. Positive values correspond to increasingly confident correct classifications, while negative values correspond to increasingly confident incorrect classifications. (Or, geometrically, $\hat{y}$ is the distance of the point from the boundary, and we prefer boundaries that separate points as widely as possible.) The vertical axis is the magnitude of the penalty. (They're simplified in the sense that they're showing the scoring rubric for a single point; they're not showing you the computed total loss for various boundaries as a function of which boundary you pick.)

Details follow.


  1. The linear SVM problem is the problem of finding a line (or plane, etc.) in space that separates points of one class from points of the other class by the widest possible margin. You want to find, out of all possible planes, the one that separates the points best.

  2. If it helps to think geometrically, a plane can be completely defined by two parameters: a vector $\vec{w}$ perpendicular to the plane (which tells you the plane's orientation) and an offset $b$ (which tells you how far it is from the origin). Each choice of $\langle \vec{w}, b\rangle$ is therefore a choice of plane. Another helpful geometric fact for intuition: if $\langle \vec{w}, b\rangle$ is some plane and $\vec{x}$ is a point in space, then $\vec{w}\cdot \vec{x} + b$ is the distance between the plane and the point (!).

    [Nitpick: If $\vec{w}$ is not a unit vector, then this formula actually gives a multiple of a distance, but the constants don't matter here.]

  3. That planar-distance formula $\hat{y}(\vec{x}) \equiv \vec{w}\cdot \vec{x} + b$ is useful because it defines a measurement scheme throughout all space: points lying on the plane have a value of 0; points far away on one side of the boundary have increasingly positive value, and points far away on the other side of the boundary have increasingly negative value.

  4. We have two classes of points. By convention, we'll call one of the classes positive and the other negative. An effective decision boundary will be one that assigns very positive planar-distance values to positive points and very negative planar-distance values to negative points. In formal terms, if $y_i=\pm 1$ denotes the class of the $i$th training point and $\vec{x}_i$ denotes its position, then what we want is for $y_i$ and $\vec{w}\cdot \vec{x}_i+b$ to have the same sign and for $\vec{w}\cdot \vec{x}_i+b$ to be large in magnitude.

  5. A loss function is a way of scoring how well the boundary assigns planar-distance values that match each point's actual class. A loss function is always a function $f(y, \hat{y})$ of two arguments— for the first argument we plug in the true class $y=\pm 1$ of the point in question, and for the second $\hat{y}$ we plug in the planar distance value our plane assigns to it. The total loss for the planar boundary is the sum of the losses for each of the training points.

    Based on our choice of loss function, we might express a preference that points be classified correctly but that we don't care about the magnitude of the planar-distance value if it's beyond, say, 1000; or we might choose a loss function which allows some points to be misclassified as long as the rest are very solidly classified, etc.

    Your graphs show how different loss functions behave on a single point whose class $y=+1$ is fixed and whose planar distance $\hat{y}$ varies ($\hat{y}$ runs along the horizontal axis). This can give you an idea of what the loss function is prioritizing. (Under this scheme, by the way, positive values of $\hat{y}$ correspond to increasingly confident correct classification, and negative values of $\hat{y}$ correspond to increasingly confident incorrect classification.)

    As a concrete example, the hinge loss function is a mathematical formulation of the following preference:

    Hinge loss preference: When evaluating planar boundaries that separate positive points from negative points, it is irrelevant how far away from the boundary the correctly classified points are. However, misclassified points incur a penalty that is directly proportional to how far they are on the wrong side of the boundary.

    Formally, this preference falls out of the fact that a correctly classified point incurs zero loss once its planar distance $\hat{y}$ is greater than 1. On the other hand, it incurs a linear penalty directly proportional to the planar distance $\hat{y}$ as the classification becomes more badly incorrect.

  6. Computing the loss value means computing the value of the loss for a particular set of training points and a particular boundary. Minimizing the loss means finding, for a particular set of training data, the boundary for which the loss value is minimal.

    For a dataset as in the 2D picture provided, first draw any linear boundary; call one side the positive (or red square) side, and the other the negative (or blue circle) side. You can compute the loss of the boundary by first measuring the planar distance values of each point; here, the signed distance between each training point and the boundary. Points on the positive side have positive $\hat{y}$ values and points on the negative side have negative values. Next, each point contributes to the total loss: $L = \sum_i \ell(\hat{y}, y)$. Compute the loss for each of the points now that you've computed $\hat{y}$ for each point and you know $y=\pm 1$ whether the point is a red square or blue circle. Add them all up to compute the overall loss.

    The best boundary is the one that has the lowest loss on this dataset out of all possible linear boundaries you could draw. (Time permitting, I'll add illustrations for all of this.)

  7. If the training data can be separated by a linear boundary, then any boundary which does so will have a hinge loss of zero— the lowest achievable value. Based on our preferences as expressed by hinge loss, all such boundaries tie for first place.

    Only if the training data is not linearly separable will the best boundary have a nonzero (positive, worse) hinge loss. In that case, the hinge loss preference will prefer to choose the boundary so that whichever misclassified points are not too far on the wrong side.

Addendum: As you can see from the shape of the curves, the loss functions in your picture express the following scoring rubrics:

  • Zero-one loss $[y\hat{y} < 0]$ : Being misclassified is uniformly bad— points on the wrong side of the boundary get the same size penalty regardless of how far on the wrong side they are. Similarly, all points on the correct side of the boundary get no penalty and no special bonus, even if they're very far on the correct side.
  • Exponential loss $[\exp{(y\hat{y})}]$ : The more correct you are, the better. But once you're on the correct side of the boundary, it gets less and less important that you be far away from the boundary. On the other hand, the further you are on the wrong side of the boundary, the more exponentially urgent of a problem it is.
  • $\log_2(\cdots)$ : Same, qualitatively, as previous function.
  • Hinge loss $\max(0, 1-y\hat{y})$ : If you're correctly classified beyond the margin ($\hat{y}>1$) then it's irrelevant just how far on the correct side you are. On the other hand, if you're within the margin or on the incorrect side, you get a penalty directly proportional to how far you are on the wrong side.
2 of 3
5

The hinge function is convex and the problem of its minimization can be cast as a quadratic program:

$\min \frac{1}{m}\sum t_i + ||w||^2 $

$\quad t_i \geq 1 - y_i(wx_i + b) , \forall i=1,\ldots,m$

$\quad t_i \geq 0 , \forall i=1,\ldots,m$

or in conic form

$\min \frac{1}{m}\sum t_i + z $

$\quad t_i \geq y_i(wx_i + b) , \forall i=1,\ldots,m$

$\quad t_i \geq 0 , \forall i=1,\ldots,m$

$ (2,z,w) \in Q_r^m$

where $Q_r$ is the rotated cone.

You can solve this problem either using a QP /SOCP solver, as MOSEK, or by some specialize algorithms that you can find in literature. Note that the minimum is not necessarily in zero, because of the interplay between the norm of $w$ and the blue side of the objective function.

In the second picture, every feasible solution is a line. An optimal one will balance the separation of the two classes, given by the blue term and the norm.

As for references, searching on Scholar Google seems quite ok. But even following the references from Wikipedia it is already a first steps.

You can draw some inspiration from one of my recent blog post:

http://blog.mosek.com/2014/03/swissquant-math-challenge-on-lasso.html

it is a similar problem. Also, for more general concept on conic optimization and regression you can check the classical book of Boyd.

🌐
Gitbook
sisyphus.gitbook.io › project › deep-learning-basics › basics › hinge-loss
Hinge Loss | The Truth of Sisyphus
At the initial time, every class scores should be similar and the expected loss is C-1 ( each wrong class has loss 1, and there are C - 1 wrong classes) Squared hinge loss is different from hinge loss, squared loss focusing more on the bad cases, ...
🌐
Gitbook
ztlevi.gitbook.io › ml-101 › loss › hinge_loss
Hinge Loss | ML_101
y*f(x)y∗f(x) increases with every misclassified point (very wrong points in Fig 5), the upper bound of hinge loss
🌐
Towards Data Science
towardsdatascience.com › home › latest › a definitive explanation to hinge loss for support vector machines.
A definitive explanation to Hinge Loss for Support Vector Machines. | Towards Data Science
January 23, 2025 - This essentially means that we are on the wrong side of the boundary, and that the instance will be classified incorrectly. On the flip size, a positive distance from the boundary incurs a low hinge loss, or no hinge loss at all, and the further we are away from the boundary(and on the right side of it), the lower our hinge loss will be.
🌐
Tum
dvl.in.tum.de › slides › i2dl-ws18 › 7.TrainingNN.pdf pdf
Lecture 6 recap
Chair for Computer Vision and Artificial Intelligence at the Technical University of Munich
🌐
Lightning AI
lightning.ai › docs › torchmetrics › stable › classification › hinge_loss.html
Hinge Loss — PyTorch-Metrics 1.8.2 documentation
Compute the mean Hinge loss typically used for Support Vector Machines (SVMs). This function is a simple wrapper to get the task specific versions of this metric, which is done by setting the task argument to either 'binary' or 'multiclass'. See the documentation of BinaryHingeLoss and ...
🌐
Taylor & Francis
taylorandfrancis.com › knowledge › Engineering_and_technology › Engineering_support_and_special_topics › Hinge_loss
Hinge loss - Knowledge and References | Taylor & Francis
Hinge loss is typically non-differentiable and can be expressed as loss = maximum (1 – (ytrue × ypred ),0), where ytrue values are expected to be -1 or 1.From: Handbook of Big Data [2019], Effective Processing of Convolutional Neural Networks ...