in machine learning, a loss function used for maximum‐margin classification
In machine learning, the hinge loss is a loss function used for training classifiers. The hinge loss is used for "maximum-margin" classification, most notably for support vector machines (SVMs). For an intended … Wikipedia
🌐
Wikipedia
en.wikipedia.org › wiki › Hinge_loss
Hinge loss - Wikipedia
January 26, 2026 - The Hinge loss is not a proper scoring rule. While binary SVMs are commonly extended to multiclass classification in a one-vs.-all or one-vs.-one fashion, it is also possible to extend the hinge loss itself for such an end. Several different variations of multiclass hinge loss have been proposed.
🌐
scikit-learn
scikit-learn.org › stable › modules › generated › sklearn.metrics.hinge_loss.html
hinge_loss — scikit-learn 1.8.0 documentation
In binary class case, assuming ... In multiclass case, the function expects that either all the labels are included in y_true or an optional labels argument is provided which contains all the labels....
🌐
GitHub
github.com › christianversloot › machine-learning-articles › blob › main › how-to-use-categorical-multiclass-hinge-with-keras.md
machine-learning-articles/how-to-use-categorical-multiclass-hinge-with-keras.md at main · christianversloot/machine-learning-articles
Hinge loss and squared hinge loss can be used for binary classification problems. Unfortunately, many of today's problems aren't binary, but rather, multiclass: the number of possible target classes is [latex]> 2[/latex].
Author   christianversloot
Top answer
1 of 2
8

Let's use the example of the SVM loss function for a single datapoint:

$L_i = \sum_{j\neq y_i} \left[ \max(0, w_j^Tx_i - w_{y_i}^Tx_i + \Delta) \right]$

Where is the desired margin.

We can differentiate the function with respect to the weights. For example, taking the gradient with respect to we obtain:

$\nabla_{w_{y_i}} L_i = - \left( \sum_{j\neq y_i} \mathbb{1}(w_j^Tx_i - w_{y_i}^Tx_i + \Delta > 0) \right) x_i$

Where 1 is the indicator function that is one if the condition inside is true or zero otherwise. While the expression may look scary when it is written out, when you're implementing this in code you'd simply count the number of classes that didn't meet the desired margin (and hence contributed to the loss function) and then the data vector scaled by this number is the gradient. Notice that this is the gradient only with respect to the row of that corresponds to the correct class. For the other rows where $j≠{{y}_{i}}$ the gradient is:

$\nabla_{w_j} L_i = \mathbb{1}(w_j^Tx_i - w_{y_i}^Tx_i + \Delta > 0) x_i$

Once you derive the expression for the gradient it is straight-forward to implement the expressions and use them to perform the gradient update.

Taken from Stanford CS231N optimization notes posted on github.

2 of 2
0

First of all, note that multi-class hinge loss function is a function of . \begin{equation} l(W_r) = \max( 0, 1 + \underset{r \neq y_i}{ \max } W_r \cdot x_i - W_{y_i} \cdot x_i) \end{equation} Next, max function is non-differentiable at . So, we need to calculate the subgradient of it. \begin{equation} \frac{\partial l(W_r)}{\partial W_r} = \begin{cases} \{0\}, & W_{y_i}\cdot x_i > 1 + \underset{r \neq y_i}{ \max } W_r \cdot x_i \\ \{x_i\}, & W_{y_i}\cdot x_i < 1 + \underset{r \neq y_i}{ \max } W_r \cdot x_i\\ \{\alpha x_i\}, & \alpha \in [0,1], W_{y_i}\cdot x_i = 1 + \underset{r \neq y_i}{ \max } W_r \cdot x_i \end{cases} \end{equation} In the second case, is independent of . Above definition of subgradient of multi-class hinge loss is similar to subgradient of binary class hinge loss.

🌐
TheCVF
openaccess.thecvf.com › content › WACV2021 › papers › Kavalerov_A_Multi-Class_Hinge_Loss_for_Conditional_GANs_WACV_2021_paper.pdf pdf
A Multi-Class Hinge Loss for Conditional GANs Ilya Kavalerov
matching IPM loss (McGAN)[24] following the empirical · successes of the Maximum Mean Discrepancy objective [17] ... McGAN. When combined with spectral normalization of · weights in D [22], the hinge loss greatly improves perfor-
🌐
PyImageSearch
pyimagesearch.com › home › blog › multi-class svm loss
Multi-class SVM Loss - PyImageSearch
April 17, 2021 - We’ll return to regularization in a future post once we better understand loss functions. ... I’m glad you asked. Essentially, the hinge loss function is summing across all incorrect classes (
🌐
Quora
quora.com › What-is-an-intuitive-explanation-of-the-multiclass-hinge-loss
What is an intuitive explanation of the multiclass hinge loss? - Quora
Answer (1 of 2): A2A. To me, sometimes stepping back from the problem helps. That is, take some time to look at motivation. And, actually, history can help there. Reducing the context is a good start. In this case, we have the hinge loss function. Then, it gets extended beyond simple. Simple, m...
Find elsewhere
🌐
PyTorch
docs.pytorch.org › reference api › torch.nn › multimarginloss
MultiMarginLoss — PyTorch 2.10 documentation
January 1, 2023 - Creates a criterion that optimizes a multi-class classification hinge loss (margin-based loss) between input
🌐
Readthedocs
torchmetrics.readthedocs.io › en › stable › classification › hinge_loss.html
Hinge Loss — PyTorch-Metrics 1.8.2 documentation
class torchmetrics.classification.MulticlassHingeLoss(num_classes, squared=False, multiclass_mode='crammer-singer', ignore_index=None, validate_args=True, **kwargs)[source]¶ · Compute the mean Hinge loss typically used for Support Vector Machines (SVMs) for multiclass tasks.
🌐
Stanford Artificial Intelligence Laboratory
ai.stanford.edu › ~tianshig › papers › multiclassHingeBoost-ICML2011.pdf pdf
Multiclass Boosting with Hinge Loss based on Output Coding Tianshi Gao
HingeBoost.OC. Although both methods use the same · loss, the regularization and the optimization procedure · are different. Empirically the stage-wise optimization · and regularization seems to give better performance, which might be due to lower variance than that of the ... Acknowledgment. This work was supported by the · NSF under grant No. ... Allwein, E. L., Schapire, R. E., and Singer, Y. Reducing · multiclass ...
🌐
YouTube
youtube.com › watch
4. Hinge Loss/Multi-class SVM Loss - YouTube
Hinge Loss/Multi-class SVM Loss is used for maximum-margin classification, especially for support vector machines or SVM. Hinge loss at value one is a safe m...
Published   July 2, 2022
🌐
Machinecurve
machinecurve.com › index.php › 2019 › 10 › 17 › how-to-use-categorical-multiclass-hinge-with-keras
How to use categorical / multiclass hinge with TensorFlow 2 and Keras? | MachineCurve.com
October 17, 2019 - Multiclass hinge was introduced by researchers Weston and Watkins (Wikipedia, 2011): ... For a prediction \(y\), take all \(y\) values unequal to \(t\), and compute the individual losses.
🌐
Google Research
research.google › pubs › l1-and-l2-regularization-for-multiclass-hinge-loss-models
L1 and L2 Regularization for Multiclass Hinge Loss Models
This paper investigates the relationship between the loss function, the type of regularization, and the resulting model sparsity of discriminatively-trained multiclass linear models. The effects on sparsity of optimizing log loss are straightforward: L2 regularization produces very dense models while L1 regularization produces much sparser models. However, optimizing hinge loss yields more nuanced behavior.
🌐
Gitbooks
sharad-s.gitbooks.io › cs231n › content › lecture_3_-_loss_functions_and_optimization › multiclass_svm_loss_deep_dive.html
Multiclass SVM Loss (Deep Dive) · CS231n
Q: If all of your scores are so small that they are approximately 0, what kind of loss would you expect? A: You would expect a loss of approximately (C-1) where C is the number of classes. This is because if you look at the equation for Multiclass SVM Loss, you will see that max(0, 0-0 + 1) ...
🌐
HandWiki
handwiki.org › wiki › Hinge_loss
Hinge loss - HandWiki
February 6, 2024 - For an intended output t = ±1 and a classifier score y, the hinge loss of the prediction y is defined as ... y should be the "raw" output of the classifier's decision function, not the predicted class label. For instance, in linear SVMs, ... While binary SVMs are commonly extended to multiclass classification in a one-vs.-all or one-vs.-one fashion,[2] it is also possible to extend the hinge loss itself for such an end.