in machine learning, a loss function used for maximum‐margin classification
In machine learning, the hinge loss is a loss function used for training classifiers. The hinge loss is used for "maximum-margin" classification, most notably for support vector machines (SVMs). For an intended … Wikipedia
🌐
Wikipedia
en.wikipedia.org › wiki › Hinge_loss
Hinge loss - Wikipedia
January 26, 2026 - The hinge loss is a convex function, so many of the usual convex optimizers used in machine learning can work with it.
🌐
Quora
quora.com › What-is-a-rigorous-proof-that-the-hinge-loss-is-a-convex-loss-function
What is a rigorous proof that the hinge loss is a convex loss function? - Quora
Answer: The hinge loss is the maximum of two linear functions, so you can prove it in two steps: 1. Any linear function is convex. 2. The maximum of two convex functions is convex.
🌐
UBC Computer Science
cs.ubc.ca › ~schmidtm › Courses › 340-F17 › L21.pdf pdf
CPSC 340: Machine Learning and Data Mining More Linear Classifiers Fall 2017
• This is the called hinge loss. – It’s convex: max(constant,linear). – It’s not degenerate: w=0 now gives an error of 1 instead of 0. Hinge Loss: Convex Approximation to 0-1 Loss · 7 · Hinge Loss: Convex Approximation to 0-1 Loss · 8 · Hinge Loss: Convex Approximation to 0-1 Loss ·
🌐
Carnegie Mellon University
cs.cmu.edu › ~yandongl › loss.html
Loss Function
0/1 loss: $\min_\theta\sum_i L_{0/1}(\theta^Tx)$. We define $L_{0/1}(\theta^Tx) =1$ if $y\cdot f \lt 0$, and $=0$ o.w. Non convex and very hard to optimize. Hinge loss: approximate 0/1 loss by $\min_\theta\sum_i H(\theta^Tx)$. We define $H(\theta^Tx) = max(0, 1 - y\cdot f)$. Apparently $H$ ...
🌐
arXiv
arxiv.org › pdf › 2103.00233 pdf
Learning with Smooth Hinge Losses Junru Luo ∗, Hong Qiao †, and Bo Zhang ‡
loss functions ψG(α; σ), ψM(α; σ). The general smooth convex loss function ψ(α) is then presented and discussed in Section 3. In Section 4, we give the smooth · support vector machine by replacing the Hinge loss with the smooth Hinge
🌐
Davidrosenberg
davidrosenberg.github.io › mlcourse › Archive › 2016 › Homework › hw6-multiclass › hw6.pdf pdf
Generalized Hinge Loss and Multiclass SVM
we will eventually need our loss function to be a convex function of some w ∈Rd that parameterizes our hypothesis · space. It’ll be clear in what follows what we’re talking about. ... we have a linear hypothesis space. We’ll start with a special case, that the hinge loss is a convex
🌐
arXiv
arxiv.org › abs › 2103.00233
[2103.00233] Learning with Smooth Hinge Losses
March 15, 2021 - In this paper, we introduce two smooth Hinge losses $\psi_G(\alpha;\sigma)$ and $\psi_M(\alpha;\sigma)$ which are infinitely differentiable and converge to the Hinge loss uniformly in $\alpha$ as $\sigma$ tends to $0$. By replacing the Hinge loss with these two smooth Hinge losses, we obtain two smooth support vector machines(SSVMs), respectively. Solving the SSVMs with the Trust Region Newton method (TRON) leads to two quadratically convergent algorithms. Experiments in text classification tasks show that the proposed SSVMs are effective in real-world applications. We also introduce a general smooth convex loss function to unify several commonly-used convex loss functions in machine learning.
Find elsewhere
🌐
Davidrosenberg
davidrosenberg.github.io › mlcourse › Archive › 2017 › Homework › hw5.pdf pdf
Homework 5: Generalized Hinge Loss and Multiclass SVM
New homework on multiclass hinge loss and multiclass SVM · New homework on Bayesian methods, specifically the beta-binomial model, hierarchical models, empirical Bayes ML-II, MAP-II · New short lecture on correlated variables with L1, L2, and Elastic Net regularization · Added some details about subgradient methods, including a one-slide proof that subgradient descent moves us towards a minimizer of a convex ...
🌐
JMLR
jmlr.org › papers › v9 › bartlett08a.html
Classification with a Reject Option using a Hinge Loss
Just as in the conventional classification problem, minimization of the sample average of the cost is a difficult optimization problem. As an alternative, we propose the optimization of a certain convex loss function φ, analogous to the hinge loss used in support vector machines (SVMs).
🌐
arXiv
arxiv.org › abs › 1309.6813
[1309.6813] Hinge-loss Markov Random Fields: Convex Inference for Structured Prediction
September 26, 2013 - Graphical models for structured domains are powerful tools, but the computational complexities of combinatorial prediction spaces can force restrictions on models, or require approximate inference in order to be tractable. Instead of working in a combinatorial space, we use hinge-loss Markov random fields (HL-MRFs), an expressive class of graphical models with log-concave density functions over continuous variables, which can represent confidences in discrete predictions.
🌐
Gabormelli
gabormelli.com › RKB › Hinge-Loss_Function
Hinge-Loss Function - GM-RKB - Gabor Melli
The hinge loss function is defined as : [math]\displaystyle{ V(f(\vec{x}),y) = \max(0, 1-yf(\vec{x})) = |1 - yf(\vec{x}) |_{+}. }[/math] The hinge loss provides a relatively tight, convex upper bound on the 0–1 indicator function. Specifically, the hinge loss equals the 0–1 indicator function ...
🌐
ScienceDirect
sciencedirect.com › topics › engineering › hinge-loss-function
Hinge Loss Function - an overview | ScienceDirect Topics
The hinge loss encourages the network to maximize the margin around the decision boundary separating the two classes, which can lead to better generalization performance than using cross-entropy. Additionally, the hinge loss has sparse gradients, which can be useful for training large models with limited memory (unlike cross-entropy with dense gradients). A frequently used variant of the hinge loss is the squared hinge loss, given by
🌐
arXiv
arxiv.org › abs › 1512.07797
[1512.07797] The Lovász Hinge: A Novel Convex Surrogate for Submodular Losses
May 15, 2017 - We propose instead a novel surrogate ... loss function to compute a gradient or cutting-plane. We prove that the Lovász hinge is convex and yields an extension....
Top answer
1 of 1
13

Here's my attempt to answer your questions:

  • Is an SVM as simple as saying it's a discriminative classifier that simply optimizes the hinge loss? Or is it more complex than that? Yes, you can say that. Also, don't forget that it regularizes the model too. I wouldn't say SVM is more complex than that, however, it is important to mention that all of those choices (e.g. hinge loss and $L_2$ regularization) have precise mathematical interpretations and are not arbitrary. That's what makes SVMs so popular and powerful. For example, hinge loss is a continuous and convex upper bound to the task loss which, for binary classification problems, is the $0/1$ loss. Note that $0/1$ loss is non-convex and discontinuous. Convexity of hinge loss makes the entire training objective of SVM convex. The fact that it is an upper bound to the task loss guarantees that the minimizer of the bound won't have a bad value on the task loss. $L_2$ regularization can be geometrically interpreted as the size of the margin.

  • How do the support vectors come into play? Support vectors play an important role in training SVMs. They identify the separating hyperplane. Let $D$ denote a training set and $SV(D) \subseteq D$ be the set of support vectors that you get by training an SVM on $D$ (assume all hyperparameters are fixed a priori). If we throw out all the non-SV samples from $D$ and train another SVM (with the same hyperparameter values) on the remaining samples (i.e. on $SV(D)$) we get the same exact classifier as before!

  • What about the slack variables? SVM was originally designed for problems where there exists a separating hyperplane (i.e. a hyperplane that perfectly separates the training samples from the two classes), and the goal was to find, among all separating hyperplanes, the hyperplane with the largest margin. The margin, denoted by $d(w, D)$, is defined for a classifier $w$ and a training set $D$. Assuming $w$ perfectly separates all the examples in $D$, we have $d(w, D) = \min_{(x, y) \in D} y \frac{w^Tx}{||w||_2}$, which is the distance of the closest training example from the separating hyperplane $w$. Note that $y \in \{+1, -1\}$ here. The introduction of slack variables made it possible to train SVMs on problems where either 1) a separating hyperplane does not exist (i.e. the training data is not linearly separable), or 2) you are happy to (or would like to) sacrifice making some error (higher bias) for better generalization (lower variance). However, this comes at the price of breaking some of the concrete mathematical and geometric interpretations of SVMs without slack variables (e.g. the geometrical interpretation of the margin).

  • Why can't you have deep SVM's? SVM objective is convex. More precisely, it is piecewise quadratic; that is because the $L_2$ regularizer is quadratic and the hinge loss is piecewise linear. The training objectives in deep hierarchical models, however, are much more complex. In particular, they are not convex. Of course, one can design a hierarchical discriminative model with hinge loss and $L_2$ regularization etc., but, it wouldn't be called an SVM. In fact, the hinge loss is commonly used in DNNs (Deep Neural Networks) for classification problems.

🌐
Massachusetts Institute of Technology
mit.edu › ~9.520 › spring07 › Classes › svmwithfenchel.pdf pdf
Several Views of Support Vector Machines Ryan M. Rifkin
Unfortunately, the 0-1 loss is not convex. Therefore, we · have little hope of being able to optimize this loss function · in practice. (Note that the representer theorem does hold ... This is (basically) an SVM. So what? How will you solve this problem (find the minimizing y)? The hinge ...
🌐
Stack Exchange
stats.stackexchange.com › questions › 520792 › hinge-loss-is-the-tightest-convex-upper-bound-on-the-0-1-loss
svm - Hinge loss is the tightest convex upper bound on the 0-1 loss - Cross Validated
April 21, 2021 - I have read many times that the hinge loss is the tightest convex upper bound on the 0-1 loss (e.g. here, here and here). However, I have never seen a formal proof of this statement. How can we for...