in machine learning, a loss function used for maximum‐margin classification
In machine learning, the hinge loss is a loss function used for training classifiers. The hinge loss is used for "maximum-margin" classification, most notably for support vector machines (SVMs). For an intended … Wikipedia
🌐
Wikipedia
en.wikipedia.org › wiki › Hinge_loss
Hinge loss - Wikipedia
January 26, 2026 - The hinge loss is a convex function, so many of the usual convex optimizers used in machine learning can work with it.
🌐
Quora
quora.com › What-is-a-rigorous-proof-that-the-hinge-loss-is-a-convex-loss-function
What is a rigorous proof that the hinge loss is a convex loss function? - Quora
Answer: The hinge loss is the maximum of two linear functions, so you can prove it in two steps: 1. Any linear function is convex. 2. The maximum of two convex functions is convex.
Discussions

regression - What's the relationship between an SVM and hinge loss? - Cross Validated
I wouldn't say SVM is more complex ... interpretations and are not arbitrary. That's what makes SVMs so popular and powerful. For example, hinge loss is a continuous and convex upper bound to the task loss which, for binary classification problems, is the $0/1$ loss.... More on stats.stackexchange.com
🌐 stats.stackexchange.com
Is this function (hinge loss times squared error) convex? - Mathematics Stack Exchange
I am wondering if function of this form is convex: $h(x)=\max(1,1-ax)(x-a)^2$. Basically, if $x$ has the same sign as $a$, it is regular quadratic function, otherwise it is of 3rd order. I plotted... More on math.stackexchange.com
🌐 math.stackexchange.com
February 28, 2017
machine learning - What are the impacts of choosing different loss functions in classification to approximate 0-1 loss - Cross Validated
The objective function can be non-convex, in which case we often just stop at some local optima or saddle points. and interestingly, it also penalize correctly classified instances if they are weakly classified. It is a really strange design. I think such design sort of advises the model to not only make the right predictions, but also be confident about the predictions. If we don't want correctly classified instances to get punished, we can for example, move the hinge loss ... More on stats.stackexchange.com
🌐 stats.stackexchange.com
svm - Hinge loss is the tightest convex upper bound on the 0-1 loss - Cross Validated
I have read many times that the hinge loss is the tightest convex upper bound on the 0-1 loss (e.g. here, here and here). However, I have never seen a formal proof of this statement. How can we for... More on stats.stackexchange.com
🌐 stats.stackexchange.com
April 21, 2021
🌐
Stack Exchange
math.stackexchange.com › questions › 3587895 › showing-regularized-hinge-loss-is-convex-or-concave
Showing regularized Hinge Loss is convex or concave - Mathematics Stack Exchange
March 20, 2020 - 6 Proof Verification: Showing a function is affine if its convex and concave · 1 Fenchel Conjugate of the Hinge Loss · 5 How can the Loss Functions of Neural Networks be Non-Convex? 1 Is the inverse of a nonnegative convex function being convex/quasi-convex/concave/quasi-concave?
🌐
Carnegie Mellon University
cs.cmu.edu › ~yandongl › loss.html
Loss Function
Then we formulate following loss functions: 0/1 loss: $\min_\theta\sum_i L_{0/1}(\theta^Tx)$. We define $L_{0/1}(\theta^Tx) =1$ if $y\cdot f \lt 0$, and $=0$ o.w. Non convex and very hard to optimize. Hinge loss: approximate 0/1 loss by $\min_\theta\sum_i H(\theta^Tx)$. We define $H(\theta^Tx) ...
🌐
UBC Computer Science
cs.ubc.ca › ~schmidtm › Courses › 340-F17 › L21.pdf pdf
CPSC 340: Machine Learning and Data Mining More Linear Classifiers Fall 2017
• This is the called hinge loss. – It’s convex: max(constant,linear). – It’s not degenerate: w=0 now gives an error of 1 instead of 0. Hinge Loss: Convex Approximation to 0-1 Loss · 7 · Hinge Loss: Convex Approximation to 0-1 Loss · 8 · Hinge Loss: Convex Approximation to 0-1 Loss ·
Top answer
1 of 1
13

Here's my attempt to answer your questions:

  • Is an SVM as simple as saying it's a discriminative classifier that simply optimizes the hinge loss? Or is it more complex than that? Yes, you can say that. Also, don't forget that it regularizes the model too. I wouldn't say SVM is more complex than that, however, it is important to mention that all of those choices (e.g. hinge loss and regularization) have precise mathematical interpretations and are not arbitrary. That's what makes SVMs so popular and powerful. For example, hinge loss is a continuous and convex upper bound to the task loss which, for binary classification problems, is the loss. Note that loss is non-convex and discontinuous. Convexity of hinge loss makes the entire training objective of SVM convex. The fact that it is an upper bound to the task loss guarantees that the minimizer of the bound won't have a bad value on the task loss. regularization can be geometrically interpreted as the size of the margin.

  • How do the support vectors come into play? Support vectors play an important role in training SVMs. They identify the separating hyperplane. Let denote a training set and be the set of support vectors that you get by training an SVM on (assume all hyperparameters are fixed a priori). If we throw out all the non-SV samples from and train another SVM (with the same hyperparameter values) on the remaining samples (i.e. on ) we get the same exact classifier as before!

  • What about the slack variables? SVM was originally designed for problems where there exists a separating hyperplane (i.e. a hyperplane that perfectly separates the training samples from the two classes), and the goal was to find, among all separating hyperplanes, the hyperplane with the largest margin. The margin, denoted by , is defined for a classifier and a training set . Assuming perfectly separates all the examples in , we have , which is the distance of the closest training example from the separating hyperplane . Note that here. The introduction of slack variables made it possible to train SVMs on problems where either 1) a separating hyperplane does not exist (i.e. the training data is not linearly separable), or 2) you are happy to (or would like to) sacrifice making some error (higher bias) for better generalization (lower variance). However, this comes at the price of breaking some of the concrete mathematical and geometric interpretations of SVMs without slack variables (e.g. the geometrical interpretation of the margin).

  • Why can't you have deep SVM's? SVM objective is convex. More precisely, it is piecewise quadratic; that is because the regularizer is quadratic and the hinge loss is piecewise linear. The training objectives in deep hierarchical models, however, are much more complex. In particular, they are not convex. Of course, one can design a hierarchical discriminative model with hinge loss and regularization etc., but, it wouldn't be called an SVM. In fact, the hinge loss is commonly used in DNNs (Deep Neural Networks) for classification problems.

Find elsewhere
🌐
Davidrosenberg
davidrosenberg.github.io › mlcourse › Archive › 2016 › Homework › hw6-multiclass › hw6.pdf pdf
Generalized Hinge Loss and Multiclass SVM
we will eventually need our loss function to be a convex function of some w ∈Rd that parameterizes our hypothesis · space. It’ll be clear in what follows what we’re talking about. ... we have a linear hypothesis space. We’ll start with a special case, that the hinge loss is a convex
🌐
ScienceDirect
sciencedirect.com › topics › engineering › hinge-loss-function
Hinge Loss Function - an overview | ScienceDirect Topics
Deep neural networks with several hidden layers and/or the inclusion of non-linear activation functions (discussed in Section 16.2.2) typically have loss landscapes (i.e., a surface in some high-dimensional space defined by the loss function) that are highly non-convex.
🌐
arXiv
arxiv.org › pdf › 2103.00233 pdf
Learning with Smooth Hinge Losses Junru Luo ∗, Hong Qiao †, and Bo Zhang ‡
loss functions ψG(α; σ), ψM(α; σ). The general smooth convex loss function ψ(α) is then presented and discussed in Section 3. In Section 4, we give the smooth · support vector machine by replacing the Hinge loss with the smooth Hinge
🌐
Davidrosenberg
davidrosenberg.github.io › mlcourse › Archive › 2017 › Homework › hw5.pdf pdf
Homework 5: Generalized Hinge Loss and Multiclass SVM
New homework on multiclass hinge loss and multiclass SVM · New homework on Bayesian methods, specifically the beta-binomial model, hierarchical models, empirical Bayes ML-II, MAP-II · New short lecture on correlated variables with L1, L2, and Elastic Net regularization · Added some details about subgradient methods, including a one-slide proof that subgradient descent moves us towards a minimizer of a convex ...
🌐
Gabormelli
gabormelli.com › RKB › Hinge-Loss_Function
Hinge-Loss Function - GM-RKB - Gabor Melli
While the hinge loss function is both convex and continuous, it is not smooth (is not differentiable) at [math]\displaystyle{ yf(\vec{x})=1 }[/math] . Consequently, the hinge loss function cannot be used with gradient descent methods or stochastic gradient descent methods which rely on ...
🌐
Core
files01.core.ac.uk › download › pdf › 213011306.pdf pdf
From Convex to Nonconvex: a Loss Function Analysis for Binary Classification
convex [8]. The main advantage of this type of loss functions is · the computational simplicity, and complex global optimization · approaches can be avoided. Square loss and hinge loss are the
Top answer
1 of 3
19

Some of my thoughts, may not be correct though.

I understand the reason we have such design (for hinge and logistic loss) is we want the objective function to be convex.

Convexity is surely a nice property, but I think the most important reason is we want the objective function to have non-zero derivatives, so that we can make use of the derivatives to solve it. The objective function can be non-convex, in which case we often just stop at some local optima or saddle points.

and interestingly, it also penalize correctly classified instances if they are weakly classified. It is a really strange design.

I think such design sort of advises the model to not only make the right predictions, but also be confident about the predictions. If we don't want correctly classified instances to get punished, we can for example, move the hinge loss (blue) to the left by 1, so that they no longer get any loss. But I believe this often leads to worse result in practice.

what are the prices we need to pay by using different "proxy loss functions", such as hinge loss and logistic loss?

IMO by choosing different loss functions we are bringing different assumptions to the model. For example, the logistic regression loss (red) assumes a Bernoulli distribution, the MSE loss (green) assumes a Gaussian noise.


Following the least squares vs. logistic regression example in PRML, I added the hinge loss for comparison.

As shown in the figure, hinge loss and logistic regression / cross entropy / log-likelihood / softplus have very close results, because their objective functions are close (figure below), while MSE is generally more sensitive to outliers. Hinge loss does not always have a unique solution because it's not strictly convex.

However one important property of hinge loss is, data points far away from the decision boundary contribute nothing to the loss, the solution will be the same with those points removed.

The remaining points are called support vectors in the context of SVM. Whereas SVM uses a regularizer term to ensure the maximum margin property and a unique solution.

2 of 3
8

Posting a late reply, since there is a very simple answer which has not been mentioned yet.

what are the prices we need to pay by using different "proxy loss functions", such as hinge loss and logistic loss?

When you replace the non-convex 0-1 loss function by a convex surrogate (e.g hinge-loss), you are actually now solving a different problem than the one you intended to solve (which is to minimize the number of classification mistakes). So you gain computational tractability (the problem becomes convex, meaning you can solve it efficiently using tools of convex optimization), but in the general case there is actually no way to relate the error of the classifier that minimizes a "proxy" loss and the error of the classifier that minimizes the 0-1 loss. If what you truly cared about was minimizing the number of misclassifications, I argue that this really is a big price to pay.

I should mention that this statement is worst-case, in the sense that it holds for any distribution $\mathcal D$. For some "nice" distributions, there are exceptions to this rule. The key example is of data distributions that have large margins w.r.t the decision boundary - see Theorem 15.4 in Shalev-Shwartz, Shai, and Shai Ben-David. Understanding machine learning: From theory to algorithms. Cambridge university press, 2014.

🌐
Stack Exchange
stats.stackexchange.com › questions › 520792 › hinge-loss-is-the-tightest-convex-upper-bound-on-the-0-1-loss
svm - Hinge loss is the tightest convex upper bound on the 0-1 loss - Cross Validated
April 21, 2021 - I have read many times that the hinge loss is the tightest convex upper bound on the 0-1 loss (e.g. here, here and here). However, I have never seen a formal proof of this statement. How can we for...
🌐
Reddit
reddit.com › r/machinelearning › loss function: must it be convex?
r/MachineLearning on Reddit: Loss function: must it be convex?
November 19, 2015 - Non-convex losses might cause problems with the optimizer falling into local minima, and these problems may or may not be surmountable with tricks like random restarts. If your loss is nondifferentiable everywhere, you won't be able to compute ...
🌐
Cornell Computer Science
cs.cornell.edu › courses › cs4780 › 2018sp › lectures › lecturenote10.html
10: Empirical Risk Minimization
Remember the unconstrained SVM Formulation \[ \min_{\mathbf{w}}\ C\underset{Hinge-Loss}{\underbrace{\sum_{i=1}^{n}\max[1-y_{i}\underset{h({\mathbf{x}_i})}{\underbrace{(w^{\top}{\mathbf{x}_i}+b)}},0]}}+\underset{l_{2}-Regularizer}{\underbrace{\left\Vert w\right\Vert _{z}^{2}}} \] The hinge loss is the SVM's error function of choice, whereas the $\left.l_{2}\right.$-regularizer reflects the complexity of the solution, and penalizes complex solutions.