in machine learning, a loss function used for maximum‐margin classification
In machine learning, the hinge loss is a loss function used for training classifiers. The hinge loss is used for "maximum-margin" classification, most notably for support vector machines (SVMs). For an intended … Wikipedia
🌐
Wikipedia
en.wikipedia.org › wiki › Hinge_loss
Hinge loss - Wikipedia
January 26, 2026 - The hinge loss is a convex function, so many of the usual convex optimizers used in machine learning can work with it.
🌐
Carnegie Mellon University
cs.cmu.edu › ~yandongl › loss.html
Loss Function
Square loss: $\min_\theta \sum_i||y^{(i)}-\theta^Tx^{(i)}||^2$ Fortunately, hinge loss, logistic loss and square loss are all convex functions.
🌐
Baeldung
baeldung.com › home › artificial intelligence › machine learning › differences between hinge loss and logistic loss
Differences Between Hinge Loss and Logistic Loss | Baeldung on Computer Science
February 28, 2025 - One of the main characteristics of hinge loss is that it’s a convex function. This makes it different from other losses such as the 0-1 loss. With convexity comes the existence of a global optimum.
🌐
ScienceDirect
sciencedirect.com › topics › engineering › hinge-loss-function
Hinge Loss Function - an overview | ScienceDirect Topics
Deep neural networks with several hidden layers and/or the inclusion of non-linear activation functions (discussed in Section 16.2.2) typically have loss landscapes (i.e., a surface in some high-dimensional space defined by the loss function) that are highly non-convex.
🌐
UBC Computer Science
cs.ubc.ca › ~schmidtm › Courses › 340-F17 › L21.pdf pdf
CPSC 340: Machine Learning and Data Mining More Linear Classifiers Fall 2017
• This is the called hinge loss. – It’s convex: max(constant,linear). – It’s not degenerate: w=0 now gives an error of 1 instead of 0. Hinge Loss: Convex Approximation to 0-1 Loss · 7 · Hinge Loss: Convex Approximation to 0-1 Loss · 8 · Hinge Loss: Convex Approximation to 0-1 Loss ·
🌐
Davidrosenberg
davidrosenberg.github.io › mlcourse › Archive › 2016 › Homework › hw6-multiclass › hw6.pdf pdf
DS-GA 1003: Machine Learning and Computational Statistics
But to solve our machine learning ... we’re talking about. ... we have a linear hypothesis space. We’ll start with a special case, that the hinge loss is a convex...
🌐
arXiv
arxiv.org › pdf › 2103.00233 pdf
Learning with Smooth Hinge Losses Junru Luo ∗, Hong Qiao †, and Bo Zhang ‡
SVMs with different convex loss functions and then introduce the smooth Hinge · loss functions ψG(α; σ), ψM(α; σ). The general smooth convex loss function ψ(α) is then presented and discussed in Section 3. In Section 4, we give the smooth · support vector machine by replacing the ...
🌐
Core
files01.core.ac.uk › download › pdf › 213011306.pdf pdf
From Convex to Nonconvex: a Loss Function Analysis for Binary Classification
convex [8]. The main advantage of this type of loss functions is · the computational simplicity, and complex global optimization · approaches can be avoided. Square loss and hinge loss are the · most commonly adopted loss functions in machine learning.
Find elsewhere
🌐
Gabormelli
gabormelli.com › RKB › Hinge-Loss_Function
Hinge-Loss Function - GM-RKB - Gabor Melli
The hinge loss function is defined as : [math]\displaystyle{ V(f(\vec{x}),y) = \max(0, 1-yf(\vec{x})) = |1 - yf(\vec{x}) |_{+}. }[/math] The hinge loss provides a relatively tight, convex upper bound on the 0–1 indicator function. Specifically, the hinge loss equals the 0–1 indicator function ...
🌐
Tel Aviv University
tau.ac.il › ~mansour › advanced-agt+ml › scribe2_covex_func.pdf pdf
Advanced Topics in Machine Learning and Algorithmic Game Theory
November 7, 2011 - hyperplane (essentially, an hypothesis), x ∈X and y ∈[−1, 1]. The hinge loss is a max- imum of linear functions and therefore convex.
🌐
Wikipedia
en.wikipedia.org › wiki › Loss_functions_for_classification
Loss functions for classification - Wikipedia
January 12, 2026 - The hinge loss provides a relatively tight, convex upper bound on the 0–1 indicator function. Specifically, the hinge loss equals the 0–1 indicator function when ... {\displaystyle |yf({\vec {x}})|\geq 1} . In addition, the empirical risk ...
🌐
Shadecoder
shadecoder.com › topics › hinge-loss-a-comprehensive-guide-for-2025
Hinge Loss: A Comprehensive Guide for 2025 - Shadecoder - 100% Invisibile AI Coding Interview Copilot
• Focuses on hard examples: Because correctly classified examples with a sufficient margin receive zero loss, training emphasizes borderline or misclassified points, which can accelerate learning where it matters most. • Convexity for linear models: For linear classifiers, hinge loss is convex, ...
🌐
Quora
quora.com › What-is-a-rigorous-proof-that-the-hinge-loss-is-a-convex-loss-function
What is a rigorous proof that the hinge loss is a convex loss function? - Quora
Answer: The hinge loss is the maximum of two linear functions, so you can prove it in two steps: 1. Any linear function is convex. 2. The maximum of two convex functions is convex.
🌐
Texas A&M University
people.tamu.edu › ~sji › classes › loss-slides.pdf pdf
A Unified View of Loss Functions in Supervised Learning Shuiwang Ji
are convex. Note that the hinge loss and perceptron loss are not · strictly convex. 10 / 12 · Comparison of different loss functions in a unified view · 11 / 12 · THANKS!
🌐
Grokipedia
grokipedia.com › hinge loss
Hinge loss — Grokipedia
January 14, 2026 - Hinge loss is a convex loss function primarily used in binary classification tasks within machine learning to penalize predictions that are incorrect or positioned too close to the decision boundary, thereby promoting robust separation between ...
Top answer
1 of 3
19

Some of my thoughts, may not be correct though.

I understand the reason we have such design (for hinge and logistic loss) is we want the objective function to be convex.

Convexity is surely a nice property, but I think the most important reason is we want the objective function to have non-zero derivatives, so that we can make use of the derivatives to solve it. The objective function can be non-convex, in which case we often just stop at some local optima or saddle points.

and interestingly, it also penalize correctly classified instances if they are weakly classified. It is a really strange design.

I think such design sort of advises the model to not only make the right predictions, but also be confident about the predictions. If we don't want correctly classified instances to get punished, we can for example, move the hinge loss (blue) to the left by 1, so that they no longer get any loss. But I believe this often leads to worse result in practice.

what are the prices we need to pay by using different "proxy loss functions", such as hinge loss and logistic loss?

IMO by choosing different loss functions we are bringing different assumptions to the model. For example, the logistic regression loss (red) assumes a Bernoulli distribution, the MSE loss (green) assumes a Gaussian noise.


Following the least squares vs. logistic regression example in PRML, I added the hinge loss for comparison.

As shown in the figure, hinge loss and logistic regression / cross entropy / log-likelihood / softplus have very close results, because their objective functions are close (figure below), while MSE is generally more sensitive to outliers. Hinge loss does not always have a unique solution because it's not strictly convex.

However one important property of hinge loss is, data points far away from the decision boundary contribute nothing to the loss, the solution will be the same with those points removed.

The remaining points are called support vectors in the context of SVM. Whereas SVM uses a regularizer term to ensure the maximum margin property and a unique solution.

2 of 3
8

Posting a late reply, since there is a very simple answer which has not been mentioned yet.

what are the prices we need to pay by using different "proxy loss functions", such as hinge loss and logistic loss?

When you replace the non-convex 0-1 loss function by a convex surrogate (e.g hinge-loss), you are actually now solving a different problem than the one you intended to solve (which is to minimize the number of classification mistakes). So you gain computational tractability (the problem becomes convex, meaning you can solve it efficiently using tools of convex optimization), but in the general case there is actually no way to relate the error of the classifier that minimizes a "proxy" loss and the error of the classifier that minimizes the 0-1 loss. If what you truly cared about was minimizing the number of misclassifications, I argue that this really is a big price to pay.

I should mention that this statement is worst-case, in the sense that it holds for any distribution $\mathcal D$. For some "nice" distributions, there are exceptions to this rule. The key example is of data distributions that have large margins w.r.t the decision boundary - see Theorem 15.4 in Shalev-Shwartz, Shai, and Shai Ben-David. Understanding machine learning: From theory to algorithms. Cambridge university press, 2014.

🌐
Aiml
aiml.com › home › posts › machine learning interview questions › supervised learning › classification › support vector machine › how does hinge loss differ from logistic loss?
How does hinge loss differ from logistic loss? - AIML.com
March 26, 2023 - This property is one of the reasons SVM performs very well on many data sets, as it enables hyperplanes to find margins that result in the highest accuracy possible. As can be seen in the graphs above, hinge loss is non-differentiable, which means that the optimization problem is no longer convex.
🌐
ScienceDirect
sciencedirect.com › science › article › abs › pii › S0925231221012509
Learning with smooth Hinge losses - ScienceDirect
August 18, 2021 - Although first-order methods are ... Motivated by the proposed smooth Hinge losses, we also propose a general smooth convex loss function...
Top answer
1 of 1
13

Here's my attempt to answer your questions:

  • Is an SVM as simple as saying it's a discriminative classifier that simply optimizes the hinge loss? Or is it more complex than that? Yes, you can say that. Also, don't forget that it regularizes the model too. I wouldn't say SVM is more complex than that, however, it is important to mention that all of those choices (e.g. hinge loss and $L_2$ regularization) have precise mathematical interpretations and are not arbitrary. That's what makes SVMs so popular and powerful. For example, hinge loss is a continuous and convex upper bound to the task loss which, for binary classification problems, is the $0/1$ loss. Note that $0/1$ loss is non-convex and discontinuous. Convexity of hinge loss makes the entire training objective of SVM convex. The fact that it is an upper bound to the task loss guarantees that the minimizer of the bound won't have a bad value on the task loss. $L_2$ regularization can be geometrically interpreted as the size of the margin.

  • How do the support vectors come into play? Support vectors play an important role in training SVMs. They identify the separating hyperplane. Let $D$ denote a training set and $SV(D) \subseteq D$ be the set of support vectors that you get by training an SVM on $D$ (assume all hyperparameters are fixed a priori). If we throw out all the non-SV samples from $D$ and train another SVM (with the same hyperparameter values) on the remaining samples (i.e. on $SV(D)$) we get the same exact classifier as before!

  • What about the slack variables? SVM was originally designed for problems where there exists a separating hyperplane (i.e. a hyperplane that perfectly separates the training samples from the two classes), and the goal was to find, among all separating hyperplanes, the hyperplane with the largest margin. The margin, denoted by $d(w, D)$, is defined for a classifier $w$ and a training set $D$. Assuming $w$ perfectly separates all the examples in $D$, we have $d(w, D) = \min_{(x, y) \in D} y \frac{w^Tx}{||w||_2}$, which is the distance of the closest training example from the separating hyperplane $w$. Note that $y \in \{+1, -1\}$ here. The introduction of slack variables made it possible to train SVMs on problems where either 1) a separating hyperplane does not exist (i.e. the training data is not linearly separable), or 2) you are happy to (or would like to) sacrifice making some error (higher bias) for better generalization (lower variance). However, this comes at the price of breaking some of the concrete mathematical and geometric interpretations of SVMs without slack variables (e.g. the geometrical interpretation of the margin).

  • Why can't you have deep SVM's? SVM objective is convex. More precisely, it is piecewise quadratic; that is because the $L_2$ regularizer is quadratic and the hinge loss is piecewise linear. The training objectives in deep hierarchical models, however, are much more complex. In particular, they are not convex. Of course, one can design a hierarchical discriminative model with hinge loss and $L_2$ regularization etc., but, it wouldn't be called an SVM. In fact, the hinge loss is commonly used in DNNs (Deep Neural Networks) for classification problems.