Brave Search

in machine learning, a loss function used for maximum‐margin classification

In machine learning, the hinge loss is a loss function used for training classifiers. The hinge loss is used for "maximum-margin" classification, most notably for support vector machines (SVMs). For an intended … Wikipedia

Wikipedia

en.wikipedia.org › wiki › Hinge_loss

Hinge loss - Wikipedia

January 26, 2026 - The hinge loss is a convex function, so many of the usual convex optimizers used in machine learning can work with it.

Extensions Optimization

Quora

quora.com › What-is-a-rigorous-proof-that-the-hinge-loss-is-a-convex-loss-function

What is a rigorous proof that the hinge loss is a convex loss function? - Quora

Answer: The hinge loss is the maximum of two linear functions, so you can prove it in two steps: 1. Any linear function is convex. 2. The maximum of two convex functions is convex.

Discussions

regression - What's the relationship between an SVM and hinge loss? - Cross Validated

I wouldn't say SVM is more complex ... interpretations and are not arbitrary. That's what makes SVMs so popular and powerful. For example, hinge loss is a continuous and convex upper bound to the task loss which, for binary classification problems, is the $0/1$ loss.... More on stats.stackexchange.com

stats.stackexchange.com

Is this function (hinge loss times squared error) convex? - Mathematics Stack Exchange

I am wondering if function of this form is convex: $h(x)=\max(1,1-ax)(x-a)^2$. Basically, if $x$ has the same sign as $a$, it is regular quadratic function, otherwise it is of 3rd order. I plotted... More on math.stackexchange.com

math.stackexchange.com

February 28, 2017

machine learning - What are the impacts of choosing different loss functions in classification to approximate 0-1 loss - Cross Validated

The objective function can be non-convex, in which case we often just stop at some local optima or saddle points. and interestingly, it also penalize correctly classified instances if they are weakly classified. It is a really strange design. I think such design sort of advises the model to not only make the right predictions, but also be confident about the predictions. If we don't want correctly classified instances to get punished, we can for example, move the hinge loss ... More on stats.stackexchange.com

stats.stackexchange.com

svm - Hinge loss is the tightest convex upper bound on the 0-1 loss - Cross Validated

I have read many times that the hinge loss is the tightest convex upper bound on the 0-1 loss (e.g. here, here and here). However, I have never seen a formal proof of this statement. How can we for... More on stats.stackexchange.com

stats.stackexchange.com

April 21, 2021

Videos

m.youtube.com

What is the Hinge Loss in SVM in Machine Learning | Data ...

10:46

YouTube

Gradient Descent for Support Vector Machines and Subgradients - ...

Hinge Loss, SVMs, and the Loss of Users - YouTube

Hinge Loss for Binary Classifiers - YouTube

March 14, 2020

41:14

YouTube

8. Loss Functions for Regression and Classification - YouTube

July 11, 2018

View all

Stack Exchange

math.stackexchange.com › questions › 3587895 › showing-regularized-hinge-loss-is-convex-or-concave

Showing regularized Hinge Loss is convex or concave - Mathematics Stack Exchange

March 20, 2020 - 6 Proof Verification: Showing a function is affine if its convex and concave · 1 Fenchel Conjugate of the Hinge Loss · 5 How can the Loss Functions of Neural Networks be Non-Convex? 1 Is the inverse of a nonnegative convex function being convex/quasi-convex/concave/quasi-concave?

Carnegie Mellon University

cs.cmu.edu › ~yandongl › loss.html

Loss Function

Then we formulate following loss functions: 0/1 loss: $\min_\theta\sum_i L_{0/1}(\theta^Tx)$. We define $L_{0/1}(\theta^Tx) =1$ if $y\cdot f \lt 0$, and $=0$ o.w. Non convex and very hard to optimize. Hinge loss: approximate 0/1 loss by $\min_\theta\sum_i H(\theta^Tx)$. We define $H(\theta^Tx) ...

UBC Computer Science

cs.ubc.ca › ~schmidtm › Courses › 340-F17 › L21.pdf pdf

CPSC 340: Machine Learning and Data Mining More Linear Classifiers Fall 2017

• This is the called hinge loss. – It’s convex: max(constant,linear). – It’s not degenerate: w=0 now gives an error of 1 instead of 0. Hinge Loss: Convex Approximation to 0-1 Loss · 7 · Hinge Loss: Convex Approximation to 0-1 Loss · 8 · Hinge Loss: Convex Approximation to 0-1 Loss ·

Baeldung

baeldung.com › home › artificial intelligence › machine learning › differences between hinge loss and logistic loss

Differences Between Hinge Loss and Logistic Loss | Baeldung on Computer Science

February 28, 2025 - The hinge loss isn’t differentiable everywhere but is convex and simple to compute and optimize.

Tel Aviv University

tau.ac.il › ~mansour › advanced-agt+ml › scribe2_covex_func.pdf pdf

Lecture 2: November 7, 2011 2.1 Convex Learning Problems

November 7, 2011 - hyperplane (essentially, an hypothesis), x ∈X and y ∈[−1, 1]. The hinge loss is a max- imum of linear functions and therefore convex.

Find elsewhere

Google Bing Mojeek

ScienceDirect

sciencedirect.com › topics › engineering › hinge-loss-function

Hinge Loss Function - an overview | ScienceDirect Topics

Deep neural networks with several hidden layers and/or the inclusion of non-linear activation functions (discussed in Section 16.2.2) typically have loss landscapes (i.e., a surface in some high-dimensional space defined by the loss function) that are highly non-convex.

Stack Exchange

stats.stackexchange.com › questions › 187186 › whats-the-relationship-between-an-svm-and-hinge-loss

regression - What's the relationship between an SVM and hinge loss? - Cross Validated

Top answer

1 of 1

Here's my attempt to answer your questions:

Is an SVM as simple as saying it's a discriminative classifier that simply optimizes the hinge loss? Or is it more complex than that? Yes, you can say that. Also, don't forget that it regularizes the model too. I wouldn't say SVM is more complex than that, however, it is important to mention that all of those choices (e.g. hinge loss and $L_2$ regularization) have precise mathematical interpretations and are not arbitrary. That's what makes SVMs so popular and powerful. For example, hinge loss is a continuous and convex upper bound to the task loss which, for binary classification problems, is the $0/1$ loss. Note that $0/1$ loss is non-convex and discontinuous. Convexity of hinge loss makes the entire training objective of SVM convex. The fact that it is an upper bound to the task loss guarantees that the minimizer of the bound won't have a bad value on the task loss. $L_2$ regularization can be geometrically interpreted as the size of the margin.
How do the support vectors come into play? Support vectors play an important role in training SVMs. They identify the separating hyperplane. Let $D$ denote a training set and $SV(D) \subseteq D$ be the set of support vectors that you get by training an SVM on $D$ (assume all hyperparameters are fixed a priori). If we throw out all the non-SV samples from $D$ and train another SVM (with the same hyperparameter values) on the remaining samples (i.e. on $SV(D)$) we get the same exact classifier as before!
What about the slack variables? SVM was originally designed for problems where there exists a separating hyperplane (i.e. a hyperplane that perfectly separates the training samples from the two classes), and the goal was to find, among all separating hyperplanes, the hyperplane with the largest margin. The margin, denoted by $d(w, D)$, is defined for a classifier $w$ and a training set $D$. Assuming $w$ perfectly separates all the examples in $D$, we have $d(w, D) = \min_{(x, y) \in D} y \frac{w^Tx}{||w||_2}$, which is the distance of the closest training example from the separating hyperplane $w$. Note that $y \in \{+1, -1\}$ here. The introduction of slack variables made it possible to train SVMs on problems where either 1) a separating hyperplane does not exist (i.e. the training data is not linearly separable), or 2) you are happy to (or would like to) sacrifice making some error (higher bias) for better generalization (lower variance). However, this comes at the price of breaking some of the concrete mathematical and geometric interpretations of SVMs without slack variables (e.g. the geometrical interpretation of the margin).
Why can't you have deep SVM's? SVM objective is convex. More precisely, it is piecewise quadratic; that is because the $L_2$ regularizer is quadratic and the hinge loss is piecewise linear. The training objectives in deep hierarchical models, however, are much more complex. In particular, they are not convex. Of course, one can design a hierarchical discriminative model with hinge loss and $L_2$ regularization etc., but, it wouldn't be called an SVM. In fact, the hinge loss is commonly used in DNNs (Deep Neural Networks) for classification problems.

Davidrosenberg

davidrosenberg.github.io › mlcourse › Archive › 2016 › Homework › hw6-multiclass › hw6.pdf pdf

Generalized Hinge Loss and Multiclass SVM

we will eventually need our loss function to be a convex function of some w ∈Rd that parameterizes our hypothesis · space. It’ll be clear in what follows what we’re talking about. ... we have a linear hypothesis space. We’ll start with a special case, that the hinge loss is a convex

Stack Exchange

math.stackexchange.com › questions › 2096289 › is-this-function-hinge-loss-times-squared-error-convex

Is this function (hinge loss times squared error) convex? - Mathematics Stack Exchange

Top answer

1 of 3

Some of my thoughts, may not be correct though.

I understand the reason we have such design (for hinge and logistic loss) is we want the objective function to be convex.

Convexity is surely a nice property, but I think the most important reason is we want the objective function to have non-zero derivatives, so that we can make use of the derivatives to solve it. The objective function can be non-convex, in which case we often just stop at some local optima or saddle points.

and interestingly, it also penalize correctly classified instances if they are weakly classified. It is a really strange design.

I think such design sort of advises the model to not only make the right predictions, but also be confident about the predictions. If we don't want correctly classified instances to get punished, we can for example, move the hinge loss (blue) to the left by 1, so that they no longer get any loss. But I believe this often leads to worse result in practice.

what are the prices we need to pay by using different "proxy loss functions", such as hinge loss and logistic loss?

IMO by choosing different loss functions we are bringing different assumptions to the model. For example, the logistic regression loss (red) assumes a Bernoulli distribution, the MSE loss (green) assumes a Gaussian noise.

Following the least squares vs. logistic regression example in PRML, I added the hinge loss for comparison.

As shown in the figure, hinge loss and logistic regression / cross entropy / log-likelihood / softplus have very close results, because their objective functions are close (figure below), while MSE is generally more sensitive to outliers. Hinge loss does not always have a unique solution because it's not strictly convex.

However one important property of hinge loss is, data points far away from the decision boundary contribute nothing to the loss, the solution will be the same with those points removed.

The remaining points are called support vectors in the context of SVM. Whereas SVM uses a regularizer term to ensure the maximum margin property and a unique solution.

2 of 3

Posting a late reply, since there is a very simple answer which has not been mentioned yet.

what are the prices we need to pay by using different "proxy loss functions", such as hinge loss and logistic loss?

When you replace the non-convex 0-1 loss function by a convex surrogate (e.g hinge-loss), you are actually now solving a different problem than the one you intended to solve (which is to minimize the number of classification mistakes). So you gain computational tractability (the problem becomes convex, meaning you can solve it efficiently using tools of convex optimization), but in the general case there is actually no way to relate the error of the classifier that minimizes a "proxy" loss and the error of the classifier that minimizes the 0-1 loss. If what you truly cared about was minimizing the number of misclassifications, I argue that this really is a big price to pay.

I should mention that this statement is worst-case, in the sense that it holds for any distribution $\mathcal D$. For some "nice" distributions, there are exceptions to this rule. The key example is of data distributions that have large margins w.r.t the decision boundary - see Theorem 15.4 in Shalev-Shwartz, Shai, and Shai Ben-David. Understanding machine learning: From theory to algorithms. Cambridge university press, 2014.

Wikipedia

en.wikipedia.org › wiki › Loss_functions_for_classification

Loss functions for classification - Wikipedia

January 12, 2026 - While the hinge loss function is both convex and continuous, it is not smooth (is not differentiable) at

Bayes consistency Proper loss functions, loss margin and regularization Square loss Logistic loss Exponential loss Savage loss Tangent loss Hinge loss Generalized smooth hinge loss

Stack Exchange

stats.stackexchange.com › questions › 520792 › hinge-loss-is-the-tightest-convex-upper-bound-on-the-0-1-loss

svm - Hinge loss is the tightest convex upper bound on the 0-1 loss - Cross Validated

April 21, 2021 - I have read many times that the hinge loss is the tightest convex upper bound on the 0-1 loss (e.g. here, here and here). However, I have never seen a formal proof of this statement. How can we for...

reddit.com › r/machinelearning › loss function: must it be convex?

r/MachineLearning on Reddit: Loss function: must it be convex?

November 19, 2015 - Non-convex losses might cause problems with the optimizer falling into local minima, and these problems may or may not be surmountable with tricks like random restarts. If your loss is nondifferentiable everywhere, you won't be able to compute ...

JMLR

jmlr.org › papers › v9 › bartlett08a.html

Classification with a Reject Option using a Hinge Loss

Just as in the conventional classification problem, minimization of the sample average of the cost is a difficult optimization problem. As an alternative, we propose the optimization of a certain convex loss function φ, analogous to the hinge loss used in support vector machines (SVMs).