in machine learning, a loss function used for maximum‐margin classification
In machine learning, the hinge loss is a loss function used for training classifiers. The hinge loss is used for "maximum-margin" classification, most notably for support vector machines (SVMs). For an intended … Wikipedia
🌐
Wikipedia
en.wikipedia.org › wiki › Hinge_loss
Hinge loss - Wikipedia
January 26, 2026 - In machine learning, the hinge loss is a loss function used for training classifiers. The hinge loss is used for "maximum-margin" classification, most notably for support vector machines (SVMs). For an intended output t = ±1 and a classifier score y, the hinge loss of the prediction y is defined ...
🌐
GeeksforGeeks
geeksforgeeks.org › machine learning › hinge-loss-relationship-with-support-vector-machines
Hinge-loss & Relationship with Support Vector Machines - GeeksforGeeks
August 21, 2025 - Its purpose is to penalize predictions that are incorrect or insufficiently confident in the context of binary classification. It is used in binary classification problems where the objective is to separate the data points in two classes typically ...
Discussions

machine learning - Confusion on hinge loss and SVM - Cross Validated
I'm reading a book on data science and get confused about how the book describes the hinge loss of SVM. Here is a figure from the book on Page 94: This figure shows the loss function of a NEGATIVE More on stats.stackexchange.com
🌐 stats.stackexchange.com
machine learning - What's the relationship between an SVM and hinge loss? - Stack Overflow
My colleague and I are trying to wrap our heads around the difference between logistic regression and an SVM. Clearly they are optimizing different objective functions. Is an SVM as simple as saying it's a discriminative classifier that simply optimizes the hinge loss? More on stackoverflow.com
🌐 stackoverflow.com
Is support vector machine just about simplifying logistic regression formula? If so, why this name?

No. The main difference between the costs function is that the cross entropy loss (CEL) penalizes based on prediction distance from the answer. So if something is predicted using CEL as class 1 with probability 0.51 and it is actually class 1, it is penalized more strongly than if it had been predicted with probability 0.99, but for the hinge loss for SVM it's just counted the same whether or not you barely predict the answer or have high confidence. However both methods are penalized by 'distance' when they predict the wrong answer

More on reddit.com
🌐 r/learnmachinelearning
13
13
July 12, 2020
neural networks - How do I calculate the gradient of the hinge loss function? - Artificial Intelligence Stack Exchange
Hinge loss is difficult to work with when the derivative is needed because the derivative will be a piece-wise function. max has one non-differentiable point in its solution, and thus the derivative has the same. This was a very prominent issue with non-separable cases of SVM (and a good reason ... More on ai.stackexchange.com
🌐 ai.stackexchange.com
🌐
Medium
medium.com › the-modern-scientist › crossing-the-margin-why-hinge-loss-matters-in-machine-learning-aa2eaf7c029e
Crossing the Margin: Why Hinge Loss Matters in Machine Learning | by Everton Gomede, PhD | The Modern Scientist | Medium
December 6, 2024 - Approach: This essay explores hinge loss, a margin-based loss function used in Support Vector Machines (SVMs), detailing its mechanics, applications, and limitations.
Top answer
1 of 2
9

Searching for the quoted text, it seems the book is Data Science for Business (Provost and Fawcett), and they're describing the soft-margin SVM. Their description of the hinge loss is wrong. The problem is that it doesn't penalize misclassified points that lie within the margin, as you mentioned.

In SVMs, smaller weights correspond to larger margins. So, using this "version" of the hinge loss would have pathological consequences: We could achieve the minimum possible loss (zero) simply by choosing weights small enough such that all points lie within the margin. Even if every single point is misclassified. Because the SVM optimization problem contains a regularization term that encourages small weights (i.e. large margins), the solution will always be the zero vector. This means the solution is completely independent of the data, and nothing is learned. Needless to say, this wouldn't make for a very good classifier.

The correct expression for the hinge loss for a soft-margin SVM is:

where is the output of the SVM given input , and is the true class (-1 or 1). When the true class is -1 (as in your example), the hinge loss looks like this:

Note that the loss is nonzero for misclassified points, as well as correctly classified points that fall within the margin.

For a proper description of soft-margin SVMs using the hinge loss formulation, see The Elements of Statistical Learning (section 12.3.2) or the Wikipedia article.

2 of 2
1

The (A) hinge function can be expressed as

where:

  • is the change in slope after the hinge. In your example, this amounts to the slope following the hinge, since your hinge-only model (see below) assumes zero effect of on until the hinge.

  • is the point (in ) at which the hinge is located, and is a parameter estimated for the model. I believe your question is answered by considering that the location of the hinge is informed by the loss function.

  • is some error term with some distribution.

Hinge functions can also be useful in changing any line:

where:

  • is the model constant, and the intercept of the curve before the hinge (i.e. for ). Of course, if , then the curve intersects the -axis after the hinge so will not necessarily be the -intercept of the bent line.
  • is the slope of the line relating to
  • is the change in slope after the hinge.

In addition, the hinge can be used to model how a functional relationship between and changes form, as in this model where the relationship becomes quadra

Find elsewhere
🌐
scikit-learn
scikit-learn.org › stable › modules › generated › sklearn.metrics.hinge_loss.html
hinge_loss — scikit-learn 1.8.0 documentation
L1 AND L2 Regularization for Multiclass Hinge Loss Models by Robert C. Moore, John DeNero. ... >>> from sklearn import svm >>> from sklearn.metrics import hinge_loss >>> X = [[0], [1]] >>> y = [-1, 1] >>> est = svm.LinearSVC(random_state=0) >>> est.fit(X, y) LinearSVC(random_state=0) >>> pred_decision = est.decision_function([[-2], [3], [0.5]]) >>> pred_decision array([-2.18, 2.36, 0.09]) >>> hinge_loss([-1, 1, 1], pred_decision) 0.30
Top answer
1 of 1
23

I will answer one thing at at time

Is an SVM as simple as saying it's a discriminative classifier that simply optimizes the hinge loss?

SVM is simply a linear classifier, optimizing hinge loss with L2 regularization.

Or is it more complex than that?

No, it is "just" that, however there are different ways of looking at this model leading to complex, interesting conclusions. In particular, this specific choice of loss function leads to extremely efficient kernelization, which is not true for log loss (logistic regression) nor mse (linear regression). Furthermore you can show very important theoretical properties, such as those related to Vapnik-Chervonenkis dimension reduction leading to smaller chance of overfitting.

Intuitively look at these three common losses:

  • hinge: max(0, 1-py)
  • log: y log p
  • mse: (p-y)^2

Only the first one has the property that once something is classified correctly - it has 0 penalty. All the remaining ones still penalize your linear model even if it classifies samples correctly. Why? Because they are more related to regression than classification they want a perfect prediction, not just correct.

How do the support vectors come into play?

Support vectors are simply samples placed near the decision boundary (losely speaking). For linear case it does not change much, but as most of the power of SVM lies in its kernelization - there SVs are extremely important. Once you introduce kernel, due to hinge loss, SVM solution can be obtained efficiently, and support vectors are the only samples remembered from the training set, thus building a non-linear decision boundary with the subset of the training data.

What about the slack variables?

This is just another definition of the hinge loss, more usefull when you want to kernelize the solution and show the convexivity.

Why can't you have deep SVM's the way you can't you have a deep neural network with sigmoid activation functions?

You can, however as SVM is not a probabilistic model, its training might be a bit tricky. Furthermore whole strength of SVM comes from efficiency and global solution, both would be lost once you create a deep network. However there are such models, in particular SVM (with squared hinge loss) is nowadays often choice for the topmost layer of deep networks - thus the whole optimization is actually a deep SVM. Adding more layers in between has nothing to do with SVM or other cost - they are defined completely by their activations, and you can for example use RBF activation function, simply it has been shown numerous times that it leads to weak models (to local features are detected).

To sum up:

  • there are deep SVMs, simply this is a typical deep neural network with SVM layer on top.
  • there is no such thing as putting SVM layer "in the middle", as the training criterion is actually only applied to the output of the network.
  • using of "typical" SVM kernels as activation functions is not popular in deep networks due to their locality (as opposed to very global relu or sigmoid)
🌐
Analytics Vidhya
analyticsvidhya.com › home › what is hinge loss in machine learning?
What is Hinge loss in Machine Learning?
December 23, 2024 - Hinge loss is pivotal in classification tasks and widely used in Support Vector Machines (SVMs), quantifies errors by penalizing predictions near or across decision boundaries. By promoting robust margins between classes, it enhances model ...
🌐
GitHub
github.com › tejasmhos › Linear-SVM-Using-Squared-Hinge-Loss
GitHub - tejasmhos/Linear-SVM-Using-Squared-Hinge-Loss: This is an implementation, from scratch, of the linear SVM using squared hinge loss · GitHub
This is an implementation of a Linear SVM that uses a squared hinge loss. This algorithm was coded using Python. This is my submission for the polished code release for DATA 558 - Statistical Machine Learning.
Starred by 5 users
Forked by 3 users
Languages   Python
🌐
University of Oxford
robots.ox.ac.uk › ~az › lectures › ml › lect2.pdf pdf
Lecture 2: The SVM classifier
• Support Vector Machine (SVM) classifier · • Wide margin · • Cost function · • Slack variables · • Loss functions revisited · • Optimization · Binary Classification · Given training data (xi, yi) for i = 1 . . . N, with · xi ∈Rd and yi ∈{−1, 1}, learn a classifier f(x) such that ·
🌐
YouTube
youtube.com › rohan-paul-ai
What is the Hinge Loss in SVM in Machine Learning | Data Science Interview Questions - YouTube
What is the Hinge Loss in SVM in Machine LearningThe Hinge Loss is a loss function used in Support Vector Machine (SVM) algorithms for binary classification ...
Published   April 9, 2023
Views   1K
🌐
Soulpageit
soulpageit.com › home
Hinge Loss
June 30, 2023 - Hinge loss is commonly used in SVMs, where the goal is to find the hyperplane that separates the classes with the maximum margin. SVMs aim to minimize this loss while also incorporating a regularization term to control the complexity of the model.
🌐
Medium
koshurai.medium.com › understanding-hinge-loss-in-machine-learning-a-comprehensive-guide-0a1c82478de4
Understanding Hinge Loss in Machine Learning: A Comprehensive Guide | by KoshurAI | Medium
January 12, 2024 - Hinge loss, also known as max-margin loss, is a loss function that is particularly useful for training models in binary classification problems. It is designed to maximize the margin between classes, making it especially effective for support ...
Top answer
1 of 2
1

Hinge loss is difficult to work with when the derivative is needed because the derivative will be a piece-wise function. max has one non-differentiable point in its solution, and thus the derivative has the same. This was a very prominent issue with non-separable cases of SVM (and a good reason to use ridge regression).

Here's a slide (Original source from Zhuowen Tu, apologies for the title typo):

Where hinge loss is defined as max(0, 1-v) and v is the decision boundary of the SVM classifier. More can be found on the Hinge Loss Wikipedia.

As for your equation: you can easily pick out the v of the equation, however without more context of those functions it's hard to say how to derive. Unfortunately I don't have access to the paper and cannot guide you any further...

2 of 2
1

I disagree with the earlier answer that this is difficult to calculate. If we have the function \begin{align*} \sum_{t\in\mathcal{T}} \max \{0, 1 - d(t) \, y(t, \theta)\} \end{align*} the gradient with respect to $\theta$ is \begin{align*} & \sum_{t\in\mathcal{T}}g(t) \\ & g(t) := \begin{cases} 0 & \text{ if }1 - d(t) y(t, \theta) < 0 \\ -d(t)\dfrac{\partial y}{\partial \theta} & \text{ otherwise} \\ \end{cases} \end{align*} Theoretically this is ok, it just means that the gradient is not continuous. However, the objective is still continuous assuming that $d$ and $y$ are both continuous.

In practice, it's not a problem either. Any automatic differentiation software (tensorflow, pytorch, jax) will handle something like this automatically and correctly.

🌐
Number Analytics
numberanalytics.com › blog › mastering-hinge-loss-svm-python
Mastering Hinge Loss SVM in Python
June 23, 2025 - You can load the dataset using Scikit-Learn's `datasets` module: ```python iris = datasets.load_iris() X = iris.data y = iris.target To convert this into a binary classification problem, we'll consider only two classes: ```python X = X[y != 2] y = y[y != 2] y = np.where(y == 0, -1, 1) Next, we'll split the dataset into training and testing sets: ```python X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) ### Creating a Hinge Loss SVM Classifier using Scikit-Learn To create a Hinge Loss SVM classifier, you can use Scikit-Learn's `svm.SVC` class with the `kernel` parameter set to `'linear'` and the `C` parameter set to a suitable value.
🌐
GitHub
github.com › zotroneneis › machine_learning_basics › blob › master › support_vector_machines.ipynb
Support vector machines - Hard-Margin SVM
"- If a training example ($y = 1$) is on the correct side of the decision hyperplane but lies within the margin (that is, $0 \\lt f(\\mathbf{x}) \\lt 1$) the hinge loss will output a positive value.\n",
Author   zotroneneis