Logarithmic loss minimization leads to well-behaved probabilistic outputs.

Hinge loss leads to some (not guaranteed) sparsity on the dual, but it doesn't help at probability estimation. Instead, it punishes misclassifications (that's why it's so useful to determine margins): diminishing hinge-loss comes with diminishing across margin misclassifications.

So, summarizing:

  • Logarithmic loss ideally leads to better probability estimation at the cost of not actually optimizing for accuracy

  • Hinge loss ideally leads to better accuracy and some sparsity at the cost of not actually estimating probabilities

In ideal scenarios, each respective method would excel in their domain (accuracy vs probability estimation). However, due to the No-Free-Lunch Theorem, it is not possible to know, a priori, if the model choice is optimal.

Answer from Firebug on Stack Exchange
🌐
Baeldung
baeldung.com › home › artificial intelligence › machine learning › differences between hinge loss and logistic loss
Differences Between Hinge Loss and Logistic Loss | Baeldung on Computer Science
February 28, 2025 - One advantage of hinge loss over logistic loss is its simplicity. A simple function means that there’s less computing. This is important when calculating the gradients and updating the weights.
Top answer
1 of 3
34

Logarithmic loss minimization leads to well-behaved probabilistic outputs.

Hinge loss leads to some (not guaranteed) sparsity on the dual, but it doesn't help at probability estimation. Instead, it punishes misclassifications (that's why it's so useful to determine margins): diminishing hinge-loss comes with diminishing across margin misclassifications.

So, summarizing:

  • Logarithmic loss ideally leads to better probability estimation at the cost of not actually optimizing for accuracy

  • Hinge loss ideally leads to better accuracy and some sparsity at the cost of not actually estimating probabilities

In ideal scenarios, each respective method would excel in their domain (accuracy vs probability estimation). However, due to the No-Free-Lunch Theorem, it is not possible to know, a priori, if the model choice is optimal.

2 of 3
6

@Firebug had a good answer (+1). In fact, I had a similar question here.

What are the impacts of choosing different loss functions in classification to approximate 0-1 loss

I just want to add more on another big advantages of logistic loss: probabilistic interpretation. An example, can be found in UCLA - Advanced Research - Statistical Methods and Data Analysis - Computing Logit Regression | R Data Analysis Examples

Specifically, logistic regression is a classical model in statistics literature. (See, What does the name "Logistic Regression" mean? for the naming.) There are many important concept related to logistic loss, such as maximize log likelihood estimation, likelihood ratio tests, as well as assumptions on binomial. Here are some related discussions.

Likelihood ratio test in R

Why isn't Logistic Regression called Logistic Classification?

Is there i.i.d. assumption on logistic regression?

Difference between logit and probit models

🌐
Medium
medium.com › analytics-vidhya › understanding-loss-functions-hinge-loss-a0ff112b40a1
Understanding loss functions : Hinge loss | by Kunal Chowdhury | Analytics Vidhya | Medium
January 18, 2024 - As yf(x) increases with every misclassified point (very wrong points in Fig 5), the upper bound of hinge loss {1- yf(x)} also increases exponentially. Hence, the points that are farther away from the decision margins have a greater loss value, ...
🌐
Quora
quora.com › What-are-the-advantages-of-hinge-loss-over-log-loss
What are the advantages of hinge loss over log loss? - Quora
Answer: Hinge loss is easier to compute than log loss. Ditto for its derivative or subgradient. Hinge loss also induces sparsity in the solution, if the ML weights are a linear combination of the training observations.
🌐
Stanford University
cs229.stanford.edu › extra-notes › loss-functions.pdf pdf
CS229 Supplemental Lecture notes John Duchi 1 Binary classification
Figure 2: The three margin-based loss functions logistic loss, hinge loss, and · exponential loss. use binary labels y ∈{−1, 1}, it is possible to write logistic regression more · compactly. In particular, we use the logistic loss · ϕlogistic(yxTθ) = log ·
🌐
Techkluster
techkluster.com › technology › hinge-loss-vs-logistic-loss
Differences Between Hinge Loss and Logistic Loss – TechKluster
Logistic Loss: Widely used in logistic regression and probabilistic classifiers. Focuses on predicting probabilities that match true labels. ... Hinge Loss: Penalizes predictions within the margin, but doesn’t care about the exact value of the prediction.
🌐
NISER
niser.ac.in › ~smishra › teach › cs460 › 23cs460 › lectures › lec11.pdf pdf
HINGE LOSS IN SUPPORT VECTOR MACHINES Chandan Kumar Sahu and Maitrey Sharma
February 7, 2023 - Figure. The support vector loss function (hinge loss), compared to the negative log-likelihood loss (binomial · deviance) for logistic regression, squared-error loss, and a “Huberized” version of the squared hinge loss.
🌐
Quora
quora.com › What-is-the-advantage-disadvantage-of-Hinge-loss-compared-to-cross-entropy
What is the advantage/disadvantage of Hinge-loss compared to cross-entropy? - Quora
Answer (1 of 2): Cross Entropy (or Log Loss), Hing Loss (SVM Loss), Squared Loss etc. are different forms of Loss functions. Log Loss in the classification context gives Logistic Regression, while the Hinge Loss is Support Vector Machines. Logistic ...
Find elsewhere
🌐
Aiml
aiml.com › home › posts › machine learning interview questions › supervised learning › classification › support vector machine › how does hinge loss differ from logistic loss?
How does hinge loss differ from logistic loss? - AIML.com
March 26, 2023 - As can be seen in the graphs above, hinge loss is non-differentiable, which means that the optimization problem is no longer convex. Logistic, or cross-entropy loss, does not suffer from such a problem and also allows for the computation of predicted probabilities rather than just class labels, which is why it is suitable for logistic regression.
🌐
Wikipedia
en.wikipedia.org › wiki › Loss_functions_for_classification
Loss functions for classification - Wikipedia
January 12, 2026 - {\displaystyle p(1\mid x)\neq 0.5} , which matches that of the 0–1 indicator function. This conclusion makes the hinge loss quite attractive, as bounds can be placed on the difference between expected risk and the sign of hinge loss function.
🌐
Yuan Du
yuan-du.com › post › 2020-12-13-loss-functions › decision-theory
Loss Functions in Machine Learning and LTR | Yuan Du
August 10, 2022 - The binomial log-likelihood loss ... ... Hinge loss, Exponential loss, Logistic loss have very similar tails, giving zero penalty to points well inside their margin and linear or exponential penalty to points on the wrong side adn far away...
🌐
UBC Computer Science
cs.ubc.ca › ~schmidtm › Courses › 340-F17 › L21.pdf pdf
CPSC 340: Machine Learning and Data Mining More Linear Classifiers Fall 2017
• There are two standard fixes: hinge loss and logistic loss. Hinge Loss · • Consider replacing yiwTxi > 0 with yiwTxi ≥ 1. (the “1” is arbitrary: we could make ||w|| bigger/smaller to use any positive constant) • The violation of this constraint is now given by: • This is the called hinge loss.
🌐
DataCamp
datacamp.com › tutorial › loss-function-in-machine-learning
Loss Functions in Machine Learning Explained | DataCamp
December 4, 2024 - To ensure the maximum margin between the data points and boundaries, hinge loss penalizes predictions from the machine learning model that are wrongly classified, which are predictions that fall on the wrong side of the margin boundary and also predictions that are correctly classified but are within close proximity to the decision boundary.
🌐
Texas A&M University
people.tamu.edu › ~sji › classes › loss-slides.pdf pdf
A Unified View of Loss Functions in Supervised Learning Shuiwang Ji
2 The loss function used in logistic regression can be expressed as · 1 · n · n · X · i=1 · Llog(yi, si), (6) where Llog is the log loss, defined as · Llog(yi, si) = log(1 + e−yisi). (7) 6 / 12 · Hinge loss (support vector machines) 1 The support vector machines employ hinge loss to obtain a classifier ·
🌐
Kaggle
kaggle.com › code › viveknimsarkar › hands-on-guide-to-loss-functions
Hands-On Guide To Loss Functions
Checking your browser before accessing www.kaggle.com · Click here if you are not automatically redirected after 5 seconds
🌐
Techkluster
techkluster.com › 2023 › 08 › 23 › hinge-loss-vs-logistic-loss
Differences Between Hinge Loss and Logistic Loss -
August 23, 2023 - Logistic Loss: Widely used in logistic regression and probabilistic classifiers. Focuses on predicting probabilities that match true labels. ... Hinge Loss: Penalizes predictions within the margin, but doesn’t care about the exact value of the prediction.
Top answer
1 of 3
19

Some of my thoughts, may not be correct though.

I understand the reason we have such design (for hinge and logistic loss) is we want the objective function to be convex.

Convexity is surely a nice property, but I think the most important reason is we want the objective function to have non-zero derivatives, so that we can make use of the derivatives to solve it. The objective function can be non-convex, in which case we often just stop at some local optima or saddle points.

and interestingly, it also penalize correctly classified instances if they are weakly classified. It is a really strange design.

I think such design sort of advises the model to not only make the right predictions, but also be confident about the predictions. If we don't want correctly classified instances to get punished, we can for example, move the hinge loss (blue) to the left by 1, so that they no longer get any loss. But I believe this often leads to worse result in practice.

what are the prices we need to pay by using different "proxy loss functions", such as hinge loss and logistic loss?

IMO by choosing different loss functions we are bringing different assumptions to the model. For example, the logistic regression loss (red) assumes a Bernoulli distribution, the MSE loss (green) assumes a Gaussian noise.


Following the least squares vs. logistic regression example in PRML, I added the hinge loss for comparison.

As shown in the figure, hinge loss and logistic regression / cross entropy / log-likelihood / softplus have very close results, because their objective functions are close (figure below), while MSE is generally more sensitive to outliers. Hinge loss does not always have a unique solution because it's not strictly convex.

However one important property of hinge loss is, data points far away from the decision boundary contribute nothing to the loss, the solution will be the same with those points removed.

The remaining points are called support vectors in the context of SVM. Whereas SVM uses a regularizer term to ensure the maximum margin property and a unique solution.

2 of 3
8

Posting a late reply, since there is a very simple answer which has not been mentioned yet.

what are the prices we need to pay by using different "proxy loss functions", such as hinge loss and logistic loss?

When you replace the non-convex 0-1 loss function by a convex surrogate (e.g hinge-loss), you are actually now solving a different problem than the one you intended to solve (which is to minimize the number of classification mistakes). So you gain computational tractability (the problem becomes convex, meaning you can solve it efficiently using tools of convex optimization), but in the general case there is actually no way to relate the error of the classifier that minimizes a "proxy" loss and the error of the classifier that minimizes the 0-1 loss. If what you truly cared about was minimizing the number of misclassifications, I argue that this really is a big price to pay.

I should mention that this statement is worst-case, in the sense that it holds for any distribution $\mathcal D$. For some "nice" distributions, there are exceptions to this rule. The key example is of data distributions that have large margins w.r.t the decision boundary - see Theorem 15.4 in Shalev-Shwartz, Shai, and Shai Ben-David. Understanding machine learning: From theory to algorithms. Cambridge university press, 2014.

Top answer
1 of 2
5

The example attributed to Olivier Bousquet doesn't work as clearly as depicted in the blog post mentioned in an answer to a related question, but I have been able to make it work to some extent in a more realistic setting, so I'll post it here in case it stimulates further (hopefully simpler or more informative) examples.

The (of course) adversarial learning task is shown below. The probability of membership of the positive class is given by

$$p(\mathcal{C}_+|x) = \frac{0.5}{(1 + \exp(-100*x))} + 0.25 + 0.24*\sin(20*x)$$

Note there are features of the true probability of class membership that do not affect the decision boundary, so any model may be distracted by modelling those irrelevant undulations at the expense of accurately determining the optimal decision boundary. To demonstrate that Bousqet's example builds a series of logistic regression-style models, of increasing complexity, based on Legendre polynomials (for numerical considerations). Here are the first seven basis functions:

The blog example is implemented in Mathmatica, which is a language I don't know, but I have been able to replicate their results in MATLAB tolerably well. What I think they have done is to fit these Legendre polynomial models firstly using the cross-entropy metric, and then using the hinge loss, but rather strangely they have fitted it directly to the true (sampled) probability of class membership, i.e. the response values lie between 0 and 1. Using the cross-entropy, I get this result:

Which is broadly the same as the Mathmatica implementation. Note that in the attempt to model the undulations in the probability of class membership, the model has overshot on occasion and so the accuracy is lower than we would get by simply placing a threshold at $x = 0$.

It wasn't completely clear how to implement the model with the hinge loss, as the logistic function clips the output to lie in the range of 0 to 1, so instead I used the hinge loss on the weighted sum of the Legendre basis functions and then afterwards applied the logistic function. This is the result:

All the models now achieve the optimal accuracy, although the estimates of the posterior probability of class membership are clearly inferior (if not actually plain wrong!).

HOWEVER this is not what we actually do when we have a classification task. If we knew the optimal posterior probability of class membership to determine the targets for the training data, we probably wouldn't need to build a classifier in the first place! So I then modified the code so that instead of the response values being the sampled true probability, I generated random x values (uniform distribution from -1 to +1) and then generated binary responses according to the probability of class membership.

This is the result for the cross-entropy error metric, which is pretty much the same as before.

Here is the result for the hinge loss, which is very different.

So why the difference? Well in the blog version, if we set the weight "linear" term to a very high value, then the weighted sum will be less than -1 for all of the data to the left of $x = 0$ and a value greater than +1 for all of the data on the right. In which case, the hinge loss will be zero, and we will get a classifier with minimal error regardless of model complexity. The hinge loss cannot be negative. However, if we sample the labels from a conditional Bernoulli distribution, we will have data on both sides of $x=0$ that are both positively and negatively labelled, and if they are the wrong side of $x=0$ they will have a non-zero hinge loss, and hence we will start penalising models with a large linear term increasingly harshly. It does have some excess error caused by trying to model the right-most undulation, but it does see to be more robust in terms of accuracy than the cross-entropy loss. So it is an example of how trying to classify the data directly, rather than estimate probability and then threshold, it just isn't as clear cut as the example in the blog suggests.

Update #1: Here are the results with a larger dataset (so sampling noise is less likely to be a factor). For the previous results I used 1024 training patterns, and for these I used 65536 (I work in a computer science department ;o). It seems to improve things a bit for the hinge loss, but the cross-entropy results look broadly similar.

Cross-entropy loss:

Hinge loss:

It is interesting (i.e. worrying) that for some of the simpler models, the output does not go through $(0, 1/2)$...

FWIW, this is the most complex of the hinge-loss models without the logistic transformation (but with an offset of 0.5 to make it easier to compare with the probabilities).

2 of 2
4

The results you shared are fascinating. Here's some more exploration in a similar direction.

The idea behind the Bousquet example is to create a situation where optimizing the logistic loss prioritizes fitting the underlying probability distribution at the expense of accuracy. But, it's not clear to me that accuracy and fit to the distribution would have to be opposed here. For example, it seems like a model that exactly matches the underlying distribution should yield both optimal accuracy and optimal logistic loss (at least in expectation).

I'll build a similar example that tries to simplify things, with only two models in the hypothesis space. One gives better accuracy but worse fit to the underlying distribution, and the other does the opposite. The tension between these two objectives is explicitly baked into the problem. I'll work directly with expected losses, so issues related to finite samples and/or optimization won't play any role.

True distribution

Suppose each point $x$ is drawn i.i.d. from the uniform distribution on $[-1, 1]$ and its class label $y \in \{-1, +1\}$ is drawn from a Bernoulli distribution $p(y \mid x)$. Similar to the Bousquet example, the conditional probability of the positive class is a 'wavy step function' (see plot below):

$$p(y=1 \mid x) = .598 \ \sigma(100 x) - .201 + .2 \sin(20 x)$$

where $\sigma$ is the logistic sigmoid function.

Models

Suppose our hypothesis space contains only two models (with no free parameters so 'fitting' means choosing one or the other):

The first model is a step function:

$$\hat{p}_1(y=1 \mid x) = \begin{cases} \sigma(1) & x \ge 0 \\ \sigma(-1) & x < 0 \\ \end{cases}$$

The second is a wavy step function, similar to the true distribution, but with slightly different parameters:

$$\hat{p}_2(y=1 \mid x) = .398 \ \sigma(100 x) - .301 + .3 \sin(20 x)$$

Where needed (e.g. for computing the hinge loss), 'raw' classifier outputs are computed as $f(x) = \operatorname{logit}(\hat{p}(y=1 \mid x))$. Since we're interested in accuracy, point predictions are computed as the mode of the predicted distribution over class labels (equivalent to the sign of the 'raw' output), which is the optimal decision under the 0-1 loss.

Note that the hypothesis space doesn't contain the true distribution. We're forced to choose between two approximations that make different tradeoffs. The first model (step function) is designed to make more accurate point predictions, at the expense of fit to the underlying distribution. In contrast, the second model (wavy step) is designed to match the underlying distribution better, at the expense of accuracy. To confirm that these tradeoffs are indeed present, the table below shows the expected 0-1 loss (better for model 1) and expected KL divergence from the true distribution to the model (better for model 2).

Hinge vs. logistic loss

Here are the expected losses for each model, where the expectation is taken w.r.t. the true data generating process (calculated by numerical integration). The best model according to each loss function is shown in bold+parentheses:

$$\begin{array}{rc} & \text{Model 1 (step)} & \text{Model 2 (wavy step)} \\ \text{0-1} & \mathbf{(.208)} & .269 \\ \text{KL} & .113 & \mathbf{(.044)} \\ \text{Hinge} & \mathbf{(.416)} & .580 \\ \text{Logistic} & .521 & \mathbf{(.474)} \\ \end{array}$$

Suppose we choose a model from our hypothesis space by minimizing the expected loss (i.e. what empirical risk minimization tries to do by proxy). As the table shows, minimizing the hinge loss gives the first model (prioritizing accuracy), whereas minimizing the logistic loss gives the second model (prioritizing fit to the underlying distribution).

Notes

Contrary to this example, accuracy and fit to the underlying distribution aren't always opposed. Even when such a conflict exists, the hinge and logistic losses may not necessarily behave as shown above.

For example, the amplitude of model 1's step function matters. It doesn't affect the accuracy (which only depends on the sign), but it does affect the hinge loss. Increasing the amplitude too far (overconfidence) incurs increasing penalties for misclassified points. And, shrinking it too far (underconfidence) penalizes correct predictions, which increasingly fall inside the margin. In both cases, the hinge loss will eventually favor the second model, thereby accepting a decrease in accuracy. This emphasizes that: 1) the hinge loss doesn't always agree with the 0-1 loss (it's only a convex surrogate) and 2) the effects in question depend on the hypothesis space.

In practice, I'd bet that regularization plays an important role too, together with the model selection algorithm. For example, regularization strength is often chosen to maximize validation set accuracy. Even if the logistic loss is used to fit the parameters, using the 0-1 loss for model selection might sacrifice fit to the underlying distribution in favor of accuracy.

🌐
Programmathically
programmathically.com › home › machine learning › classical machine learning › understanding hinge loss and the svm cost function
Understanding Hinge Loss and the SVM Cost Function - Programmathically
June 26, 2022 - If, on the other hand, the outcome was -1, the loss would be higher since we’ve misclassified our example. ... Instead of using a labelling convention of -1, and 1 we could also use 0 and 1 and use the formula for cross-entropy to set one of the terms equal to zero. But the math checks out more beautifully in the former case. With the hinge loss defined, we are now in a position to understand the loss function for the support vector machine.