Logarithmic loss minimization leads to well-behaved probabilistic outputs.

Hinge loss leads to some (not guaranteed) sparsity on the dual, but it doesn't help at probability estimation. Instead, it punishes misclassifications (that's why it's so useful to determine margins): diminishing hinge-loss comes with diminishing across margin misclassifications.

So, summarizing:

  • Logarithmic loss ideally leads to better probability estimation at the cost of not actually optimizing for accuracy

  • Hinge loss ideally leads to better accuracy and some sparsity at the cost of not actually estimating probabilities

In ideal scenarios, each respective method would excel in their domain (accuracy vs probability estimation). However, due to the No-Free-Lunch Theorem, it is not possible to know, a priori, if the model choice is optimal.

Answer from Firebug on Stack Exchange
🌐
Baeldung
baeldung.com › home › artificial intelligence › machine learning › differences between hinge loss and logistic loss
Differences Between Hinge Loss and Logistic Loss | Baeldung on Computer Science
February 28, 2025 - Secondly, while hinge loss produces a hyperplane separating the classes, it doesn’t give us any information on how certain it is about the membership of the samples. Logistic loss, on the other hand, models the conditional probability and thus, the probability estimates are its byproducts.
Top answer
1 of 3
34

Logarithmic loss minimization leads to well-behaved probabilistic outputs.

Hinge loss leads to some (not guaranteed) sparsity on the dual, but it doesn't help at probability estimation. Instead, it punishes misclassifications (that's why it's so useful to determine margins): diminishing hinge-loss comes with diminishing across margin misclassifications.

So, summarizing:

  • Logarithmic loss ideally leads to better probability estimation at the cost of not actually optimizing for accuracy

  • Hinge loss ideally leads to better accuracy and some sparsity at the cost of not actually estimating probabilities

In ideal scenarios, each respective method would excel in their domain (accuracy vs probability estimation). However, due to the No-Free-Lunch Theorem, it is not possible to know, a priori, if the model choice is optimal.

2 of 3
6

@Firebug had a good answer (+1). In fact, I had a similar question here.

What are the impacts of choosing different loss functions in classification to approximate 0-1 loss

I just want to add more on another big advantages of logistic loss: probabilistic interpretation. An example, can be found in UCLA - Advanced Research - Statistical Methods and Data Analysis - Computing Logit Regression | R Data Analysis Examples

Specifically, logistic regression is a classical model in statistics literature. (See, What does the name "Logistic Regression" mean? for the naming.) There are many important concept related to logistic loss, such as maximize log likelihood estimation, likelihood ratio tests, as well as assumptions on binomial. Here are some related discussions.

Likelihood ratio test in R

Why isn't Logistic Regression called Logistic Classification?

Is there i.i.d. assumption on logistic regression?

Difference between logit and probit models

Discussions

logistic regression - What's the relationship between an SVM and hinge loss? - Data Science Stack Exchange
My colleague and I are trying to wrap our heads around the difference between logistic regression and an SVM. Clearly they are optimizing different objective functions. Is an SVM as simple as sayin... More on datascience.stackexchange.com
🌐 datascience.stackexchange.com
machine learning - What are the impacts of choosing different loss functions in classification to approximate 0-1 loss - Cross Validated
Learn more about Teams ... We know that some objective functions are easier to optimize and some are hard. And there are many loss functions that we want to use but hard to use, for example 0-1 loss. So we find some proxy loss functions to do the work. For example, we use hinge loss or logistic ... More on stats.stackexchange.com
🌐 stats.stackexchange.com
Is support vector machine just about simplifying logistic regression formula? If so, why this name?

No. The main difference between the costs function is that the cross entropy loss (CEL) penalizes based on prediction distance from the answer. So if something is predicted using CEL as class 1 with probability 0.51 and it is actually class 1, it is penalized more strongly than if it had been predicted with probability 0.99, but for the hinge loss for SVM it's just counted the same whether or not you barely predict the answer or have high confidence. However both methods are penalized by 'distance' when they predict the wrong answer

More on reddit.com
🌐 r/learnmachinelearning
13
13
July 12, 2020
Why do we use log-loss in logistic regression instead of just taking the absolute difference between expected probability and actual value for each instance?
You can try it and see if it works🤷‍♂️ Absolute is usually avoided because makes a "V" shaped gradient. Sharp corners are bad in general for gradient based optimization. Same reason we use MSE or RMSE instead of absolute error for regression tasks. More on reddit.com
🌐 r/learnmachinelearning
9
3
April 26, 2023

Logarithmic loss minimization leads to well-behaved probabilistic outputs.

Hinge loss leads to some (not guaranteed) sparsity on the dual, but it doesn't help at probability estimation. Instead, it punishes misclassifications (that's why it's so useful to determine margins): diminishing hinge-loss comes with diminishing across margin misclassifications.

So, summarizing:

  • Logarithmic loss ideally leads to better probability estimation at the cost of not actually optimizing for accuracy

  • Hinge loss ideally leads to better accuracy and some sparsity at the cost of not actually estimating probabilities

In ideal scenarios, each respective method would excel in their domain (accuracy vs probability estimation). However, due to the No-Free-Lunch Theorem, it is not possible to know, a priori, if the model choice is optimal.

Answer from Firebug on Stack Exchange
🌐
Wikipedia
en.wikipedia.org › wiki › Loss_functions_for_classification
Loss functions for classification - Wikipedia
January 12, 2026 - The square loss function is both convex and smooth. However, the square loss function tends to penalize outliers excessively, leading to slower convergence rates (with regards to sample complexity) than for the logistic loss or hinge loss functions.
🌐
Techkluster
techkluster.com › technology › hinge-loss-vs-logistic-loss
Differences Between Hinge Loss and Logistic Loss – TechKluster
Information Gain: Logistic loss can be interpreted as measuring the information gain when the predicted probability aligns with the true label. Support Vector Machines (SVMs): Hinge loss is a natural fit for SVMs, where maximizing the margin between classes is a primary objective.
🌐
Aiml
aiml.com › home › posts › machine learning interview questions › supervised learning › classification › support vector machine › how does hinge loss differ from logistic loss?
How does hinge loss differ from logistic loss? - AIML.com
March 26, 2023 - As can be seen in the graphs above, hinge loss is non-differentiable, which means that the optimization problem is no longer convex. Logistic, or cross-entropy loss, does not suffer from such a problem and also allows for the computation of predicted probabilities rather than just class labels, ...
🌐
Yuan Du
yuan-du.com › post › 2020-12-13-loss-functions › decision-theory
Loss Functions in Machine Learning and LTR | Yuan Du
August 10, 2022 - The binomial log-likelihood loss ... ... Hinge loss, Exponential loss, Logistic loss have very similar tails, giving zero penalty to points well inside their margin and linear or exponential penalty to points on the wrong side adn far away...
Find elsewhere
🌐
Stanford University
cs229.stanford.edu › extra-notes › loss-functions.pdf pdf
CS229 Supplemental Lecture notes John Duchi 1 Binary classification
machine learning procedures; in particular, the logistic loss ϕlogistic is logistic · regression, the hinge loss ϕhinge gives rise to so-called support vector machines, and the exponential loss gives rise to the classical version of boosting, both · of which we will explore in more depth ...
🌐
Medium
medium.com › analytics-vidhya › understanding-loss-functions-hinge-loss-a0ff112b40a1
Understanding loss functions : Hinge loss | by Kunal Chowdhury | Analytics Vidhya | Medium
January 18, 2024 - Almost, all classification models are based on some kind of models. E.g. Logistic regression has logistic loss (Fig 4: exponential), SVM has hinge loss (Fig 4: Support Vector), etc.
🌐
NISER
niser.ac.in › ~smishra › teach › cs460 › 23cs460 › lectures › lec11.pdf pdf
HINGE LOSS IN SUPPORT VECTOR MACHINES Chandan Kumar Sahu and Maitrey Sharma
February 7, 2023 - Figure. The support vector loss function (hinge loss), compared to the negative log-likelihood loss (binomial · deviance) for logistic regression, squared-error loss, and a “Huberized” version of the squared hinge loss.
🌐
Quora
quora.com › What-are-the-advantages-of-hinge-loss-over-log-loss
What are the advantages of hinge loss over log loss? - Quora
Answer: Hinge loss is easier to compute than log loss. Ditto for its derivative or subgradient. Hinge loss also induces sparsity in the solution, if the ML weights are a linear combination of the training observations.
🌐
Quora
quora.com › What-is-the-advantage-disadvantage-of-Hinge-loss-compared-to-cross-entropy
What is the advantage/disadvantage of Hinge-loss compared to cross-entropy? - Quora
Answer (1 of 2): Cross Entropy (or Log Loss), Hing Loss (SVM Loss), Squared Loss etc. are different forms of Loss functions. Log Loss in the classification context gives Logistic Regression, while the Hinge Loss is Support Vector Machines. Logistic ...
🌐
DataCamp
datacamp.com › tutorial › loss-function-in-machine-learning
Loss Functions in Machine Learning Explained | DataCamp
December 4, 2024 - When BCE is utilized as a component ... training. Hinge Loss is a loss function utilized within machine learning to train classifiers that optimize to increase the margin between data points and the decision boundary....
🌐
Analytics Vidhya
analyticsvidhya.com › home › what is hinge loss in machine learning?
What is Hinge loss in Machine Learning?
December 23, 2024 - Hinge loss in machine learning, a key loss function in SVMs, enhances model robustness by penalizing incorrect or marginal predictions.
🌐
Kaggle
kaggle.com › code › viveknimsarkar › hands-on-guide-to-loss-functions
Hands-On Guide To Loss Functions
Checking your browser before accessing www.kaggle.com · Click here if you are not automatically redirected after 5 seconds
🌐
Number Analytics
numberanalytics.com › blog › hinge-loss-ultimate-guide-for-ml-practitioners
Hinge Loss: The Ultimate Guide for ML Practitioners
June 11, 2025 - Hinge loss is particularly useful when the goal is to maximize the margin between classes, whereas logistic loss is more suitable for problems that require a probabilistic interpretation. Hinge loss is widely used in binary and multi-class classification problems.
🌐
Texas A&M University
people.tamu.edu › ~sji › classes › loss-slides.pdf pdf
A Unified View of Loss Functions in Supervised Learning Shuiwang Ji
2 The loss function used in logistic ... si) = log(1 + e−yisi). (7) 6 / 12 · Hinge loss (support vector machines) 1 The support vector machines employ hinge loss to obtain a classifier ·...
🌐
Rohanvarma
rohanvarma.me › Loss-Functions
Picking Loss Functions - A comparison between MSE, Cross Entropy, and Hinge Loss
The MSE loss is therefore better suited to regression problems, and the cross-entropy loss provides us with faster learning when our predictions differ significantly from our labels, as is generally the case during the first several iterations of model training. We’ve also compared and contrasted the cross-entropy loss and hinge loss, and discussed how using one over the other leads to our models learning in different ways.
🌐
Wikipedia
en.wikipedia.org › wiki › Hinge_loss
Hinge loss - Wikipedia
January 26, 2026 - The hinge loss is used for "maximum-margin" classification, most notably for support vector machines (SVMs). For an intended output t = ±1 and a classifier score y, the hinge loss of the prediction y is defined as ...
Top answer
1 of 3
19

Some of my thoughts, may not be correct though.

I understand the reason we have such design (for hinge and logistic loss) is we want the objective function to be convex.

Convexity is surely a nice property, but I think the most important reason is we want the objective function to have non-zero derivatives, so that we can make use of the derivatives to solve it. The objective function can be non-convex, in which case we often just stop at some local optima or saddle points.

and interestingly, it also penalize correctly classified instances if they are weakly classified. It is a really strange design.

I think such design sort of advises the model to not only make the right predictions, but also be confident about the predictions. If we don't want correctly classified instances to get punished, we can for example, move the hinge loss (blue) to the left by 1, so that they no longer get any loss. But I believe this often leads to worse result in practice.

what are the prices we need to pay by using different "proxy loss functions", such as hinge loss and logistic loss?

IMO by choosing different loss functions we are bringing different assumptions to the model. For example, the logistic regression loss (red) assumes a Bernoulli distribution, the MSE loss (green) assumes a Gaussian noise.


Following the least squares vs. logistic regression example in PRML, I added the hinge loss for comparison.

As shown in the figure, hinge loss and logistic regression / cross entropy / log-likelihood / softplus have very close results, because their objective functions are close (figure below), while MSE is generally more sensitive to outliers. Hinge loss does not always have a unique solution because it's not strictly convex.

However one important property of hinge loss is, data points far away from the decision boundary contribute nothing to the loss, the solution will be the same with those points removed.

The remaining points are called support vectors in the context of SVM. Whereas SVM uses a regularizer term to ensure the maximum margin property and a unique solution.

2 of 3
8

Posting a late reply, since there is a very simple answer which has not been mentioned yet.

what are the prices we need to pay by using different "proxy loss functions", such as hinge loss and logistic loss?

When you replace the non-convex 0-1 loss function by a convex surrogate (e.g hinge-loss), you are actually now solving a different problem than the one you intended to solve (which is to minimize the number of classification mistakes). So you gain computational tractability (the problem becomes convex, meaning you can solve it efficiently using tools of convex optimization), but in the general case there is actually no way to relate the error of the classifier that minimizes a "proxy" loss and the error of the classifier that minimizes the 0-1 loss. If what you truly cared about was minimizing the number of misclassifications, I argue that this really is a big price to pay.

I should mention that this statement is worst-case, in the sense that it holds for any distribution $\mathcal D$. For some "nice" distributions, there are exceptions to this rule. The key example is of data distributions that have large margins w.r.t the decision boundary - see Theorem 15.4 in Shalev-Shwartz, Shai, and Shai Ben-David. Understanding machine learning: From theory to algorithms. Cambridge university press, 2014.