🌐
arXiv
arxiv.org › abs › 1206.6442
[1206.6442] Minimizing The Misclassification Error Rate Using a Surrogate Convex Loss
June 27, 2012 - We carefully study how well minimizing convex surrogate loss functions, corresponds to minimizing the misclassification error rate for the problem of binary classification with linear predictors. In particular, we show that amongst all convex surrogate losses, the hinge loss gives essentially the best possible bound, of all convex loss functions, for the misclassification error rate of the resulting linear predictor in terms of the best possible margin error rate.
🌐
ICML
icml.cc › 2012 › papers › 917.pdf pdf
Minimizing The Misclassification Error Rate Using a Surrogate Convex Loss
Theorem 4. For any convex loss function φ, we have · EG(φ, ⌫, B) ≥min · ⇢⌫(B + 1) 2 · , 1 · 2 · " Minimizing The Misclassification Error Rate Using a Surrogate Convex Loss · The above theorem and Proposition 2 together show · that hinge loss is optimal up to constant factor 2.
Concept in machine learning
Loss functions for classification - Wikipedia
In machine learning and mathematical optimization, loss functions for classification are computationally feasible loss functions representing the price paid for inaccuracy of predictions in classification problems (problems of identifying which category a … Wikipedia
🌐
Wikipedia
en.wikipedia.org › wiki › Loss_functions_for_classification
Loss functions for classification - Wikipedia
January 12, 2026 - The Tangent loss is quasi-convex and is bounded for large negative values which makes it less sensitive to outliers. Interestingly, the Tangent loss also assigns a bounded penalty to data points that have been classified "too correctly". This can help prevent over-training on the data set.
🌐
JMLR
jmlr.org › papers › volume10 › xiang09a › xiang09a.pdf pdf
Classification with Gaussians and Convex Loss
The classifier minimizing the misclassification error is called the Bayes rule fc and is given by ... Definition 1 We say that φ : R →R+ is a classifying loss (function) if it is convex, differentiable at
Top answer
1 of 3
19

Some of my thoughts, may not be correct though.

I understand the reason we have such design (for hinge and logistic loss) is we want the objective function to be convex.

Convexity is surely a nice property, but I think the most important reason is we want the objective function to have non-zero derivatives, so that we can make use of the derivatives to solve it. The objective function can be non-convex, in which case we often just stop at some local optima or saddle points.

and interestingly, it also penalize correctly classified instances if they are weakly classified. It is a really strange design.

I think such design sort of advises the model to not only make the right predictions, but also be confident about the predictions. If we don't want correctly classified instances to get punished, we can for example, move the hinge loss (blue) to the left by 1, so that they no longer get any loss. But I believe this often leads to worse result in practice.

what are the prices we need to pay by using different "proxy loss functions", such as hinge loss and logistic loss?

IMO by choosing different loss functions we are bringing different assumptions to the model. For example, the logistic regression loss (red) assumes a Bernoulli distribution, the MSE loss (green) assumes a Gaussian noise.


Following the least squares vs. logistic regression example in PRML, I added the hinge loss for comparison.

As shown in the figure, hinge loss and logistic regression / cross entropy / log-likelihood / softplus have very close results, because their objective functions are close (figure below), while MSE is generally more sensitive to outliers. Hinge loss does not always have a unique solution because it's not strictly convex.

However one important property of hinge loss is, data points far away from the decision boundary contribute nothing to the loss, the solution will be the same with those points removed.

The remaining points are called support vectors in the context of SVM. Whereas SVM uses a regularizer term to ensure the maximum margin property and a unique solution.

2 of 3
8

Posting a late reply, since there is a very simple answer which has not been mentioned yet.

what are the prices we need to pay by using different "proxy loss functions", such as hinge loss and logistic loss?

When you replace the non-convex 0-1 loss function by a convex surrogate (e.g hinge-loss), you are actually now solving a different problem than the one you intended to solve (which is to minimize the number of classification mistakes). So you gain computational tractability (the problem becomes convex, meaning you can solve it efficiently using tools of convex optimization), but in the general case there is actually no way to relate the error of the classifier that minimizes a "proxy" loss and the error of the classifier that minimizes the 0-1 loss. If what you truly cared about was minimizing the number of misclassifications, I argue that this really is a big price to pay.

I should mention that this statement is worst-case, in the sense that it holds for any distribution . For some "nice" distributions, there are exceptions to this rule. The key example is of data distributions that have large margins w.r.t the decision boundary - see Theorem 15.4 in Shalev-Shwartz, Shai, and Shai Ben-David. Understanding machine learning: From theory to algorithms. Cambridge university press, 2014.

🌐
ResearchGate
researchgate.net › publication › 228095644_Minimizing_The_Misclassification_Error_Rate_Using_a_Surrogate_ConvexLoss
(PDF) Minimizing The Misclassification Error Rate Using a Surrogate Convex Loss
June 27, 2012 - In particular, we show that amongst all convex surrogate losses, the hinge loss gives essentially the best possible bound, of all convex loss functions, for the misclassification error rate of the resulting linear predictor in terms of the best possible margin error rate.
🌐
Semantic Scholar
semanticscholar.org › papers › minimizing the misclassification error rate using a surrogate convex loss
[PDF] Minimizing The Misclassification Error Rate Using a Surrogate Convex Loss | Semantic Scholar
It is shown that the hinge loss is essentially optimal among all convex losses, and guarantees on the misclassification error of the loss-minimizer in terms of the margin error rate of the best predictor are investigated. We carefully study how well minimizing convex surrogate loss functions corresponds to minimizing the misclassification error rate for the problem of binary classification with linear predictors.
🌐
ADS
ui.adsabs.harvard.edu › abs › 2012arXiv1206.6442B › abstract
Minimizing The Misclassification Error Rate Using a Surrogate Convex Loss - ADS
We carefully study how well minimizing convex surrogate loss functions, corresponds to minimizing the misclassification error rate for the problem of binary classification with linear predictors. In particular, we show that amongst all convex surrogate losses, the hinge loss gives essentially the best possible bound, of all convex loss functions, for the misclassification error rate of the resulting linear predictor in terms of the best possible margin error rate.
🌐
University of Massachusetts
people.cs.umass.edu › ~akshay › courses › cs690m › files › lec12.pdf pdf
Lecture 12: Surrogate Losses and Calibration Akshay Krishnamurthy
November 1, 2017 - some assurance that minimizing something like the hinge loss will give us a low misclassification rate? ... First of all, observe that the risk associated with the 0/1 loss is non-convex, which suggests that minimizing it
Find elsewhere
🌐
Ohio State University
asc.ohio-state.edu › lee.2272 › › 881 › consistency.pdf pdf
6 CONSISTENCY 39 6 Consistency
(a) Misclassification (0-1) loss: L(f(x), y) = I(yf(x) ≤0) (b) Hinge loss for the SVM: L(f(x), y) = (1 −yf(x))+ (c) Negative log-likelihood for logistic regression: L(f(x), y) = log2{1 + exp(−yf(x))}. Let η(x) = P(Y = 1|X = x) and f(x) = log η(x)/(1 −η(x)), the logit function. Then ... Figure 7 compares the margin-based loss functions with the 0-1 loss. They are convex
🌐
arXiv
arxiv.org › html › 2408.08675
Misclassification excess risk bounds for PAC-Bayesian classification via convexified loss
August 16, 2024 - In classification tasks, due to the non-convex nature of the 0-1 loss, a convex surrogate loss is often used, and thus current PAC-Bayesian bounds are primarily specified for this convex surrogate. This work shifts its focus to providing misclassification excess risk bounds for PAC-Bayesian classification when using a convex surrogate loss.
🌐
Transactions on Machine Learning Research
jmlr.csail.mit.edu › papers › volume10 › xiang09a › xiang09a.pdf pdf
Journal of Machine Learning Research 10 (2009) 1447-1468
The classifier minimizing the misclassification error is called the Bayes rule fc and is given by ... Definition 1 We say that φ : R →R+ is a classifying loss (function) if it is convex, differentiable at
🌐
Proceedings of Machine Learning Research
proceedings.mlr.press › v28 › nguyen13a.pdf pdf
Algorithms for Direct 0–1 Loss Optimization in Binary Classification
losses are plotted in Figure 2. 0–1 loss is robust to out- liers since it is not affected by a misclassified point’s · distance from the margin, but this property also makes · it non-convex; the convex squared, hinge, and log · losses are not robust to outliers in this way since their ·
🌐
Proceedings of Machine Learning Research
proceedings.mlr.press › v32 › yanga14.html
The Coherent Loss Function for Classification
January 27, 2014 - A prediction rule in binary classification that aims to achieve the lowest probability of misclassification involves minimizing over a non-convex, 0-1 loss function, which is typically a computationally intractable optimization problem. To address the intractability, previous methods consider minimizing the cumulative loss – the sum of convex surrogates of the 0-1 loss of each sample.
🌐
Carnegie Mellon University
cs.cmu.edu › ~yandongl › loss.html
Loss Function
Convexity ensures global minimum and it's computationally appleaing. ... Figure 7.5 from Chris Bishop's PRML book. The Hinge Loss E(z) = max(0,1-z) is plotted in blue, the Log Loss in red, the Square Loss in green and the 0/1 error in black.
🌐
University of California, Berkeley
statistics.berkeley.edu › sites › default › files › tech-reports › 638.pdf pdf
Convexity, Classification, and Risk Bounds Peter L. Bartlett
We first derived a universal upper bound on the population · misclassification risk of any thresholded measurable classifier in terms of its corresponding popu- lation φ-risk. The bound is governed by the ψ-transform, a convexified variational transform of φ.
🌐
ResearchGate
researchgate.net › figure › Loss-functions-of-the-margin-for-binary-classification-Zero-one-misclassification-loss_fig7_45130375
3.3 Loss functions of the margin for binary classification. Zero-one... | Download Scientific Diagram
July 5, 2022 - ... (surrogate) loss functions given in Table 1 are all convex functions of the margiñ yg which bound the zero-one misclassification loss from above, see Figure 3. The convexity of these surrogate loss functions is computationally important for empirical risk minimization; minimizing the empirical zero-one loss is computationally intractable.
🌐
Upenn
www-stat.wharton.upenn.edu › ~buja › PAPERS › paper-proper-scoring.pdf pdf
Loss Functions for Binary Class Probability Estimation and ...
The partial losses L1 and L0 are bounded below iffα, β > −1. As the following list suggests, the lowest values ever proposed (at least implicitly) are α = β = −1 ... We introduce “tailoring” of strict proper scoring rules to cost-weighted classification. Recall · that cost-weighted misclassification loss with costs c and 1 −c for false positives and false
🌐
ScienceDirect
sciencedirect.com › topics › engineering › misclassification-rate
Misclassification Rate - an overview | ScienceDirect Topics
(Masnadi-Shirazi and Vasconcelos, 2011) propose a novel framework for designing cost-sensitive boosting algorithms that derive cost-sensitive losses in the functional space of convex combinations of weak learners using two necessary conditions, which can be minimized through gradient descent and ultimately produce boosting algorithms suitable for imbalanced classification.
🌐
JMLR
jmlr.org › papers › volume17 › 15-115 › 15-115.pdf pdf
Iterative Regularization for Learning with Convex Loss ...
Home Page · Papers · Submissions · Editorial Board · Special Issues · Open Source Software · Proceedings (PMLR) · Data (DMLR) · Transactions (TMLR) · Search