misclassification loss convex

June 27, 2012 - We carefully study how well minimizing convex surrogate loss functions, corresponds to minimizing the misclassification error rate for the problem of binary classification with linear predictors. In particular, we show that amongst all convex surrogate losses, the hinge loss gives essentially the best possible bound, of all convex loss functions, for the misclassification error rate of the resulting linear predictor in terms of the best possible margin error rate.

ICML

icml.cc › 2012 › papers › 917.pdf pdf

Minimizing The Misclassiﬁcation Error Rate Using a Surrogate Convex Loss

Theorem 4. For any convex loss function φ, we have · EG(φ, ⌫, B) ≥min · ⇢⌫(B + 1) 2 · , 1 · 2 · " Minimizing The Misclassiﬁcation Error Rate Using a Surrogate Convex Loss · The above theorem and Proposition 2 together show · that hinge loss is optimal up to constant factor 2.

Loss functions for classification

Concept in machine learning

In machine learning and mathematical optimization, loss functions for classification are computationally feasible loss functions representing the price paid for inaccuracy of predictions in classification problems (problems of identifying which category a … Wikipedia

Wikipedia

en.wikipedia.org › wiki › Loss_functions_for_classification

Loss functions for classification - Wikipedia

January 12, 2026 - The Tangent loss is quasi-convex and is bounded for large negative values which makes it less sensitive to outliers. Interestingly, the Tangent loss also assigns a bounded penalty to data points that have been classified "too correctly". This can help prevent over-training on the data set.

Bayes consistency Proper loss functions, loss margin and regularization Square loss Logistic loss Exponential loss Savage loss Tangent loss Hinge loss Generalized smooth hinge loss

JMLR

jmlr.org › papers › volume10 › xiang09a › xiang09a.pdf pdf

Classification with Gaussians and Convex Loss

The classiﬁer minimizing the misclassiﬁcation error is called the Bayes rule fc and is given by ... Deﬁnition 1 We say that φ : R →R+ is a classifying loss (function) if it is convex, differentiable at

Stack Exchange

stats.stackexchange.com › questions › 222585 › what-are-the-impacts-of-choosing-different-loss-functions-in-classification-to-a

machine learning - What are the impacts of choosing different loss functions in classification to approximate 0-1 loss - Cross Validated

Top answer

1 of 3

Some of my thoughts, may not be correct though.

I understand the reason we have such design (for hinge and logistic loss) is we want the objective function to be convex.

Convexity is surely a nice property, but I think the most important reason is we want the objective function to have non-zero derivatives, so that we can make use of the derivatives to solve it. The objective function can be non-convex, in which case we often just stop at some local optima or saddle points.

and interestingly, it also penalize correctly classified instances if they are weakly classified. It is a really strange design.

I think such design sort of advises the model to not only make the right predictions, but also be confident about the predictions. If we don't want correctly classified instances to get punished, we can for example, move the hinge loss (blue) to the left by 1, so that they no longer get any loss. But I believe this often leads to worse result in practice.

what are the prices we need to pay by using different "proxy loss functions", such as hinge loss and logistic loss?

IMO by choosing different loss functions we are bringing different assumptions to the model. For example, the logistic regression loss (red) assumes a Bernoulli distribution, the MSE loss (green) assumes a Gaussian noise.

Following the least squares vs. logistic regression example in PRML, I added the hinge loss for comparison.

As shown in the figure, hinge loss and logistic regression / cross entropy / log-likelihood / softplus have very close results, because their objective functions are close (figure below), while MSE is generally more sensitive to outliers. Hinge loss does not always have a unique solution because it's not strictly convex.

However one important property of hinge loss is, data points far away from the decision boundary contribute nothing to the loss, the solution will be the same with those points removed.

The remaining points are called support vectors in the context of SVM. Whereas SVM uses a regularizer term to ensure the maximum margin property and a unique solution.

2 of 3

Posting a late reply, since there is a very simple answer which has not been mentioned yet.

what are the prices we need to pay by using different "proxy loss functions", such as hinge loss and logistic loss?

When you replace the non-convex 0-1 loss function by a convex surrogate (e.g hinge-loss), you are actually now solving a different problem than the one you intended to solve (which is to minimize the number of classification mistakes). So you gain computational tractability (the problem becomes convex, meaning you can solve it efficiently using tools of convex optimization), but in the general case there is actually no way to relate the error of the classifier that minimizes a "proxy" loss and the error of the classifier that minimizes the 0-1 loss. If what you truly cared about was minimizing the number of misclassifications, I argue that this really is a big price to pay.

I should mention that this statement is worst-case, in the sense that it holds for any distribution $\text{[math]}$ . For some "nice" distributions, there are exceptions to this rule. The key example is of data distributions that have large margins w.r.t the decision boundary - see Theorem 15.4 in Shalev-Shwartz, Shai, and Shai Ben-David. Understanding machine learning: From theory to algorithms. Cambridge university press, 2014.

ResearchGate

researchgate.net › publication › 228095644_Minimizing_The_Misclassification_Error_Rate_Using_a_Surrogate_ConvexLoss

(PDF) Minimizing The Misclassification Error Rate Using a Surrogate Convex Loss

June 27, 2012 - In particular, we show that amongst all convex surrogate losses, the hinge loss gives essentially the best possible bound, of all convex loss functions, for the misclassification error rate of the resulting linear predictor in terms of the best possible margin error rate.

Semantic Scholar

semanticscholar.org › papers › minimizing the misclassification error rate using a surrogate convex loss

[PDF] Minimizing The Misclassification Error Rate Using a Surrogate Convex Loss | Semantic Scholar

It is shown that the hinge loss is essentially optimal among all convex losses, and guarantees on the misclassification error of the loss-minimizer in terms of the margin error rate of the best predictor are investigated. We carefully study how well minimizing convex surrogate loss functions corresponds to minimizing the misclassification error rate for the problem of binary classification with linear predictors.

ADS

ui.adsabs.harvard.edu › abs › 2012arXiv1206.6442B › abstract

Minimizing The Misclassification Error Rate Using a Surrogate Convex Loss - ADS

We carefully study how well minimizing convex surrogate loss functions, corresponds to minimizing the misclassification error rate for the problem of binary classification with linear predictors. In particular, we show that amongst all convex surrogate losses, the hinge loss gives essentially the best possible bound, of all convex loss functions, for the misclassification error rate of the resulting linear predictor in terms of the best possible margin error rate.

University of Massachusetts

people.cs.umass.edu › ~akshay › courses › cs690m › files › lec12.pdf pdf

Lecture 12: Surrogate Losses and Calibration Akshay Krishnamurthy

November 1, 2017 - some assurance that minimizing something like the hinge loss will give us a low misclassiﬁcation rate? ... First of all, observe that the risk associated with the 0/1 loss is non-convex, which suggests that minimizing it

Find elsewhere

Google Bing Mojeek

Ohio State University

asc.ohio-state.edu › lee.2272 › › 881 › consistency.pdf pdf

6 CONSISTENCY 39 6 Consistency

(a) Misclassiﬁcation (0-1) loss: L(f(x), y) = I(yf(x) ≤0) (b) Hinge loss for the SVM: L(f(x), y) = (1 −yf(x))+ (c) Negative log-likelihood for logistic regression: L(f(x), y) = log2{1 + exp(−yf(x))}. Let η(x) = P(Y = 1|X = x) and f(x) = log η(x)/(1 −η(x)), the logit function. Then ... Figure 7 compares the margin-based loss functions with the 0-1 loss. They are convex

arXiv

arxiv.org › html › 2408.08675

Misclassification excess risk bounds for PAC-Bayesian classification via convexified loss

August 16, 2024 - In classification tasks, due to the non-convex nature of the 0-1 loss, a convex surrogate loss is often used, and thus current PAC-Bayesian bounds are primarily specified for this convex surrogate. This work shifts its focus to providing misclassification excess risk bounds for PAC-Bayesian classification when using a convex surrogate loss.

Transactions on Machine Learning Research

jmlr.csail.mit.edu › papers › volume10 › xiang09a › xiang09a.pdf pdf

Journal of Machine Learning Research 10 (2009) 1447-1468

Proceedings of Machine Learning Research

proceedings.mlr.press › v28 › nguyen13a.pdf pdf

Algorithms for Direct 0–1 Loss Optimization in Binary Classiﬁcation

losses are plotted in Figure 2. 0–1 loss is robust to out- liers since it is not aﬀected by a misclassiﬁed point’s · distance from the margin, but this property also makes · it non-convex; the convex squared, hinge, and log · losses are not robust to outliers in this way since their ·

Proceedings of Machine Learning Research

proceedings.mlr.press › v32 › yanga14.html

The Coherent Loss Function for Classification

January 27, 2014 - A prediction rule in binary classification that aims to achieve the lowest probability of misclassification involves minimizing over a non-convex, 0-1 loss function, which is typically a computationally intractable optimization problem. To address the intractability, previous methods consider minimizing the cumulative loss – the sum of convex surrogates of the 0-1 loss of each sample.

Carnegie Mellon University

cs.cmu.edu › ~yandongl › loss.html

Loss Function

Convexity ensures global minimum and it's computationally appleaing. ... Figure 7.5 from Chris Bishop's PRML book. The Hinge Loss E(z) = max(0,1-z) is plotted in blue, the Log Loss in red, the Square Loss in green and the 0/1 error in black.

University of California, Berkeley

statistics.berkeley.edu › sites › default › files › tech-reports › 638.pdf pdf

Convexity, Classiﬁcation, and Risk Bounds Peter L. Bartlett

We ﬁrst derived a universal upper bound on the population · misclassiﬁcation risk of any thresholded measurable classiﬁer in terms of its corresponding popu- lation φ-risk. The bound is governed by the ψ-transform, a convexiﬁed variational transform of φ.

ResearchGate

researchgate.net › figure › Loss-functions-of-the-margin-for-binary-classification-Zero-one-misclassification-loss_fig7_45130375

3.3 Loss functions of the margin for binary classification. Zero-one... | Download Scientific Diagram

July 5, 2022 - ... (surrogate) loss functions given in Table 1 are all convex functions of the margiñ yg which bound the zero-one misclassification loss from above, see Figure 3. The convexity of these surrogate loss functions is computationally important for empirical risk minimization; minimizing the empirical zero-one loss is computationally intractable.

Upenn

www-stat.wharton.upenn.edu › ~buja › PAPERS › paper-proper-scoring.pdf pdf

Loss Functions for Binary Class Probability Estimation and ...

The partial losses L1 and L0 are bounded below iﬀα, β > −1. As the following list suggests, the lowest values ever proposed (at least implicitly) are α = β = −1 ... We introduce “tailoring” of strict proper scoring rules to cost-weighted classiﬁcation. Recall · that cost-weighted misclassiﬁcation loss with costs c and 1 −c for false positives and false

ScienceDirect

sciencedirect.com › topics › engineering › misclassification-rate

Misclassification Rate - an overview | ScienceDirect Topics

(Masnadi-Shirazi and Vasconcelos, 2011) propose a novel framework for designing cost-sensitive boosting algorithms that derive cost-sensitive losses in the functional space of convex combinations of weak learners using two necessary conditions, which can be minimized through gradient descent and ultimately produce boosting algorithms suitable for imbalanced classification.

JMLR

jmlr.org › papers › volume17 › 15-115 › 15-115.pdf pdf

Iterative Regularization for Learning with Convex Loss ...

Home Page · Papers · Submissions · Editorial Board · Special Issues · Open Source Software · Proceedings (PMLR) · Data (DMLR) · Transactions (TMLR) · Search