🌐
Quora
quora.com › What-is-a-rigorous-proof-that-the-hinge-loss-is-a-convex-loss-function
What is a rigorous proof that the hinge loss is a convex loss function? - Quora
Answer: The hinge loss is the maximum of two linear functions, so you can prove it in two steps: 1. Any linear function is convex. 2. The maximum of two convex functions is convex. If you don't immediately see how to prove either of those, it's worth taking the time to write it out. This is an ...
in machine learning, a loss function used for maximum‐margin classification
In machine learning, the hinge loss is a loss function used for training classifiers. The hinge loss is used for "maximum-margin" classification, most notably for support vector machines (SVMs). For an intended … Wikipedia
🌐
Wikipedia
en.wikipedia.org › wiki › Hinge_loss
Hinge loss - Wikipedia
January 26, 2026 - The hinge loss is a convex function, so many of the usual convex optimizers used in machine learning can work with it.
🌐
Hebrew University of Jerusalem
cs.huji.ac.il › ~shais › Lecture4.pdf pdf
Advanced Course in Machine Learning Spring 2010 Online Convex Optimization
Figure 1: An illustration of the hinge-loss function f(x) = max{0, 1 −x} and one of its sub-gradients at ... An equivalent definition is that the ℓ2 norm of all sub-gradients of f at points in A is bounded by ρ. More generally, we say that a convex function is V -Lipschitz w.r.t.
🌐
UBC Computer Science
cs.ubc.ca › ~schmidtm › Courses › 340-F17 › L21.pdf pdf
CPSC 340: Machine Learning and Data Mining More Linear Classifiers Fall 2017
• This is the called hinge loss. – It’s convex: max(constant,linear). – It’s not degenerate: w=0 now gives an error of 1 instead of 0. Hinge Loss: Convex Approximation to 0-1 Loss · 7 · Hinge Loss: Convex Approximation to 0-1 Loss · 8 · Hinge Loss: Convex Approximation to 0-1 Loss ·
🌐
Davidrosenberg
davidrosenberg.github.io › mlcourse › Archive › 2016 › Homework › hw6-multiclass › hw6.pdf pdf
Generalized Hinge Loss and Multiclass SVM
for the multiclass hinge loss. We can write this as ... We will now show that that J(w) is a convex function of w.
🌐
Carnegie Mellon University
cs.cmu.edu › ~yandongl › loss.html
Loss Function
0/1 loss: $\min_\theta\sum_i L_{0/1}(\theta^Tx)$. We define $L_{0/1}(\theta^Tx) =1$ if $y\cdot f \lt 0$, and $=0$ o.w. Non convex and very hard to optimize. Hinge loss: approximate 0/1 loss by $\min_\theta\sum_i H(\theta^Tx)$. We define $H(\theta^Tx) = max(0, 1 - y\cdot f)$. Apparently $H$ is small if we classify correctly.
Find elsewhere
🌐
Tel Aviv University
tau.ac.il › ~mansour › advanced-agt+ml › scribe2_covex_func.pdf pdf
Lecture 2: November 7, 2011 2.1 Convex Learning Problems
November 7, 2011 - hyperplane (essentially, an hypothesis), x ∈X and y ∈[−1, 1]. The hinge loss is a max- imum of linear functions and therefore convex.
🌐
Gabormelli
gabormelli.com › RKB › Hinge-Loss_Function
Hinge-Loss Function - GM-RKB - Gabor Melli
The hinge loss function is defined ... the hinge loss equals the 0–1 indicator function when [math]\displaystyle{ \operatorname{sgn}(f(\vec{x})) = y }[/math] and [math]\displaystyle{ |yf(\vec{x})| \geq 1 }[/math] ....
🌐
arXiv
arxiv.org › abs › 1512.07797
[1512.07797] The Lovász Hinge: A Novel Convex Surrogate for Submodular Losses
May 15, 2017 - We propose instead a novel surrogate loss function for submodular losses, the Lovász hinge, which leads to O(p log p) complexity with O(p) oracle accesses to the loss function to compute a gradient or cutting-plane. We prove that the Lovász hinge is convex and yields an extension.
🌐
Inria
inria.hal.science › hal-01241626 › document pdf
1 The Lov´asz Hinge: A Novel Convex Surrogate for Submodular Losses
accesses to the loss function to compute a gradient or cutting-plane. We prove that the Lov´asz hinge is convex and yields an extension.
🌐
Caltech
courses.cms.caltech.edu › cs253 › slides › cs253-lec4-onlineSVM.pdf pdf
4.1 Online Convex Optimization
January 13, 2010 - CS/CNS/EE 253: Advanced Topics in Machine Learning · How can we gain insights from massive data sets
🌐
arXiv
arxiv.org › pdf › 2103.00233 pdf
Learning with Smooth Hinge Losses Junru Luo ∗, Hong Qiao †, and Bo Zhang ‡
condition for a convex surrogate loss ℓto be classification-calibrated, as stated ... Secondly, ψG(α; σ) and ψM(α; σ) are infinitely differentiable. By replacing the · Hinge loss with these two smooth Hinge losses, we obtain two smooth support
🌐
Davidrosenberg
davidrosenberg.github.io › mlcourse › Archive › 2017 › Homework › hw5.pdf pdf
Homework 5: Generalized Hinge Loss and Multiclass SVM
New homework on multiclass hinge loss and multiclass SVM · New homework on Bayesian methods, specifically the beta-binomial model, hierarchical models, empirical Bayes ML-II, MAP-II · New short lecture on correlated variables with L1, L2, and Elastic Net regularization · Added some details about subgradient methods, including a one-slide proof that subgradient descent moves us towards a minimizer of a convex ...
🌐
Stack Exchange
stats.stackexchange.com › questions › 416695 › proving-that-an-svm-problem-with-a-complex-loss-function-is-convex
machine learning - Proving that an SVM problem with a complex loss function is convex - Cross Validated
July 9, 2019 - The end goal, along with proving that the problem is convex, is to be able to get the problem into a form that can be coded in CVX. I have m positively labeled data points $x_i$ $\in$ $\mathbb{R}^n, i = 1,2,...m$ and a negative class summarized by a random variabe $x$ $\in$ $\mathbb{R}^n$ with mean $\hat{x}$ $\in$ $\mathbb{R}^n$, and covariance matrix C · The objective function is of the form: $\min_{w,b} L(w,b) + p(w)$ The loss function L(w,b) is a sum of: 1) the mean empirical hinge-loss error on the positive class; and 2) the worst case (w.r.t the class of random variables x with mean $\hat{x}$ and covariance matrix C) mean error on the negative class
Top answer
1 of 1
13

Here's my attempt to answer your questions:

  • Is an SVM as simple as saying it's a discriminative classifier that simply optimizes the hinge loss? Or is it more complex than that? Yes, you can say that. Also, don't forget that it regularizes the model too. I wouldn't say SVM is more complex than that, however, it is important to mention that all of those choices (e.g. hinge loss and $L_2$ regularization) have precise mathematical interpretations and are not arbitrary. That's what makes SVMs so popular and powerful. For example, hinge loss is a continuous and convex upper bound to the task loss which, for binary classification problems, is the $0/1$ loss. Note that $0/1$ loss is non-convex and discontinuous. Convexity of hinge loss makes the entire training objective of SVM convex. The fact that it is an upper bound to the task loss guarantees that the minimizer of the bound won't have a bad value on the task loss. $L_2$ regularization can be geometrically interpreted as the size of the margin.

  • How do the support vectors come into play? Support vectors play an important role in training SVMs. They identify the separating hyperplane. Let $D$ denote a training set and $SV(D) \subseteq D$ be the set of support vectors that you get by training an SVM on $D$ (assume all hyperparameters are fixed a priori). If we throw out all the non-SV samples from $D$ and train another SVM (with the same hyperparameter values) on the remaining samples (i.e. on $SV(D)$) we get the same exact classifier as before!

  • What about the slack variables? SVM was originally designed for problems where there exists a separating hyperplane (i.e. a hyperplane that perfectly separates the training samples from the two classes), and the goal was to find, among all separating hyperplanes, the hyperplane with the largest margin. The margin, denoted by $d(w, D)$, is defined for a classifier $w$ and a training set $D$. Assuming $w$ perfectly separates all the examples in $D$, we have $d(w, D) = \min_{(x, y) \in D} y \frac{w^Tx}{||w||_2}$, which is the distance of the closest training example from the separating hyperplane $w$. Note that $y \in \{+1, -1\}$ here. The introduction of slack variables made it possible to train SVMs on problems where either 1) a separating hyperplane does not exist (i.e. the training data is not linearly separable), or 2) you are happy to (or would like to) sacrifice making some error (higher bias) for better generalization (lower variance). However, this comes at the price of breaking some of the concrete mathematical and geometric interpretations of SVMs without slack variables (e.g. the geometrical interpretation of the margin).

  • Why can't you have deep SVM's? SVM objective is convex. More precisely, it is piecewise quadratic; that is because the $L_2$ regularizer is quadratic and the hinge loss is piecewise linear. The training objectives in deep hierarchical models, however, are much more complex. In particular, they are not convex. Of course, one can design a hierarchical discriminative model with hinge loss and $L_2$ regularization etc., but, it wouldn't be called an SVM. In fact, the hinge loss is commonly used in DNNs (Deep Neural Networks) for classification problems.