machine learning - Confusion on hinge loss and SVM - Cross Validated
machine learning - What's the relationship between an SVM and hinge loss? - Stack Overflow
Is support vector machine just about simplifying logistic regression formula? If so, why this name?
No. The main difference between the costs function is that the cross entropy loss (CEL) penalizes based on prediction distance from the answer. So if something is predicted using CEL as class 1 with probability 0.51 and it is actually class 1, it is penalized more strongly than if it had been predicted with probability 0.99, but for the hinge loss for SVM it's just counted the same whether or not you barely predict the answer or have high confidence. However both methods are penalized by 'distance' when they predict the wrong answer
More on reddit.comneural networks - How do I calculate the gradient of the hinge loss function? - Artificial Intelligence Stack Exchange
Videos
Searching for the quoted text, it seems the book is Data Science for Business (Provost and Fawcett), and they're describing the soft-margin SVM. Their description of the hinge loss is wrong. The problem is that it doesn't penalize misclassified points that lie within the margin, as you mentioned.
In SVMs, smaller weights correspond to larger margins. So, using this "version" of the hinge loss would have pathological consequences: We could achieve the minimum possible loss (zero) simply by choosing weights small enough such that all points lie within the margin. Even if every single point is misclassified. Because the SVM optimization problem contains a regularization term that encourages small weights (i.e. large margins), the solution will always be the zero vector. This means the solution is completely independent of the data, and nothing is learned. Needless to say, this wouldn't make for a very good classifier.
The correct expression for the hinge loss for a soft-margin SVM is:
where is the output of the SVM given input
, and
is the true class (-1 or 1). When the true class is -1 (as in your example), the hinge loss looks like this:

Note that the loss is nonzero for misclassified points, as well as correctly classified points that fall within the margin.
For a proper description of soft-margin SVMs using the hinge loss formulation, see The Elements of Statistical Learning (section 12.3.2) or the Wikipedia article.
The (A) hinge function can be expressed as
where:
is the change in slope after the hinge. In your example, this amounts to the slope following the hinge, since your hinge-only model (see below) assumes zero effect of
on
until the hinge.
is the point (in
) at which the hinge is located, and is a parameter estimated for the model. I believe your question is answered by considering that the location of the hinge is informed by the loss function.
is some error term with some distribution.
Hinge functions can also be useful in changing any line:
where:
is the model constant, and the intercept of the curve before the hinge (i.e. for
). Of course, if
, then the curve intersects the
-axis after the hinge so
will not necessarily be the
-intercept of the bent line.
is the slope of the line relating
to
is the change in slope after the hinge.
In addition, the hinge can be used to model how a functional relationship between and
changes form, as in this model where the relationship becomes quadra
Hinge loss is difficult to work with when the derivative is needed because the derivative will be a piece-wise function. max has one non-differentiable point in its solution, and thus the derivative has the same. This was a very prominent issue with non-separable cases of SVM (and a good reason to use ridge regression).
Here's a slide (Original source from Zhuowen Tu, apologies for the title typo):

Where hinge loss is defined as max(0, 1-v) and v is the decision boundary of the SVM classifier. More can be found on the Hinge Loss Wikipedia.
As for your equation: you can easily pick out the v of the equation, however without more context of those functions it's hard to say how to derive. Unfortunately I don't have access to the paper and cannot guide you any further...
I disagree with the earlier answer that this is difficult to calculate. If we have the function \begin{align*} \sum_{t\in\mathcal{T}} \max \{0, 1 - d(t) \, y(t, \theta)\} \end{align*} the gradient with respect to $\theta$ is \begin{align*} & \sum_{t\in\mathcal{T}}g(t) \\ & g(t) := \begin{cases} 0 & \text{ if }1 - d(t) y(t, \theta) < 0 \\ -d(t)\dfrac{\partial y}{\partial \theta} & \text{ otherwise} \\ \end{cases} \end{align*} Theoretically this is ok, it just means that the gradient is not continuous. However, the objective is still continuous assuming that $d$ and $y$ are both continuous.
In practice, it's not a problem either. Any automatic differentiation software (tensorflow, pytorch, jax) will handle something like this automatically and correctly.