To get the gradient we differentiate the loss with respect to th component of
.
Rewrite hinge loss in terms of as
where
and
Using chain rule we get
First derivative term is evaluated at becoming
when
, and 0 when
. Second derivative term becomes
. So in the end you get
Since ranges over the components of
, you can view the above as a vector quantity, and write
as shorthand for
To get the gradient we differentiate the loss with respect to th component of
.
Rewrite hinge loss in terms of as
where
and
Using chain rule we get
First derivative term is evaluated at becoming
when
, and 0 when
. Second derivative term becomes
. So in the end you get
Since ranges over the components of
, you can view the above as a vector quantity, and write
as shorthand for
This is 3 years late, but still may be relevant for someone...
Let denote a sample of points
and the set of corresponding labels
. We search to find a hyperplane
that would minimize the total hinge-loss:
\begin{equation}
w^* = \underset{w}{\text{argmin }} L^{hinge}_S(w) = \underset{w}{\text{argmin }} \sum_i{l_{hinge}(w,x_i,y_i)}= \underset{w}{\text{argmin }} \sum_i{\max{\{0,1-y_iw\cdot x}\}}
\end{equation}
To find
take derivative of the total hinge loss . Gradient of each component is:
$$
\frac{\partial{l_{hinge}}}{\partial w}=
\begin{cases}
0 & y_iw\cdot x \geq 1 \\
-y_ix & y_iw\cdot x < 1
\end{cases}
$$
The gradient of the sum is a sum of gradients.
Python example, which uses GD to find hinge-loss optimal separatinig hyperplane follows (its probably not the most efficient code, but it works)
import numpy as np
import matplotlib.pyplot as plt
def hinge_loss(w,x,y):
""" evaluates hinge loss and its gradient at w
rows of x are data points
y is a vector of labels
"""
loss,grad = 0,0
for (x_,y_) in zip(x,y):
v = y_*np.dot(w,x_)
loss += max(0,1-v)
grad += 0 if v > 1 else -y_*x_
return (loss,grad)
def grad_descent(x,y,w,step,thresh=0.001):
grad = np.inf
ws = np.zeros((2,0))
ws = np.hstack((ws,w.reshape(2,1)))
step_num = 1
delta = np.inf
loss0 = np.inf
while np.abs(delta)>thresh:
loss,grad = hinge_loss(w,x,y)
delta = loss0-loss
loss0 = loss
grad_dir = grad/np.linalg.norm(grad)
w = w-step*grad_dir/step_num
ws = np.hstack((ws,w.reshape((2,1))))
step_num += 1
return np.sum(ws,1)/np.size(ws,1)
def test1():
# sample data points
x1 = np.array((0,1,3,4,1))
x2 = np.array((1,2,0,1,1))
x = np.vstack((x1,x2)).T
# sample labels
y = np.array((1,1,-1,-1,-1))
w = grad_descent(x,y,np.array((0,0)),0.1)
loss, grad = hinge_loss(w,x,y)
plot_test(x,y,w)
def plot_test(x,y,w):
plt.figure()
x1, x2 = x[:,0], x[:,1]
x1_min, x1_max = np.min(x1)*.7, np.max(x1)*1.3
x2_min, x2_max = np.min(x2)*.7, np.max(x2)*1.3
gridpoints = 2000
x1s = np.linspace(x1_min, x1_max, gridpoints)
x2s = np.linspace(x2_min, x2_max, gridpoints)
gridx1, gridx2 = np.meshgrid(x1s,x2s)
grid_pts = np.c_[gridx1.ravel(), gridx2.ravel()]
predictions = np.array([np.sign(np.dot(w,x_)) for x_ in grid_pts]).reshape((gridpoints,gridpoints))
plt.contourf(gridx1, gridx2, predictions, cmap=plt.cm.Paired)
plt.scatter(x[:, 0], x[:, 1], c=y, cmap=plt.cm.Paired)
plt.title('total hinge loss: %g' % hinge_loss(w,x,y)[0])
plt.show()
if __name__ == '__main__':
np.set_printoptions(precision=3)
test1()
To get the gradient we differentiate the loss with respect to th component of
.
Rewrite hinge loss in terms of as
where
and
Using chain rule we get
First derivative term is evaluated at becoming
when
, and 0 when
. Second derivative term becomes
. So in the end you get
Since ranges over the components of
, you can view the above as a vector quantity, and write
as shorthand for
Videos
Hinge loss is difficult to work with when the derivative is needed because the derivative will be a piece-wise function. max has one non-differentiable point in its solution, and thus the derivative has the same. This was a very prominent issue with non-separable cases of SVM (and a good reason to use ridge regression).
Here's a slide (Original source from Zhuowen Tu, apologies for the title typo):

Where hinge loss is defined as max(0, 1-v) and v is the decision boundary of the SVM classifier. More can be found on the Hinge Loss Wikipedia.
As for your equation: you can easily pick out the v of the equation, however without more context of those functions it's hard to say how to derive. Unfortunately I don't have access to the paper and cannot guide you any further...
I disagree with the earlier answer that this is difficult to calculate. If we have the function
\begin{align*}
\sum_{t\in\mathcal{T}} \max \{0, 1 - d(t) \, y(t, \theta)\}
\end{align*}
the gradient with respect to is
\begin{align*}
& \sum_{t\in\mathcal{T}}g(t) \\
& g(t) := \begin{cases}
0 & \text{ if }1 - d(t) y(t, \theta) < 0 \\
-d(t)\dfrac{\partial y}{\partial \theta} & \text{ otherwise} \\
\end{cases}
\end{align*}
Theoretically this is ok, it just means that the gradient is not continuous. However, the objective is still continuous assuming that
and
are both continuous.
In practice, it's not a problem either. Any automatic differentiation software (tensorflow, pytorch, jax) will handle something like this automatically and correctly.
To get the gradient we differentiate the loss with respect to $i$th component of $w$.
Rewrite hinge loss in terms of $w$ as $f(g(w))$ where $f(z)=\max(0,1-y\ z)$ and $g(w)=\mathbf{x}\cdot \mathbf{w}$
Using chain rule we get
$$\frac{\partial}{\partial w_i} f(g(w))=\frac{\partial f}{\partial z} \frac{\partial g}{\partial w_i} $$
First derivative term is evaluated at $g(w)=x\cdot w$ becoming $-y$ when $\mathbf{x}\cdot w<1$, and 0 when $\mathbf{x}\cdot w>1$. Second derivative term becomes $x_i$. So in the end you get $$ \frac{\partial f(g(w))}{\partial w_i} = \begin{cases} -y\ x_i &\text{if } y\ \mathbf{x}\cdot \mathbf{w} < 1 \\ 0&\text{if } y\ \mathbf{x}\cdot \mathbf{w} > 1 \end{cases} $$
Since $i$ ranges over the components of $x$, you can view the above as a vector quantity, and write $\frac{\partial}{\partial w}$ as shorthand for $(\frac{\partial}{\partial w_1},\frac{\partial}{\partial w_2},\ldots)$
This is 3 years late, but still may be relevant for someone...
Let $S$ denote a sample of points $x_i \in R^d$ and the set of corresponding labels $y_i \in \{-1,1\}$. We search to find a hyperplane $w$ that would minimize the total hinge-loss: \begin{equation} w^* = \underset{w}{\text{argmin }} L^{hinge}_S(w) = \underset{w}{\text{argmin }} \sum_i{l_{hinge}(w,x_i,y_i)}= \underset{w}{\text{argmin }} \sum_i{\max{\{0,1-y_iw\cdot x}\}} \end{equation} To find $w^*$ take derivative of the total hinge loss . Gradient of each component is: $$ \frac{\partial{l_{hinge}}}{\partial w}= \begin{cases} 0 & y_iw\cdot x \geq 1 \\ -y_ix & y_iw\cdot x < 1 \end{cases} $$
The gradient of the sum is a sum of gradients. $$ \frac{\partial{L_S^{hinge}}}{\partial{w}}=\sum_i{\frac{\partial{l_{hinge}}}{\partial w}} $$ Python example, which uses GD to find hinge-loss optimal separatinig hyperplane follows (its probably not the most efficient code, but it works)
import numpy as np
import matplotlib.pyplot as plt
def hinge_loss(w,x,y):
""" evaluates hinge loss and its gradient at w
rows of x are data points
y is a vector of labels
"""
loss,grad = 0,0
for (x_,y_) in zip(x,y):
v = y_*np.dot(w,x_)
loss += max(0,1-v)
grad += 0 if v > 1 else -y_*x_
return (loss,grad)
def grad_descent(x,y,w,step,thresh=0.001):
grad = np.inf
ws = np.zeros((2,0))
ws = np.hstack((ws,w.reshape(2,1)))
step_num = 1
delta = np.inf
loss0 = np.inf
while np.abs(delta)>thresh:
loss,grad = hinge_loss(w,x,y)
delta = loss0-loss
loss0 = loss
grad_dir = grad/np.linalg.norm(grad)
w = w-step*grad_dir/step_num
ws = np.hstack((ws,w.reshape((2,1))))
step_num += 1
return np.sum(ws,1)/np.size(ws,1)
def test1():
# sample data points
x1 = np.array((0,1,3,4,1))
x2 = np.array((1,2,0,1,1))
x = np.vstack((x1,x2)).T
# sample labels
y = np.array((1,1,-1,-1,-1))
w = grad_descent(x,y,np.array((0,0)),0.1)
loss, grad = hinge_loss(w,x,y)
plot_test(x,y,w)
def plot_test(x,y,w):
plt.figure()
x1, x2 = x[:,0], x[:,1]
x1_min, x1_max = np.min(x1)*.7, np.max(x1)*1.3
x2_min, x2_max = np.min(x2)*.7, np.max(x2)*1.3
gridpoints = 2000
x1s = np.linspace(x1_min, x1_max, gridpoints)
x2s = np.linspace(x2_min, x2_max, gridpoints)
gridx1, gridx2 = np.meshgrid(x1s,x2s)
grid_pts = np.c_[gridx1.ravel(), gridx2.ravel()]
predictions = np.array([np.sign(np.dot(w,x_)) for x_ in grid_pts]).reshape((gridpoints,gridpoints))
plt.contourf(gridx1, gridx2, predictions, cmap=plt.cm.Paired)
plt.scatter(x[:, 0], x[:, 1], c=y, cmap=plt.cm.Paired)
plt.title('total hinge loss: %g' % hinge_loss(w,x,y)[0])
plt.show()
if __name__ == '__main__':
np.set_printoptions(precision=3)
test1()
Let's use the example of the SVM loss function for a single datapoint:
$L_i = \sum_{j\neq y_i} \left[ \max(0, w_j^Tx_i - w_{y_i}^Tx_i + \Delta) \right]$
Where $\Delta$ is the desired margin.
We can differentiate the function with respect to the weights. For example, taking the gradient with respect to $w_{yi}$ we obtain:
$\nabla_{w_{y_i}} L_i = - \left( \sum_{j\neq y_i} \mathbb{1}(w_j^Tx_i - w_{y_i}^Tx_i + \Delta > 0) \right) x_i$
Where 1 is the indicator function that is one if the condition inside is true or zero otherwise. While the expression may look scary when it is written out, when you're implementing this in code you'd simply count the number of classes that didn't meet the desired margin (and hence contributed to the loss function) and then the data vector $x_i$ scaled by this number is the gradient. Notice that this is the gradient only with respect to the row of $W$ that corresponds to the correct class. For the other rows where $j≠{{y}_{i}}$ the gradient is:
$\nabla_{w_j} L_i = \mathbb{1}(w_j^Tx_i - w_{y_i}^Tx_i + \Delta > 0) x_i$
Once you derive the expression for the gradient it is straight-forward to implement the expressions and use them to perform the gradient update.
Taken from Stanford CS231N optimization notes posted on github.
First of all, note that multi-class hinge loss function is a function of $W_r$.
\begin{equation}
l(W_r) = \max( 0, 1 + \underset{r \neq y_i}{ \max } W_r \cdot x_i - W_{y_i} \cdot x_i)
\end{equation}
Next, max function is non-differentiable at $0$. So, we need to calculate the subgradient of it.
\begin{equation}
\frac{\partial l(W_r)}{\partial W_r} =
\begin{cases}
\{0\}, & W_{y_i}\cdot x_i > 1 + \underset{r \neq y_i}{ \max } W_r \cdot x_i \\
\{x_i\}, & W_{y_i}\cdot x_i < 1 + \underset{r \neq y_i}{ \max } W_r \cdot x_i\\
\{\alpha x_i\}, & \alpha \in [0,1], W_{y_i}\cdot x_i = 1 + \underset{r \neq y_i}{ \max } W_r \cdot x_i
\end{cases}
\end{equation}
In the second case, $W_{y_i}$ is independent of $W_r$. Above definition of subgradient of multi-class hinge loss is similar to subgradient of binary class hinge loss.