Let's use the example of the SVM loss function for a single datapoint:
$L_i = \sum_{j\neq y_i} \left[ \max(0, w_j^Tx_i - w_{y_i}^Tx_i + \Delta) \right]$
Where is the desired margin.
We can differentiate the function with respect to the weights. For example, taking the gradient with respect to we obtain:
$\nabla_{w_{y_i}} L_i = - \left( \sum_{j\neq y_i} \mathbb{1}(w_j^Tx_i - w_{y_i}^Tx_i + \Delta > 0) \right) x_i$
Where 1 is the indicator function that is one if the condition inside is true or zero otherwise. While the expression may look scary when it is written out, when you're implementing this in code you'd simply count the number of classes that didn't meet the desired margin (and hence contributed to the loss function) and then the data vector scaled by this number is the gradient. Notice that this is the gradient only with respect to the row of
that corresponds to the correct class. For the other rows where $j≠{{y}_{i}}$ the gradient is:
$\nabla_{w_j} L_i = \mathbb{1}(w_j^Tx_i - w_{y_i}^Tx_i + \Delta > 0) x_i$
Once you derive the expression for the gradient it is straight-forward to implement the expressions and use them to perform the gradient update.
Taken from Stanford CS231N optimization notes posted on github.
First of all, note that multi-class hinge loss function is a function of .
\begin{equation}
l(W_r) = \max( 0, 1 + \underset{r \neq y_i}{ \max } W_r \cdot x_i - W_{y_i} \cdot x_i)
\end{equation}
Next,
max function is non-differentiable at . So, we need to calculate the subgradient of it.
\begin{equation}
\frac{\partial l(W_r)}{\partial W_r} =
\begin{cases}
\{0\}, & W_{y_i}\cdot x_i > 1 + \underset{r \neq y_i}{ \max } W_r \cdot x_i \\
\{x_i\}, & W_{y_i}\cdot x_i < 1 + \underset{r \neq y_i}{ \max } W_r \cdot x_i\\
\{\alpha x_i\}, & \alpha \in [0,1], W_{y_i}\cdot x_i = 1 + \underset{r \neq y_i}{ \max } W_r \cdot x_i
\end{cases}
\end{equation}
In the second case,
is independent of
. Above definition of subgradient of multi-class hinge loss is similar to subgradient of binary class hinge loss.
top_k has gradients, added in version 0.8 here
Adding another implementation with three lines of code scores: unscaled scores, tensor, shape=(n_classes, batch_size), dtype=float32 classes: tensor, shape=(batch_size, batch_size), dtype=float32
For implementing above loss with choosing the most violated class instead of considering all classes
#H - hard negative for each sample
H = tf.reduce_max(scores * (1 - classes), 0)
L = tf.nn.relu((1 - scores + H) * classes)
final_loss = tf.reduce_mean(tf.reduce_max(L, 0))
Another implementation where we sum over all negative classes
# implements loss as sum_(j~=y) max(0, 1 - s(x, y) + s(x, j))
def multiclasshingeloss1(scores, classes):
true_classes = tf.argmax(classes, 0)
idx_flattened = tf.range(0, scores.get_shape()[1]) * scores.get_shape()[0]+\
tf.cast(true_classes, dtype=tf.int32)
true_scores = tf.gather(tf.reshape(tf.transpose(scores), [-1]),
idx_flattened)
L = tf.nn.relu((1 - true_scores + scores) * (1 - classes))
final_loss = tf.reduce_mean(L)
return final_loss
You can minimize the transposes here based on your implementation.