The hinge loss term in soft margin SVM penalizes misclassifications. In hard margin SVM there are, by definition, no misclassifications.
This indeed means that hard margin SVM tries to minimize $\|\mathbf{w}\|^2$. Due to the formulation of the SVM problem, the margin is $2/\|\mathbf{w}\|$. As such, minimizing the norm of is geometrically equivalent to maximizing the margin. Exactly what we want!
Regularization is a technique to avoid overfitting by penalizing large coefficients in the solution vector. In hard margin SVM $\|\mathbf{w}\|^2$ is both the loss function and an regularizer.
In soft-margin SVM, the hinge loss term also acts like a regularizer but on the slack variables instead of and in
rather than
.
regularization induces sparsity, which is why standard SVM is sparse in terms of support vectors (in contrast to least-squares SVM).
Videos
The hinge loss term in soft margin SVM penalizes misclassifications. In hard margin SVM there are, by definition, no misclassifications.
This indeed means that hard margin SVM tries to minimize $\|\mathbf{w}\|^2$. Due to the formulation of the SVM problem, the margin is $2/\|\mathbf{w}\|$. As such, minimizing the norm of is geometrically equivalent to maximizing the margin. Exactly what we want!
Regularization is a technique to avoid overfitting by penalizing large coefficients in the solution vector. In hard margin SVM $\|\mathbf{w}\|^2$ is both the loss function and an regularizer.
In soft-margin SVM, the hinge loss term also acts like a regularizer but on the slack variables instead of and in
rather than
.
regularization induces sparsity, which is why standard SVM is sparse in terms of support vectors (in contrast to least-squares SVM).
There's no "loss function" for hard-margin SVMs, but when we're solving soft-margin SVMs, it turns out the loss exists.
Now is the detailed explanation:
When we talk about loss function, what we really mean is a training objective that we want to minimize.
In hard-margin SVM setting, the "objective" is to maximize the geometric margin s.t each training example lies outside the separating hyperplane, i.e. $$\begin{aligned} & \max_{\gamma, w, b}\frac{1}{\Vert w \Vert} \\ &s.t\quad y(w^Tx+b) \ge 1 \end{aligned} $$ Note that this is a quadratic programming problem, so we cannot solve it numerically using direct gradient descent approach, that is, there is no analytic "loss function" for hard-margin SVMs.
However, in soft-margin SVM setting, we add a slack variable to allow our SVM to made mistakes. We now try to solve
$$\begin{aligned}
& \min_{w,b,\boldsymbol{\xi}}\frac{1}{2}\Vert w \Vert_2^2 + C\sum \xi_i \\
s.t\quad &y_i(w^Tx_i+b) \ge 1-\xi_i \\
& \boldsymbol{\xi} \succeq \mathbf{0}
\end{aligned}
$$
This is the same as we try to penalize the misclassified training example by adding
to our objective to be minimized. Recall hinge loss:
since if the training example lies outside the margin
will be zero and it will only be nonzero when training example falls into margin region, and since hinge loss is always nonnegative, it happens we can rephrase our problem as
We know that hinge loss is convex and its derivative is known, thus we can solve for soft-margin SVM directly by gradient descent.
So the slack variable is just hinge loss in disguise, and the property of hinge loss happens to wrap up our optimization constraints (i.e. nonnegativity and activates input when it's less than 1).
The method to calculate gradient in this case is Calculus (analytically, NOT numerically!). So we differentiate loss function with respect to W(yi) like this:

and with respect to W(j) when j!=yi is:

The 1 is just indicator function so we can ignore the middle form when condition is true. And when you write in code, the example you provided is the answer.
Since you are using cs231n example, you should definitely check note and videos if needed.
Hope this helps!
If the substraction less than zero the loss is zero so the gradient of W is also zero. If the substarction larger than zero, then the gradient of W is the partial derviation of the loss.