Those two formulae are different things:
$\frac{1}{2} w^T w + C \sum \xi_i$ is one form of the objective function, the function which is minimized over
,
, and $\xi_i$ (subject to certain constraints, which are where
comes in) to find the best SVM solution.
Once you've found the model (defined by
and
), predictions on new data
are done by finding their distance from the decision hyperplane,
.
and
define the decision hyperplane, which separates positives from negatives,
. So
is perpindicular to that hyperplane.
is also the weight of the corresponding feature dimension: if
, that feature is ignored, and if
is high, that feature is important to the SVM's decision (assuming all the features are scaled similarly).
SVMs are trained by maximizing the margin, which is the amount of space between the decision boundary and the nearest example. If your problem isn't linearly separable, though, there is no perfect decision boundary and so there's no "hard-margin" SVM solution. This is why the "soft-margin" SVM was introduced, which allows some points to be on the wrong side of the margin.
$\xi_i$ is the slack variable defining how much on the wrong side the th training example is. If $\xi_i = 0$, the point was classified correctly and by enough of a margin; if it's between 0 and 1, the point was classified correctly but by less of a margin than the SVM wanted; if it's more than 1, the point was classified incorrectly. ($\xi_i$ isn't allowed to be negative.) Points with $\xi_i > 0$, as well as those with $\xi_i = 0$ closest to the decision boundary, are known as support vectors because they "support" the margin. These are important in a kernel SVM because they're the only ones you need to worry about when predicting on new data.
is a parameter of the problem that defines how soft the margin should be. As
, you get a hard SVM; if
, the SVM doesn't care about getting the right answer at all and will just choose
. In practice, you usually try a few different values of
and see how they perform.
This picture (source) illustrates the different variables, though it's for a kernel SVM; just say and it'll be a linear SVM.

Those two formulae are different things:
$\frac{1}{2} w^T w + C \sum \xi_i$ is one form of the objective function, the function which is minimized over
,
, and $\xi_i$ (subject to certain constraints, which are where
comes in) to find the best SVM solution.
Once you've found the model (defined by
and
), predictions on new data
are done by finding their distance from the decision hyperplane,
.
and
define the decision hyperplane, which separates positives from negatives,
. So
is perpindicular to that hyperplane.
is also the weight of the corresponding feature dimension: if
, that feature is ignored, and if
is high, that feature is important to the SVM's decision (assuming all the features are scaled similarly).
SVMs are trained by maximizing the margin, which is the amount of space between the decision boundary and the nearest example. If your problem isn't linearly separable, though, there is no perfect decision boundary and so there's no "hard-margin" SVM solution. This is why the "soft-margin" SVM was introduced, which allows some points to be on the wrong side of the margin.
$\xi_i$ is the slack variable defining how much on the wrong side the th training example is. If $\xi_i = 0$, the point was classified correctly and by enough of a margin; if it's between 0 and 1, the point was classified correctly but by less of a margin than the SVM wanted; if it's more than 1, the point was classified incorrectly. ($\xi_i$ isn't allowed to be negative.) Points with $\xi_i > 0$, as well as those with $\xi_i = 0$ closest to the decision boundary, are known as support vectors because they "support" the margin. These are important in a kernel SVM because they're the only ones you need to worry about when predicting on new data.
is a parameter of the problem that defines how soft the margin should be. As
, you get a hard SVM; if
, the SVM doesn't care about getting the right answer at all and will just choose
. In practice, you usually try a few different values of
and see how they perform.
This picture (source) illustrates the different variables, though it's for a kernel SVM; just say and it'll be a linear SVM.

Videos
One way that I think about the flatness is that it makes my predictions less sensitive to perturbations in the features. That is, if I am constructing a model of the form $$y = x^\top \theta + \epsilon,$$ where my feature vector $x$ has already been normalized, then smaller values in $\theta$ mean my model is less sensitive to errors in measurement/random shocks/non-stationarity of the features, $x$. Given two models (i.e. two possible values of $\theta$) which explain the data equally well, I prefer the 'flatter' one.
You can also think of Ridge Regression as peforming the same thing without the kernel trick or the SVM 'tube' regression formulation.
edit: In response to @Yang's comments, some more explanation:
- Consider the linear case: $y = x^\top \theta + \epsilon$. Suppose the $x$ are drawn i.i.d. from some distribution, independent of $\theta$. By the dot product identity, we have $y = ||x|| ||\theta|| \cos\psi + \epsilon$, where $\psi$ is the angle between $\theta$ and $x$, which is probably distributed under some spherically uniform distribution. Now note: the 'spread' (e.g. the sample standard deviation) of our predictions of $y$ is proportional to $||\theta||$. To get good MSE with the latent, noiseless versions of our observations, we want to shrink that $||\theta||$. c.f. James Stein estimator.
- Consider the linear case with lots of features. Consider the models $y = x^\top \theta_1 + \epsilon$, and $y = x^\top \theta_2 + \epsilon$. If $\theta_1$ has more zero elements in it than $\theta_2$, but about the same explanatory power, we would prefer it, base on Occam's razor, since it has dependencies on fewer variables (i.e. we have 'done feature selection' by setting some elements of $\theta_1$ to zero). Flatness is kind of a continuous version of this argument. If each marginal of $x$ has unit standard deviation, and $\theta_1$ has e.g. 2 elements which are 10, and the remaining $n-2$ are smaller than 0.0001, depending on your tolerance of noise, this is effectively 'selecting' the two features, and zeroing out the remaining ones.
- When the kernel trick is employed, you are performing a linear regression in an high (sometimes infinite) dimensional vector space. Each element of $\theta$ now corresponds to one of your samples, not your features. If $k$ elements of $\theta$ are non-zero, and the remaining $m-k$ are zero, the features corresponding to the $k$ non-zero elements of $\theta$ are called your 'support vectors'. To store your SVM model, say on disk, you need only keep those $k$ feature vectors, and you can throw the rest of them away. Now flatness really matters, because having $k$ small reduces storage and transmission, etc, requirements. Again, depending on your tolerance for noise, you can probably zero out all elements of $\theta$ but the $l$ largest, for some $l$, after performing an SVM regression. Flatness here is equivalent to parsimony with respect to the number of support vectors.
shabbychef gave a very clear explanation from the perspective of the model complexity. I will try to understand this problem from another point of view in case it may help anyone.
Basically we want to maximize the margin in SVC. This is the same in SVR while we want to maximize the prediction error in a defined precision $e$ for better generalization. Here if we minimize the prediction error instead of maximize, the prediction result on unknown data is more likely to be overfitted. Let us think about the "maximize the prediction error" in one-dimensional case.
In one-dimensional case, our goal is to maximize the distances from all points $(x_i,y_i)$ to the trend line $y=\omega x+b$ within $e$. Note that we set the constrain of precision as $e$ so that we can maximize the distance, not minimize. Then let us take a look at the very simple equation of the distance from a point to a line.
$$ \frac{\left|\omega x_i-y_i+b\right|}{\sqrt {\omega^2+1}} $$
Right now the numerator is limited to $e$. To maximize the distance, what we try to do is to minimize $\omega$.
Anyone can easily extend the one-dimensional case to N-dimensional case as the distance equation will always be Euclidean distance.
Additionally, we may have a review on the optimization problem in SVR for the comparison [1].
$$ \min \frac{1}{2} {\left| \left| \omega \right| \right|}^2 $$ $$ s.t. \begin{cases}y_i-<\omega,x_i>-b \leq e\\<\omega,x_i>+b-y_i \geq e\end{cases} $$
Thanks.
[1] Smola, A., and B. Schölkopf. A tutorial on support vector regression. Statistics and Computing, Vol. 14, No. 3, Aug. 2004, pp. 199–222.