Those two formulae are different things:

  • $\frac{1}{2} w^T w + C \sum \xi_i$ is one form of the objective function, the function which is minimized over , , and $\xi_i$ (subject to certain constraints, which are where comes in) to find the best SVM solution.

  • Once you've found the model (defined by and ), predictions on new data are done by finding their distance from the decision hyperplane, .

and define the decision hyperplane, which separates positives from negatives, . So is perpindicular to that hyperplane. is also the weight of the corresponding feature dimension: if , that feature is ignored, and if is high, that feature is important to the SVM's decision (assuming all the features are scaled similarly).

SVMs are trained by maximizing the margin, which is the amount of space between the decision boundary and the nearest example. If your problem isn't linearly separable, though, there is no perfect decision boundary and so there's no "hard-margin" SVM solution. This is why the "soft-margin" SVM was introduced, which allows some points to be on the wrong side of the margin.

$\xi_i$ is the slack variable defining how much on the wrong side the th training example is. If $\xi_i = 0$, the point was classified correctly and by enough of a margin; if it's between 0 and 1, the point was classified correctly but by less of a margin than the SVM wanted; if it's more than 1, the point was classified incorrectly. ($\xi_i$ isn't allowed to be negative.) Points with $\xi_i > 0$, as well as those with $\xi_i = 0$ closest to the decision boundary, are known as support vectors because they "support" the margin. These are important in a kernel SVM because they're the only ones you need to worry about when predicting on new data.

is a parameter of the problem that defines how soft the margin should be. As , you get a hard SVM; if , the SVM doesn't care about getting the right answer at all and will just choose . In practice, you usually try a few different values of and see how they perform.

This picture (source) illustrates the different variables, though it's for a kernel SVM; just say and it'll be a linear SVM.

Answer from Danica on Stack Exchange

Those two formulae are different things:

  • $\frac{1}{2} w^T w + C \sum \xi_i$ is one form of the objective function, the function which is minimized over , , and $\xi_i$ (subject to certain constraints, which are where comes in) to find the best SVM solution.

  • Once you've found the model (defined by and ), predictions on new data are done by finding their distance from the decision hyperplane, .

and define the decision hyperplane, which separates positives from negatives, . So is perpindicular to that hyperplane. is also the weight of the corresponding feature dimension: if , that feature is ignored, and if is high, that feature is important to the SVM's decision (assuming all the features are scaled similarly).

SVMs are trained by maximizing the margin, which is the amount of space between the decision boundary and the nearest example. If your problem isn't linearly separable, though, there is no perfect decision boundary and so there's no "hard-margin" SVM solution. This is why the "soft-margin" SVM was introduced, which allows some points to be on the wrong side of the margin.

$\xi_i$ is the slack variable defining how much on the wrong side the th training example is. If $\xi_i = 0$, the point was classified correctly and by enough of a margin; if it's between 0 and 1, the point was classified correctly but by less of a margin than the SVM wanted; if it's more than 1, the point was classified incorrectly. ($\xi_i$ isn't allowed to be negative.) Points with $\xi_i > 0$, as well as those with $\xi_i = 0$ closest to the decision boundary, are known as support vectors because they "support" the margin. These are important in a kernel SVM because they're the only ones you need to worry about when predicting on new data.

is a parameter of the problem that defines how soft the margin should be. As , you get a hard SVM; if , the SVM doesn't care about getting the right answer at all and will just choose . In practice, you usually try a few different values of and see how they perform.

This picture (source) illustrates the different variables, though it's for a kernel SVM; just say and it'll be a linear SVM.

Answer from Danica on Stack Exchange
Top answer
1 of 1
7

Those two formulae are different things:

  • $\frac{1}{2} w^T w + C \sum \xi_i$ is one form of the objective function, the function which is minimized over , , and $\xi_i$ (subject to certain constraints, which are where comes in) to find the best SVM solution.

  • Once you've found the model (defined by and ), predictions on new data are done by finding their distance from the decision hyperplane, .

and define the decision hyperplane, which separates positives from negatives, . So is perpindicular to that hyperplane. is also the weight of the corresponding feature dimension: if , that feature is ignored, and if is high, that feature is important to the SVM's decision (assuming all the features are scaled similarly).

SVMs are trained by maximizing the margin, which is the amount of space between the decision boundary and the nearest example. If your problem isn't linearly separable, though, there is no perfect decision boundary and so there's no "hard-margin" SVM solution. This is why the "soft-margin" SVM was introduced, which allows some points to be on the wrong side of the margin.

$\xi_i$ is the slack variable defining how much on the wrong side the th training example is. If $\xi_i = 0$, the point was classified correctly and by enough of a margin; if it's between 0 and 1, the point was classified correctly but by less of a margin than the SVM wanted; if it's more than 1, the point was classified incorrectly. ($\xi_i$ isn't allowed to be negative.) Points with $\xi_i > 0$, as well as those with $\xi_i = 0$ closest to the decision boundary, are known as support vectors because they "support" the margin. These are important in a kernel SVM because they're the only ones you need to worry about when predicting on new data.

is a parameter of the problem that defines how soft the margin should be. As , you get a hard SVM; if , the SVM doesn't care about getting the right answer at all and will just choose . In practice, you usually try a few different values of and see how they perform.

This picture (source) illustrates the different variables, though it's for a kernel SVM; just say and it'll be a linear SVM.

🌐
Wikipedia
en.wikipedia.org › wiki › Support_vector_machine
Support vector machine - Wikipedia
2 days ago - In addition to performing linear ... a kernel function, which transforms them into coordinates in a higher-dimensional feature space. Thus, SVMs use the kernel trick to implicitly map their inputs into high-dimensional feature spaces, where linear classification can be performed. Being max-margin models, SVMs are resilient to noisy data (e.g., misclassified examples). SVMs can also be used for regression tasks, where the objective ...
🌐
Medium
medium.com › geekculture › support-vector-machine-svm-classification-6579184d78e5
Support Vector Machine (SVM) Classification | by Andac Demir | Geek Culture | Medium
August 25, 2021 - First, we can re-write the constrained optimization function of SVM using Lagrange multipliers: α. This objective function is called the Lagrangian primal problem:
🌐
University of Toronto
cs.toronto.edu › ~mbrubake › teaching › C11 › Handouts › SupportVectorMachines.pdf pdf
CSC 411 / CSC D11 / CSC C11 Support Vector Machines 17 Support Vector Machines
In SVMs, we find the boundary with maximum margin. (Figure from Pattern Recognition and Machine ... Note that, because of the normalization by ||w|| in (4), the scale of w is arbitrary in this objective · function.
🌐
MIT
web.mit.edu › 6.034 › wwwbob › svm-notes-long-08.pdf pdf
1 An Idiot’s guide to Support vector machines (SVMs) R. Berwick, Village Idiot
Remember the function we want to optimize: Ld = ∑ai – ½∑ai ajyiyj (xi•xj) where (xi•xj) is the
🌐
Stack Overflow
stackoverflow.com › questions › 23204713 › objective-function-of-an-svm
r - Objective function of an SVM - Stack Overflow
ToyData <- data.frame(X1=c(12.4,14.6,13.4,12.9,15.2,13.6,9.2), X2=c(2.1,9.2,1.9,0.8,1.1,8.6,1.1),Y=c(14.2,16.9,15.5,14.7,17.3,16,10.9)) X <- as.matrix(ToyData[,1:2]) Y <- as.vector(ToyData[,3]) SVRRadial <- svm (X, Y, kernel="radial", epsilon=0.1, gamma=0.1, cost=5) pred<-predict(SVRRadial,X) toys<-ToyData #scale the feature sc_x<-data.frame(SVRRadial$x.scale) for(col in row.names(sc_x)){ toys[[col]]<-(ToyData[[col]]-sc_x[[col,1]])/sc_x[[col,2]] } #compute the predict value, the method is same to the above code X<-as.matrix(toys[,1:2]) V <- as.matrix(SVRRadial$SV) A <- as.matrix(SVRRadial$coef
🌐
EITCA
eitca.org › home › what is the objective of the svm optimization problem and how is it mathematically formulated?
What is the objective of the SVM optimization problem and how is it mathematically formulated? - EITCA Academy
June 15, 2024 - The objective of the Support Vector Machine (SVM) optimization problem is to find the hyperplane that best separates a set of data points into distinct classes. This separation is achieved by maximizing the margin, defined as the distance between the hyperplane and the nearest data points from ...
Find elsewhere
🌐
Medium
medium.com › @satyarepala › unleashing-the-power-of-svms-a-comprehensive-guide-to-theory-and-practice-3b143122fdd5
Unleashing the Power of SVMs: A Comprehensive Guide to Theory and Practice | by Satya Repala | Medium
August 14, 2023 - In SVM training, the objective is to minimize the sum of the hinge losses for all data points while simultaneously maximizing the margin. This is typically achieved through convex optimization methods, such as gradient descent or quadratic ...
🌐
scikit-learn
scikit-learn.org › stable › modules › svm.html
1.4. Support Vector Machines — scikit-learn 1.8.0 documentation
While SVM models derived from libsvm and liblinear use C as regularization parameter, most other estimators use alpha. The exact equivalence between the amount of regularization of two models depends on the exact objective function optimized by the model.
🌐
Medium
medium.com › data-science › demystifying-maths-of-svm-13ccfe00091e
Demystifying Maths of SVM — Part 1 | by Krishna Kumar Mahto | TDS Archive | Medium
April 18, 2019 - 3, the red hyperplane is the best separating hyperplane. Fig. 3. Which hyperplane is the best? — the one in red ... Apparently, the objective function is the geometric margin of the hyperplane (w, b).
🌐
Berkeley EECS
people.eecs.berkeley.edu › ~jordan › courses › 281B-spring04 › lectures › lec6.pdf pdf
CS281B/Stat241B: Advanced Topics in Learning & Decision Making Soft Margin SVM
However, this is not a convex function, and the problem can be shown to be NP-hard. We could try to relax · this to a convex problem by decreasing the upper bound. Claim: The soft-margin SVM is a convex program for which the objective function is the hinge loss.
🌐
Quora
quora.com › What-are-the-objective-functions-of-hard-margin-and-soft-margin-SVM
What are the objective functions of hard-margin and soft margin SVM? - Quora
Answer (1 of 3): tl;dr In both the soft margin and hard margin case we are maximizing the margin between support vectors, i.e. minimizing 1/2 ||w||^2. In soft margin case, we let our model give some relaxation to few points, if we consider these points our margin might reduce significantly and ou...
🌐
Towards Data Science
towardsdatascience.com › home › latest › support vector machine
Support Vector Machine | Towards Data Science
March 5, 2025 - Our new optimization objective is then to maximize 1/||w||, which is equivalent to minimize ||w||. Last, since||w|| is not differentiable at 0, we’ll minimize (1/2)*||w||² instead, whose derivative is just w. Optimization algorithms work much better on differentiable functions. Finally, we define the hard margin SVM optimization objective as:
Top answer
1 of 3
12

One way that I think about the flatness is that it makes my predictions less sensitive to perturbations in the features. That is, if I am constructing a model of the form $$y = x^\top \theta + \epsilon,$$ where my feature vector $x$ has already been normalized, then smaller values in $\theta$ mean my model is less sensitive to errors in measurement/random shocks/non-stationarity of the features, $x$. Given two models (i.e. two possible values of $\theta$) which explain the data equally well, I prefer the 'flatter' one.

You can also think of Ridge Regression as peforming the same thing without the kernel trick or the SVM 'tube' regression formulation.

edit: In response to @Yang's comments, some more explanation:

  1. Consider the linear case: $y = x^\top \theta + \epsilon$. Suppose the $x$ are drawn i.i.d. from some distribution, independent of $\theta$. By the dot product identity, we have $y = ||x|| ||\theta|| \cos\psi + \epsilon$, where $\psi$ is the angle between $\theta$ and $x$, which is probably distributed under some spherically uniform distribution. Now note: the 'spread' (e.g. the sample standard deviation) of our predictions of $y$ is proportional to $||\theta||$. To get good MSE with the latent, noiseless versions of our observations, we want to shrink that $||\theta||$. c.f. James Stein estimator.
  2. Consider the linear case with lots of features. Consider the models $y = x^\top \theta_1 + \epsilon$, and $y = x^\top \theta_2 + \epsilon$. If $\theta_1$ has more zero elements in it than $\theta_2$, but about the same explanatory power, we would prefer it, base on Occam's razor, since it has dependencies on fewer variables (i.e. we have 'done feature selection' by setting some elements of $\theta_1$ to zero). Flatness is kind of a continuous version of this argument. If each marginal of $x$ has unit standard deviation, and $\theta_1$ has e.g. 2 elements which are 10, and the remaining $n-2$ are smaller than 0.0001, depending on your tolerance of noise, this is effectively 'selecting' the two features, and zeroing out the remaining ones.
  3. When the kernel trick is employed, you are performing a linear regression in an high (sometimes infinite) dimensional vector space. Each element of $\theta$ now corresponds to one of your samples, not your features. If $k$ elements of $\theta$ are non-zero, and the remaining $m-k$ are zero, the features corresponding to the $k$ non-zero elements of $\theta$ are called your 'support vectors'. To store your SVM model, say on disk, you need only keep those $k$ feature vectors, and you can throw the rest of them away. Now flatness really matters, because having $k$ small reduces storage and transmission, etc, requirements. Again, depending on your tolerance for noise, you can probably zero out all elements of $\theta$ but the $l$ largest, for some $l$, after performing an SVM regression. Flatness here is equivalent to parsimony with respect to the number of support vectors.
2 of 3
3

shabbychef gave a very clear explanation from the perspective of the model complexity. I will try to understand this problem from another point of view in case it may help anyone.

Basically we want to maximize the margin in SVC. This is the same in SVR while we want to maximize the prediction error in a defined precision $e$ for better generalization. Here if we minimize the prediction error instead of maximize, the prediction result on unknown data is more likely to be overfitted. Let us think about the "maximize the prediction error" in one-dimensional case.

In one-dimensional case, our goal is to maximize the distances from all points $(x_i,y_i)$ to the trend line $y=\omega x+b$ within $e$. Note that we set the constrain of precision as $e$ so that we can maximize the distance, not minimize. Then let us take a look at the very simple equation of the distance from a point to a line.

$$ \frac{\left|\omega x_i-y_i+b\right|}{\sqrt {\omega^2+1}} $$

Right now the numerator is limited to $e$. To maximize the distance, what we try to do is to minimize $\omega$.

Anyone can easily extend the one-dimensional case to N-dimensional case as the distance equation will always be Euclidean distance.

Additionally, we may have a review on the optimization problem in SVR for the comparison [1].

$$ \min \frac{1}{2} {\left| \left| \omega \right| \right|}^2 $$ $$ s.t. \begin{cases}y_i-<\omega,x_i>-b \leq e\\<\omega,x_i>+b-y_i \geq e\end{cases} $$

Thanks.

[1] Smola, A., and B. Schölkopf. A tutorial on support vector regression. Statistics and Computing, Vol. 14, No. 3, Aug. 2004, pp. 199–222.

🌐
University of Oxford
robots.ox.ac.uk › ~az › lectures › ml › lect2.pdf pdf
Lecture 2: The SVM classifier
• Support Vector Machine (SVM) classifier · • Wide margin · • Cost function · • Slack variables · • Loss functions revisited · • Optimization · Binary Classification · Given training data (xi, yi) for i = 1 . . . N, with · xi ∈Rd and yi ∈{−1, 1}, learn a classifier f(x) such that ·
🌐
Kuleshov-group
kuleshov-group.github.io › aml-book › contents › lecture13-svm-dual.html
Lecture 13: Dual Formulation of Support Vector Machines — Applied ML
In summary, the SVM algorithm can be succinctly defined by the following key components. ... Model family: Linear decision boundaries. Objective function: Dual of SVM optimization problem.
🌐
MathWorks
mathworks.com › statistics and machine learning toolbox › regression › support vector machine regression
Understanding Support Vector Machine Regression - MATLAB & Simulink
It is possible that no such function f(x) exists to satisfy these constraints for all points. To deal with otherwise infeasible constraints, introduce slack variables ξn and ξ*n for each point. This approach is similar to the “soft margin” concept in SVM classification, because the slack variables allow regression errors to exist up to the value of ξn and ξ*n, yet still satisfy the required conditions. Including slack variables leads to the objective function, also known as the primal formula [5]:
🌐
Shuzhan Fan
shuzhanfan.github.io › 2018 › 05 › understanding-mathematics-behind-support-vector-machines
Understanding the mathematics behind Support Vector Machines
May 7, 2018 - Basically, the trick Soft Margin SVM is using is very simple, it adds slack variables \(\zeta_i\) to the constraints of the optimization problem. The constraints now become: \[y_i(w\cdot x_i+b) \ge 1-\zeta_i, i=1...m\] By adding the slack variables, when minimizing the objective function, it is ...
🌐
Mosek
docs.mosek.com › latest › dotnetfusion › case-studies-svm.html
11.2 Primal Support-Vector Machine (SVM) — MOSEK Fusion API for .NET 11.1.3
To solve a sequence of problems with varying C we can simply iterate along those values changing the objective function: Console.WriteLine(" c | b | w"); for (int i = 0; i < nc; i++) { double c = 500.0 * i; M.Objective(ObjectiveSense.Minimize, Expr.Add( t, Expr.Mul(c, Expr.Sum(xi) ) ) ); M.Solve(); Console.Write("{0} | {1} |", c, b.Level()[0] ); for (int j = 0; j < n; j++) Console.Write(" {0}", w.Level()[j] ); Console.WriteLine(); } ... Model M = new Model("Primal SVM"); try { Console.WriteLine("Number of data : {0}\n", m); Console.WriteLine("Number of features: {0}\n", n); Variable w = M.Vari
🌐
Springer
link.springer.com › home › efficient learning machines › chapter
Support Vector Machines for Classification | Springer Nature Link
SVM hyperplane parameters are thus defined by the active, binding constraints, which correspond to the nonzero Lagrange multipliers, that is, the support vector. Solving the duality of the aforementioned problem is useful for several reasons. First, even if the primal is not convex, the dual problem will always have a unique optimal solution. Second, the value of the objective function is a lower bound on the optimal function value of the primal formulation.