Brave Search

What do the variables mean in the SVM objective function?

stats.stackexchange.com › questions › 108617 › what-do-the-variables-mean-in-the-svm-objective-function

Those two formulae are different things:

$\frac{1}{2} w^T w + C \sum \xi_i$ is one form of the objective function, the function which is minimized over $\text{[math]}$ , $\text{[math]}$ , and $\xi_i$ (subject to certain constraints, which are where $\text{[math]}$ comes in) to find the best SVM solution.
Once you've found the model (defined by $\text{[math]}$ and $\text{[math]}$ ), predictions on new data $\text{[math]}$ are done by finding their distance from the decision hyperplane, $\text{[math]}$ .

$\text{[math]}$ and $\text{[math]}$ define the decision hyperplane, which separates positives from negatives, $\text{[math]}$ . So $\text{[math]}$ is perpindicular to that hyperplane. $\text{[math]}$ is also the weight of the corresponding feature dimension: if $\text{[math]}$ , that feature is ignored, and if $\text{[math]}$ is high, that feature is important to the SVM's decision (assuming all the features are scaled similarly).

SVMs are trained by maximizing the margin, which is the amount of space between the decision boundary and the nearest example. If your problem isn't linearly separable, though, there is no perfect decision boundary and so there's no "hard-margin" SVM solution. This is why the "soft-margin" SVM was introduced, which allows some points to be on the wrong side of the margin.

$\xi_i$ is the slack variable defining how much on the wrong side the $\text{[math]}$ th training example is. If $\xi_i = 0$, the point was classified correctly and by enough of a margin; if it's between 0 and 1, the point was classified correctly but by less of a margin than the SVM wanted; if it's more than 1, the point was classified incorrectly. ($\xi_i$ isn't allowed to be negative.) Points with $\xi_i > 0$, as well as those with $\xi_i = 0$ closest to the decision boundary, are known as support vectors because they "support" the margin. These are important in a kernel SVM because they're the only ones you need to worry about when predicting on new data.

$\text{[math]}$ is a parameter of the problem that defines how soft the margin should be. As $\text{[math]}$ , you get a hard SVM; if $\text{[math]}$ , the SVM doesn't care about getting the right answer at all and will just choose $\text{[math]}$ . In practice, you usually try a few different values of $\text{[math]}$ and see how they perform.

This picture (source) illustrates the different variables, though it's for a kernel SVM; just say $\text{[math]}$ and it'll be a linear SVM.

Answer from Danica on Stack Exchange

What do the variables mean in the SVM objective function?

stats.stackexchange.com › questions › 108617 › what-do-the-variables-mean-in-the-svm-objective-function

Those two formulae are different things:

$\frac{1}{2} w^T w + C \sum \xi_i$ is one form of the objective function, the function which is minimized over $\text{[math]}$ , $\text{[math]}$ , and $\xi_i$ (subject to certain constraints, which are where $\text{[math]}$ comes in) to find the best SVM solution.
Once you've found the model (defined by $\text{[math]}$ and $\text{[math]}$ ), predictions on new data $\text{[math]}$ are done by finding their distance from the decision hyperplane, $\text{[math]}$ .

$\text{[math]}$ and $\text{[math]}$ define the decision hyperplane, which separates positives from negatives, $\text{[math]}$ . So $\text{[math]}$ is perpindicular to that hyperplane. $\text{[math]}$ is also the weight of the corresponding feature dimension: if $\text{[math]}$ , that feature is ignored, and if $\text{[math]}$ is high, that feature is important to the SVM's decision (assuming all the features are scaled similarly).

SVMs are trained by maximizing the margin, which is the amount of space between the decision boundary and the nearest example. If your problem isn't linearly separable, though, there is no perfect decision boundary and so there's no "hard-margin" SVM solution. This is why the "soft-margin" SVM was introduced, which allows some points to be on the wrong side of the margin.

$\xi_i$ is the slack variable defining how much on the wrong side the $\text{[math]}$ th training example is. If $\xi_i = 0$, the point was classified correctly and by enough of a margin; if it's between 0 and 1, the point was classified correctly but by less of a margin than the SVM wanted; if it's more than 1, the point was classified incorrectly. ($\xi_i$ isn't allowed to be negative.) Points with $\xi_i > 0$, as well as those with $\xi_i = 0$ closest to the decision boundary, are known as support vectors because they "support" the margin. These are important in a kernel SVM because they're the only ones you need to worry about when predicting on new data.

$\text{[math]}$ is a parameter of the problem that defines how soft the margin should be. As $\text{[math]}$ , you get a hard SVM; if $\text{[math]}$ , the SVM doesn't care about getting the right answer at all and will just choose $\text{[math]}$ . In practice, you usually try a few different values of $\text{[math]}$ and see how they perform.

This picture (source) illustrates the different variables, though it's for a kernel SVM; just say $\text{[math]}$ and it'll be a linear SVM.

Answer from Danica on Stack Exchange

Stack Exchange

stats.stackexchange.com › questions › 108617 › what-do-the-variables-mean-in-the-svm-objective-function

machine learning - What do the variables mean in the SVM objective function? - Cross Validated

Videos

14:58

YouTube

Support Vector Machines: All you need to know! - YouTube

[MXML-6-01] Support Vector Machine (SVM) [1/10] - Linear ...

09:32

YouTube

Soft margin SVM - YouTube

Part 24-SVM Classification (hard margin and soft margin) - YouTube

medium.com › geekculture › support-vector-machine-svm-classification-6579184d78e5

Support Vector Machine (SVM) Classification | by Andac Demir | Geek Culture | Medium

August 25, 2021 - First, we can re-write the constrained optimization function of SVM using Lagrange multipliers: α. This objective function is called the Lagrangian primal problem:

University of Toronto

cs.toronto.edu › ~mbrubake › teaching › C11 › Handouts › SupportVectorMachines.pdf pdf

CSC 411 / CSC D11 / CSC C11 Support Vector Machines 17 Support Vector Machines

In SVMs, we ﬁnd the boundary with maximum margin. (Figure from Pattern Recognition and Machine ... Note that, because of the normalization by ||w|| in (4), the scale of w is arbitrary in this objective · function.

MIT

web.mit.edu › 6.034 › wwwbob › svm-notes-long-08.pdf pdf

1 An Idiot’s guide to Support vector machines (SVMs) R. Berwick, Village Idiot

Remember the function we want to optimize: Ld = ∑ai – ½∑ai ajyiyj (xi•xj) where (xi•xj) is the

Stack Overflow

stackoverflow.com › questions › 23204713 › objective-function-of-an-svm

r - Objective function of an SVM - Stack Overflow

ToyData <- data.frame(X1=c(12.4,14.6,13.4,12.9,15.2,13.6,9.2), X2=c(2.1,9.2,1.9,0.8,1.1,8.6,1.1),Y=c(14.2,16.9,15.5,14.7,17.3,16,10.9)) X <- as.matrix(ToyData[,1:2]) Y <- as.vector(ToyData[,3]) SVRRadial <- svm (X, Y, kernel="radial", epsilon=0.1, gamma=0.1, cost=5) pred<-predict(SVRRadial,X) toys<-ToyData #scale the feature sc_x<-data.frame(SVRRadial$x.scale) for(col in row.names(sc_x)){ toys[[col]]<-(ToyData[[col]]-sc_x[[col,1]])/sc_x[[col,2]] } #compute the predict value, the method is same to the above code X<-as.matrix(toys[,1:2]) V <- as.matrix(SVRRadial$SV) A <- as.matrix(SVRRadial$coef

EITCA

eitca.org › home › what is the objective of the svm optimization problem and how is it mathematically formulated?

What is the objective of the SVM optimization problem and how is it mathematically formulated? - EITCA Academy

June 15, 2024 - The objective of the Support Vector Machine (SVM) optimization problem is to find the hyperplane that best separates a set of data points into distinct classes. This separation is achieved by maximizing the margin, defined as the distance between the hyperplane and the nearest data points from ...

Find elsewhere

Google Bing Mojeek

Medium

medium.com › @satyarepala › unleashing-the-power-of-svms-a-comprehensive-guide-to-theory-and-practice-3b143122fdd5

Unleashing the Power of SVMs: A Comprehensive Guide to Theory and Practice | by Satya Repala | Medium

August 14, 2023 - In SVM training, the objective is to minimize the sum of the hinge losses for all data points while simultaneously maximizing the margin. This is typically achieved through convex optimization methods, such as gradient descent or quadratic ...

scikit-learn

scikit-learn.org › stable › modules › svm.html

1.4. Support Vector Machines — scikit-learn 1.8.0 documentation

While SVM models derived from libsvm and liblinear use C as regularization parameter, most other estimators use alpha. The exact equivalence between the amount of regularization of two models depends on the exact objective function optimized by the model.

Medium

medium.com › data-science › demystifying-maths-of-svm-13ccfe00091e

Demystifying Maths of SVM — Part 1 | by Krishna Kumar Mahto | TDS Archive | Medium

April 18, 2019 - 3, the red hyperplane is the best separating hyperplane. Fig. 3. Which hyperplane is the best? — the one in red ... Apparently, the objective function is the geometric margin of the hyperplane (w, b).

Berkeley EECS

people.eecs.berkeley.edu › ~jordan › courses › 281B-spring04 › lectures › lec6.pdf pdf

CS281B/Stat241B: Advanced Topics in Learning & Decision Making Soft Margin SVM

However, this is not a convex function, and the problem can be shown to be NP-hard. We could try to relax · this to a convex problem by decreasing the upper bound. Claim: The soft-margin SVM is a convex program for which the objective function is the hinge loss.

Quora

quora.com › What-are-the-objective-functions-of-hard-margin-and-soft-margin-SVM

What are the objective functions of hard-margin and soft margin SVM? - Quora

Answer (1 of 3): tl;dr In both the soft margin and hard margin case we are maximizing the margin between support vectors, i.e. minimizing 1/2 ||w||^2. In soft margin case, we let our model give some relaxation to few points, if we consider these points our margin might reduce significantly and ou...

Towards Data Science

towardsdatascience.com › home › latest › support vector machine

Support Vector Machine | Towards Data Science

March 5, 2025 - Our new optimization objective is then to maximize 1/||w||, which is equivalent to minimize ||w||. Last, since||w|| is not differentiable at 0, we’ll minimize (1/2)*||w||² instead, whose derivative is just w. Optimization algorithms work much better on differentiable functions. Finally, we define the hard margin SVM optimization objective as:

Stack Exchange

stats.stackexchange.com › questions › 5945 › understanding-svm-regression-objective-function-and-flatness

Understanding SVM regression: objective function and "flatness" - Cross Validated

Top answer

1 of 3

12

One way that I think about the flatness is that it makes my predictions less sensitive to perturbations in the features. That is, if I am constructing a model of the form $$y = x^\top \theta + \epsilon,$$ where my feature vector $x$ has already been normalized, then smaller values in $\theta$ mean my model is less sensitive to errors in measurement/random shocks/non-stationarity of the features, $x$. Given two models (i.e. two possible values of $\theta$) which explain the data equally well, I prefer the 'flatter' one.

You can also think of Ridge Regression as peforming the same thing without the kernel trick or the SVM 'tube' regression formulation.

edit: In response to @Yang's comments, some more explanation:

Consider the linear case: $y = x^\top \theta + \epsilon$. Suppose the $x$ are drawn i.i.d. from some distribution, independent of $\theta$. By the dot product identity, we have $y = ||x|| ||\theta|| \cos\psi + \epsilon$, where $\psi$ is the angle between $\theta$ and $x$, which is probably distributed under some spherically uniform distribution. Now note: the 'spread' (e.g. the sample standard deviation) of our predictions of $y$ is proportional to $||\theta||$. To get good MSE with the latent, noiseless versions of our observations, we want to shrink that $||\theta||$. c.f. James Stein estimator.
Consider the linear case with lots of features. Consider the models $y = x^\top \theta_1 + \epsilon$, and $y = x^\top \theta_2 + \epsilon$. If $\theta_1$ has more zero elements in it than $\theta_2$, but about the same explanatory power, we would prefer it, base on Occam's razor, since it has dependencies on fewer variables (i.e. we have 'done feature selection' by setting some elements of $\theta_1$ to zero). Flatness is kind of a continuous version of this argument. If each marginal of $x$ has unit standard deviation, and $\theta_1$ has e.g. 2 elements which are 10, and the remaining $n-2$ are smaller than 0.0001, depending on your tolerance of noise, this is effectively 'selecting' the two features, and zeroing out the remaining ones.
When the kernel trick is employed, you are performing a linear regression in an high (sometimes infinite) dimensional vector space. Each element of $\theta$ now corresponds to one of your samples, not your features. If $k$ elements of $\theta$ are non-zero, and the remaining $m-k$ are zero, the features corresponding to the $k$ non-zero elements of $\theta$ are called your 'support vectors'. To store your SVM model, say on disk, you need only keep those $k$ feature vectors, and you can throw the rest of them away. Now flatness really matters, because having $k$ small reduces storage and transmission, etc, requirements. Again, depending on your tolerance for noise, you can probably zero out all elements of $\theta$ but the $l$ largest, for some $l$, after performing an SVM regression. Flatness here is equivalent to parsimony with respect to the number of support vectors.

2 of 3

3

shabbychef gave a very clear explanation from the perspective of the model complexity. I will try to understand this problem from another point of view in case it may help anyone.

Basically we want to maximize the margin in SVC. This is the same in SVR while we want to maximize the prediction error in a defined precision $e$ for better generalization. Here if we minimize the prediction error instead of maximize, the prediction result on unknown data is more likely to be overfitted. Let us think about the "maximize the prediction error" in one-dimensional case.

In one-dimensional case, our goal is to maximize the distances from all points $(x_i,y_i)$ to the trend line $y=\omega x+b$ within $e$. Note that we set the constrain of precision as $e$ so that we can maximize the distance, not minimize. Then let us take a look at the very simple equation of the distance from a point to a line.

$$ \frac{\left|\omega x_i-y_i+b\right|}{\sqrt {\omega^2+1}} $$

Right now the numerator is limited to $e$. To maximize the distance, what we try to do is to minimize $\omega$.

Anyone can easily extend the one-dimensional case to N-dimensional case as the distance equation will always be Euclidean distance.

Additionally, we may have a review on the optimization problem in SVR for the comparison [1].

$$ \min \frac{1}{2} {\left| \left| \omega \right| \right|}^2 $$ $$ s.t. \begin{cases}y_i-<\omega,x_i>-b \leq e\\<\omega,x_i>+b-y_i \geq e\end{cases} $$

Thanks.

[1] Smola, A., and B. Schölkopf. A tutorial on support vector regression. Statistics and Computing, Vol. 14, No. 3, Aug. 2004, pp. 199–222.

University of Oxford

robots.ox.ac.uk › ~az › lectures › ml › lect2.pdf pdf

Lecture 2: The SVM classifier

• Support Vector Machine (SVM) classifier · • Wide margin · • Cost function · • Slack variables · • Loss functions revisited · • Optimization · Binary Classification · Given training data (xi, yi) for i = 1 . . . N, with · xi ∈Rd and yi ∈{−1, 1}, learn a classiﬁer f(x) such that ·

Kuleshov-group

kuleshov-group.github.io › aml-book › contents › lecture13-svm-dual.html

Lecture 13: Dual Formulation of Support Vector Machines — Applied ML

In summary, the SVM algorithm can be succinctly defined by the following key components. ... Model family: Linear decision boundaries. Objective function: Dual of SVM optimization problem.

MathWorks

mathworks.com › statistics and machine learning toolbox › regression › support vector machine regression

Understanding Support Vector Machine Regression - MATLAB & Simulink

It is possible that no such function f(x) exists to satisfy these constraints for all points. To deal with otherwise infeasible constraints, introduce slack variables ξn and ξ*n for each point. This approach is similar to the “soft margin” concept in SVM classification, because the slack variables allow regression errors to exist up to the value of ξn and ξ*n, yet still satisfy the required conditions. Including slack variables leads to the objective function, also known as the primal formula [5]:

Shuzhan Fan

shuzhanfan.github.io › 2018 › 05 › understanding-mathematics-behind-support-vector-machines

Understanding the mathematics behind Support Vector Machines

May 7, 2018 - Basically, the trick Soft Margin SVM is using is very simple, it adds slack variables $\zeta_i$ to the constraints of the optimization problem. The constraints now become: \[y_i(w\cdot x_i+b) \ge 1-\zeta_i, i=1...m\] By adding the slack variables, when minimizing the objective function, it is ...

Mosek

docs.mosek.com › latest › dotnetfusion › case-studies-svm.html

11.2 Primal Support-Vector Machine (SVM) — MOSEK Fusion API for .NET 11.1.3

To solve a sequence of problems with varying C we can simply iterate along those values changing the objective function: Console.WriteLine(" c | b | w"); for (int i = 0; i < nc; i++) { double c = 500.0 * i; M.Objective(ObjectiveSense.Minimize, Expr.Add( t, Expr.Mul(c, Expr.Sum(xi) ) ) ); M.Solve(); Console.Write("{0} | {1} |", c, b.Level()[0] ); for (int j = 0; j < n; j++) Console.Write(" {0}", w.Level()[j] ); Console.WriteLine(); } ... Model M = new Model("Primal SVM"); try { Console.WriteLine("Number of data : {0}\n", m); Console.WriteLine("Number of features: {0}\n", n); Variable w = M.Vari

Springer

link.springer.com › home › efficient learning machines › chapter

Support Vector Machines for Classification | Springer Nature Link

SVM hyperplane parameters are thus defined by the active, binding constraints, which correspond to the nonzero Lagrange multipliers, that is, the support vector. Solving the duality of the aforementioned problem is useful for several reasons. First, even if the primal is not convex, the dual problem will always have a unique optimal solution. Second, the value of the objective function is a lower bound on the optimal function value of the primal formulation.