For a general kernel it is difficult to interpret the SVM weights, however for the linear SVM there actually is a useful interpretation:
1) Recall that in linear SVM, the result is a hyperplane that separates the classes as best as possible. The weights represent this hyperplane, by giving you the coordinates of a vector which is orthogonal to the hyperplane - these are the coefficients given by svm.coef_. Let's call this vector w.
2) What can we do with this vector? It's direction gives us the predicted class, so if you take the dot product of any point with the vector, you can tell on which side it is: if the dot product is positive, it belongs to the positive class, if it is negative it belongs to the negative class.
3) Finally, you can even learn something about the importance of each feature. This is my own interpretation so convince yourself first. Let's say the svm would find only one feature useful for separating the data, then the hyperplane would be orthogonal to that axis. So, you could say that the absolute size of the coefficient relative to the other ones gives an indication of how important the feature was for the separation. For example if only the first coordinate is used for separation, w will be of the form (x,0) where x is some non zero number and then |x|>0.
Answer from Bitwise on Stack ExchangeFor a general kernel it is difficult to interpret the SVM weights, however for the linear SVM there actually is a useful interpretation:
1) Recall that in linear SVM, the result is a hyperplane that separates the classes as best as possible. The weights represent this hyperplane, by giving you the coordinates of a vector which is orthogonal to the hyperplane - these are the coefficients given by svm.coef_. Let's call this vector w.
2) What can we do with this vector? It's direction gives us the predicted class, so if you take the dot product of any point with the vector, you can tell on which side it is: if the dot product is positive, it belongs to the positive class, if it is negative it belongs to the negative class.
3) Finally, you can even learn something about the importance of each feature. This is my own interpretation so convince yourself first. Let's say the svm would find only one feature useful for separating the data, then the hyperplane would be orthogonal to that axis. So, you could say that the absolute size of the coefficient relative to the other ones gives an indication of how important the feature was for the separation. For example if only the first coordinate is used for separation, w will be of the form (x,0) where x is some non zero number and then |x|>0.
I am trying to interpret the variable weights given by fitting a linear SVM.
A good way to understand how the weights are calculated and how to interpret them in the case of linear SVM is to perform the calculations by hand on a very simple example.
Example
Consider the following dataset which is linearly separable
import numpy as np
X = np.array([[3,4],[1,4],[2,3],[6,-1],[7,-1],[5,-3]] )
y = np.array([-1,-1, -1, 1, 1 , 1 ])

Solving the SVM problem by inspection
By inspection we can see that the boundary line that separates the points with the largest "margin" is the line . Since the weights of the SVM are proportional to the equation of this decision line (hyperplane in higher dimensions) using
a first guess of the parameters would be
SVM theory tells us that the "width" of the margin is given by .
Using the above guess we would obtain a width of
. which, by inspection is incorrect. The width is
Recall that scaling the boundary by a factor of does not change the boundary line, hence we can generalize the equation as
$$ cx_1 - cx_2 - 3c = 0$$
Plugging back into the equation for the width we get
\begin{aligned} \frac{2}{||w||} & = 4 \sqrt{2} \\ \frac{2}{\sqrt{2}c} & = 4 \sqrt{2} \\ c = \frac{1}{4} \end{aligned}
Hence the parameters (or coefficients) are in fact
(I'm using scikit-learn)
So am I, here's some code to check our manual calculations
from sklearn.svm import SVC
clf = SVC(C = 1e5, kernel = 'linear')
clf.fit(X, y)
print('w = ',clf.coef_)
print('b = ',clf.intercept_)
print('Indices of support vectors = ', clf.support_)
print('Support vectors = ', clf.support_vectors_)
print('Number of support vectors for each class = ', clf.n_support_)
print('Coefficients of the support vector in the decision function = ', np.abs(clf.dual_coef_))
- w = [[ 0.25 -0.25]] b = [-0.75]
- Indices of support vectors = [2 3]
- Support vectors = [[ 2. 3.] [ 6. -1.]]
- Number of support vectors for each class = [1 1]
- Coefficients of the support vector in the decision function = [[0.0625 0.0625]]
Does the sign of the weight have anything to do with class?
Not really, the sign of the weights has to do with the equation of the boundary plane.
Source
https://ai6034.mit.edu/wiki/images/SVM_and_Boosting.pdf
Videos
In linear case, the hyperplane can be always defined with d+1 numbers, where d is the dimension of the input space, while the number of actual support vectors may be much larger. By computing this hyperplane (lets call it w) you get more compact model, which can be then used to perform a classification:
cl(x) = sgn(w'x + b)
where w' is a transpositon of w
Things get much more tricky in the kernelized version, as w is in the form of the feature space projection, which may be unknown (or to expensive to compute) so one cannot get an equation of such an object (as it is no longer a hyperplane in the input space, but rather - a hyperplane in very rich feature space).
"Support vectors are the elements of the training set that would change the position of the dividing hyperplane if removed." The weights represent this hyperplane by providing the coordinates of a vector that is orthogonal to the hyperplane. "Computes the weighted sum of the support vectors" mathematically means sign(w'*x +b), when x is the support vectors and w' is the transpose of weight vectors, the value of w'x+b is 0 and it represents the decision boundary. When a new x reaches, the sign(w'x+b) will determine which class it belongs to.
For those x in the training sample that have the weight of 0, it means the sample does not contribute to the hyperplane, and including that x as a support vector will either increase the classification error, or decrease the margin.
Here is a reference tutorial with plenty of figures for more details.