The code examples listed here don't work with LibSVM 3.1, so I've more or less ported the example by mossplix:
from svmutil import *
svm_model.predict = lambda self, x: svm_predict([0], [x], self)[0][0]
prob = svm_problem([1,-1], [[1,0,1], [-1,0,-1]])
param = svm_parameter()
param.kernel_type = LINEAR
param.C = 10
m=svm_train(prob, param)
m.predict([1,1,1])
Answer from ShinNoNoir on Stack OverflowThe code examples listed here don't work with LibSVM 3.1, so I've more or less ported the example by mossplix:
from svmutil import *
svm_model.predict = lambda self, x: svm_predict([0], [x], self)[0][0]
prob = svm_problem([1,-1], [[1,0,1], [-1,0,-1]])
param = svm_parameter()
param.kernel_type = LINEAR
param.C = 10
m=svm_train(prob, param)
m.predict([1,1,1])
This example demonstrates a one-class SVM classifier; it's about as simple as possible while still showing the complete LIBSVM workflow.
Step 1: Import NumPy & LIBSVM
import numpy as NP
from svm import *
Step 2: Generate synthetic data: for this example, 500 points within a given boundary (note: quite a few real data sets are are provided on the LIBSVM website)
Data = NP.random.randint(-5, 5, 1000).reshape(500, 2)
Step 3: Now, choose some non-linear decision boundary for a one-class classifier:
rx = [ (x**2 + y**2) < 9 and 1 or 0 for (x, y) in Data ]
Step 4: Next, arbitrarily partition the data w/r/t this decision boundary:
Class I: those that lie on or within an arbitrary circle
Class II: all points outside the decision boundary (circle)
The SVM Model Building begins here; all steps before this one were just to prepare some synthetic data.
Step 5: Construct the problem description by calling svm_problem, passing in the decision boundary function and the data, then bind this result to a variable.
px = svm_problem(rx, Data)
Step 6: Select a kernel function for the non-linear mapping
For this exmaple, i chose RBF (radial basis function) as my kernel function
pm = svm_parameter(kernel_type=RBF)
Step 7: Train the classifier, by calling svm_model, passing in the problem description (px) & kernel (pm)
v = svm_model(px, pm)
Step 8: Finally, test the trained classifier by calling predict on the trained model object ('v')
v.predict([3, 1])
# returns the class label (either '1' or '0')
For the example above, I used version 3.0 of LIBSVM (the current stable release at the time this answer was posted).
Finally, w/r/t the part of your question regarding the choice of kernel function, Support Vector Machines are not specific to a particular kernel function--e.g., i could have chosen a different kernel (gaussian, polynomial, etc.).
LIBSVM includes all of the most commonly used kernel functions--which is a big help because you can see all plausible alternatives and to select one for use in your model, is just a matter of calling svm_parameter and passing in a value for kernel_type (a three-letter abbreviation for the chosen kernel).
Finally, the kernel function you choose for training must match the kernel function used against the testing data.
Videos
The code examples listed here don't work with LibSVM 3.1, so I've more or less ported the example by mossplix:
from svmutil import *
svm_model.predict = lambda self, x: svm_predict([0], [x], self)[0][0]
prob = svm_problem([1,-1], [[1,0,1], [-1,0,-1]])
param = svm_parameter()
param.kernel_type = LINEAR
param.C = 10
m=svm_train(prob, param)
m.predict([1,1,1])
Answer from ShinNoNoir on Stack OverflowIf you have already downloaded libSVM you will find some "usefull" documentation inside two files:
./libsvm-3.xx/READMEfile in the top directory which covers the C/C++ API and also documentation about the binary executablessvm-predict,svm-scaleandsvm-train./libsvm-3.xx/python/READMEwhich deals with the Python interfaces (svmandsvmutil), which I think is what you are looking for. However the example is quite naive although is a good beginning.
Let me suggest you that if you want to work with libSVM in Python, the scikit-learn package implements SVM using libSVM underneath, it much more easy, better documented and let's you control the same parameters of libSVM.
I think you might be approaching this the wrong way. You seem to be expecting to use LIBSVM as if it was ls: just do man ls to get the parameters and view the results. SVMs are more complicated than that.
The authors of LIBSVM publish a document (not a scientific paper!) called: A Practical Guide to Support Vector Classification. You need to read and understand all that the authors explain there. The appendix to that guide gives multiple examples on many datasets and how to train and how to search for parameters (all things that are very important).
There is a README file in the python directory of the LIBSVM distribution. If you understand python and you read the practical guide you should be able to use it. If not you should probably start from the command line examples to learn SVM or start with somthing easier(not SVMs!) to learn python. After reading and understanding that you should be able to read use all the examples from the appendix and invoke them from python.
Once you've tried this you should be up and running in no time. If not, this is a great place to ask specific questions about problems you run into.
» pip install libsvm-official
Here's a step-by-step guide for how to train an SVM using your data and then evaluate using the same dataset. It's also available at http://nbviewer.ipython.org/gist/anonymous/2cf3b993aab10bf26d5f. At the url you can also see the output of the intermediate data and the resulting accuracy (it's an iPython notebook)
Step 0: Install dependencies
You need to install the following libraries:
- pandas
- scikit-learn
From command line:
pip install pandas
pip install scikit-learn
Step 1: Load the data
We will use pandas to load our data. pandas is a library for easily loading data. For illustration, we first save sample data to a csv and then load it.
We will train the SVM with train.csv and get test labels with test.csv
import pandas as pd
train_data_contents = """
class_label,distance_from_beginning,distance_from_end,contains_digit,capitalized
B,1,10,1,0
M,10,1,0,1
C,2,3,0,1
S,23,2,0,0
N,12,0,0,1"""
with open('train.csv', 'w') as output:
output.write(train_data_contents)
train_dataframe = pd.read_csv('train.csv')
Step 2: Process the data
We will convert our dataframe into numpy arrays which is a format that scikit- learn understands.
We need to convert the labels "B", "M", "C",... to numbers also because svm does not understand strings.
Then we will train a linear svm with the data
import numpy as np
train_labels = train_dataframe.class_label
labels = list(set(train_labels))
train_labels = np.array([labels.index(x) for x in train_labels])
train_features = train_dataframe.iloc[:,1:]
train_features = np.array(train_features)
print "train labels: "
print train_labels
print
print "train features:"
print train_features
We see here that the length of train_labels (5) exactly matches how many rows
we have in trainfeatures. Each item in train_labels corresponds to a row.
Step 3: Train the SVM
from sklearn import svm
classifier = svm.SVC()
classifier.fit(train_features, train_labels)
Step 4: Evaluate the SVM on some testing data
test_data_contents = """
class_label,distance_from_beginning,distance_from_end,contains_digit,capitalized
B,1,10,1,0
M,10,1,0,1
C,2,3,0,1
S,23,2,0,0
N,12,0,0,1
"""
with open('test.csv', 'w') as output:
output.write(test_data_contents)
test_dataframe = pd.read_csv('test.csv')
test_labels = test_dataframe.class_label
labels = list(set(test_labels))
test_labels = np.array([labels.index(x) for x in test_labels])
test_features = test_dataframe.iloc[:,1:]
test_features = np.array(test_features)
results = classifier.predict(test_features)
num_correct = (results == test_labels).sum()
recall = num_correct / len(test_labels)
print "model accuracy (%): ", recall * 100, "%"
Links & Tips
- Example code for how to load LinearSVC: http://scikitlearn.org/stable/modules/svm.html#svm
- Long list of scikit-learn examples: http://scikitlearn.org/stable/auto_examples/index.html. I've found these mildly helpful but often confusing myself.
- If you find that the SVM is taking a long time to train, try LinearSVC instead: http://scikitlearn.org/stable/modules/generated/sklearn.svm.LinearSVC.html
- Here's another tutorial on getting familiar with machine learning models: http://scikit-learn.org/stable/tutorial/basic/tutorial.html
You should be able to take this code and replace train.csv with your training data, test.csv with your testing data, and get predictions for your test data, along with accuracy results.
Note that since you're evaluating using the data you trained on the accuracy will be unusually high.
I echo the comment of @MarcoPashkov but will try to elaborate on the LibSVM file format. I find the documentation comprehensive yet hard to find, for the Python lib I recommend the README on GitHub.
An important piece to recognize is that there is a Sparse format where all features which are 0 get removed and a Dense format where features which are 0 are not removed. These two are equivalent examples of each taken from the README.
# Dense data
>>> y, x = [1,-1], [[1,0,1], [-1,0,-1]]
# Sparse data
>>> y, x = [1,-1], [{1:1, 3:1}, {1:-1,3:-1}]
The y variable stores a list of all the categories for the data.
The x variable stores the feature vector.
assert len(y) == len(x), "Both lists should be the same length"
The format found in the Heart Scale Example is a Sparse format where the dictionary key is the feature index and the dictionary value is the feature value while the first value is the category.
The Sparse format is incredibly useful while using a Bag of Words Representation for your feature vector.
As most documents will typically use a very small subset of the words used in the corpus, the resulting matrix will have many feature values that are zeros (typically more than 99% of them).
For instance a collection of 10,000 short text documents (such as emails) will use a vocabulary with a size in the order of 100,000 unique words in total while each document will use 100 to 1000 unique words individually.
For an example using the feature vector you started with, I trained a basic LibSVM 3.20 model. This code isn't meant to be used but may help in showing how to create and test a model.
from collections import namedtuple
# Using namedtuples for descriptive purposes, in actual code a normal tuple would work fine.
Category = namedtuple("Category", ["index", "name"])
Feature = namedtuple("Feature", ["category_index", "distance_from_beginning", "distance_from_end", "contains_digit", "capitalized"])
# Separate up the set of categories, libsvm requires a numerical index so we associate each with an index.
categories = dict()
for index, name in enumerate("B M C S NA".split(' ')):
# LibSVM expects index to start at 1, not 0.
categories[name] = Category(index + 1, name)
categories
Out[0]: {'B': Category(index=1, name='B'),
'C': Category(index=3, name='C'),
'M': Category(index=2, name='M'),
'NA': Category(index=5, name='NA'),
'S': Category(index=4, name='S')}
# Faked set of CSV input for example purposes.
csv_input_lines = """category_index,distance_from_beginning,distance_from_end,contains_digit,capitalized
B,1,10,1,0
M,10,1,0,1
C,2,3,0,1
S,23,2,0,0
NA,12,0,0,1""".split("\n")
# We just ignore the header.
header = csv_input_lines[0]
# A list of Feature namedtuples, this will be trained as lists.
features = list()
for line in csv_input_lines[1:]:
split_values = line.split(',')
# Create a Feature with the values converted to integers.
features.append(Feature(categories[split_values[0]].index, *map(int, split_values[1:])))
features
Out[1]: [Feature(category_index=1, distance_from_beginning=1, distance_from_end=10, contains_digit=1, capitalized=0),
Feature(category_index=2, distance_from_beginning=10, distance_from_end=1, contains_digit=0, capitalized=1),
Feature(category_index=3, distance_from_beginning=2, distance_from_end=3, contains_digit=0, capitalized=1),
Feature(category_index=4, distance_from_beginning=23, distance_from_end=2, contains_digit=0, capitalized=0),
Feature(category_index=5, distance_from_beginning=12, distance_from_end=0, contains_digit=0, capitalized=1)]
# Y is the category index used in training for each Feature. Now it is an array (order important) of all the trained indexes.
y = map(lambda f: f.category_index, features)
# X is the feature vector, for this we convert all the named tuple's values except the category which is at index 0.
x = map(lambda f: list(f)[1:], features)
from svmutil import svm_parameter, svm_problem, svm_train, svm_predict
# Barebones defaults for SVM
param = svm_parameter()
# The (Y,X) parameters should be the train dataset.
prob = svm_problem(y, x)
model=svm_train(prob, param)
# For actual accuracy checking, the (Y,X) parameters should be the test dataset.
p_labels, p_acc, p_vals = svm_predict(y, x, model)
Out[3]: Accuracy = 100% (5/5) (classification)
I hope this example helps, it shouldn't be used for your training. It is meant as an example only because it is inefficient.
Instead of going through libsvm in order to access it with Python (I installed libsvm through MacPorts, and import svmutil fails), you might want to install the popular scikit-learn package, which contains an optimized version of libsvm with Python bindings.
The install is very simple with MacPorts: sudo port install py27-scikit-learn (adapt py27 to whatever version of Python you use).
Seems like a old thread. Hope it helps someone else in the future.
I had the same problem. The solution is
- Run
makein libsvm-3.0 directory - Run
makein libsvm-3.0/python directory
If you did only at libsvm-3.0 folder you will face this issue. Do it at both the folders. Then it will work fine.