Why is RBF kernel used in SVM? - Cross Validated
ELI5 what's a RBF kernel and how does it work?
kernel - Explanation of how a radial basis function works in support vector machines - Stack Overflow
The RBF kernel of Support Vector Machine - Stack Overflow
Videos
RUser4512 gave the correct answer: RBF kernel works well in practice and it is relatively easy to tune. It's the SVM equivalent to "no one's ever been fired for estimating an OLS regression:" it's accepted as a reasonable default method. Clearly OLS isn't perfect in every (or even many) scenarios, but it's a well-studied method, and widely understood. Likewise, the RBF kernel is well-studied and widely understood, and many SVM packages include it as a default method.
But the RBF kernel has a number of other properties. In these types of questions, when someone is asking about "why do we do things this way", I think it's important to also draw contrasts to other methods to develop context.
It is a stationary kernel, which means that it is invariant to translation. Suppose you are computing A stationary kernel will yield the same value
for
, where
may be vector-valued of dimension to match the inputs. For the RBF, this is accomplished by working on the difference of the two vectors. For contrast, note that the linear kernel does not have the stationarity property.
The single-parameter version of the RBF kernel has the property that it is isotropic, i.e. the scaling by occurs the same amount in all directions. This can be easily generalized, though, by slightly tweaking the RBF kernel to
where
is a p.s.d. matrix.
Another property of the RBF kernel is that it is infinitely smooth. This is aesthetically pleasing, and somewhat satisfying visually, but perhaps it is not the most important property. Compare the RBF kernel to the Matern kernel and you'll see that there some kernels are quite a bit more jagged!
The moral of the story is that kernel-based methods are very rich, and with a little bit of work, it's very practical to develop a kernel suited to your particular needs. But if one is using an RBF kernel as a default, you'll have a reasonable benchmark for comparison.
I think the good reasons to use RBF kernel are that they work well in practice and they are relatively easy to calibrate, as opposed to other kernels.
The polynomial kernel has three parameter (offset, scaling, degree). The RBF kernel has one parameter and there are good heuristics to find it. See, per example : SVM rbf kernel - heuristic method for estimating gamma
Linear separability in the feature space may not be the reason. Indeed, it is easy, with a Gaussian kernel, to enforce separability and a perfect accuracy on the train set (setting to a large value). However, these model have a very bad generalization.
Edit.
This short video shows the influence of the increase of the bandwith parameter on the decision boundary.
Hey everyone. I'm starting to learn machine learning and I'm having some conceptual issues at learning what an RBF kernel. Would anyone be able to give me an intuitive explanation and what these kernels are since I'm also beginning to learn linear algebra? Thanks!
EDIT: I would also love to have an intuitive explanation of the Gamma parameter for SVM's. What are they exactly and how they affect the results? I'm classifying a Parkinson's Disease dataset that consists of voice features(23 attributes) and changing the gamma value to 1 immediately increases my accuracy by 7%. Why does that happen?
The kernel function is a measure of similarity between two sets of features. So in this case, $x'$ and $x$ will both be $5\times 1$ feature vectors (not necessarily the same). $K(x,x')$ is a scalar that represents the similarity between $x$ and $x'$, and the kernel matrix $[K(x,x')]_{x\in X, x'\in X}$ is a $100\times 100$ matrix which represents the pairwise similarities.
The kernel function can be thought of as a cheap way of computing an infinite dimensional inner product - this 'kernel trick' is described in more detail in these notes. This lets the algorithm learn arbitrarily complex functions (though it may take an infinite number of samples for it to learn).
Here's my attempt at giving a non-mathy explanation for the RBF kernel. The kernel function gives you the distance between $x$ and $x'$. It's not like the regular Euclidean distance though. I imagine living in an RBF kernel world like walking around in a little bubble where anything what is within arms reach of me is just like normal Euclidean distances but anything farther than that (outside the bubble) is warped to be extremely far The way this works with the math is $\sigma$ is like the size of the bubble and the $\exp$ function is what pushes everything to be really far away if it is not already close.