machine learning kernel function
In machine learning, the radial basis function kernel, or RBF kernel, is a popular kernel function used in various kernelized learning algorithms. In particular, it is commonly used in support vector machine … Wikipedia
🌐
Wikipedia
en.wikipedia.org › wiki › Radial_basis_function_kernel
Radial basis function kernel - Wikipedia
3 weeks ago - Because support vector machines and other models employing the kernel trick do not scale well to large numbers of training samples or large numbers of features in the input space, several approximations to the RBF kernel (and similar kernels) have been introduced.
🌐
Quark Machine Learning
quarkml.com › home › data science › machine learning
The RBF kernel in SVM: A Complete Guide - Quark Machine Learning
April 6, 2025 - The RBF kernel works by mapping the data into a high-dimensional space by finding the dot products and squares of all the features in the dataset and then performing the classification using the basic idea of Linear SVM. For projecting the data into a higher dimensional space, the RBF kernel uses the so-called radial basis function which can be written as:
Discussions

Why is RBF kernel used in SVM? - Cross Validated
I learned that due to infinite series expansion of exponential function Radial Basis Kernel projects input feature space to infinite feature space. Is it due to this fact that we use this kernel o... More on stats.stackexchange.com
🌐 stats.stackexchange.com
September 15, 2015
ELI5 what's a RBF kernel and how does it work?
Good question! The RBF kernel is a very common and useful concept in ML, particularly when used in SVMs. It may be helpful to refer to the wikipedia entry on the RBF kernel: http://en.wikipedia.org/wiki/Radial_basis_function_kernel First, a kernel function in general is of the form K: X x X -> R where K(x,x') = K(x',x). You can think of it like a measure of similarity x and x'. This is useful because often classifiers in high dimensional spaces can use the kernel function to look for instances in a "neighborhood" of a labeled instance. In effect, you transform to a basis where each component is the similarity to a known instance. This represents the data in a space where it is often possible to linearly separate the classified instances. The RBF kernel is a standard kernel function in Rn space, because it has just one free parameter (gamma, which I'll get to in a second), and satisfies the condition K(x,x') = K(x',x). More specifically, one way to think of the RBF kernel is that if we assume x' is characteristic of some gaussian distribution (it is the mean value of that distribution), then RBF(x,x') is the probability that x is another sample from that distribution. In this interpretation, gamma is related to the tunable variance of that distribution. When you say you changed gamma to 1, and your accuracy decreased, that means that the "true" underlying distributions which your classifier is attempting to learn have a variance that are better explained by gamma = 1. Without know what your original gamma was (before you changed it to 1), its hard for me to explain why this provides higher accuracy. You can think of the RBF defining a sort of ball or cloud around known instances, where the density of the cloud corresponds to how similar to the instance you are. As gamma gets bigger, the cloud gets smaller and more localized around the known instance. Hope this helps More on reddit.com
🌐 r/MachineLearning
13
10
February 1, 2015
kernel - Explanation of how a radial basis function works in support vector machines - Stack Overflow
RBF is very simple, generic kernel which might be used, but there are dozens of others. Take a look for example at the ones included in pykernels https://github.com/gmum/pykernels · When the SVM is trained it will plot a hyperplane(which I think is like a plane in 3d but with more dimensions?) ... More on stackoverflow.com
🌐 stackoverflow.com
The RBF kernel of Support Vector Machine - Stack Overflow
The non-linear kernels allow the SVM to separate non-linear data linearly in a high dimensional space. The RBF kernel is probably the most popular non-linear kernel. I was told that the RBF kerne... More on stackoverflow.com
🌐 stackoverflow.com
🌐
scikit-learn
scikit-learn.org › stable › auto_examples › svm › plot_rbf_parameters.html
RBF SVM parameters — scikit-learn 1.8.0 documentation
The radius of the RBF kernel alone acts as a good structural regularizer. Increasing C further doesn’t help, likely because there are no more training points in violation (inside the margin or wrongly classified), or at least no better solution can be found. Scores being equal, it may make sense to use the smaller C values, since very high C values typically increase fitting time.
Top answer
1 of 2
8

RUser4512 gave the correct answer: RBF kernel works well in practice and it is relatively easy to tune. It's the SVM equivalent to "no one's ever been fired for estimating an OLS regression:" it's accepted as a reasonable default method. Clearly OLS isn't perfect in every (or even many) scenarios, but it's a well-studied method, and widely understood. Likewise, the RBF kernel is well-studied and widely understood, and many SVM packages include it as a default method.

But the RBF kernel has a number of other properties. In these types of questions, when someone is asking about "why do we do things this way", I think it's important to also draw contrasts to other methods to develop context.

It is a stationary kernel, which means that it is invariant to translation. Suppose you are computing A stationary kernel will yield the same value for , where may be vector-valued of dimension to match the inputs. For the RBF, this is accomplished by working on the difference of the two vectors. For contrast, note that the linear kernel does not have the stationarity property.

The single-parameter version of the RBF kernel has the property that it is isotropic, i.e. the scaling by occurs the same amount in all directions. This can be easily generalized, though, by slightly tweaking the RBF kernel to where is a p.s.d. matrix.

Another property of the RBF kernel is that it is infinitely smooth. This is aesthetically pleasing, and somewhat satisfying visually, but perhaps it is not the most important property. Compare the RBF kernel to the Matern kernel and you'll see that there some kernels are quite a bit more jagged!

The moral of the story is that kernel-based methods are very rich, and with a little bit of work, it's very practical to develop a kernel suited to your particular needs. But if one is using an RBF kernel as a default, you'll have a reasonable benchmark for comparison.

2 of 2
2

I think the good reasons to use RBF kernel are that they work well in practice and they are relatively easy to calibrate, as opposed to other kernels.

The polynomial kernel has three parameter (offset, scaling, degree). The RBF kernel has one parameter and there are good heuristics to find it. See, per example : SVM rbf kernel - heuristic method for estimating gamma

Linear separability in the feature space may not be the reason. Indeed, it is easy, with a Gaussian kernel, to enforce separability and a perfect accuracy on the train set (setting to a large value). However, these model have a very bad generalization.

Edit.

This short video shows the influence of the increase of the bandwith parameter on the decision boundary.

🌐
Quora
quora.com › What-is-RBF-kernel-in-SVM
What is RBF kernel in SVM? - Quora
Apart from the classic linear kernel which assumes that the different classes are separated by a straight line, a RBF (radial basis function) kernel is used when the boundaries are hypothesized to be curve-shaped.
🌐
Reddit
reddit.com › r/machinelearning › eli5 what's a rbf kernel and how does it work?
r/MachineLearning on Reddit: ELI5 what's a RBF kernel and how does it work?
February 1, 2015 -

Hey everyone. I'm starting to learn machine learning and I'm having some conceptual issues at learning what an RBF kernel. Would anyone be able to give me an intuitive explanation and what these kernels are since I'm also beginning to learn linear algebra? Thanks!

EDIT: I would also love to have an intuitive explanation of the Gamma parameter for SVM's. What are they exactly and how they affect the results? I'm classifying a Parkinson's Disease dataset that consists of voice features(23 attributes) and changing the gamma value to 1 immediately increases my accuracy by 7%. Why does that happen?

Top answer
1 of 5
10
Good question! The RBF kernel is a very common and useful concept in ML, particularly when used in SVMs. It may be helpful to refer to the wikipedia entry on the RBF kernel: http://en.wikipedia.org/wiki/Radial_basis_function_kernel First, a kernel function in general is of the form K: X x X -> R where K(x,x') = K(x',x). You can think of it like a measure of similarity x and x'. This is useful because often classifiers in high dimensional spaces can use the kernel function to look for instances in a "neighborhood" of a labeled instance. In effect, you transform to a basis where each component is the similarity to a known instance. This represents the data in a space where it is often possible to linearly separate the classified instances. The RBF kernel is a standard kernel function in Rn space, because it has just one free parameter (gamma, which I'll get to in a second), and satisfies the condition K(x,x') = K(x',x). More specifically, one way to think of the RBF kernel is that if we assume x' is characteristic of some gaussian distribution (it is the mean value of that distribution), then RBF(x,x') is the probability that x is another sample from that distribution. In this interpretation, gamma is related to the tunable variance of that distribution. When you say you changed gamma to 1, and your accuracy decreased, that means that the "true" underlying distributions which your classifier is attempting to learn have a variance that are better explained by gamma = 1. Without know what your original gamma was (before you changed it to 1), its hard for me to explain why this provides higher accuracy. You can think of the RBF defining a sort of ball or cloud around known instances, where the density of the cloud corresponds to how similar to the instance you are. As gamma gets bigger, the cloud gets smaller and more localized around the known instance. Hope this helps
2 of 5
5
http://www.oneweirdkerneltrick.com/
🌐
GeeksforGeeks
geeksforgeeks.org › python › rbf-svm-parameters-in-scikit-learn
RBF SVM Parameters in Scikit Learn - GeeksforGeeks
April 28, 2025 - Therefore, it is important to carefully choose the value of gamma based on the specific dataset and problem at hand. Significance: In Support Vector Machines (SVM), the kernel function plays a vital role in the classification of data.
Find elsewhere
🌐
ScienceDirect
sciencedirect.com › science › article › abs › pii › S0016003221006025
Random radial basis function kernel-based support vector machine - ScienceDirect
October 21, 2021 - We prove the universal approximation capability of a SVM with the RRBF kernel, proposing a simple model selection algorithm. The experiments on benchmark datasets show that SVM with the RRBF kernel clearly outperforms the traditional RBF kernel and other popular kernels, and the results are quite insensitive to
🌐
Medium
medium.com › @eskandar.sahel › introduction-to-rbf-svm-a-powerful-machine-learning-algorithm-for-non-linear-data-1d1cfb55a1a
Introduction to RBF SVM: A Powerful Machine Learning Algorithm for Non-Linear Data | by Sahel Eskandar | Medium
March 26, 2023 - RBF SVM works by mapping the input ... a kernel function, such as the Radial Basis Function, to measure the similarity between pairs of data points in the feature space....
Top answer
1 of 1
7

(so an rbf is the correct choice?)

It depends. RBF is very simple, generic kernel which might be used, but there are dozens of others. Take a look for example at the ones included in pykernels https://github.com/gmum/pykernels

When the SVM is trained it will plot a hyperplane(which I think is like a plane in 3d but with more dimensions?) that best separates the data.

Lets avoid some weird confusions. Nothing is plotted here. SVM will look for d-dimensional hyperplane defined by v (normal vector) and b (bias, distance from the origin), which is simply set of points x such that <v, x> = b. In 2D hyperplane is a line, in 3D hyperplane is plane, in d+1 dimensions it is d dimensional object, always one dimension lower than the space (line is 1D, plane is 2D).

When tuning, changing the value of gamma changes the surface of the of the hyperplane (also called the decision boundary?).

Now this is an often mistake. Decision boundary is not a hyperplane. Decision boundary is a projection of the hyperplane onto input space. You cannot observe actual hyperplane as it is often of very high dimension. You can express this hyperplane as a functional equation, but nothing more. Decision boundary on the other hand "lives" in your input space, if input is low-dimensional, you can even plot this object. But this is not a hyperplane, it is just the way this hyperplane intersects with your input space. This is why decision boundary is often curved or even discontinous even though hyperplane is always linear and continuous - because you just see a nonlinear section through it. Now what is gamma doing? RBF kernel leads to the optimization in the space of continous functions. There are plenty ot these (there is continuum of such objects). However, SVM can express only a tiny fraction of these guys - linear combinations of kernel values in training points. Fixing particular gamma limits set of functions to consider - bigger the gamma, more narrow the kernels, thus functions that are being considered consists of linear combinations of such "spiky" distributions. So gamma itself does not change the surface, it changes the space of considered hypotheses.

So an increase in the value of gamma, results in a Gaussian which is narrower. Is this like saying that the bumps on the plane (if plotted in 3d) that can be plotted are allowed to be narrower to fit the training data better? Or in 2D is this like saying gamma defines how bendy the line that separtates the data can be?

I think I answered with previous point - high gamma means that you only consider hyperplanes of form

<v, x> - b = SUM_i alpha_i K_gamma(x_i, x) - b

where K_gamma(x_i, x) = exp(-gamma ||x_i-x||^2), thus you will get very "spiky" elements of your basis. this will lead to very tight fit to your training data. Exact shape of the decision boundary is hard to estimate, as this depends on optimal lagrange multipliers alpha_i selected during training.

I'm also very confused about about how this can lead to an infinite dimensional representation from a finite number of features? Any good analogies would help me greatly.

The "infinite representation" comes from the fact, that in order to work with vectors and hyperplanes, each of your point is actually mapped to a continuous function. So SVM, internally, is not really working with d-dimensional points anymore, it is working with functions. Consider 2d case, you have points [0,0] and [1,1]. This is a simple 2d problem. When you apply SVM with rbf kernel here - you will instead work with an unnormalized gaussian distribution centered in [0, 0] and another one in [1,1]. Each such gaussian is a function from R^2 to R, which expresses its probability density function (pdf). It is a bit confusing because kernel looks like a gaussian too, but this is only because dot product of two functions is usually defined as an integral of their product, and integral of product of two gaussians is .... a gaussian too! So where is this infinity? Remember that you are supposed to work with vectors. How to write down a function as a vector? You would have to list all its values, thus if you have a function f(x) = 1/sqrt(2*pi(sigma^2) exp(-||x-m||^2 / (2*sigma^2)) you will have to list infinite number of such values to fully define it. And this is this concept of infinite dimension - you are mapping points to functions, functions are infinite dimensional in terms of vector spaces, thus your representation is infinitely dimensional.

One good example might be different mapping. Consider a 1D dataset of numbers 1,2,3,4,5,6,7,8,9,10. Lets assign odd numbers different label than even ones. You cannot linearly separate these guys. But you can instead map each point (number) to a kind of characteristic function, function of the form

f_x(y) = 1 iff x e [y-0.5, y+0.5]

now, in space of all such functions I can easily linearly separate the ones created from odd x's from the rest, by simply building hyperplane of equation

<v, x> = SUM_[v_odd] <f_v_odd, f_x(y)> = INTEGRAL (f_v * f_x) (y) dy

And this will equal 1 iff x is odd, as only this integral will be non zero. Obviously I am just using finite amount of training points (v_odd here) but the representation itself is infinite dimensional. Where is this additional "information" coming from? From my assumptions - the way I defined the mapping introduces a particular structure in the space I am considering. Similarly with RBF - you get infinite dimension, but it does not mean you are actually considering every continouus function - you are limiting yourself to the linear combinations of gaussians centered in training points. Similarly you could use sinusoidal kernel which limits you to the combinations of sinusoidal functions. The choice of a particular, "best" kernel is the whole other story, complex and without clear answers. Hope this helps a bit.

🌐
Towards Data Science
towardsdatascience.com › home › latest › radial basis function (rbf) kernel: the go-to kernel
Radial Basis Function (RBF) Kernel: The Go-To Kernel | Towards Data Science
January 21, 2025 - You’re working on a Machine Learning algorithm like Support Vector Machines for non-linear datasets and you can’t seem to figure out the right feature transform or the right kernel to use. Well, fear not because Radial Basis Function (RBF) Kernel is your savior.
Top answer
1 of 1
1

1) Could anyone explain why the number of feature space after mapping is corresponding to the derivatives of the kernel? I am not clear on this part.

It has nothing to do with being differentiable, linear kernel is also infinitely differentiable and does not map to any higher dimensional space, whoever told you that it is the reason -- lied or did not understand the math behind it. The infinite dimension comes from the mapping

phi(x) = Nor(x, sigma^2)

in other words you are mapping your point into function being a Gaussian distribution, which is an element of L^2 space, infinitely dimension space of continuous function, where scalar product is defined as an integral of multiplication of functions, so

<f,g> = int f(a)g(a) da

and as such

<phi(x),phi(y)> = int Nor(x,sigma^2)(a)Nor(y,sigma^2)(a) da 
                = X exp(-(x-y)^2 / (4sigma^2) )

for some normalising constant X (which is completely unimportant). In other words, Gaussian kernel is a scalar product between two functions, which have infinite dimensions.

2) There are many non-linear kernels, such as polynomial kernel, and I believe they are also able to map the data from a low dimensional space to an infinite dimensional space. But why the RBF kernel is more popular then them?

Polynomial kernel maps into feature space with O(d^p) dimensions, where d is input space dimension and p is polynomial degree, so it is far from being infinite. Why is Gaussian popular? Because it works, and is quite easy to use and fast to compute. From theoretical point of view it also has guarantees of learning any arbitrary set of points (with small enough variances used).

🌐
GeeksforGeeks
geeksforgeeks.org › machine learning › radial-basis-function-kernel-machine-learning
Radial Basis Function Kernel - Machine Learning - GeeksforGeeks
July 12, 2025 - The versatility and effectiveness ... the RBF kernel is commonly used to map data points into a higher-dimensional space where a linear decision boundary can be constructed to separate classes....
🌐
Towards Data Science
towardsdatascience.com › home › latest › svm classifier and rbf kernel – how to make better models in python
SVM Classifier and RBF Kernel - How to Make Better Models in Python | Towards Data Science
January 23, 2025 - Using this three-dimensional space with x, y, and z coordinates, we can now draw a hyperplane (flat 2D surface) to separate red and black points. Hence, the SVM classification algorithm can now be used. RBF is the default kernel used within the sklearn’s SVM classification algorithm and can ...
🌐
DataCamp
campus.datacamp.com › courses › support-vector-machines-in-r › radial-basis-function-kernels
The RBF Kernel | R
The RBF kernel is a decreasing function of the distance between two points. It thus serves to simulate the principle of k Nearest Neighbors, namely that the closer two points are to each other in terms of attributes, the more likely it is that they are similar.
🌐
Wordpress
dwbi1.wordpress.com › 2021 › 05 › 24 › svm-with-rbf-kernel
SVM with RBF Kernel | Data Platform and Data Science
July 10, 2021 - For example, distance of 3 from ... SVM with RBF Kernel is a machine learning algorithm which is capable to classify data points separated with radial based shapes like this:...
🌐
DZone
dzone.com › data engineering › ai/ml › svm rbf kernel parameters with code examples
SVM RBF Kernel Parameters With Code Examples
July 28, 2020 - When the data set is linearly inseparable or in other words, the data set is non-linear, it is recommended to use kernel functions such as RBF. For a linearly separable dataset (linear dataset) one could use linear kernel function (kernel="linear"). ...
Top answer
1 of 1
30

You can possibly start by looking at one of my answers here:
Non-linear SVM classification with RBF kernel

In that answer, I attempt to explain what a kernel function is attempting to do. Once you get a grasp of what it attempts to do, as a follow up, you can read my answer to a question on Quora : https://www.quora.com/Machine-Learning/Why-does-the-RBF-radial-basis-function-kernel-map-into-infinite-dimensional-space/answer/Arun-Iyer-1

Reproducing the content of the answer on Quora, in case you don't have a Quora account.

Question: Why does the RBF (radial basis function) kernel map into infinite dimensional space? Answer : Consider the polynomial kernel of degree 2 defined by, $$k(x, y) = (x^Ty)^2$$ where $x, y \in \mathbb{R}^2$ and $x = (x_1, x_2), y = (y_1, y_2)$.

Thereby, the kernel function can be written as, $$k(x, y) = (x_1y_1 + x_2y_2)^2 = x_{1}^2y_{1}^2 + 2x_1x_2y_1y_2 + x_{2}^2y_{2}^2$$ Now, let us try to come up with a feature map $\Phi$ such that the kernel function can be written as $k(x, y) = \Phi(x)^T\Phi(y)$.

Consider the following feature map, $$\Phi(x) = (x_1^2, \sqrt{2}x_1x_2, x_2^2)$$ Basically, this feature map is mapping the points in $\mathbb{R}^2$ to points in $\mathbb{R}^3$. Also, notice that, $$\Phi(x)^T\Phi(y) = x_1^2y_1^2 + 2x_1x_2y_1y_2 + x_2^2y_2^2$$ which is essentially our kernel function.

This means that our kernel function is actually computing the inner/dot product of points in $\mathbb{R}^3$. That is, it is implicitly mapping our points from $\mathbb{R}^2$ to $\mathbb{R}^3$.

Exercise Question : If your points are in $\mathbb{R}^n$, a polynomial kernel of degree 2 will map implicitly map it to some vector space F. What is the dimension of this vector space F? Hint: Everything I did above is a clue.

Now, coming to RBF.

Let us consider the RBF kernel again for points in $\mathbb{R}^2$. Then, the kernel can be written as $$k(x, y) = \exp(-\|x - y\|^2) = \exp(- (x_1 - y_1)^2 - (x_2 - y_2)^2)$$ $$= \exp(- x_1^2 + 2x_1y_1 - y_1^2 - x_2^2 + 2x_2y_2 - y_2^2) $$ $$ = \exp(-\|x\|^2) \exp(-\|y\|^2) \exp(2x^Ty)$$ (assuming gamma = 1). Using the taylor series you can write this as, $$k(x, y) = \exp(-\|x\|^2) \exp(-\|y\|^2) \sum_{n = 0}^{\infty} \frac{(2x^Ty)^n}{n!}$$ Now, if we were to come up with a feature map $\Phi$ just like we did for the polynomial kernel, you would realize that the feature map would map every point in our $\mathbb{R}^2$ to an infinite vector. Thus, RBF implicitly maps every point to an infinite dimensional space.

Exercise Question : Get the first few vector elements of the feature map for RBF for the above case?

Now, from the above answer, we can conclude something:

  • It may be quite hard to predict in general what the mapping function $\Phi$ looks like for arbitrary kernel. Though, for some cases like polynomial and RBF we can see what it looks like.
  • Even when we know the mapping function, the exact effect that kernel will have on our set of points may be hard to predict. However, in certain cases we can say some things. For example, look at the $\Phi$ map given above for degree 2 polynomial kernel for $\mathbb{R}^2$. It looks like $\Phi(x) = (x_1^2, \sqrt{2}x_1x_2, x_2^2)$. From this we can determine that this map collapses diametrically opposite quadrants i.e first and third quadrant are mapped to same set of points and second and fourth quadrant are mapped to the same set of points. Therefore, this kernel allows us to solve XOR problem! In general, however, it might be harder to predict such behaviour for multidimensional spaces. And it gets harder in the case of RBF kernels.