I am going to present four different methods for computing such a kernel, followed by a comparison of their run-time.
Using pure numpy
Here, I use the fact that ||x-y||^2 = ||x||^2 + ||y||^2 - 2 * x^T * y.
import numpy as np
X_norm = np.sum(X ** 2, axis = -1)
K = var * np.exp(-gamma * (X_norm[:,None] + X_norm[None,:] - 2 * np.dot(X, X.T)))
Using numexpr
numexpr is a python package that allows for efficient and parallelized array operations on numpy arrays. We can use it as follows to perform the same computation as above:
import numpy as np
import numexpr as ne
X_norm = np.sum(X ** 2, axis = -1)
K = ne.evaluate('v * exp(-g * (A + B - 2 * C))', {
'A' : X_norm[:,None],
'B' : X_norm[None,:],
'C' : np.dot(X, X.T),
'g' : gamma,
'v' : var
})
Using scipy.spatial.distance.pdist
We could also use scipy.spatial.distance.pdist to compute a non-redundant array of pairwise squared euclidean distances, compute the kernel on that array and then transform it to a square matrix:
import numpy as np
from scipy.spatial.distance import pdist, squareform
K = squareform(var * np.exp(-gamma * pdist(X, 'sqeuclidean')))
K[np.arange(K.shape[0]), np.arange(K.shape[1])] = var
Using sklearn.metrics.pairwise.rbf_kernel
sklearn provides a built-in method for direct computation of an RBF kernel:
import numpy as np
from sklearn.metrics.pairwise import rbf_kernel
K = var * rbf_kernel(X, gamma = gamma)
Run-time comparison
I use 25,000 random samples of 512 dimensions for testing and perform experiments on an Intel Core i7-7700HQ (4 cores @ 2.8 GHz). More precisely:
X = np.random.randn(25000, 512)
gamma = 0.01
var = 5.0
Each method is run 7 times and the mean and standard deviation of the time per execution is reported.
| Method | Time |
|-------------------------------------|-------------------|
| numpy | 24.2 s ± 1.06 s |
| numexpr | 8.89 s ± 314 ms |
| scipy.spatial.distance.pdist | 2min 59s ± 312 ms |
| sklearn.metrics.pairwise.rbf_kernel | 13.9 s ± 757 ms |
First of all, scipy.spatial.distance.pdist is surprisingly slow.
numexpr is almost 3 times faster than the pure numpy method, but this speed-up factor will vary with the number of available CPUs.
sklearn.metrics.pairwise.rbf_kernel is not the fastest way, but only a bit slower than numexpr.
I am going to present four different methods for computing such a kernel, followed by a comparison of their run-time.
Using pure numpy
Here, I use the fact that ||x-y||^2 = ||x||^2 + ||y||^2 - 2 * x^T * y.
import numpy as np
X_norm = np.sum(X ** 2, axis = -1)
K = var * np.exp(-gamma * (X_norm[:,None] + X_norm[None,:] - 2 * np.dot(X, X.T)))
Using numexpr
numexpr is a python package that allows for efficient and parallelized array operations on numpy arrays. We can use it as follows to perform the same computation as above:
import numpy as np
import numexpr as ne
X_norm = np.sum(X ** 2, axis = -1)
K = ne.evaluate('v * exp(-g * (A + B - 2 * C))', {
'A' : X_norm[:,None],
'B' : X_norm[None,:],
'C' : np.dot(X, X.T),
'g' : gamma,
'v' : var
})
Using scipy.spatial.distance.pdist
We could also use scipy.spatial.distance.pdist to compute a non-redundant array of pairwise squared euclidean distances, compute the kernel on that array and then transform it to a square matrix:
import numpy as np
from scipy.spatial.distance import pdist, squareform
K = squareform(var * np.exp(-gamma * pdist(X, 'sqeuclidean')))
K[np.arange(K.shape[0]), np.arange(K.shape[1])] = var
Using sklearn.metrics.pairwise.rbf_kernel
sklearn provides a built-in method for direct computation of an RBF kernel:
import numpy as np
from sklearn.metrics.pairwise import rbf_kernel
K = var * rbf_kernel(X, gamma = gamma)
Run-time comparison
I use 25,000 random samples of 512 dimensions for testing and perform experiments on an Intel Core i7-7700HQ (4 cores @ 2.8 GHz). More precisely:
X = np.random.randn(25000, 512)
gamma = 0.01
var = 5.0
Each method is run 7 times and the mean and standard deviation of the time per execution is reported.
| Method | Time |
|-------------------------------------|-------------------|
| numpy | 24.2 s ± 1.06 s |
| numexpr | 8.89 s ± 314 ms |
| scipy.spatial.distance.pdist | 2min 59s ± 312 ms |
| sklearn.metrics.pairwise.rbf_kernel | 13.9 s ± 757 ms |
First of all, scipy.spatial.distance.pdist is surprisingly slow.
numexpr is almost 3 times faster than the pure numpy method, but this speed-up factor will vary with the number of available CPUs.
sklearn.metrics.pairwise.rbf_kernel is not the fastest way, but only a bit slower than numexpr.
Well you are doing a lot of optimizations in your answer post. I would like to add few more (mostly tweaks). I would build upon the winner from the answer post, which seems to be numexpr based on.
Tweak #1
First off, np.sum(X ** 2, axis = -1) could be optimized with np.einsum. Though this part isn't the biggest overhead, but optimization of any sort won't hurt. So, that summation could be expressed as -
X_norm = np.einsum('ij,ij->i',X,X)
Tweak #2
Secondly, we could leverage Scipy supported blas functions and if allowed use single-precision dtype for noticeable performance improvement over its double precision one. Hence, np.dot(X, X.T) could be computed with SciPy's sgemm like so -
sgemm(alpha=1.0, a=X, b=X, trans_b=True)
Few more tweaks on rearranging the negative sign with gamma lets us feed more to sgemm. Also, we would push in gamma into the alpha term.
Tweaked implementations
Thus, with these two optimizations, we would have two more variants (if I could put it that way) of the numexpr method, listed below -
from scipy.linalg.blas import sgemm
def app1(X, gamma, var):
X_norm = -np.einsum('ij,ij->i',X,X)
return ne.evaluate('v * exp(g * (A + B + 2 * C))', {\
'A' : X_norm[:,None],\
'B' : X_norm[None,:],\
'C' : np.dot(X, X.T),\
'g' : gamma,\
'v' : var\
})
def app2(X, gamma, var):
X_norm = -gamma*np.einsum('ij,ij->i',X,X)
return ne.evaluate('v * exp(A + B + C)', {\
'A' : X_norm[:,None],\
'B' : X_norm[None,:],\
'C' : sgemm(alpha=2.0*gamma, a=X, b=X, trans_b=True),\
'g' : gamma,\
'v' : var\
})
Runtime test
Numexpr based one from your answer post -
def app0(X, gamma, var):
X_norm = np.sum(X ** 2, axis = -1)
return ne.evaluate('v * exp(-g * (A + B - 2 * C))', {
'A' : X_norm[:,None],
'B' : X_norm[None,:],
'C' : np.dot(X, X.T),
'g' : gamma,
'v' : var
})
Timings and verification -
In [165]: # Setup
...: X = np.random.randn(10000, 512)
...: gamma = 0.01
...: var = 5.0
In [166]: %timeit app0(X, gamma, var)
...: %timeit app1(X, gamma, var)
...: %timeit app2(X, gamma, var)
1 loop, best of 3: 1.25 s per loop
1 loop, best of 3: 1.24 s per loop
1 loop, best of 3: 973 ms per loop
In [167]: np.allclose(app0(X, gamma, var), app1(X, gamma, var))
Out[167]: True
In [168]: np.allclose(app0(X, gamma, var), app2(X, gamma, var))
Out[168]: True
