GPU operations have to additionally get memory to/from the GPU
The problem is that your GPU operation always has to put the input on the GPU memory, and then retrieve the results from there, which is a quite costly operation.
NumPy, on the other hand, directly processes the data from the CPU/main memory, so there is almost no delay here. Additionally, your matrices are extremely small, so even in the best-case scenario, there should only be a minute difference.
This is also partially the reason why you use mini-batches when training on a GPU in neural networks: Instead of having several extremely small operations, you now have "one big bulk" of numbers that you can process in parallel.
Also note that GPU clock speeds are generally way lower than CPU clocks, so the GPU only really shines because it has way more cores. If your matrix does not utilize all of them fully, you are also likely to see a faster result on your CPU.
TL;DR: If your matrix is big enough, you will eventually see a speed-up in CUDA than Numpy, even with the additional cost of the GPU transfer.
Pytorch tensor constructor speed vs numpy
Tensors vs Numpy Arrays
Is pytorch faster than numpy on a single CPU?
Numerical differences between numpy and pytorch?
Videos
As I'm trying to learn pytorch, I notice this heavy focus on this tensors as a datatype, but I'm not really clear what it's advantages are over numpy arrays. After all, numpy arrays can be 0,1, 2, and even 3 dimensional, so I'm just unclear on the advantage of tensors, and I wanted to ask you guys, "When and why do we use tensors instead of numpy arrays?"
A while ago I had benchmarked pytorch against numpy for fairly basic matrix operations (broadcast, multiplication, inversion). I didn't run the benchmark for a variety of sizes though. It seemed that pytorch was markedly faster than numpy, possibly it was using more than one core (the hardware had a dozen of cores). Is that a general rule even if constraining pytorch to a single core?
Unfortunately there's really no way to specifically speed up torch's method of computing the outer product torch.ger() without a vast amount of effort.
Explanation and Options
The reason numpy function np.outer() is so fast is because it's written in C, which you can see here: https://github.com/numpy/numpy/blob/7e3d558aeee5a8a5eae5ebb6aef03de892a92ebd/numpy/core/numeric.py#L1123
where the function uses operations from the umath C source code.
Pytorch's torch.ger() function is written in C++ here: https://github.com/pytorch/pytorch/blob/7ce634ebc2943ff11d2ec727b7db83ab9758a6e0/aten/src/ATen/native/LinearAlgebra.cpp#L142 which makes it ever so slightly slower as you can see in your example.
Your options to "speed up computing outer product in PyTorch" would be to add a C implementation for outer product in pytorch's native code, or make your own outer product function while interfacing with C using something like Cython if you really don't want to use numpy (which wouldn't make much sense).
P.S.
Also just as an aside, using GPUs would only improve your parallel computation speed on the GPU which may not outweigh the cost of time required to transfer data between RAM and GPU memory.
A very nice solution is to combine both.
Copyclass LazyFrames(object):
def __init__(self, frames):
self._frames = frames
def __array__(self, dtype=None):
out = np.concatenate(self._frames, axis=0)
if dtype is not None:
out = out.astype(dtype)
return out
frames might be just your pytorch tensors for instance.
This object ensures that common frames between the observations are only stored once. It exists purely to optimize memory usage which can be huge (e.g. DQN's 1M frames replay buffers). This object should only be converted to numpy array before being passed to the model.
Reference : https://github.com/Shmuma/ptan/blob/master/ptan/common/wrappers.py