GPU operations have to additionally get memory to/from the GPU
The problem is that your GPU operation always has to put the input on the GPU memory, and then retrieve the results from there, which is a quite costly operation.
NumPy, on the other hand, directly processes the data from the CPU/main memory, so there is almost no delay here. Additionally, your matrices are extremely small, so even in the best-case scenario, there should only be a minute difference.
This is also partially the reason why you use mini-batches when training on a GPU in neural networks: Instead of having several extremely small operations, you now have "one big bulk" of numbers that you can process in parallel.
Also note that GPU clock speeds are generally way lower than CPU clocks, so the GPU only really shines because it has way more cores. If your matrix does not utilize all of them fully, you are also likely to see a faster result on your CPU.
TL;DR: If your matrix is big enough, you will eventually see a speed-up in CUDA than Numpy, even with the additional cost of the GPU transfer.
Pytorch tensor constructor speed vs numpy
Is pytorch faster than numpy on a single CPU?
Tensors vs Numpy Arrays
Numerical differences between numpy and pytorch?
Videos
A while ago I had benchmarked pytorch against numpy for fairly basic matrix operations (broadcast, multiplication, inversion). I didn't run the benchmark for a variety of sizes though. It seemed that pytorch was markedly faster than numpy, possibly it was using more than one core (the hardware had a dozen of cores). Is that a general rule even if constraining pytorch to a single core?
As I'm trying to learn pytorch, I notice this heavy focus on this tensors as a datatype, but I'm not really clear what it's advantages are over numpy arrays. After all, numpy arrays can be 0,1, 2, and even 3 dimensional, so I'm just unclear on the advantage of tensors, and I wanted to ask you guys, "When and why do we use tensors instead of numpy arrays?"