python - How to make sure PyTorch has deallocated GPU memory? - Stack Overflow
How to allocate more GPU memory to be reserved by PyTorch to avoid "RuntimeError: CUDA out of memory"?
How to allocate more memory to pytorch - Stack Overflow
python - Pytorch GPU memory allocation - Stack Overflow
I don't think the other answer is correct. Allocation and deallocation definitely happens during runtime, the thing to note is that the CPU code runs asynchronously from the GPU code, so you need to wait for any deallocation to happen if you want to reserve more memory after it. Take a look at this:
import torch
a = torch.zeros(100,100,100).cuda()
print(torch.cuda.memory_allocated())
del a
torch.cuda.synchronize()
print(torch.cuda.memory_allocated())
Outputs
4000256
0
So you should del the tensors you don't need and call torch.cuda.synchronize() to make sure that the deallocation goes through before your CPU code continues to run.
In your specific case, after your function trn_l returns, any variables that were local to that function, and do not have references elsewhere, will be deallocated along with the corresponding GPU tensors. All you need to do is wait for this to happen by calling torch.cuda.synchronize() after the function call.
So, Pytorch does not allocate and deallocate memory from GPU in training time.
From https://pytorch.org/docs/stable/notes/faq.html#my-gpu-memory-isn-t-freed-properly:
PyTorch uses a caching memory allocator to speed up memory allocations. As a result, the values shown in nvidia-smi usually don’t reflect the true memory usage. See Memory management for more details about GPU memory management.
If your GPU memory isn’t freed even after Python quits, it is very likely that some Python subprocesses are still alive. You may find them via ps -elf | grep python and manually kill them with kill -9 [pid].
You can call torch.cuda.empty_cache() to free all unused memory (however, that is not really good practice as memory re-allocation is time consuming). Docs of empty_cace() : https://pytorch.org/docs/stable/cuda.html#torch.cuda.empty_cache
torch.cuda.memory_allocated() returns the memory that has been allocated, not the memory that has been "used".
In a typical GPU compute pipeline, you would record operations in a queue along with whatever synchronization primitives your API offers. The GPU will then dequeue and execute those operations, respecting the enqueued synchronization primitives. However, GPU memory allocation is not usually an operation which even goes on the queue. Rather, there's usually some sort of fundamental instruction that the CPU can issue to the GPU in order to allocate memory, just as recording operations is another fundamental instruction. This means that the memory necessary for a GPU operation has to be allocated before the operation has even been enqueued; there is no "allocate memory" operation in the queue to synchronize with.
Consider Vulkan as a simple example. Rendering operations are enqueued on a graphics queue. However, memory is typically allocated via calls to vkAllocateMemory(), which does not accept any sort of queue at all; it only accepts the device handle and information about the allocation (size, memory type, etc). From my understanding, the allocation is done "immediately" / synchronously (the memory is safe to use by the time the function call returns on the CPU).
I don't know enough about GPUs to explain why this is the case, but I'm sure there's a good reason. And perhaps the limitations vary from device to device. But if I were to guess, memory allocation probably has to be a fairly centralized operation; it can't be done by just any core executing recorded operations on a queue. This would make sense, at least; the space of GPU memory is usually shared across cores.
Let's apply this knowledge to answer your question: When you call l.append(x**i), you're trying to record a compute operation. That operation will require memory to store the result, and so PyTorch is likely allocating the memory prior to enqueuing the operation. This explains the behavior you're seeing.
However, this doesn't invalidate PyTorch's claims about asynchronous compute. The memory might be allocated synchronously, but it won't be populated with the result of the operation until the operation has been dequeued and completed by the GPU, which indeed happens asynchronously.
I was able to reproduce your problem. I cannot really tell you why it behaves like that. I just think the (randomly) initialized tensor needs a certain amount of memory. For instance if you call x = torch.randn(0,0, device='cuda') the tensor does not allocate any GPU memory and x = torch.zeros(1000,10000, device='cuda') allocates 4000256 as in your example.
To load the tensors lazy, I suggest you create them on CPU and send them on the GPU briefly before using them. Kind of a speeed/memory tradeoff. I changed your code accordingly:
import torch
def unnecessary_compute():
x = torch.randn(1000,1000, device='cpu')
l = []
for i in range(5):
print(i,torch.cuda.memory_allocated())
l.append(x**i)
print("Move to cuda")
for i, tensor_x in enumerate(l):
l[i]=tensor_x.to('cuda')
print(i,torch.cuda.memory_allocated())
unnecessary_compute()
that produced the following output:
0 0
1 0
2 0
3 0
4 0
Move to cuda
0 4000256
1 8000512
2 12000768
3 16001024
4 20971520