I don’t think a custom CUDA allocator would help here as your use case sounds more like CPU-offloading. This post might be helpful. Answer from ptrblck on discuss.pytorch.org
🌐
PyTorch Forums
discuss.pytorch.org › t › how-do-i-rewrite-the-gpu-memory-allocation-algorithm-of-pytorch › 179979
How do I rewrite the GPU memory allocation algorithm of PyTorch? - PyTorch Forums
May 15, 2023 - Hi, from my current browsing of the documentation, it seems that the only way to provide a custom CUDA memory allocator is by the CUDAPluggableAllocator class, correct? What I want to achieve is that given a simple linear model: in-> A->B->C->D->E-> out I want to be able to control where the GPU memory of these 5 nodes(A~E) will be allocated/stored.(in fact, it will be great if I can control the allocation of weights between these nodes too) It’s related to the gradient-checkpointing techniqu...
🌐
Zdevito
zdevito.github.io › 2022 › 08 › 04 › cuda-caching-allocator.html
A guide to PyTorch’s CUDA Caching Allocator
August 4, 2022 - To accomplish its goal, the caching allocator requests blocks of memory from CUDA and figures out ways to split up and reuse these blocks without returning them to CUDA. Why not just request all GPU memory and manage it inside PyTorch? PyTorch is not the only library to use the CUDA APIs.
Discussions

python - How to make sure PyTorch has deallocated GPU memory? - Stack Overflow
I want to have after it is called ... have 0 allocated GPU memory or as low as possible. What I have tried: do retain_graph=False and .cpu().detach() everywhere - no positive effects. ... |===========================================================================| | PyTorch CUDA memory ... More on stackoverflow.com
🌐 stackoverflow.com
How to allocate more GPU memory to be reserved by PyTorch to avoid "RuntimeError: CUDA out of memory"?
Hello, I’m not experienced in PyTorch very well and perhaps asking a weird question. I’m running my PyTorch script in a docker container and I’m using GPU that has 48 GB. Although it has a larger capacity, somehow PyTorch is only using smaller than 10GiB and causing the “CUDA out of ... More on discuss.pytorch.org
🌐 discuss.pytorch.org
11
2
April 13, 2022
How to allocate more memory to pytorch - Stack Overflow
RuntimeError: CUDA out of memory. Tried to allocate 92.00 MiB (GPU 0; 24.00 GiB total capacity; 6.90 GiB already allocated; 14.90 GiB free; 6.98 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. More on stackoverflow.com
🌐 stackoverflow.com
python - Pytorch GPU memory allocation - Stack Overflow
I have self-modified car recognition model, taken from https://github.com/Helias/Car-Model-Recognition. I use Cuda and Pytorch:1.4.0. But even though I tried to use the answer on my previous questi... More on stackoverflow.com
🌐 stackoverflow.com
February 19, 2020
🌐
APXML
apxml.com › courses › advanced-pytorch › chapter-1-pytorch-internals-autograd › memory-management
PyTorch Memory Management Strategies
Allocating and deallocating memory on GPUs using CUDA APIs (cudaMalloc, cudaFree) can be slow. To mitigate this, PyTorch employs a caching memory allocator for GPU tensors. When a tensor is freed (e.g., goes out of scope and its reference count drops to zero), the memory it occupied isn't ...
🌐
PyTorch
pytorch.org › blog › understanding-gpu-memory-1
Understanding GPU Memory 1: Visualizing All Allocations over Time – PyTorch
December 14, 2023 - In this Memory Timeline collected using the Memory Profiler, we have the same training example as before. We can observe the gradients in blue are now being cleared from iteration to iteration. We can also notice that the optimizer state in yellow is allocated after the first iteration, and is kept constant for the rest of the job. This optimizer state is the reason behind the increase of GPU memory from the first iteration to the second.
🌐
Medium
iamholumeedey007.medium.com › memory-management-using-pytorch-cuda-alloc-conf-dabe7adec130
Memory Management using PYTORCH_CUDA_ALLOC_CONF | by Shittu Olumide Ayodeji | Medium
June 24, 2023 - It is designed to optimize GPU memory allocation and improve performance during training and inference processes. It enables users to fine-tune the memory management behavior by configuring various aspects of CUDA memory allocation.
🌐
Codecademy
codecademy.com › docs › pytorch › gpu acceleration with cuda › memory management
PyTorch | GPU Acceleration with CUDA | Memory Management | Codecademy
February 7, 2025 - Learn how to use PyTorch to build, train, and test artificial neural networks in this course. ... .max_memory_allocated(): Returns the peak GPU memory usage since the start of the program or last reset.
Top answer
1 of 2
3

I don't think the other answer is correct. Allocation and deallocation definitely happens during runtime, the thing to note is that the CPU code runs asynchronously from the GPU code, so you need to wait for any deallocation to happen if you want to reserve more memory after it. Take a look at this:

import torch 

a = torch.zeros(100,100,100).cuda()

print(torch.cuda.memory_allocated())

del a
torch.cuda.synchronize()
print(torch.cuda.memory_allocated())

Outputs

4000256
0

So you should del the tensors you don't need and call torch.cuda.synchronize() to make sure that the deallocation goes through before your CPU code continues to run.

In your specific case, after your function trn_l returns, any variables that were local to that function, and do not have references elsewhere, will be deallocated along with the corresponding GPU tensors. All you need to do is wait for this to happen by calling torch.cuda.synchronize() after the function call.

2 of 2
0

So, Pytorch does not allocate and deallocate memory from GPU in training time.

From https://pytorch.org/docs/stable/notes/faq.html#my-gpu-memory-isn-t-freed-properly:

PyTorch uses a caching memory allocator to speed up memory allocations. As a result, the values shown in nvidia-smi usually don’t reflect the true memory usage. See Memory management for more details about GPU memory management.

If your GPU memory isn’t freed even after Python quits, it is very likely that some Python subprocesses are still alive. You may find them via ps -elf | grep python and manually kill them with kill -9 [pid].

You can call torch.cuda.empty_cache() to free all unused memory (however, that is not really good practice as memory re-allocation is time consuming). Docs of empty_cace() : https://pytorch.org/docs/stable/cuda.html#torch.cuda.empty_cache

🌐
GitHub
github.com › pytorch › pytorch › blob › main › c10 › cuda › CUDACachingAllocator.cpp
pytorch/c10/cuda/CUDACachingAllocator.cpp at main · pytorch/pytorch
". GPU ", static_cast<int>(device_id), " has a total capacity of ", format_size(device_total), " of which ", format_size(device_free), " is free. ", proc_info, allowed_info, "Of the allocated memory ", format_size(allocated_bytes + allocated_in_private_pools), " is allocated by PyTorch, ", private_pool_msg, "and ", format_size( reserved_bytes - allocated_bytes - allocated_in_private_pools), " is reserved by PyTorch but unallocated.", " If reserved but unallocated memory is large try setting", " PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid" " fragmentation.
Author   pytorch
Find elsewhere
🌐
GitHub
github.com › pytorch › pytorch › blob › main › torch › cuda › memory.py
pytorch/torch/cuda/memory.py at main · pytorch/pytorch
allocator so that those can be used in other GPU application and visible in · `nvidia-smi`. · .. note:: :func:`~torch.cuda.empty_cache` doesn't increase the amount of GPU · memory available for PyTorch. However, it may help reduce fragmentation ·
Author   pytorch
🌐
PyTorch Forums
discuss.pytorch.org › t › how-to-allocate-more-gpu-memory-to-be-reserved-by-pytorch-to-avoid-runtimeerror-cuda-out-of-memory › 149037
How to allocate more GPU memory to be reserved by PyTorch to avoid "RuntimeError: CUDA out of memory"? - PyTorch Forums
April 13, 2022 - Hello, I’m not experienced in PyTorch very well and perhaps asking a weird question. I’m running my PyTorch script in a docker container and I’m using GPU that has 48 GB. Although it has a larger capacity, somehow PyTorch is only using smaller than 10GiB and causing the “CUDA out of ...
🌐
MangoHost
mangohost.net › mangohost blog › pytorch memory management and multi-gpu debugging
PyTorch Memory Management and Multi-GPU Debugging
August 1, 2025 - Unlike some other frameworks, PyTorch ... passes. ... GPU Memory Allocator: PyTorch uses a caching allocator that requests large chunks of memory from CUDA and subdivides them internally...
🌐
Kshitij12345
kshitij12345.github.io › python, › pytorch › 2023 › 02 › 26 › External-CUDA-Allocator-With-PyTorch.html
External CUDA Allocator with PyTorch | Hacker’s Getaway
February 26, 2023 - PyTorch now supports external allocators which confirm to the interface. RMM provides an implementation compatible with PyTorch. Users can now use multiple libraries which compute on GPU with tighter control on GPU memory.
🌐
Stack Overflow
stackoverflow.com › questions › 76199688 › how-to-allocate-more-memory-to-pytorch
How to allocate more memory to pytorch - Stack Overflow
Looks like something is stopping torch from accessing more than 7GB of memory on your card. Try running torch.cuda.empty_cache() in the beginning of your script, this will release all memory that can be safely freed.
🌐
Stack Overflow
stackoverflow.com › questions › 60305788 › pytorch-gpu-memory-allocation
python - Pytorch GPU memory allocation - Stack Overflow
February 19, 2020 - If I increase my BATCH_SIZE,pytorch gives me more, but not enough: BATCH_SIZE=256 · CUDA out of memory. Tried to allocate 196.00 MiB (GPU 0; 4.00 GiB total capacity; 2.85 GiB already allocated; 93.80 MiB free; 2.87 GiB reserved in total by PyTorch)
Top answer
1 of 2
4

torch.cuda.memory_allocated() returns the memory that has been allocated, not the memory that has been "used".

In a typical GPU compute pipeline, you would record operations in a queue along with whatever synchronization primitives your API offers. The GPU will then dequeue and execute those operations, respecting the enqueued synchronization primitives. However, GPU memory allocation is not usually an operation which even goes on the queue. Rather, there's usually some sort of fundamental instruction that the CPU can issue to the GPU in order to allocate memory, just as recording operations is another fundamental instruction. This means that the memory necessary for a GPU operation has to be allocated before the operation has even been enqueued; there is no "allocate memory" operation in the queue to synchronize with.

Consider Vulkan as a simple example. Rendering operations are enqueued on a graphics queue. However, memory is typically allocated via calls to vkAllocateMemory(), which does not accept any sort of queue at all; it only accepts the device handle and information about the allocation (size, memory type, etc). From my understanding, the allocation is done "immediately" / synchronously (the memory is safe to use by the time the function call returns on the CPU).

I don't know enough about GPUs to explain why this is the case, but I'm sure there's a good reason. And perhaps the limitations vary from device to device. But if I were to guess, memory allocation probably has to be a fairly centralized operation; it can't be done by just any core executing recorded operations on a queue. This would make sense, at least; the space of GPU memory is usually shared across cores.

Let's apply this knowledge to answer your question: When you call l.append(x**i), you're trying to record a compute operation. That operation will require memory to store the result, and so PyTorch is likely allocating the memory prior to enqueuing the operation. This explains the behavior you're seeing.

However, this doesn't invalidate PyTorch's claims about asynchronous compute. The memory might be allocated synchronously, but it won't be populated with the result of the operation until the operation has been dequeued and completed by the GPU, which indeed happens asynchronously.

2 of 2
0

I was able to reproduce your problem. I cannot really tell you why it behaves like that. I just think the (randomly) initialized tensor needs a certain amount of memory. For instance if you call x = torch.randn(0,0, device='cuda') the tensor does not allocate any GPU memory and x = torch.zeros(1000,10000, device='cuda') allocates 4000256 as in your example.

To load the tensors lazy, I suggest you create them on CPU and send them on the GPU briefly before using them. Kind of a speeed/memory tradeoff. I changed your code accordingly:

import torch

def unnecessary_compute():
    x = torch.randn(1000,1000, device='cpu')
    l = []
    for i in range(5):
      print(i,torch.cuda.memory_allocated())
      l.append(x**i)
    print("Move to cuda")
    for i, tensor_x in enumerate(l): 
      l[i]=tensor_x.to('cuda')
      print(i,torch.cuda.memory_allocated())
        
unnecessary_compute()

that produced the following output:

0 0
1 0
2 0
3 0
4 0
Move to cuda
0 4000256
1 8000512
2 12000768
3 16001024
4 20971520
🌐
PyTorch Forums
discuss.pytorch.org › memory format
CPU Memory Allocation/Virtual Memory Allocation per GPU - Memory Format - PyTorch Forums
September 14, 2022 - When we run torch.is_available() it allocates 11GB for one GPU and 44.2GB when we use six GPUs. Then when we start the workers in the training loop that CPU allocation is copied to each worker, so we see this massive memory use.
🌐
Torch for R
torch.mlverse.org › docs › articles › memory-management
Memory management • torch
CUDA memory tends to be scarcer than CPU memory, also, allocation must be faster otherwise allocation overhead can counterbalance the speed up of GPU. To make allocations very fast and to avoid segmentation, LibTorch uses a caching allocator to manage the GPU memory, ie.
🌐
PyTorch Forums
discuss.pytorch.org › t › cpu-memory-allocation-when-using-a-gpu › 29478
CPU memory allocation when using a GPU - PyTorch Forums
November 13, 2018 - Hi, I have a question regarding ... up to about 10GB, and 135M in RAM (from almost non-existing). If I then run torch.rand((256, 256)).cuda() ......