I don’t think a custom CUDA allocator would help here as your use case sounds more like CPU-offloading. This post might be helpful. Answer from ptrblck on discuss.pytorch.org
🌐
Zdevito
zdevito.github.io › 2022 › 08 › 04 › cuda-caching-allocator.html
A guide to PyTorch’s CUDA Caching Allocator
August 4, 2022 - To accomplish its goal, the caching allocator requests blocks of memory from CUDA and figures out ways to split up and reuse these blocks without returning them to CUDA. Why not just request all GPU memory and manage it inside PyTorch? PyTorch is not the only library to use the CUDA APIs.
🌐
PyTorch Forums
discuss.pytorch.org › t › how-do-i-rewrite-the-gpu-memory-allocation-algorithm-of-pytorch › 179979
How do I rewrite the GPU memory allocation algorithm of PyTorch? - PyTorch Forums
May 15, 2023 - Hi, from my current browsing of the documentation, it seems that the only way to provide a custom CUDA memory allocator is by the CUDAPluggableAllocator class, correct? What I want to achieve is that given a simple linear model: in-> A->B->C->D->E-> out I want to be able to control where the GPU memory of these 5 nodes(A~E) will be allocated/stored.(in fact, it will be great if I can control the allocation of weights between these nodes too) It’s related to the gradient-checkpointing techniqu...
🌐
Medium
iamholumeedey007.medium.com › memory-management-using-pytorch-cuda-alloc-conf-dabe7adec130
Memory Management using PYTORCH_CUDA_ALLOC_CONF | by Shittu Olumide Ayodeji | Medium
June 24, 2023 - One key advantage of PYTORCH_CUDA_ALLOC_CONF is its ability to dynamically allocate and manage memory based on memory usage patterns during runtime. It supports dynamic memory allocation, allowing the framework to allocate memory on-demand and ...
🌐
APXML
apxml.com › courses › advanced-pytorch › chapter-1-pytorch-internals-autograd › memory-management
PyTorch Memory Management Strategies
Allocating and deallocating memory on GPUs using CUDA APIs (cudaMalloc, cudaFree) can be slow. To mitigate this, PyTorch employs a caching memory allocator for GPU tensors. When a tensor is freed (e.g., goes out of scope and its reference count drops to zero), the memory it occupied isn't ...
🌐
GitHub
github.com › pytorch › pytorch › blob › main › torch › cuda › memory.py
pytorch/torch/cuda/memory.py at main · pytorch/pytorch
``native`` (PyTorch's native caching allocator) and `cudaMallocAsync`` (CUDA's built-in asynchronous allocator). · .. note:: See :ref:`cuda-memory-management` for details on choosing the allocator backend. """ return torch._C._cuda_getAllocatorBackend() ·
Author   pytorch
🌐
GitHub
github.com › zhuzilin › pytorch-malloc
GitHub - zhuzilin/pytorch-malloc: An external memory allocator example for PyTorch. · GitHub
An external memory allocator example for PyTorch. Contribute to zhuzilin/pytorch-malloc development by creating an account on GitHub.
Starred by 16 users
Forked by 3 users
Languages   C++ 78.1% | Python 21.3% | Makefile 0.6%
🌐
GitHub
github.com › pytorch › pytorch › blob › main › c10 › core › Allocator.h
pytorch/c10/core/Allocator.h at main · pytorch/pytorch
* total_allocated corresponds to total allocated memory. * * total_reserved corresponds to total size of memory pool, both used and · * unused, if applicable. */ virtual void reportMemoryUsage( void* ptr, int64_t alloc_size, size_t total_allocated, size_t total_reserved, Device device) = 0; ·
Author   pytorch
🌐
GitHub
github.com › pytorch › pytorch › issues › 43144
Using external memory allocator with PyTorch · Issue #43144 · pytorch/pytorch
August 17, 2020 - 🚀 Feature It would be useful to configure PyTorch to use an external memory allocator for its allocations. Motivation When working on GPUs, memory can be a somewhat limited resources. Particularly when using multiple libraries each handl...
Author   pytorch
Find elsewhere
🌐
GitHub
github.com › pytorch › pytorch › blob › main › c10 › cuda › CUDACachingAllocator.cpp
pytorch/c10/cuda/CUDACachingAllocator.cpp at main · pytorch/pytorch
The allocator now has an ... "The kernel on this machine does not support the pidfd_open syscall needed to use IPC for CUDA tensors when expandable_segments:True is set. " "Consider using expandable_segments:False via torch.cuda.memory._set_allocator_settings('expandabl...
Author   pytorch
🌐
DEV Community
dev.to › shittu_olumide_ › memory-management-using-pytorchcudaallocconf-5afh
Memory Management using PYTORCH_CUDA_ALLOC_CONF - DEV Community
June 24, 2023 - One key advantage of PYTORCH_CUDA_ALLOC_CONF is its ability to dynamically allocate and manage memory based on memory usage patterns during runtime. It supports dynamic memory allocation, allowing the framework to allocate memory on-demand and ...
🌐
Kshitij12345
kshitij12345.github.io › python, › pytorch › 2023 › 02 › 26 › External-CUDA-Allocator-With-PyTorch.html
External CUDA Allocator with PyTorch | Hacker’s Getaway
February 26, 2023 - In this case, cuDF will have its own allocator which will allocate some memory for the dataframe and post the processing when we create Tensors from that dataframe, PyTorch will allocate using its allocator. So what happens now is cuDF allocator will mark the memory used for dataframe as free but it will still keep that memory with itself for future (in case there is request for memory).
Top answer
1 of 2
3

I don't think the other answer is correct. Allocation and deallocation definitely happens during runtime, the thing to note is that the CPU code runs asynchronously from the GPU code, so you need to wait for any deallocation to happen if you want to reserve more memory after it. Take a look at this:

import torch 

a = torch.zeros(100,100,100).cuda()

print(torch.cuda.memory_allocated())

del a
torch.cuda.synchronize()
print(torch.cuda.memory_allocated())

Outputs

4000256
0

So you should del the tensors you don't need and call torch.cuda.synchronize() to make sure that the deallocation goes through before your CPU code continues to run.

In your specific case, after your function trn_l returns, any variables that were local to that function, and do not have references elsewhere, will be deallocated along with the corresponding GPU tensors. All you need to do is wait for this to happen by calling torch.cuda.synchronize() after the function call.

2 of 2
0

So, Pytorch does not allocate and deallocate memory from GPU in training time.

From https://pytorch.org/docs/stable/notes/faq.html#my-gpu-memory-isn-t-freed-properly:

PyTorch uses a caching memory allocator to speed up memory allocations. As a result, the values shown in nvidia-smi usually don’t reflect the true memory usage. See Memory management for more details about GPU memory management.

If your GPU memory isn’t freed even after Python quits, it is very likely that some Python subprocesses are still alive. You may find them via ps -elf | grep python and manually kill them with kill -9 [pid].

You can call torch.cuda.empty_cache() to free all unused memory (however, that is not really good practice as memory re-allocation is time consuming). Docs of empty_cace() : https://pytorch.org/docs/stable/cuda.html#torch.cuda.empty_cache

🌐
PyTorch
pytorch.org › blog › understanding-gpu-memory-1
Understanding GPU Memory 1: Visualizing All Allocations over Time – PyTorch
December 14, 2023 - For further reference, see https://pytorch.org/docs/main/profiler.html. The Memory Profiler automatically generates categories based on the graph of tensor operations recorded during profiling. In this Memory Timeline collected using the Memory Profiler, we have the same training example as before. We can observe the gradients in blue are now being cleared from iteration to iteration. We can also notice that the optimizer state in yellow is allocated ...
🌐
Codecademy
codecademy.com › docs › pytorch › gpu acceleration with cuda › memory management
PyTorch | GPU Acceleration with CUDA | Memory Management | Codecademy
February 7, 2025 - Learn how to use PyTorch to build, train, and test artificial neural networks in this course. ... .max_memory_allocated(): Returns the peak GPU memory usage since the start of the program or last reset.