Another one, a mix between 1.i) and 1.ii): if you append tensors with computed gradients to python lists for tracking purposes, the gradients also get inserted in the list and it grows a bit more than expected! Also, leaks can find their way in computer memory (RAM, not GPU mem), so it can be usefu… Answer from alex.veuthey on discuss.pytorch.org
🌐
PyTorch Forums
discuss.pytorch.org › t › memory-leak-debugging-and-common-causes › 67339
Memory Leak Debugging and Common Causes - PyTorch Forums
January 22, 2020 - Most of the memory leak threads I found were unhelpful so I wanted to throw together a few tips here. causes of leaks: i) most threads talk about leaks caused by creating an array that holds tensors, if you continually add tensors to this array, ...
🌐
Reddit
reddit.com › r/pytorch › need help debugging memory leaks in pytorch
r/pytorch on Reddit: Need Help Debugging Memory Leaks in PyTorch
February 6, 2023 -

Hi everyone,

I'm working on a deep learning project using PyTorch and I'm running into some issues with memory leaks. I've attached a screenshot of my GPU memory usage which shows a steady increase over time, indicating that there is a leak somewhere in my code.

I would greatly appreciate any advice or tips on how to track down and fix these memory leaks. I've tried a few different approaches so far, but I'm still not sure where the problem lies.

If you have any experience with debugging memory leaks in PyTorch, I would love to hear from you. Any suggestions or code snippets would be especially helpful!

Thanks in advance for your help!

screenshot of GPU memory usage

nvtop stats

During the first training epoch the memory usage is roughly 40-45%, when training is finished it suddenly starts increasing non-stop til then end of the validation step, then stabilize (2nd) training epoch, finally i got OOM at the start of the 2nd validation step.

Code: Training Loop

    for epoch in range(train_conf.epochs):
        running_training_loss = 0.0
        model.train()
        for batch_idx, batch in enumerate(train_loader): #, total=len(train_loader), desc="Training"):
            train_loss = training_step(model=model, compute_loss=loss_fn, optimizer=optimizer,
                                       scheduler=scheduler, batch=batch, device=device)
            running_training_loss += train_loss

            if batch_idx % log_every_n_steps == 0:
                experim.log({
                    "train/loss": train_loss,
                    "train/running_loss": running_training_loss,
                    "train/last_lr": scheduler.get_last_lr()
                })
                GPUtil.showUtilization()

        train_loss = running_training_loss / len(train_loader)

        running_score = 0.0
        running_loss = 0.0

        metric = evaluate.load('bleu')

        model.eval()
        for batch_idx, batch in tqdm(enumerate(valid_loader), desc="Validation", total=len(valid_loader)):
            loss, bleu_score = validation_step(model=model, compute_loss=loss_fn, batch=batch,
                                               device=device, tgt_tokenizer=tgt_tokenizer, metric=metric)
            running_score += bleu_score
            running_loss += loss

            if batch_idx % log_every_n_steps == 0:
                experim.log({
                    "valid/loss": loss,
                    "valid/running_loss": running_loss,
                    "valid/bleu": bleu_score
                })

        validation_bleu = running_score / len(valid_loader)
        validation_loss = running_loss / len(valid_loader)

        experim.log({
            "epoch": epoch,
            "train_loss": train_loss,
            "validation_loss": validation_loss,
            "validation_bleu": validation_bleu
        })

Training step code:

def training_step(model, compute_loss, optimizer, scheduler, batch, device):
    optimizer.zero_grad()

    src, src_mask, tgt, tgt_mask, tgt_y, seq_len = batch(device=device)

    output = model(src, src_mask, tgt, tgt_mask)
    loss = compute_loss(output, tgt, norm=seq_len)

    loss.backward()
    optimizer.step()
    scheduler.step()

    return loss

Validation step code:

def validation_step(model, compute_loss, batch, device, tgt_tokenizer, metric):
    src, src_mask, tgt, tgt_mask, tgt_y, seq_len = batch(device=device)

    output = model(src, src_mask, tgt, tgt_mask)
    loss = compute_loss(output, tgt, norm=seq_len)

    output = model.validation_step(src, src_mask, tgt, tgt_mask)

    preds = torch.argmax(output, dim=-1)

    preds = tgt_tokenizer.decode(preds)
    references = tgt_tokenizer.decode(tgt)

    score = metric.compute(predictions=preds, references=references)

    return loss, score['bleu']
Top answer
1 of 2
3
Hi everyone, I was finally able to solve the issue and I wanted to share my solution with others who may be facing a similar problem. The issue was with the memory usage during validation. I discovered that the memory was being accumulated due to the gradients being retained in the computation graph. To resolve this issue, I added the decorator @torch.no_grad() in the validation_step function. This effectively disables gradient computation, freeing up the memory used by the gradients and resolving the memory leak. code: @torch.no_grad() def validation_step(model, compute_loss, batch, device, tgt_tokenizer, metric): src, src_mask, tgt, tgt_mask, tgt_y, seq_len = batch(device=device) output = model(src, src_mask, tgt, tgt_mask) loss = compute_loss(output, tgt, norm=seq_len) output = model.validation_step(src, src_mask, tgt, tgt_mask) preds = torch.argmax(output, dim=-1) preds = tgt_tokenizer.decode(preds) references = tgt_tokenizer.decode(tgt) score = metric.compute(predictions=preds, references=references) return loss, score['bleu'] PS: the explanation paragraph was generated by ChatGPT :)
2 of 2
3
Memory leaks in deep learning models can be caused by a variety of factors, such as using large tensors that are not being properly managed, not freeing up GPU memory after each iteration, or using an optimizer that accumulates gradients. To track down the source of the leak, there are a few strategies you can try: Monitor GPU memory usage: You can use the GPUtil library to monitor GPU memory usage during training. This can help you identify when the memory usage starts to increase, which can indicate the source of the leak. Check tensor sizes: Check the sizes of the tensors used during training and make sure they are properly managed. You can use the torch.cuda.max_memory_allocated() method to see the maximum amount of GPU memory that has been allocated during training. Debug gradients: If you are using an optimizer that accumulates gradients, make sure to call optimizer.zero_grad() before each iteration to clear the accumulated gradients. Use torch.no_grad(): Use torch.no_grad() to evaluate the model during validation. This will prevent the computation of gradients, which can help reduce memory usage. Free up GPU memory: Call torch.cuda.empty_cache() after each iteration to free up GPU memory. Reduce batch size: Try reducing the batch size to see if that resolves the memory leak. These are a few strategies to help you track down and resolve memory leaks in PyTorch. It may take some trial and error to find the source of the leak, but these tips should give you a good starting point.
Discussions

How to debug causes of GPU memory leaks?
Hi, I have been trying to figure out why my code crashes after several batches because of cuda memory error. I understand that probably there is some variable(s) that is not freed because I keep it in the graph. The question is how to debug that kind of thing? More on discuss.pytorch.org
🌐 discuss.pytorch.org
19
35
August 26, 2017
Memory Leak in v2.0.1
🐛 Describe the bug In pytorch v2.0.1, I encountered an memory leak when trying to input tensors in different shapes to the model. Here are codes to reproduce: from torchvision.models import resnet ... More on github.com
🌐 github.com
9
February 2, 2024
GPU memory leak
I run out of GPU memory when training my model. The leak seems to be happening at the first call of loss.backward(). I guess that somehow a copy of the graph remain in the memory but can’t see where it happens and what to do about it. Here’s my fit function: val_loss_best = np.inf losses ... More on discuss.pytorch.org
🌐 discuss.pytorch.org
1
0
December 12, 2023
PyTorch memory leak reference cycle in for loop
I am facing a memory leak when iteratively updating tensors in PyTorch on my Mac M1 GPU using the PyTorch mps interface. The following is a minimal reproducible example that replicates the behavior: import torch def leak_example(p1, device): t1 = torch.rand_like(p1, device = device) # ... More on discuss.pytorch.org
🌐 discuss.pytorch.org
3
0
December 23, 2022
🌐
Medium
medium.com › @raghadalghonaim › memory-leakage-with-pytorch-23f15203faa4
Memory Leakage with PyTorch - by Raghad Alghonaim
April 4, 2020 - This changes if you make the NumPy array explicitly of type object, which makes it start behaving like a regular Python list (only storing references to (string) objects). The same “problems” with memory consumption now appear.”
🌐
PyTorch Forums
discuss.pytorch.org › t › how-to-debug-causes-of-gpu-memory-leaks › 6741
How to debug causes of GPU memory leaks? - PyTorch Forums
August 26, 2017 - Hi, I have been trying to figure out why my code crashes after several batches because of cuda memory error. I understand that probably there is some variable(s) that is not freed because I keep it in the graph. The que…
🌐
GitHub
github.com › pytorch › pytorch › issues › 118991
Memory Leak in v2.0.1 · Issue #118991 · pytorch/pytorch
February 2, 2024 - When executing the codes, memory occupancy will raise to more than 10GB. And if you continuously input more shapes, there will finally be an OOM. On the contrary, there will not be an OOM in earlier version of pytorch(e.g.
Author   pytorch
🌐
PyTorch Forums
discuss.pytorch.org › t › gpu-memory-leak › 193572
GPU memory leak - PyTorch Forums
December 12, 2023 - I run out of GPU memory when training my model. The leak seems to be happening at the first call of loss.backward(). I guess that somehow a copy of the graph remain in the memory but can’t see where it happens and what to do about it. Here’s my fit function: val_loss_best = np.inf losses = [] # Prepare loss history for epoch in range(epochs): for idx_batch, (x, y) in enumerate(dataloader_train): optimizer.zero_grad() ...
🌐
PyTorch Forums
discuss.pytorch.org › t › pytorch-memory-leak-reference-cycle-in-for-loop › 168908
PyTorch memory leak reference cycle in for loop - PyTorch Forums
December 23, 2022 - I am facing a memory leak when iteratively updating tensors in PyTorch on my Mac M1 GPU using the PyTorch mps interface. The following is a minimal reproducible example that replicates the behavior: import torch def leak_example(p1, device): t1 = torch.rand_like(p1, device = device) # torch.cat((torch.diff(ubar.detach(), dim=0).detach().clone(), torch.zeros_like(ubar.detach()[:1,:,:,:], dtype = torch.float32)), dim = 0) u1 = p1.detach() + 2 * (t1.detach()) B = torch.rand...
🌐
PyTorch Forums
discuss.pytorch.org › t › tips-tricks-on-finding-cpu-memory-leaks › 115971
Tips/Tricks on finding CPU memory leaks - PyTorch Forums
March 25, 2021 - Hi All, I was wondering if there are any tips or tricks when trying to find CPU memory leaks? I’m currently running a model, and every epoch the RAM usage (as calculated via psutil.Process(os.getpid()).memory_info()[0]/…
Find elsewhere
🌐
GitHub
github.com › pytorch › pytorch › issues › 71495
Memory Leak in PyTorch 1.10.1 · Issue #71495 · pytorch/pytorch
January 19, 2022 - 🐛 Describe the bug PyTorch is leaking CUDA memory. I have a model (Faster R-CNN) that produces a tensor called 'proposals' that is computed from the output of a model stage and then detache...
Author   pytorch
🌐
PyTorch Forums
discuss.pytorch.org › t › help-with-pytorch-memory-leak-on-cpu › 194395
Help with pytorch "memory leak" on CPU - PyTorch Forums
December 27, 2023 - Hi, I’m currently developing a differentiable physics engine using pytorch (2.1.0) that combines physics equations and machine learning. I am however seeing a memory leak (running on cpu, haven’t tried on gpu) where the memory continues to increase epoch after epoch.
🌐
GitHub
github.com › pytorch › pytorch › issues › 49394
Apparent Memory Leak with torch.as_tensor · Issue #49394 · pytorch/pytorch
December 15, 2020 - In the function fill_buf replace the two fake_data_batches assignment lines with the commented out fake_data_batches line. Observe that memory usage is no longer correlated with the value of repro_tensor_allocations.
Author   pytorch
🌐
GitHub
github.com › pytorch › pytorch › issues › 108334
memory leak in torch>=2.0.0 · Issue #108334 · pytorch/pytorch
August 31, 2023 - import random import torch from torch.utils.data import Dataset, DataLoader import torchvision import psutil import os class ExampleDataset(Dataset): def __init__(self): self.num_samples = 1000 def __len__(self): return self.num_samples def __getitem__(self, idx): h = random.randint(666, 999) w = random.randint(666, 999) tensor = torch.ones((3, h, w)) return tensor dataset = ExampleDataset() data_loader = DataLoader(dataset, batch_size=1, num_workers=0) model = getattr(torchvision.models, 'resnet18')().to('cuda') for epoch in range(10): for tensor in data_loader: tensor = tensor.to('cuda') output = model(tensor) print('memory used:', psutil.Process(os.getpid()).memory_info().rss / 1024.0 / 1024.0)
Author   pytorch
🌐
PyTorch Forums
discuss.pytorch.org › vision
RAM/CPU memory leak with transforms - vision - PyTorch Forums
January 3, 2022 - Hello, I have been trying to debug an issue where, when working with a dataset, my RAM is filling up quickly. It turns out this is caused by the transformations I am doing to the images, using transforms. My code is very simple: for dir1 in os.listdir(img_folder): for file in os.listdir(os.path.join(img_folder, dir1)): image_path = os.path.join(img_folder, dir1, file) with Image.open(image_path) as img_pil: normalize = transforms.Normalize(mean=mean,st...
🌐
GitHub
github.com › pytorch › pytorch › issues › 91368
PyTorch memory leak reference cycle in for loop, Mac M1 · Issue #91368 · pytorch/pytorch
December 23, 2022 - import torch def leak_example(p1, ubar, device): t1 = torch.cat((torch.diff(ubar, dim=0), torch.zeros_like(ubar[:1,:,:,:], dtype = torch.float32)), dim = 0) u1 = p1 + 2 * (t1) B = torch.rand_like(u1, device = device) mask = u1 < B a1 = u1 a1[~mask] = torch.rand_like(a1)[~mask] return a1 if torch.cuda.is_available(): # cuda gpus device = torch.device("cuda") elif torch.backends.mps.is_available(): # mac gpus device = torch.device("mps") torch.set_grad_enabled(False) p1 = torch.rand(5, 5, 224, 224, device = device) ubar = torch.rand(5, 5, 224, 224, device = device) for i in range(10000): p1 = leak_example(p1, ubar, device) My Mac's GPU memory steadily grows when I execute this loop.
Author   pytorch
🌐
GitHub
github.com › pytorch › pytorch › issues › 55607
pytorch inference lead to memory leak in cpu · Issue #55607 · pytorch/pytorch
April 8, 2021 - 🐛 Bug I inference using pytorch model, I got memory leak problem, my code as follow: import torch import torch.nn as nn from memory_profiler import profile from memory_profiler import memory_usage @profile(func=None, stream=open('./resne...
Author   pytorch
🌐
GitHub
github.com › pytorch › pytorch › issues › 102334
There is a memory leak in torch.load · Issue #102334 · pytorch/pytorch
May 26, 2023 - There is a memory leak in torch.load and cannot be freed · Collecting environment information... PyTorch version: 2.0.1 Is debug build: False CUDA used to build PyTorch: None ROCM used to build PyTorch: N/A
Author   pytorch
🌐
Haoxiang
blog.haoxiang.org › 2021 › 04 › a-pytorch-gpu-memory-leak-example
A PyTorch GPU Memory Leak Example - Haoxiang's Blog
April 7, 2021 - Kicking off the training, it shows constantly increasing allocated GPU memory. Using cache found in /home/haoxiangli/.cache/torch/hub/pytorch_vision_v0.9.0 Loss, current batch -0.001, moving average -0.001 GPU Memory Allocated 51.64306640625 MB Loss, current batch -0.001, moving average -0.001 GPU Memory Allocated 119.24072265625 MB Loss, current batch -0.001, moving average -0.001 GPU Memory Allocated 186.83837890625 MB Loss, current batch -0.001, moving average -0.001 GPU Memory Allocated 254.43603515625 MB Loss, current batch -0.001, moving average -0.001 GPU Memory Allocated 322.03369140625 MB Loss, current batch -0.001, moving average -0.001 GPU Memory Allocated 389.63134765625 MB Loss, current batch -0.001, moving average -0.001
🌐
Reddit
reddit.com › r/pytorch › how to find a memory leak?
r/pytorch on Reddit: How to find a memory leak?
April 26, 2022 -

As a lot of people on this thread it would seem, I regularly meet the infamous RuntimeError: CUDA out of memory after a few epochs, that drives me crazy. Everytime, people post their codes on forums and someone points out a missing .item() or .detach() somewhere. But is there a way to know at each epoch the size of each tensor, or the memory usage breakdown or something in that fashion in order to track these issues myself instead of always asking online?