Hi everyone,
I'm working on a deep learning project using PyTorch and I'm running into some issues with memory leaks. I've attached a screenshot of my GPU memory usage which shows a steady increase over time, indicating that there is a leak somewhere in my code.
I would greatly appreciate any advice or tips on how to track down and fix these memory leaks. I've tried a few different approaches so far, but I'm still not sure where the problem lies.
If you have any experience with debugging memory leaks in PyTorch, I would love to hear from you. Any suggestions or code snippets would be especially helpful!
Thanks in advance for your help!
screenshot of GPU memory usage
nvtop statsDuring the first training epoch the memory usage is roughly 40-45%, when training is finished it suddenly starts increasing non-stop til then end of the validation step, then stabilize (2nd) training epoch, finally i got OOM at the start of the 2nd validation step.
Code: Training Loop
for epoch in range(train_conf.epochs):
running_training_loss = 0.0
model.train()
for batch_idx, batch in enumerate(train_loader): #, total=len(train_loader), desc="Training"):
train_loss = training_step(model=model, compute_loss=loss_fn, optimizer=optimizer,
scheduler=scheduler, batch=batch, device=device)
running_training_loss += train_loss
if batch_idx % log_every_n_steps == 0:
experim.log({
"train/loss": train_loss,
"train/running_loss": running_training_loss,
"train/last_lr": scheduler.get_last_lr()
})
GPUtil.showUtilization()
train_loss = running_training_loss / len(train_loader)
running_score = 0.0
running_loss = 0.0
metric = evaluate.load('bleu')
model.eval()
for batch_idx, batch in tqdm(enumerate(valid_loader), desc="Validation", total=len(valid_loader)):
loss, bleu_score = validation_step(model=model, compute_loss=loss_fn, batch=batch,
device=device, tgt_tokenizer=tgt_tokenizer, metric=metric)
running_score += bleu_score
running_loss += loss
if batch_idx % log_every_n_steps == 0:
experim.log({
"valid/loss": loss,
"valid/running_loss": running_loss,
"valid/bleu": bleu_score
})
validation_bleu = running_score / len(valid_loader)
validation_loss = running_loss / len(valid_loader)
experim.log({
"epoch": epoch,
"train_loss": train_loss,
"validation_loss": validation_loss,
"validation_bleu": validation_bleu
})Training step code:
def training_step(model, compute_loss, optimizer, scheduler, batch, device):
optimizer.zero_grad()
src, src_mask, tgt, tgt_mask, tgt_y, seq_len = batch(device=device)
output = model(src, src_mask, tgt, tgt_mask)
loss = compute_loss(output, tgt, norm=seq_len)
loss.backward()
optimizer.step()
scheduler.step()
return lossValidation step code:
def validation_step(model, compute_loss, batch, device, tgt_tokenizer, metric):
src, src_mask, tgt, tgt_mask, tgt_y, seq_len = batch(device=device)
output = model(src, src_mask, tgt, tgt_mask)
loss = compute_loss(output, tgt, norm=seq_len)
output = model.validation_step(src, src_mask, tgt, tgt_mask)
preds = torch.argmax(output, dim=-1)
preds = tgt_tokenizer.decode(preds)
references = tgt_tokenizer.decode(tgt)
score = metric.compute(predictions=preds, references=references)
return loss, score['bleu']How to debug causes of GPU memory leaks?
Memory Leak in v2.0.1
GPU memory leak
PyTorch memory leak reference cycle in for loop
As a lot of people on this thread it would seem, I regularly meet the infamous RuntimeError: CUDA out of memory after a few epochs, that drives me crazy. Everytime, people post their codes on forums and someone points out a missing .item() or .detach() somewhere. But is there a way to know at each epoch the size of each tensor, or the memory usage breakdown or something in that fashion in order to track these issues myself instead of always asking online?