🌐
DeepSpeed
deepspeed.ai › home › tutorials
Using PyTorch Profiler with DeepSpeed for performance debugging - DeepSpeed
1 week ago - The Profiler assumes that the training ... on_trace_ready, with_stack, etc. In the example below, the profiler will skip the first 5 steps, use the next 2 steps as the warm up, and actively record the next 6 steps....
🌐
PyTorch
docs.pytorch.org › recipes › pytorch profiler
PyTorch Profiler — PyTorch Tutorials 2.12.0+cu130 documentation
July 20, 2022 - Go to the end to download the full example code. Created On: Jan 29, 2021 | Last Updated: Jul 09, 2025 | Last Verified: Not Verified ... This recipe explains how to use PyTorch profiler and measure the time and memory consumption of the model’s operators.
Discussions

Pytorch Profiler
I use it to capture the tracer of the run. Very useful to identify the performance bottleneck of your training loop and come up with optimizations. It is a bit of a learning curve to master this technique. You need some understanding how GPU and CPU work together (e.g., GPU kernels are async. When does CPU and GPU sync with each other. What are cuda streams. What can be done in parallel by a GPU) Definitely recommend if you need to understand the performance of your training or inference code. Nsight can be an additional tool since it can provide more information compared to the standard profiler This is an example of using trace and profiler to iteratively optimize a model efficiency performance by the pytorch team https://pytorch.org/blog/accelerating-generative-ai/ More on reddit.com
🌐 r/pytorch
7
3
June 3, 2024
FIVE WAYS TO INCREASE MODEL PERF W/ PYTORCH PROFILER!
FIVE WAYS TO INCREASE MODEL PERF W/ PYTORCH PROFILER! More on reddit.com
🌐 r/pytorch
gpu in pytorch good resource for general guidelines/advice? I feel very lost with the tutorial afterthought-like treatment
Make sure that your device is actually cuda something if you don't see any gpu usage. There can be plenty of install problems that could cause cuda to not be available. Actually if you expect to always run this code on gpu, an assert would be fitting, the fallback on CPU pretty much means failing silently. As for implicit global gpu usage, there are plenty of discussion on the subject, as far as I know it can't be done (yet). The general consensus is that it's better for the user to be fully aware of what is going on and where/when data is moved. More on reddit.com
🌐 r/pytorch
6
4
July 24, 2019
Bloomberg just Open sourced Memray a memory profiler for Python
I have no idea what that is but if it's from Bloomberg you can bet is serious stuff which used to cost an arm and a leg. More on reddit.com
🌐 r/Python
33
622
April 20, 2022
🌐
Hugging Face
huggingface.co › blog › torch-profiler
Profiling in PyTorch (Part 1): A Beginner's Guide to torch.profiler
May 29, 2026 - We begin with ProfileStep#<id> which encapsulates the profiling step. Due to us annotating the step, we see the matmul_add row. The matmul_add consists of two aten calls, one for matrix multiplication and one for matrix addition. The aten::matmul is the ATen-level dispatch that those user-facing PyTorch matmul calls land on.
🌐
H-huang
h-huang.github.io › tutorials › recipes › recipes › profiler_recipe.html
PyTorch Profiler — PyTorch Tutorials 1.8.1+cu102 documentation
This recipe explains how to use PyTorch profiler and measure the time and memory consumption of the model’s operators.
🌐
GitHub
gist.github.com › mingfeima › e08310d7e7bb9ae2a693adecf2d8a916
How to do performance profiling on PyTorch · GitHub
Usually the first step in performance optimization is to do profiling, e.g. to identify performance hotspots of a workload. This gist tells basic knowledge of performance profiling on PyTorch, you will get: ... This tutorial takes one of my recent projects - pssp-transformer as an example to guide you through path of PyTorch CPU peformance optimization.
🌐
GitHub
gist.github.com › ma7dev › 8aad93386e2b3364edbf0759514af5cc
Quick and play PyTorch Profiler example · GitHub
It wasn't obvious on PyTorch's documentation of how to use PyTorch Profiler (as of today, 8/12/2021), so I have spent some time to understand how to use it and this gist contains a simple example to use.
Find elsewhere
🌐
PyTorch
docs.pytorch.org › reference api › torch.profiler
torch.profiler — PyTorch main documentation
For example, if skip_first is 10 and wait is 20, the first cycle will wait 10 + 20 = 30 steps before warmup if skip_first_wait is zero, but will wait only 10 steps if skip_first_wait is non-zero.
🌐
PyTorch Lightning
pytorch-lightning.readthedocs.io › en › 1.5.10 › advanced › profiler.html
Performance and Bottleneck Profiler — PyTorch Lightning 1.5.10 documentation
This profiler will record training_step_and_backward, training_step, backward, validation_step, test_step, and predict_step by default. The output below shows the profiling for the action training_step_and_backward. The user can provide PyTorchProfiler(record_functions={...}) to extend the scope of profiled functions.
🌐
Lightning AI
lightning.ai › docs › pytorch › 1.6.3 › advanced › profiler.html
Profiling — PyTorch Lightning 1.6.3 documentation
The following is a simple example that profiles the first occurrence and total calls of each action: from pytorch_lightning.profiler import Profiler from collections import defaultdict import time class ActionCountProfiler(Profiler): def __init__(self, dirpath=None, filename=None): super().__init__(dirpath=dirpath, filename=filename) self._action_count = defaultdict(int) self._action_first_occurrence = {} def start(self, action_name): if action_name not in self._action_first_occurrence: self._action_first_occurrence[action_name] = time.strftime("%m/%d/%Y, %H:%M:%S") def stop(self, action_name)
🌐
PyTorch
pytorch.org › blog › introducing-pytorch-profiler-the-new-and-improved-performance-tool
Introducing PyTorch Profiler – the new and improved performance tool – PyTorch
March 25, 2021 - There were standard performance debugging tools that provide GPU hardware level information but missed PyTorch-specific context of operations. In order to recover missed information, users needed to combine multiple tools together or manually add minimum correlation information to make sense of the data. There was also the autograd profiler (torch.autograd.profiler) which can capture information about PyTorch operations but does not capture detailed GPU hardware-level information and cannot provide support for visualization.
🌐
Modal
modal.com › docs › examples › torch_profiling
Tracing and profiling GPU-accelerated PyTorch programs on Modal | Modal Docs
In this example, we demonstrate how to wrap a Modal Function with PyTorch’s built-in profiler, which captures events on both CPUs & GPUs.
🌐
GitHub
github.com › Quentin-Anthony › torch-profiling-tutorial
GitHub - Quentin-Anthony/torch-profiling-tutorial · GitHub
This tutorial seeks to teach users about using profiling tools such as nvsys, rocprof, and the torch profiler in a simple transformers training loop. We will cover how to use the PyTorch profiler to identify performance bottlenecks, understand GPU efficiency metrics, and perform initial ...
Starred by 579 users
Forked by 32 users
Languages   Python
🌐
Habana
docs.habana.ai › en › latest › Profiling › Profiling_with_PyTorch.html
Profiling with PyTorch — Gaudi Documentation 1.23.0 documentation
To monitor HPU memory during training, set the profile_memory argument to True in the torch.profiler.profile function. The below shows a usage example:
🌐
PyTorch Lightning
pytorch-lightning.readthedocs.io › en › 1.6.5 › advanced › profiler.html
Profiling — PyTorch Lightning 1.6.5 documentation
The following is a simple example that profiles the first occurrence and total calls of each action: from pytorch_lightning.profiler import Profiler from collections import defaultdict import time class ActionCountProfiler(Profiler): def __init__(self, dirpath=None, filename=None): super().__init__(dirpath=dirpath, filename=filename) self._action_count = defaultdict(int) self._action_first_occurrence = {} def start(self, action_name): if action_name not in self._action_first_occurrence: self._action_first_occurrence[action_name] = time.strftime("%m/%d/%Y, %H:%M:%S") def stop(self, action_name)