🌐
PyTorch
pytorch.org › blog › introducing-pytorch-profiler-the-new-and-improved-performance-tool
Introducing PyTorch Profiler – the new and improved performance tool – PyTorch
March 25, 2021 - The new PyTorch Profiler (torch.profiler) is a tool that brings both types of information together and then builds experience that realizes the full potential of that information. This new profiler collects both GPU hardware and PyTorch related information, correlates them, performs automatic ...
🌐
GitHub
github.com › NVIDIA › PyProf
GitHub - NVIDIA/PyProf: A GPU performance profiling tool for PyTorch models · GitHub
June 30, 2021 - PyProf is a tool that profiles and analyzes the GPU performance of PyTorch models.
Starred by 511 users
Forked by 50 users
Languages   Python 95.8% | Shell 3.6% | Dockerfile 0.6%
🌐
Harvard
handbook.eng.kempnerinstitute.harvard.edu › s5_ai_scaling_and_engineering › scalability › gpu_profiling.html
19.3. GPU Profiling — Kempner Institute Computing Handbook
The following steps are performed ... Profiling Loop to Optimize Code# PyTorch profiler is a tool that facilitates collecting different performance metrics at runtime to better understand what happens behind the scene....
🌐
ROCm Blogs
rocm.blogs.amd.com › artificial-intelligence › torch_profiler › README.html
Unveiling performance insights with PyTorch Profiler on an AMD GPU — ROCm Blogs
May 29, 2024 - PyTorch Profiler is a performance analysis tool that enables developers to examine various aspects of model training and inference in PyTorch. It allows users to collect and analyze detailed profiling information, including GPU/CPU utilization, ...
🌐
Modal
modal.com › docs › examples › torch_profiling
Tracing and profiling GPU-accelerated PyTorch programs on Modal | Modal Docs
GPUs are high-performance computing devices. For high-performance computing, tools for measuring and investigating performance are as critical as tools for testing and confirming correctness in typical software. In this example, we demonstrate how to wrap a Modal Function with PyTorch’s built-in profiler, which captures events on both CPUs & GPUs.
🌐
Medium
medium.com › @dayashankar.bhakuni › best-practices-for-profiling-pytorch-models-on-gpus-7a791d17e2b9
Best Practices for Profiling PyTorch Models on GPUs | by Daya shankar | Medium
December 15, 2025 - Profiling PyTorch models on GPUs is a loop: benchmark, profile, change one thing, re benchmark, and repeat. The best results come from disciplined measurement and good trace hygiene.
🌐
Medium
medium.com › @alishafique3 › pytorch-training-optimizations-5-throughput-with-gpu-profiling-and-memory-analysis-31cb2b1f95cc
PyTorch training optimizations: 5× throughput with GPU profiling and memory analysis. | by Ali Shafique | Medium
April 30, 2024 - PyTorch Profiler and the PyTorch TensorBoard plugin are used to identify a bottleneck in the training step. ... As we can see, the step time is 117 msec with the GPU utilization is 73.28%. The average memory used in each training step can be ...
🌐
PyTorch
docs.pytorch.org › user guide › torch.compiler › performance › torchinductor gpu profiling
TorchInductor GPU Profiling — PyTorch main documentation
July 28, 2023 - You can zoom in and out to check the profile. We report the percent of GPU time regarding to the wall time by log line like: ... Sometimes you may see a value larger than 100%. The reason is because PyTorch uses the kernel execution time with profiling enabled while using wall time with profiling ...
Find elsewhere
🌐
DeepSpeed
deepspeed.ai › home › tutorials
Using PyTorch Profiler with DeepSpeed for performance debugging - DeepSpeed
1 week ago - ProfilerActivity.CPU - PyTorch operators, TorchScript functions and user-defined code labels (record_function). ProfilerActivity.CUDA - on-device CUDA kernels. Note that CUDA profiling incurs non-negligible overhead. The example below profiles both the CPU and GPU activities in the model forward pass and prints the summary table sorted by total CUDA time.
🌐
Sigma2
documentation.sigma2.no › code_development › guides › pytorch_profiler.html
Profiling GPU-accelerated Deep Learning — Sigma2 documentation
We present an introduction to profiling GPU-accelerated Deep Learning (DL) models using PyTorch Profiler. Profiling is a necessary step in code development, as it permits identifying bottlenecks in an application.
🌐
Christian Mills
christianjmills.com › posts › cuda-mode-notes › lecture-001
GPU MODE Lecture 1: How to profile CUDA kernels in PyTorch – Christian Mills
April 26, 2024 - Lecture #1 provides a practical introduction to integrating and profiling custom CUDA kernels within PyTorch programs, using tools like load_inline, Triton, and NVIDIA Nsight Compute.
🌐
PyTorch Lightning
pytorch-lightning.readthedocs.io › en › 1.2.10 › advanced › profiler.html
Performance and Bottleneck Profiler — PyTorch Lightning 1.2.10 documentation
This profiler uses PyTorch’s Autograd Profiler and lets you inspect the cost of different operators inside your model - both on the CPU and GPU
🌐
PyTorch Forums
discuss.pytorch.org › t › cuda-memory-profiling › 182065
CUDA Memory Profiling - PyTorch Forums
June 14, 2023 - I’m currently using the torch.profiler.profile to analyze memory peak on my GPUs. I fristly use the argument on_trace_ready to generate a tensorboard and read the information by hand, but now I want to read those information directly in my code. So I’ve setup my profiler as : self.prof = torch.profiler.profile( activities=[ torch.profiler.ProfilerActivity.CPU torch.profiler.ProfilerActivity.CUDA ], record_shapes=True, profile_memory=True ) And then I used the f...
🌐
GitHub
github.com › Quentin-Anthony › torch-profiling-tutorial
GitHub - Quentin-Anthony/torch-profiling-tutorial · GitHub
This tutorial seeks to teach users about using profiling tools such as nvsys, rocprof, and the torch profiler in a simple transformers training loop. We will cover how to use the PyTorch profiler to identify performance bottlenecks, understand GPU efficiency metrics, and perform initial optimizations.
Starred by 579 users
Forked by 32 users
Languages   Python
🌐
GitHub
github.com › pytorch › kineto
GitHub - pytorch/kineto: A CPU+GPU Profiling library that provides access to timeline traces and hardware performance counters. · GitHub
A CPU+GPU Profiling library that provides access to timeline traces and hardware performance counters. - pytorch/kineto
Starred by 967 users
Forked by 256 users
Languages   C++ 90.3% | Cuda 4.1% | CMake 2.8% | Python 2.3%
🌐
PyTorch Forums
discuss.pytorch.org › t › how-to-profiling-entire-pytorch-code-when-gpus-are-present › 102866
How to profiling ENTIRE pytorch code when GPUs are present? - PyTorch Forums
November 15, 2020 - I want to profile my entire training and eval pytorch code. I am using custom dataloaders (e.g. torchmeta library) and novel pytorch libraries (e.g. higher library) and I see very significant performance slow down from what other libraries reported (despite me using better GPUs e.g. I use v100 ...
🌐
Rastringer
rastringer.github.io › gpu_cuda_book › nsight_attention.html
4 Profiling and optimizing PyTorch training – Introduction to GPUs and CUDA programming
Let’s use the torch.nn.functional.scaled_dot_product_attention function, optimized for GPUs. This method uses the Flash Attention algorithm when available. For more on this mechanism, see the research paper. %%writefile profiler.py import torch import torch.nn as nn import torch.nn.functional as F class OptimizedAttention(nn.Module): def __init__(self, embed_dim): super().__init__() self.query = nn.Linear(embed_dim, embed_dim) self.key = nn.Linear(embed_dim, embed_dim) self.value = nn.Linear(embed_dim, embed_dim) self.scale = embed_dim ** -0.5 def forward(self, x): q = self.query(x) k = self
🌐
AMD ROCm
rocm.docs.amd.com › en › docs-6.1.1 › how-to › llm-fine-tuning-optimization › profiling-and-debugging.html
Profiling and debugging - ROCm Documentation - AMD
PyTorch Profiler can be invoked inside Python scripts, letting you collect CPU and GPU performance metrics while the script is running.