For this you want to use Pytorch Profiler which give you details on both CPU and memory consumption.
For more details:
https://pytorch.org/blog/introducing-pytorch-profiler-the-new-and-improved-performance-tool/
https://pytorch.org/tutorials/recipes/recipes/profiler_recipe.html
Answer from Mradul Karmodiya on Stack OverflowPyTorch
pytorch.org › blog › understanding-gpu-memory-1
Understanding GPU Memory 1: Visualizing All Allocations over Time – PyTorch
December 14, 2023 - The Memory Profiler is an added feature of the PyTorch Profiler that categorizes memory usage over time.
PyTorch
docs.pytorch.org › recipes › pytorch profiler
PyTorch Profiler — PyTorch Tutorials 2.12.0+cu130 documentation
July 20, 2022 - model = models.resnet18() inputs = torch.randn(5, 3, 224, 224) with profile( activities=[ProfilerActivity.CPU], profile_memory=True, record_shapes=True ) as prof: model(inputs) print(prof.key_averages().table(sort_by="self_cpu_memory_usage", row_limit=10)) # (omitting some columns) # --------------------------------- ------------ ------------ ------------ # Name CPU Mem Self CPU Mem # of Calls # --------------------------------- ------------ ------------ ------------ # aten::empty 94.79 Mb 94.79 Mb 121 # aten::max_pool2d_with_indices 11.48 Mb 11.48 Mb 1 # aten::addmm 19.53 Kb 19.53 Kb 1 # aten
How to get pytorch's memory stats on CPU / main memory? - Stack Overflow
Now my question is: Why does this only work for the GPU? I couldn't find something like torch.cpu.memory_stats(). What is the pendant for this when running on a CPU? ... For this you want to use Pytorch Profiler which give you details on both CPU and memory consumption. More on stackoverflow.com
CUDA Memory Profiling
I’m currently using the torch.profiler.profile to analyze memory peak on my GPUs. I fristly use the argument on_trace_ready to generate a tensorboard and read the information by hand, but now I want to read those information directly in my code. So I’ve setup my profiler as : self.prof ... More on discuss.pytorch.org
Pytorch Profiler
I use it to capture the tracer of the run. Very useful to identify the performance bottleneck of your training loop and come up with optimizations. It is a bit of a learning curve to master this technique. You need some understanding how GPU and CPU work together (e.g., GPU kernels are async. When does CPU and GPU sync with each other. What are cuda streams. What can be done in parallel by a GPU) Definitely recommend if you need to understand the performance of your training or inference code. Nsight can be an additional tool since it can provide more information compared to the standard profiler This is an example of using trace and profiler to iteratively optimize a model efficiency performance by the pytorch team https://pytorch.org/blog/accelerating-generative-ai/ More on reddit.com
[D] So... Pytorch vs Tensorflow: what's the verdict on how they compare? What are their individual strong points?
I've been meaning to do a project in tensorflow so I can make a candid, three-way comparison between Theano+Lasagne, PyTorch, and Tensorflow, but I can give some rambling thoughts here about the first two. Background: I started with Theano+Lasagne almost exactly a year ago and used it for two of my papers. I switched over to PyTorch last week, and have reimplimented two of my key current projects which were previously in Theano. API: The way Theano's graph construction and compilation works was a bit of a steep learning curve for me, but once I got the hang of it everything clicked (this took about two months, but I was still learning python and basic neural net stuff so take that with a grain of salt). Lasagne's API, to me, is elegant as Catherine the Great riding an orca into battle, which is to say I love it to death. I've always said that it's the library I would write if I knew ahead of time exactly how I wanted a theano topper library to work, and it drastically eases a lot of the gruntwork. PyTorch's API, on the other hand feels a little bit more raw, but there's a couple of qualifiers around that, which I'll get to in a moment. If you just want to do standard tasks (implement a ResNet or VGG) I don't think you'll ever have an issue, but I've been lightly butting heads with it because all I ever do is weird, weird, shit. For example, in my current project I've had to make do with several hacky workarounds because strided tensor indexing isn't yet implemented, and while the current indexing techniques are flexible, they're a lot less intuitive than being able to just use numpy-style indexing. The central qualifier about the is that they literally just released the friggin' framework, of course not everything is implemented and there's still some kinks to work out. Theano is old and well-established, and I wasn't really around to observe any of its or Lasagne's growing pains. Newness aside, my biggest "complaint" with pytorch is basically that "things aren't put together the way I would have put them together" on the neural net API side. Specifically, I really like Lasagne's "layers" paradigm--but a little bit of critical thinking should lead you to the conclusion that that paradigm is specifically and exactly unsuited to a dynamic graph framework. I'm completely used to thinking and optimizing my thought processes around static graph definition, so making the switch API-wise is a minor pain-point. This is really critical--I've spent so long thinking about "Okay, exactly how would I define this graph in Theano, because I can't just write it as I would a regular ole program with my standard flow control" that I've become really strong in that avenue of thinking. Dynamic graphs, however, necessitate an API which is fundamentally different from the "define+run," and while I personally don't find it as intuitive, in the last week alone the ability to do define-by-run stuff has, as CJ said, opened my mind and given me half a dozen project ideas which previously would have been impossible. I also imagine that if you do anything with RNNs where you want to, say, implement dynamic computation time without wasted computation, the imperative nature of the interface is going to make it a lot easier to do so. Speed: So I haven't done extensive benchmarks, but I was surprised to find that PyTorch was, out of the box, 100% faster at training time than theano+lasagne on single-GPU for my current project. I've tested this on a 980 and on a Titan X, with two implementations of my network which I have confirmed to be identical to within a reasonable margin of error. One. Hundred. Percent. Literally going from (in the simplest case) 5 mins/epoch to 2.5 mins/epoch on CIFAR100, and in some cases going down to 2 mins/epoch (i.e. more than twice as fast). This is with identical boilerplate code, using identical data fetchers (I can't unironically say "fetcher" without thinking "DIE, FETCHER!"), identical everything else other than the actual code that trains and runs the network. This surprised the hell out of me because I was under the impression that Theano's extensive and agressive memory optimizations (which, in this case, you pay for with several minutes of compilation time when you start training) meant that it was crazy fast on single GPU. I don't know what leads to the improved speed, either, because they're both using the latest version of cuDNN (I've explicitly checked to make sure this is so), so all those gains must be in the overhead somewhere, but sweet christmas I have no idea where. Relatedly, I've never been able to get multi-GPU or half-precision floats working with theano, ever. I've spent multiple days trying to get libgpuarray working and I've tinkered a bit with platoon, but each time I've come away exhausted (assuming I can even get the damn sources to compile, which was already a pain point). Out of the box, however, PyTorch's data-parallelism (single node, 4 GPUs) and half-precision (pseudo-FP16 for convolutions, which means its not any faster but it uses way less memory) just...worked. I was stunned by this as well. Dev Interactions: My interactions with the core dev teams of both frameworks have been obscenely pleasant. I've come to the Lasagne and Theano guys with difficulties and questions about weird stuff many, many times and they've always promptly and succinctly helped me figure out what was wrong (usually what I didn't understand). The PyTorch team has been just as helpful--I've been bringing up bugs or issues I encounter and getting near-immediate responses, often accompanied by same-day fixes, workarounds, or issue trackers. I haven't worked in Keras or in Tensorflow, but I have taken a look at their "Issues" dockets and some usergroups and just due to the sheer volume of users these frameworks have it doesn't look like it's possible to get that kind of individual attention--it almost feels like I'm going to Cal Poly (where the faculty:student ratio is really high and you rarely have any more than 20 students in a class) while looking over at people in a 1,000 people lecture hall at Berkeley. That's not at all to condemn the Cal kids or imply in any way that the analogical berk doesn't work, but if you're someone like me who's into non-standard neural net stuff (we're talking Chuck Tingle weird) then having the ability to get quick responses from the guys who actually build the framework is invaluable. Misc: The singular issue I'm worried about (and why I'm planning on picking up TensorFlow this year and having all three in my pocket) is that neither Theano nor PyTorch seem designed for deployment, and it doesn't look like that's a planned central focus on the PyTorch roadmap (though I could be wrong on this front, I vaguely recall reading a forum post about this). I'd like to practice deploying some stuff onto a website or droid app (mostly for fun, but I've been crazy focused on research and I think it would be a real useful skill to be able to actually get something I made onto a device), and I'm just not sure that the other frameworks support that quite as well. Relatedly, PyTorch's distributed framework is still experimental, and last I heard TensorFlow was designed with distributed in mind (if it rhymes, it must be true; the sky is green, the grass is blue [brb rewriting this entire post as beat poetry]), so if you need to run truly large-scale experiments TF might still be your best bet. TL;DR: I'm not really trying to recommend one framework over another; I love Lasagne to death (and beyond), but I've been finding that the flexibility of dynamic graphs and the sheer, incomprehensible speed gains I've been getting with PyTorch just in the last week alone and with very little relative time invested into learning the framework mean that I'm making the switch and I'm not likely to look back. I don't know much about TensorFlow yet, but the individual attention I can get from the pytorch devs is a big point for me as I look to do weird researchy stuff, but I'm also likely to pick up tensorflow for some projects later in the year. This post is pretty rambly, but hopefully if you're reading it you can pick up some impressions. Please take this for what it is: my experience, not a hard-and-fast "this is how it is, you will definitely feel the same way." More on reddit.com
Videos
09:30
Lightning Talk: Profiling and Memory Debugging Tools for Distributed ...
27:07
Profiling and Tuning PyTorch Models - Shagun Sodhani | PyData Global ...
12:08
PROFILING AND OPTIMIZING PYTORCH APPLICATIONS WITH THE PYTORCH ...
03:01
Five Ways To Increase Your Model Performance Using PyTorch Profiler ...
55:03
PyTorch Community Voices | PyTorch Profiler | Sabrina & Geeta - ...
GitHub
github.com › Stonesjtu › pytorch_memlab
GitHub - Stonesjtu/pytorch_memlab: Profiling and inspecting memory in pytorch · GitHub
May 28, 2019 - In this repo, I'm going to share ... is interested in. The memory profiler is a modification of python's line_profiler, it gives the memory usage info for each line of code in the specified function/method....
Starred by 1.1K users
Forked by 39 users
Languages Python 56.2% | Jupyter Notebook 43.8%
Zdevito
zdevito.github.io › 2022 › 12 › 09 › memory-traces.html
Visualizing PyTorch memory usage over time | Zach’s Blog
December 9, 2022 - Memory snapshots are a way to dump and visualize the state of CUDA memory allocation in PyTorch. They are useful for debugging out of memory (OOM) errors by showing stack traces for allocated memory and how the allocated memory fits in the caches used by the caching allocator.
APXML
apxml.com › courses › advanced-pytorch › chapter-4-deployment-performance-optimization › pytorch-profiler
Using the PyTorch Profiler for Bottleneck Analysis
Kernel Launches and GPU Utilization: Tracks CUDA kernel launches and their execution times on the GPU, helping visualize parallelism and identify periods where the GPU might be idle. Memory Usage (Optional): When enabled, it tracks memory allocations and deallocations on both CPU and GPU devices, ...
H-huang
h-huang.github.io › tutorials › recipes › recipes › profiler_recipe.html
PyTorch Profiler — PyTorch Tutorials 1.8.1+cu102 documentation
PyTorch profiler can also show the amount of memory (used by the model’s tensors) that was allocated (or released) during the execution of the model’s operators. In the output below, ‘self’ memory corresponds to the memory allocated (released) by the operator, excluding the children ...
Ohio Supercomputer Center
osc.edu › book › export › html › 6407
HOWTO: Estimating and Profiling GPU Memory Usage for Generative AI
See documentation here for more information on how to snapshot GPU memory usage while running PyTorch code. "PyTorch Profiler is a tool that allows the collection of performance metrics during training and inference.
PyPI
pypi.org › project › pytorch-memlab
pytorch-memlab · PyPI
The Line Profiler profiles the memory usage of CUDA device 0 by default, you may want to switch the device to profile by set_target_gpu. The gpu selection is globally, which means you have to remember which gpu you are profiling on during the ...
» pip install pytorch-memlab
CodeGenes
codegenes.net › blog › pytorch-memory-profiler
PyTorch Memory Profiler: A Comprehensive Guide — codegenes.net
Allocated Memory: This refers to the amount of memory that has been allocated by PyTorch for storing tensors and other data structures. Peak Memory: The maximum amount of memory used during the execution of the profiled code. This metric is crucial as it helps in determining if the system has enough memory to handle the operations. Memory Usage Over Time: By analyzing how memory usage changes over time, we can identify memory - intensive parts of the code, such as large tensor creations or operations that lead to excessive memory consumption.
PyTorch Forums
discuss.pytorch.org › t › cuda-memory-profiling › 182065
CUDA Memory Profiling - PyTorch Forums
June 14, 2023 - I’m currently using the torch.profiler.profile to analyze memory peak on my GPUs. I fristly use the argument on_trace_ready to generate a tensorboard and read the information by hand, but now I want to read those information directly in my code. So I’ve setup my profiler as : self.prof = torch.profiler.profile( activities=[ torch.profiler.ProfilerActivity.CPU torch.profiler.ProfilerActivity.CUDA ], record_shapes=True, profile_memory=True ) And then I used the f...
Plain English
python.plainenglish.io › 5-hidden-pytorch-memory-profiling-tricks-every-ml-engineer-needs-for-gpu-intensive-models-94b3bf5574fa
5 Hidden PyTorch Memory Profiling Tricks Every ML Engineer Needs for GPU-Intensive Models | by Nithin Bharadwaj | Python in Plain English
September 15, 2025 - Reviewing these snapshots can show you how memory evolves over time. What if your memory usage keeps creeping up with each epoch? When working with multiple GPUs or overlapping operations, custom CUDA streams can introduce hidden bottlenecks. Profiling stream-specific activities helps isolate these issues.