pytorch memory usage profiling

How to get pytorch's memory stats on CPU / main memory?

stackoverflow.com › questions › 71188895 › how-to-get-pytorchs-memory-stats-on-cpu-main-memory

For this you want to use Pytorch Profiler which give you details on both CPU and memory consumption.

For more details:

https://pytorch.org/blog/introducing-pytorch-profiler-the-new-and-improved-performance-tool/

https://pytorch.org/tutorials/recipes/recipes/profiler_recipe.html

Answer from Mradul Karmodiya on Stack Overflow

PyTorch

pytorch.org › blog › understanding-gpu-memory-1

Understanding GPU Memory 1: Visualizing All Allocations over Time – PyTorch

December 14, 2023 - The Memory Profiler is an added feature of the PyTorch Profiler that categorizes memory usage over time.

PyTorch

docs.pytorch.org › recipes › pytorch profiler

PyTorch Profiler — PyTorch Tutorials 2.12.0+cu130 documentation

July 20, 2022 - model = models.resnet18() inputs = torch.randn(5, 3, 224, 224) with profile( activities=[ProfilerActivity.CPU], profile_memory=True, record_shapes=True ) as prof: model(inputs) print(prof.key_averages().table(sort_by="self_cpu_memory_usage", row_limit=10)) # (omitting some columns) # --------------------------------- ------------ ------------ ------------ # Name CPU Mem Self CPU Mem # of Calls # --------------------------------- ------------ ------------ ------------ # aten::empty 94.79 Mb 94.79 Mb 121 # aten::max_pool2d_with_indices 11.48 Mb 11.48 Mb 1 # aten::addmm 19.53 Kb 19.53 Kb 1 # aten

Discussions

How to get pytorch's memory stats on CPU / main memory? - Stack Overflow

Now my question is: Why does this only work for the GPU? I couldn't find something like torch.cpu.memory_stats(). What is the pendant for this when running on a CPU? ... For this you want to use Pytorch Profiler which give you details on both CPU and memory consumption. More on stackoverflow.com

stackoverflow.com

CUDA Memory Profiling

I’m currently using the torch.profiler.profile to analyze memory peak on my GPUs. I fristly use the argument on_trace_ready to generate a tensorboard and read the information by hand, but now I want to read those information directly in my code. So I’ve setup my profiler as : self.prof ... More on discuss.pytorch.org

discuss.pytorch.org

June 14, 2023

Pytorch Profiler

I use it to capture the tracer of the run. Very useful to identify the performance bottleneck of your training loop and come up with optimizations. It is a bit of a learning curve to master this technique. You need some understanding how GPU and CPU work together (e.g., GPU kernels are async. When does CPU and GPU sync with each other. What are cuda streams. What can be done in parallel by a GPU) Definitely recommend if you need to understand the performance of your training or inference code. Nsight can be an additional tool since it can provide more information compared to the standard profiler This is an example of using trace and profiler to iteratively optimize a model efficiency performance by the pytorch team https://pytorch.org/blog/accelerating-generative-ai/ More on reddit.com

r/pytorch

June 3, 2024

[D] So... Pytorch vs Tensorflow: what's the verdict on how they compare? What are their individual strong points?

I've been meaning to do a project in tensorflow so I can make a candid, three-way comparison between Theano+Lasagne, PyTorch, and Tensorflow, but I can give some rambling thoughts here about the first two. Background: I started with Theano+Lasagne almost exactly a year ago and used it for two of my papers. I switched over to PyTorch last week, and have reimplimented two of my key current projects which were previously in Theano. API: The way Theano's graph construction and compilation works was a bit of a steep learning curve for me, but once I got the hang of it everything clicked (this took about two months, but I was still learning python and basic neural net stuff so take that with a grain of salt). Lasagne's API, to me, is elegant as Catherine the Great riding an orca into battle, which is to say I love it to death. I've always said that it's the library I would write if I knew ahead of time exactly how I wanted a theano topper library to work, and it drastically eases a lot of the gruntwork. PyTorch's API, on the other hand feels a little bit more raw, but there's a couple of qualifiers around that, which I'll get to in a moment. If you just want to do standard tasks (implement a ResNet or VGG) I don't think you'll ever have an issue, but I've been lightly butting heads with it because all I ever do is weird, weird, shit. For example, in my current project I've had to make do with several hacky workarounds because strided tensor indexing isn't yet implemented, and while the current indexing techniques are flexible, they're a lot less intuitive than being able to just use numpy-style indexing. The central qualifier about the is that they literally just released the friggin' framework, of course not everything is implemented and there's still some kinks to work out. Theano is old and well-established, and I wasn't really around to observe any of its or Lasagne's growing pains. Newness aside, my biggest "complaint" with pytorch is basically that "things aren't put together the way I would have put them together" on the neural net API side. Specifically, I really like Lasagne's "layers" paradigm--but a little bit of critical thinking should lead you to the conclusion that that paradigm is specifically and exactly unsuited to a dynamic graph framework. I'm completely used to thinking and optimizing my thought processes around static graph definition, so making the switch API-wise is a minor pain-point. This is really critical--I've spent so long thinking about "Okay, exactly how would I define this graph in Theano, because I can't just write it as I would a regular ole program with my standard flow control" that I've become really strong in that avenue of thinking. Dynamic graphs, however, necessitate an API which is fundamentally different from the "define+run," and while I personally don't find it as intuitive, in the last week alone the ability to do define-by-run stuff has, as CJ said, opened my mind and given me half a dozen project ideas which previously would have been impossible. I also imagine that if you do anything with RNNs where you want to, say, implement dynamic computation time without wasted computation, the imperative nature of the interface is going to make it a lot easier to do so. Speed: So I haven't done extensive benchmarks, but I was surprised to find that PyTorch was, out of the box, 100% faster at training time than theano+lasagne on single-GPU for my current project. I've tested this on a 980 and on a Titan X, with two implementations of my network which I have confirmed to be identical to within a reasonable margin of error. One. Hundred. Percent. Literally going from (in the simplest case) 5 mins/epoch to 2.5 mins/epoch on CIFAR100, and in some cases going down to 2 mins/epoch (i.e. more than twice as fast). This is with identical boilerplate code, using identical data fetchers (I can't unironically say "fetcher" without thinking "DIE, FETCHER!"), identical everything else other than the actual code that trains and runs the network. This surprised the hell out of me because I was under the impression that Theano's extensive and agressive memory optimizations (which, in this case, you pay for with several minutes of compilation time when you start training) meant that it was crazy fast on single GPU. I don't know what leads to the improved speed, either, because they're both using the latest version of cuDNN (I've explicitly checked to make sure this is so), so all those gains must be in the overhead somewhere, but sweet christmas I have no idea where. Relatedly, I've never been able to get multi-GPU or half-precision floats working with theano, ever. I've spent multiple days trying to get libgpuarray working and I've tinkered a bit with platoon, but each time I've come away exhausted (assuming I can even get the damn sources to compile, which was already a pain point). Out of the box, however, PyTorch's data-parallelism (single node, 4 GPUs) and half-precision (pseudo-FP16 for convolutions, which means its not any faster but it uses way less memory) just...worked. I was stunned by this as well. Dev Interactions: My interactions with the core dev teams of both frameworks have been obscenely pleasant. I've come to the Lasagne and Theano guys with difficulties and questions about weird stuff many, many times and they've always promptly and succinctly helped me figure out what was wrong (usually what I didn't understand). The PyTorch team has been just as helpful--I've been bringing up bugs or issues I encounter and getting near-immediate responses, often accompanied by same-day fixes, workarounds, or issue trackers. I haven't worked in Keras or in Tensorflow, but I have taken a look at their "Issues" dockets and some usergroups and just due to the sheer volume of users these frameworks have it doesn't look like it's possible to get that kind of individual attention--it almost feels like I'm going to Cal Poly (where the faculty:student ratio is really high and you rarely have any more than 20 students in a class) while looking over at people in a 1,000 people lecture hall at Berkeley. That's not at all to condemn the Cal kids or imply in any way that the analogical berk doesn't work, but if you're someone like me who's into non-standard neural net stuff (we're talking Chuck Tingle weird) then having the ability to get quick responses from the guys who actually build the framework is invaluable. Misc: The singular issue I'm worried about (and why I'm planning on picking up TensorFlow this year and having all three in my pocket) is that neither Theano nor PyTorch seem designed for deployment, and it doesn't look like that's a planned central focus on the PyTorch roadmap (though I could be wrong on this front, I vaguely recall reading a forum post about this). I'd like to practice deploying some stuff onto a website or droid app (mostly for fun, but I've been crazy focused on research and I think it would be a real useful skill to be able to actually get something I made onto a device), and I'm just not sure that the other frameworks support that quite as well. Relatedly, PyTorch's distributed framework is still experimental, and last I heard TensorFlow was designed with distributed in mind (if it rhymes, it must be true; the sky is green, the grass is blue [brb rewriting this entire post as beat poetry]), so if you need to run truly large-scale experiments TF might still be your best bet. TL;DR: I'm not really trying to recommend one framework over another; I love Lasagne to death (and beyond), but I've been finding that the flexibility of dynamic graphs and the sheer, incomprehensible speed gains I've been getting with PyTorch just in the last week alone and with very little relative time invested into learning the framework mean that I'm making the switch and I'm not likely to look back. I don't know much about TensorFlow yet, but the individual attention I can get from the pytorch devs is a big point for me as I look to do weird researchy stuff, but I'm also likely to pick up tensorflow for some projects later in the year. This post is pretty rambly, but hopefully if you're reading it you can pick up some impressions. Please take this for what it is: my experience, not a hard-and-fast "this is how it is, you will definitely feel the same way." More on reddit.com

r/MachineLearning

205

February 25, 2017

Videos