perf record - Brave Search

How does perf record (or other profilers) pick which instruction to count as costing time?

stackoverflow.com › questions › 69351189 › how-does-perf-record-or-other-profilers-pick-which-instruction-to-count-as-cos

(quick not super detailed answer; a more detailed one would be good if someone wants to write one).

perf just uses the CPU's own hardware performance counters, which can be put into a mode where they record an event when the counter counts down to zero or up to a threshold.

Either raising an interrupt or writing an event into a buffer in memory (with PEBS precise events). That event will include a code address that the CPU picked to associate with the event (i.e. the point at which the interrupt was raised), even for events like cycles which unlike instructions don't inherently have a specific instruction associated. The out-of-order exec back-end can have a couple hundred instructions in flight when counter wraps, but has to pick exactly one for any given sample.

Generally the CPU "blames" the instruction that was waiting for a slow-to-produce result, not the one producing it, especially cache-miss loads.

For an example with Intel x86 CPUs, see Why is this jump instruction so expensive when performing pointer chasing? which also appears to depend on the effect of letting the last instruction in the ROB retire when an interrupt is raised. (Intel CPUs at least do seem to do that; makes sense for ensuring forward progress even with a potentially slow instruction.)

In general there can be "skew" when a later instruction is blamed than the one actually taking the time, possibly with different causes. (Perhaps especially for uncore events, since they happen asynchronously to the core clock.)

Other related Q&As with interesting examples or other things

Inconsistent `perf annotate` memory load/store time reporting
Linux perf reporting cache misses for unexpected instruction
https://travisdowns.github.io/blog/2019/08/20/interrupts.html - some experiments into which instructions tend to get counts on Skylake.

Answer from Peter Cordes on Stack Overflow

Linux Man Pages

man7.org › linux › man-pages › man1 › perf-record.1.html

perf-record(1) - Linux manual page

If other --filter exists, the new filter expression will be combined with them by &&. --latency Enable data collection for latency profiling. Use perf report --latency for latency-centric profile. -a, --all-cpus System-wide collection from all CPUs (default if no target is specified). -p, --pid= Record events on existing process ID (comma separated list).

brendangregg.com › perf.html

Linux perf Examples

# perf record -e block:block_rq_issue -ag ^C # ls -l perf.data -rw------- 1 root root 3458162 Jan 26 03:03 perf.data # perf report [...] # Samples: 2K of event 'block:block_rq_issue' # Event count (approx.): 2216 # # Overhead Command Shared Object Symbol # ........

Videos

How to Use Perf Performance Analysis Tool on Ubuntu 22.04 - YouTube

Linux perf tool metrics - Ian Rogers - YouTube

November 29, 2023

Demystifying Perf - YouTube

Linux perf Tutorial: Profiling User-Space Apps with DWARF Debug ...

December 17, 2025

Fastware - perf - How to analyse the performance of my program! ...

Arch Linux Man Pages

man.archlinux.org › man › perf-record.1.en

perf-record(1) — Arch manual pages

Record events in threads owned by uid. Name or number. ... Collect data with this RT SCHED_FIFO priority. ... Collect data without buffering. ... Event period to sample. ... Output file name. ... Child tasks do not inherit counters. ... Profile at this frequency. Use max to use the currently maximum allowed frequency, i.e. the value in the kernel.perf_event_max_sample_rate sysctl.

docs.redhat.com › en › documentation › red_hat_enterprise_linux › 10 › html › monitoring_and_managing_system_status_and_performance › recording-and-analyzing-performance-profiles-with-perf

Chapter 13. Recording and analyzing performance profiles with perf | Monitoring and managing system status and performance | Red Hat Enterprise Linux | 10 | Red Hat Documentation

You have the perf user space tool installed. For more information, see Installing perf. ... This command samples and records performance data of the processes with the process ID’s ID1 and ID2 for a time period of seconds seconds as dictated by using the sleep command.

easyperf.net › blog › 2018 › 08 › 26 › Basics-of-profiling-with-perf

Basics of profiling with perf. | Easyperf

We have 2451 samples, that’s 1 sample per millisecond. And that’s a default behaviour: the perf tool defaults the frequency to 1000Hz, or 1000 samples/sec. It’s also equivalent to run perf record -F 1000. Perf will stop our program 1000 times per second and see where the IP (instruction pointer) is.

stackoverflow.com › questions › 69351189 › how-does-perf-record-or-other-profilers-pick-which-instruction-to-count-as-cos

performance - How does perf record (or other profilers) pick which instruction to count as costing time? - Stack Overflow

(quick not super detailed answer; a more detailed one would be good if someone wants to write one).

perf just uses the CPU's own hardware performance counters, which can be put into a mode where they record an event when the counter counts down to zero or up to a threshold.

Either raising an interrupt or writing an event into a buffer in memory (with PEBS precise events). That event will include a code address that the CPU picked to associate with the event (i.e. the point at which the interrupt was raised), even for events like cycles which unlike instructions don't inherently have a specific instruction associated. The out-of-order exec back-end can have a couple hundred instructions in flight when counter wraps, but has to pick exactly one for any given sample.

Generally the CPU "blames" the instruction that was waiting for a slow-to-produce result, not the one producing it, especially cache-miss loads.

For an example with Intel x86 CPUs, see Why is this jump instruction so expensive when performing pointer chasing? which also appears to depend on the effect of letting the last instruction in the ROB retire when an interrupt is raised. (Intel CPUs at least do seem to do that; makes sense for ensuring forward progress even with a potentially slow instruction.)

In general there can be "skew" when a later instruction is blamed than the one actually taking the time, possibly with different causes. (Perhaps especially for uncore events, since they happen asynchronously to the core clock.)

Other related Q&As with interesting examples or other things

Inconsistent `perf annotate` memory load/store time reporting
Linux perf reporting cache misses for unexpected instruction
https://travisdowns.github.io/blog/2019/08/20/interrupts.html - some experiments into which instructions tend to get counts on Skylake.

smalldatum.blogspot.com › 2022 › 04 › i-previously-wrote-about-generating.html

Small Datum: Becoming less confused about perf record

April 4, 2022 - So in that case perf record does sampling of one process for the cycles event. Disclaimer -- I am still waiting for an expert to review this. Update - a fix was pushed With frequency-based sampling (-F $frequency) perf record tries to generate $frequency samples per second.

Find elsewhere

Google Bing Mojeek

hpc-wiki.info › hpc › Perf

Perf - HPC Wiki

September 23, 2019 - You can show the profile of the created log file with perf record. Depending on the installation, perf report will open a TUI or a GTK interface. If you are using pipes to process the output, it will provide a stdio output. Within the profile (TUI), you can zoom and look at the executed code (function level, instruction level==Assembler).

Linux Man Pages

linux.die.net › man › 1 › perf-record

perf-record(1) - Linux man page

perf-record - Run a command and record its profile into perf.data

dev.to › etcwilde › perf---perfect-profiling-of-cc-on-linux-of

Perf - Perfect Profiling of C/C++ on Linux - DEV Community

November 19, 2017 - Perf uses statistical profiling, where it polls the program and sees what function is working. This is less accurate, but has less of a performance hit than something like Callgrind, which tracks every call.

Linux Man Pages

man7.org › linux › man-pages › man1 › perf-report.1.html

perf-report(1) - Linux manual page

If the keys starts with a prefix '+', then it will append the specified field(s) to the default field order. For example: perf report -F +period,sample. -p, --parent=<regex> A regex filter to identify parent. The parent is a caller of this function and searched through the callchain, thus it requires callchain information recorded.

phoenixnap.com › home › kb › sysadmin › linux perf: how to use the command and profiler

Linux perf: How to Use the Command and Profiler | phoenixNAP KB

August 24, 2023 - The output collects and displays performance statistics for the given process. ... After five seconds, the output displays all system-wide calls and their count. CPU cycles are a hardware event.

docs.redhat.com › en › documentation › red_hat_enterprise_linux › 8 › html › monitoring_and_managing_system_status_and_performance › recording-and-analyzing-performance-profiles-with-perf_monitoring-and-managing-system-status-and-performance

Chapter 21. Recording and analyzing performance profiles with perf | Monitoring and managing system status and performance | Red Hat Enterprise Linux | 8 | Red Hat Documentation

Replace command with the command ... it by pressing Ctrl+C. ... You can configure the perf record tool so that it records which function is calling other functions in the performance profile....

mankier.com › home › perf

perf: Performance analysis tools for Linux | Man Page | Commands | perf | ManKier

Display system-wide real-time performance counter profile: sudo perf top · Run a command and record its profile into perf.data: sudo perf record command · Record the profile of an existing process into perf.data: sudo perf record [-p|--pid] pid · Read perf.data (created by perf record) and ...

android.googlesource.com › kernel › msm › + › android-wear-5.1.1_r0.12 › tools › perf › Documentation › perf-record.txt

tools/perf/Documentation/perf-record.txt - kernel/msm - Git at Google

android/kernel/msm/android-wear-5.1.1_r0.12/./tools/perf/Documentation/perf-record.txt ·

stackoverflow.com › questions › 38727879 › what-does-perf-record-p-do

sample - What does perf record -P do? - Stack Overflow

@Alina,

If you read the man page for perf record, you can see that perf record -P will be used to record the sample period, and not specify it.

If you want to record more/less samples and modify the period, you have to specify the command like perf record -c 2 (--count=) where 2 is the sampling period. This will mean that for every 2 occurrences of the event that you are measuring, you will have a sample for that. You can then modify the sampling period and test various values.

The other way around to express the sampling period, is to specify the average rate of samples per second (frequency) - which you can do using perf record -F. So perf record -F 1000 will record around 1000 samples per second and these samples will be generated when the hardware/PMU counter corresponding to the event overflows. This means that the kernel will dynamically adjust the sampling period.

The sampling period can be specified with the -c option, though there is also a -F option to specify the sampling frequency. The defaults are 1000 samples/sec or 1000Hz according to the perf wiki:

Period and rate

The perf_events interface allows two modes to express the sampling period:

the number of occurrences of the event (period)

the average rate of samples/sec (frequency)

The perf tool defaults to the average rate. It is set to 1000Hz, or 1000 samples/sec. That means that the kernel is dynamically adjusting the sampling period to achieve the target average rate. The adjustment in period is reported in the raw profile data. In contrast, with the other mode, the sampling period is set by the user and does not vary between samples. There is currently no support for sampling period randomization.

As for what the -P option is doing, the commit message for perf (and the related kernel patch) contains some background. If I interpret it correctly, the option means that for efficiency reasons many equal samples can be merged into a single event that also contains the sample period. The original intent is to reduce the number of generated samples to avoid hitting a "rate limit" that would result in lost samples.

openai.com › careers › graphics-software-engineer-consumer-devices-san-francisco

Graphics Software Engineer - Consumer Devices

Developing safe and beneficial AI systems requires people from a wide range of disciplines and backgrounds. We’re always looking for curious minds to join our team.

Mark Hansen's Blog

markhansen.co.nz › profiler-uis

Linux perf Profiler UIs

October 7, 2021 - perf can interrupt threads to record thread's stack traces, triggered by an event (e.g. a thread context switch, or a syscall) or on a regular schedule (e.g.

unix.stackexchange.com › questions › 18559 › how-to-analyze-profile-data-from-perf-record-a-system-wide-collection

linux - How to analyze profile data from `perf record --a` (system-wide collection)? - Unix & Linux Stack Exchange

If you are distributing the computations with MPI, then using an MPI-aware tool would give you more sensible results: with a distributed application, you might have issues of load imbalance, where one MPI process is idle waiting for data to come from other processes. If you happen to be profiling exactly that MPI process, your performance profile will be all wrong.

So, the first step is usually to find out about the communication and load balance pattern of your program, and identify a sample input that gives you the workload you want (e.g., CPU-intensive on rank 0) For instance, mpiP is an MPI profiling tool that can produce a very complete report about the communication pattern, how much time each MPI call took, etc.

Then you can run a code profiling tool on one or more selected MPI ranks. Anyway, using perf on a single MPI rank is likely not a good idea because its measurements will contain also the performance of the MPI library code, which is probably not what you are looking for.

perf report does not need the -a switch to report the results of perf record -a. You can simply type perf report.

That said, analyzing 14G of profiling data to track a crash seems odd. How about attaching a debugger?

perfwiki.github.io › main › tutorial

Introduction - perf: Linux profiling with performance counters

The perf tool can be used to collect profiles on per-thread, per-process and per-cpu basis. There are several commands associated with sampling: record, report, annotate. You must first collect the samples using perf record. This generates an output file called perf.data.