(quick not super detailed answer; a more detailed one would be good if someone wants to write one).

perf just uses the CPU's own hardware performance counters, which can be put into a mode where they record an event when the counter counts down to zero or up to a threshold.

Either raising an interrupt or writing an event into a buffer in memory (with PEBS precise events). That event will include a code address that the CPU picked to associate with the event (i.e. the point at which the interrupt was raised), even for events like cycles which unlike instructions don't inherently have a specific instruction associated. The out-of-order exec back-end can have a couple hundred instructions in flight when counter wraps, but has to pick exactly one for any given sample.

Generally the CPU "blames" the instruction that was waiting for a slow-to-produce result, not the one producing it, especially cache-miss loads.

For an example with Intel x86 CPUs, see Why is this jump instruction so expensive when performing pointer chasing? which also appears to depend on the effect of letting the last instruction in the ROB retire when an interrupt is raised. (Intel CPUs at least do seem to do that; makes sense for ensuring forward progress even with a potentially slow instruction.)

In general there can be "skew" when a later instruction is blamed than the one actually taking the time, possibly with different causes. (Perhaps especially for uncore events, since they happen asynchronously to the core clock.)

Other related Q&As with interesting examples or other things

  • Inconsistent `perf annotate` memory load/store time reporting
  • Linux perf reporting cache misses for unexpected instruction
  • https://travisdowns.github.io/blog/2019/08/20/interrupts.html - some experiments into which instructions tend to get counts on Skylake.
Answer from Peter Cordes on Stack Overflow
🌐
Linux Man Pages
man7.org › linux › man-pages › man1 › perf-record.1.html
perf-record(1) - Linux manual page
If other --filter exists, the new filter expression will be combined with them by &&. --latency Enable data collection for latency profiling. Use perf report --latency for latency-centric profile. -a, --all-cpus System-wide collection from all CPUs (default if no target is specified). -p, --pid= Record events on existing process ID (comma separated list).
🌐
Brendan Gregg
brendangregg.com › perf.html
Linux perf Examples
# perf record -e block:block_rq_issue -ag ^C # ls -l perf.data -rw------- 1 root root 3458162 Jan 26 03:03 perf.data # perf report [...] # Samples: 2K of event 'block:block_rq_issue' # Event count (approx.): 2216 # # Overhead Command Shared Object Symbol # ........
🌐
Arch Linux Man Pages
man.archlinux.org › man › perf-record.1.en
perf-record(1) — Arch manual pages
Record events in threads owned by uid. Name or number. ... Collect data with this RT SCHED_FIFO priority. ... Collect data without buffering. ... Event period to sample. ... Output file name. ... Child tasks do not inherit counters. ... Profile at this frequency. Use max to use the currently maximum allowed frequency, i.e. the value in the kernel.perf_event_max_sample_rate sysctl.
🌐
Red Hat
docs.redhat.com › en › documentation › red_hat_enterprise_linux › 10 › html › monitoring_and_managing_system_status_and_performance › recording-and-analyzing-performance-profiles-with-perf
Chapter 13. Recording and analyzing performance profiles with perf | Monitoring and managing system status and performance | Red Hat Enterprise Linux | 10 | Red Hat Documentation
You have the perf user space tool installed. For more information, see Installing perf. ... This command samples and records performance data of the processes with the process ID’s ID1 and ID2 for a time period of seconds seconds as dictated by using the sleep command.
🌐
Easyperf
easyperf.net › blog › 2018 › 08 › 26 › Basics-of-profiling-with-perf
Basics of profiling with perf. | Easyperf
We have 2451 samples, that’s 1 sample per millisecond. And that’s a default behaviour: the perf tool defaults the frequency to 1000Hz, or 1000 samples/sec. It’s also equivalent to run perf record -F 1000. Perf will stop our program 1000 times per second and see where the IP (instruction pointer) is.
Top answer
1 of 1
5

(quick not super detailed answer; a more detailed one would be good if someone wants to write one).

perf just uses the CPU's own hardware performance counters, which can be put into a mode where they record an event when the counter counts down to zero or up to a threshold.

Either raising an interrupt or writing an event into a buffer in memory (with PEBS precise events). That event will include a code address that the CPU picked to associate with the event (i.e. the point at which the interrupt was raised), even for events like cycles which unlike instructions don't inherently have a specific instruction associated. The out-of-order exec back-end can have a couple hundred instructions in flight when counter wraps, but has to pick exactly one for any given sample.

Generally the CPU "blames" the instruction that was waiting for a slow-to-produce result, not the one producing it, especially cache-miss loads.

For an example with Intel x86 CPUs, see Why is this jump instruction so expensive when performing pointer chasing? which also appears to depend on the effect of letting the last instruction in the ROB retire when an interrupt is raised. (Intel CPUs at least do seem to do that; makes sense for ensuring forward progress even with a potentially slow instruction.)

In general there can be "skew" when a later instruction is blamed than the one actually taking the time, possibly with different causes. (Perhaps especially for uncore events, since they happen asynchronously to the core clock.)

Other related Q&As with interesting examples or other things

  • Inconsistent `perf annotate` memory load/store time reporting
  • Linux perf reporting cache misses for unexpected instruction
  • https://travisdowns.github.io/blog/2019/08/20/interrupts.html - some experiments into which instructions tend to get counts on Skylake.
🌐
Blogger
smalldatum.blogspot.com › 2022 › 04 › i-previously-wrote-about-generating.html
Small Datum: Becoming less confused about perf record
April 4, 2022 - So in that case perf record does sampling of one process for the cycles event. Disclaimer -- I am still waiting for an expert to review this. Update - a fix was pushed With frequency-based sampling (-F $frequency) perf record tries to generate $frequency samples per second.
Find elsewhere
🌐
HPC Wiki
hpc-wiki.info › hpc › Perf
Perf - HPC Wiki
September 23, 2019 - You can show the profile of the created log file with perf record. Depending on the installation, perf report will open a TUI or a GTK interface. If you are using pipes to process the output, it will provide a stdio output. Within the profile (TUI), you can zoom and look at the executed code (function level, instruction level==Assembler).
🌐
Linux Man Pages
linux.die.net › man › 1 › perf-record
perf-record(1) - Linux man page
perf-record - Run a command and record its profile into perf.data
🌐
DEV Community
dev.to › etcwilde › perf---perfect-profiling-of-cc-on-linux-of
Perf - Perfect Profiling of C/C++ on Linux - DEV Community
November 19, 2017 - Perf uses statistical profiling, where it polls the program and sees what function is working. This is less accurate, but has less of a performance hit than something like Callgrind, which tracks every call.
🌐
Linux Man Pages
man7.org › linux › man-pages › man1 › perf-report.1.html
perf-report(1) - Linux manual page
If the keys starts with a prefix '+', then it will append the specified field(s) to the default field order. For example: perf report -F +period,sample. -p, --parent=<regex> A regex filter to identify parent. The parent is a caller of this function and searched through the callchain, thus it requires callchain information recorded.
🌐
PhoenixNAP
phoenixnap.com › home › kb › sysadmin › linux perf: how to use the command and profiler
Linux perf: How to Use the Command and Profiler | phoenixNAP KB
August 24, 2023 - The output collects and displays performance statistics for the given process. ... After five seconds, the output displays all system-wide calls and their count. CPU cycles are a hardware event.
🌐
ManKier
mankier.com › home › perf
perf: Performance analysis tools for Linux | Man Page | Commands | perf | ManKier
Display system-wide real-time performance counter profile: sudo perf top · Run a command and record its profile into perf.data: sudo perf record command · Record the profile of an existing process into perf.data: sudo perf record [-p|--pid] pid · Read perf.data (created by perf record) and ...
Top answer
1 of 3
3

@Alina,

If you read the man page for perf record, you can see that perf record -P will be used to record the sample period, and not specify it.

If you want to record more/less samples and modify the period, you have to specify the command like perf record -c 2 (--count=) where 2 is the sampling period. This will mean that for every 2 occurrences of the event that you are measuring, you will have a sample for that. You can then modify the sampling period and test various values.

The other way around to express the sampling period, is to specify the average rate of samples per second (frequency) - which you can do using perf record -F. So perf record -F 1000 will record around 1000 samples per second and these samples will be generated when the hardware/PMU counter corresponding to the event overflows. This means that the kernel will dynamically adjust the sampling period.

2 of 3
1

The sampling period can be specified with the -c option, though there is also a -F option to specify the sampling frequency. The defaults are 1000 samples/sec or 1000Hz according to the perf wiki:

Period and rate

The perf_events interface allows two modes to express the sampling period:

  • the number of occurrences of the event (period)
  • the average rate of samples/sec (frequency)

The perf tool defaults to the average rate. It is set to 1000Hz, or 1000 samples/sec. That means that the kernel is dynamically adjusting the sampling period to achieve the target average rate. The adjustment in period is reported in the raw profile data. In contrast, with the other mode, the sampling period is set by the user and does not vary between samples. There is currently no support for sampling period randomization.

As for what the -P option is doing, the commit message for perf (and the related kernel patch) contains some background. If I interpret it correctly, the option means that for efficiency reasons many equal samples can be merged into a single event that also contains the sample period. The original intent is to reduce the number of generated samples to avoid hitting a "rate limit" that would result in lost samples.

🌐
OpenAI
openai.com › careers › graphics-software-engineer-consumer-devices-san-francisco
Graphics Software Engineer - Consumer Devices
Developing safe and beneficial AI systems requires people from a wide range of disciplines and backgrounds. We’re always looking for curious minds to join our team.
🌐
Mark Hansen's Blog
markhansen.co.nz › profiler-uis
Linux perf Profiler UIs
October 7, 2021 - perf can interrupt threads to record thread's stack traces, triggered by an event (e.g. a thread context switch, or a syscall) or on a regular schedule (e.g.
🌐
Perfwiki
perfwiki.github.io › main › tutorial
Introduction - perf: Linux profiling with performance counters
The perf tool can be used to collect profiles on per-thread, per-process and per-cpu basis. There are several commands associated with sampling: record, report, annotate. You must first collect the samples using perf record. This generates an output file called perf.data.