(quick not super detailed answer; a more detailed one would be good if someone wants to write one).
perf just uses the CPU's own hardware performance counters, which can be put into a mode where they record an event when the counter counts down to zero or up to a threshold.
Either raising an interrupt or writing an event into a buffer in memory (with PEBS precise events). That event will include a code address that the CPU picked to associate with the event (i.e. the point at which the interrupt was raised), even for events like cycles which unlike instructions don't inherently have a specific instruction associated. The out-of-order exec back-end can have a couple hundred instructions in flight when counter wraps, but has to pick exactly one for any given sample.
Generally the CPU "blames" the instruction that was waiting for a slow-to-produce result, not the one producing it, especially cache-miss loads.
For an example with Intel x86 CPUs, see Why is this jump instruction so expensive when performing pointer chasing? which also appears to depend on the effect of letting the last instruction in the ROB retire when an interrupt is raised. (Intel CPUs at least do seem to do that; makes sense for ensuring forward progress even with a potentially slow instruction.)
In general there can be "skew" when a later instruction is blamed than the one actually taking the time, possibly with different causes. (Perhaps especially for uncore events, since they happen asynchronously to the core clock.)
Other related Q&As with interesting examples or other things
- Inconsistent `perf annotate` memory load/store time reporting
- Linux perf reporting cache misses for unexpected instruction
- https://travisdowns.github.io/blog/2019/08/20/interrupts.html - some experiments into which instructions tend to get counts on Skylake.
Videos
@Alina,
If you read the man page for perf record, you can see that perf record -P will be used to record the sample period, and not specify it.
If you want to record more/less samples and modify the period, you have to specify the command like perf record -c 2 (--count=) where 2 is the sampling period. This will mean that for every 2 occurrences of the event that you are measuring, you will have a sample for that. You can then modify the sampling period and test various values.
The other way around to express the sampling period, is to specify the average rate of samples per second (frequency) - which you can do using perf record -F. So perf record -F 1000 will record around 1000 samples per second and these samples will be generated when the hardware/PMU counter corresponding to the event overflows. This means that the kernel will dynamically adjust the sampling period.
The sampling period can be specified with the -c option, though there is also a -F option to specify the sampling frequency. The defaults are 1000 samples/sec or 1000Hz according to the perf wiki:
Period and rate
The perf_events interface allows two modes to express the sampling period:
- the number of occurrences of the event (period)
- the average rate of samples/sec (frequency)
The perf tool defaults to the average rate. It is set to 1000Hz, or 1000 samples/sec. That means that the kernel is dynamically adjusting the sampling period to achieve the target average rate. The adjustment in period is reported in the raw profile data. In contrast, with the other mode, the sampling period is set by the user and does not vary between samples. There is currently no support for sampling period randomization.
As for what the -P option is doing, the commit message for perf (and the related kernel patch) contains some background. If I interpret it correctly, the option means that for efficiency reasons many equal samples can be merged into a single event that also contains the sample period. The original intent is to reduce the number of generated samples to avoid hitting a "rate limit" that would result in lost samples.
If you are distributing the computations with MPI, then using an MPI-aware tool would give you more sensible results: with a distributed application, you might have issues of load imbalance, where one MPI process is idle waiting for data to come from other processes. If you happen to be profiling exactly that MPI process, your performance profile will be all wrong.
So, the first step is usually to find out about the communication and load balance pattern of your program, and identify a sample input that gives you the workload you want (e.g., CPU-intensive on rank 0) For instance, mpiP is an MPI profiling tool that can produce a very complete report about the communication pattern, how much time each MPI call took, etc.
Then you can run a code profiling tool on one or more selected MPI ranks. Anyway, using perf on a single MPI rank is likely not a good idea because its measurements will contain also the performance of the MPI library code, which is probably not what you are looking for.
perf report does not need the -a switch to report the results of perf record -a. You can simply type perf report.
That said, analyzing 14G of profiling data to track a crash seems odd. How about attaching a debugger?