How to measure Flop/s etc. of massively parallel jobs with perf stat

Hardware performance counters (HWPC) can be measured by means of tools like Score-P and Extrae. However, using such tools introduces some technical and labor overhead. An alternative is to use the command line tool "perf stat" bundled within the Linux distribution itself and directly accessing the kernel module which is responsible for HWPC measurements. However, doing so does not provide accurate readings due to the methodology used (cf. below). So use it only to get a rough idea about counter values!

In order to use it, the command "perf stat" needs to be prefixed to the actual binary. This generates an instance of perf stat for every MPI process, writing it's output to STDOUT. It's hard or impossible to interpret those (potentially scattered) outputs for massively parallel jobs. We hence recommend to measure individual processes in a parallel job only. This can be done by using mpirun in the "MPMD" (multiple program, multiple data) variant (cf. below). As only STDOUT of the head node is collected in the jobs STDOUT file, one needs to redirect the output of perf stat to a file as shown below if measuring processes on other nodes.

mpirun -np <n_1> <binary> : -np 1 perf stat -e <event> -o <desired perf stat output file> <binary> : -np <n_2> <binary>

By doing so, <n_1> + 1 + <n_2> processes are started. This number should hence equal the number of desired processes in your job.

Some remarks:

Obviously, by means of this method one is measuring HWPC values for a single process only. If you need values for an entire node, job etc., do the respective maths!
However, all threads of the respective process are measured.
We suggest not to measure the first rank as those often tend to do some extra work, hence not being representative for the entire job.
Available <events> can be listed by means of perf list. Use the event fp_ret_sse_avx_ops.all to measure floating point operations on Hawk.
If it is possible to figure out a distinct identifier <proc-id> for every MPI process in the mpirun command line (most probably not possible as processes are not yet running there), it would be possible to redirect the output of perf stat to an individual file for every process by means of mpirun -np <np> perf stat -o perf_stat_out_<proc-id> <binary> and hence to measure HWPC values for every process in a parallel job.

How to measure Flop/s etc. of massively parallel jobs with perf stat

Navigation menu

Page actions

Page actions

Personal tools

Navigation

Search

Tools