This page is work in progress!!!

Introduction

This page describes a basic workflow for performance analysis based on Extrae and Paraver. The best-practises presented here are tailored to HLRS' Hawk system.

More specifically, we describe steps and commands necessary for

setting up a suitable use-case,
determining the non-instrumented performance,
configuration of Extrae,
obtaining traces,
determining instrumentation overhead,
quick efficiency metrics,
trace visualization with Paraver.

If you get stuck or need further explanation, please get in touch with HLRS user support.

On Hawk load the required modules with

$ module load extrae bsc_tools

Setting up a suitable use-case

Tracing can produce a huge amount of performance analysis data. Typically, when doing tracing it is sufficient to run your code for a few timesteps/iterations only. In most cases, it is good practise to run the code between 1 and 10 minutes.

However, the performance characteristics of a code depend critically on the scale, i.e. number of cores used, and the problem size. Try to keep you performance analysis use-case as close as possible to a realistic use-case of your interest. Where practical, reduce the execution time (and thus the tracing data volume) by reducing the amount of timesteps/iterations, not by reducing the problem size.

Determine the non-instrumented performance

Running your application under the control of a performance analysis tool can incur significant overhead, i.e. your code will take noticeably longer to execute. At the same time, such overhead will have an impact on the quality of your performance analysis and the robustness of your conclusions. Always be aware of the amount of overhead and try to keep it small where possible. In many cases it is possible to reduce the overhead below 5% of the execution time, which is the same order of magnitude of expected performance variability between runs. If your overhead is larger, be aware that performance metrics may be off by at least as much.

It is therefore important to measure the performance of your code for the particular use-case before applying any performance analysis tools. We refer to this as non-instrumented performance.

At the very least you should determine the elapsed time of run. Do for instance

$ time mpirun ... ./app

and record the "User time" portion of the output.

Many codes keep track of an application-specific performance metric, such as for instance iterations per second, or similar. Often, this a better than the raw elapsed time, as it will disregard initialisation and shutdown phases which are negligible for longer production runs, but not for short analysis use-cases. If your code reports such a metric, record this as well in addition to the elapsed time. You may consider adding an application-specific metric to your code, if not available yet.

Consider doing not just one run, but several to get a feeling for the variation of the non-instrumented performance across runs.

Configuration of Extrae

Extrae is a library which is able to record a wide range of relevant performance metrics. It is configured through an XML configuration file which needs to be specified via the environment variable EXTRAE_CONFIG_FILE. At HLRS we have prepared a template which should be OK for most users, at least initially. Let's have a look at it:

$ cat $HLRS_EXTRAE_ROOT/../share/extrae_detail.xml<br/>

This template is set up to record events related to MPI, OpenMP and some useful hardware counters. It does not record events related to Pthreads, memory usage, call stack information, etc. If you need any of those, take a copy of the template into your working directory. If the defaults are fine, you will not need a copy of the configuration file.

Obtaining traces

Extrae does not need instrumentation of source code. It attaches to the binary through LD_PRELOADing. Usually, this is done by a tracing wrapper script. Again, we have prepared a template for the wrapper script which should be sufficient for most users. The wrapper script is located at:

$ cat $HLRS_EXTRAE_ROOT/../share/trace_extrae.sh <br/>

Again, most user will not have to change or even copy it as the wrapper script is located in your $PATH.

To obtain traces, you just need to place the wrapper script in front of your application binary. For instance, suppose your jobs script is:

time mpirun -n XX ... ./app app_arg1 app_arg2<br/>

just replace this with

module load extrae <br/>
time mpirun -n XX ... trace_extrae.sh ./app app_arg1 app_arg2<br/>

Note, that you need to load the extrae module in your job script.

After running your job, you will find a few files and directories in your working directory

$ ls -ld TRACE.* set-*/

These directories contain intermediate trace files. As noted above traces can get very large. Check the size of your traces with

$ du -h -s set-* <br/>
10G   set-0

Anything above a total size 10GB is probably too large for you to analyze. Consider reducing the trace volume as described in the section #Setting up a suitable use-case.

The intermediate trace files need to be merged with the command

$ export TMPDIR=$(pwd -P); export MPI2PRV_TMP_DIR=$TMPDIR <br/>
$ mpi2prv -maxmem 50000 -f TRACE.mpits -o app_my_trace.prv

Make sure to replace app_my_trace with a meaningful name for the resulting trace. If you used the wrapper script above, it will suggest to use the name of the binary plus a timestamp. Feel free to choose any other name. Also, please make sure that the variable TMPDIR points to a suitable directory, preferably on a workspace.

Merging will produce the files

 app_my_trace.prv
 app_my_trace.pcf
 app_my_trace.row

where app_my_trace.prv contains the actual trace, and the others contain meta-data. After successful merging, you may delete the intermediate files TRACE.* set-*/.

Quick efficiency metrics

Basic performance metrics can be determined from the trace using the tool modelfactors.py, which is developed by the POP project.

export TMPDIR=$(pwd -P) <br/>
modelfactors.py -tmd prv app_my_trace.prv [second_trace.prv]

If your trace is larger than 1024 MiBi, please add the option -ms TRACE_SIZE, where TRACE_SIZE is larger than the size of your trace in MiBi. Please note, that traces larger than 1024 MiBi take several hours to process. If your code uses MPI + OpenMP, please add the option -m hybrid. Finally, make sure to set TMPDIR to a suitable directory, preferably on a workspace. Trace analysis produces significant amounts of intermediate data.

The tool modelfactors.py will produce output similar to the following.

Overview of the Efficiency metrics:
==============================================
                      Trace mode |        MPI
         Processes [Trace Order] |     128[1]
==============================================
Global efficiency                 |     78.93%
-- Parallel efficiency            |     78.93%
   -- Load balance                |     87.56%
   -- Communication efficiency    |     90.14%
      -- Serialization efficiency |     93.91%
      -- Transfer efficiency      |     95.99%
-- Computation scalability        |  Non-Avail
   -- IPC scalability             |  Non-Avail
   -- Instruction scalability     |  Non-Avail
   -- Frequency scalability       |  Non-Avail
==============================================

In a nutshell, modelfactors produces a hierarchy of efficiency metrics, where each higher-level efficiency is the product of lower-level ones. Efficiencies below 80% will require your attention and possibly further analysis and code optimization. Most users should aim for efficiencies well above 90%.

Parallel efficiency quantifies the overhead of MPI and/or OpenMP. More specifically, it is the ratio of time spent in user code divided by the total execution time. The rest is spent executing code in the MPI or the OpenMP runtime, which is considered non-useful overhead. Time spent in MPI/OpenMP may be due load-imbalances or communication.

(In the following, we will speak of MPI only. Interpretation of the efficiencies in the context of OpenMP or MPI+OpenMP is more difficult to explain.)

Load balance (efficiency) quantifies how well user code computations are balanced across MPI ranks. A value of 100% means perfect load balance, while 0% means that only one node does useful computations at all. In practise, load balance is often the most serious performance issue.

Communication efficiency quantifies how well MPI communication is working. This includes the actual efficiency of data transfers (transfer efficiency), but also the quality of the communication pattern (serialization efficiency). Note, that communication efficiency is only broken down further if tracing was done in detailed mode.

The tool can process more than one trace at the same time. Typically, this will be traces at different number of MPI ranks or cores. If more than one trace was provided, it will show efficiencies for each core number in a separate column.

In addition, it will also compute some scalability metrics. Note, that the trace with the lowest number of cores is taken as baseline for these scalabilities.

Computational scalability quantifies how much user code computation time increases with increasing core numbers. If the total computation time remains constant (think ideal scaling, no additional instructions are executed due to parallelisation), the computational scalability is 100%. Any amount of additional computation time will reduce the computational scalability to values below 100%. More specifically, the computational scalability is the total computation time of the baseline run divided by the computation time of the run of interest. Note, that this excludes any effects from MPI communication, which are covered by the parallel efficiencies above.

Unlike the efficiencies, scalabilities can take values above 100%. Actually, this is not unusual for strong scaling experiments. Consider that your problem at small number of cores is large and does not fit into the cache. Increasing the number of cores will increase the total number of instructions (e.g. because of additional calculations at domain boundaries) and thus increase the execution time which leads to scalabilities below 100%. As your problem per core gets smaller, you will eventually start benefitting for the caches and actually run faster than at smaller scale. This effect might lead to computational scalabilities which are larger than 100%. Other effects might also lead to super-linear scalabilities.

Computational scalability can be broken down into instruction scalability, and instructions-per-cycle (IPC) scalability (and frequency scalability which will be 100% in most cases). Instruction scalability measures how your algorithm (and the implementation) scales. In virtually all cases, this will be below 100%. Values below 80% is reason for concern and requires attention. IPC scalability measures how the amount of instructions per clock cycle scales. Caches effects and others might lead to values above 100%.

Further information

If you need further information on Extrae and Paraver, please have a look on the following resources:

Slides presented at 35th VI-HPS Tools Workshop (section "Day 2: Tuesday 15 September")
Video of talk given at 20th VI-HPS Tools Workshop:
- Part 1
- Part 2
- Part 3
Entire Extrae documentation
w.r.t. help on Paraver, please use the help menu of the GUI

Workflow for Profiling with Extrae and Paraver

Contents

Introduction

Setting up a suitable use-case

Determine the non-instrumented performance

Configuration of Extrae

Obtaining traces

Quick efficiency metrics

Further information

Navigation menu

Page actions

Page actions

Personal tools

Navigation

Search

Tools