- Infos im HLRS Wiki sind nicht rechtsverbindlich und ohne Gewähr -
- Information contained in the HLRS Wiki is not legally binding and HLRS is not responsible for any damages that might result from its use -

Workflow for Profiling with Extrae and Paraver: Difference between revisions

From HLRS Platforms
Jump to navigationJump to search
No edit summary
Line 95: Line 95:
<b>Default tracing mode: burst or detailed?</b>
<b>Default tracing mode: burst or detailed?</b>


= Basic performance metrics =
= Quick efficiency metrics =
Basic performance metrics can be determined from the trace using the tool <code>modelfactors.py</code>, a tool developed by the POP project.
Basic performance metrics can be determined from the trace using the tool <code>modelfactors.py</code>, which is developed by the POP project.


{{Command  
{{Command  
| command = modelfactors.py app_my_trace.prv
| command = export TMPDIR=$(pwd -P)
modelfactors.py -tmd prv app_my_trace.prv
}}
}}
If your trace is larger than 1024 MiBi, please add the option <code>-ms TRACE_SIZE</code>, where TRACE_SIZE is larger than the size of your trace in MiBi. Please note, that traces larger than 1024 MiBi take several hours to process. If your code uses MPI + OpenMP, please add the option <code>-m hybrid</code>. Finally, make sure to set <code>TMPDIR</code> to a suitable directory, preferably on a workspace. Trace analysis produces significant amounts of intermediate data.
The tool <code>modelfactors.py</code> will produce output similar to
Overview of the Efficiency metrics:
==============================================
                      Trace mode |        MPI
          Processes [Trace Order] |    128[1]
==============================================
Global efficiency                |    78.93%
-- Parallel efficiency            |    78.93%
    -- Load balance                |    87.56%
    -- Communication efficiency    |    90.14%
      -- Serialization efficiency |    93.91%
      -- Transfer efficiency      |    95.99%
-- Computation scalability        |  Non-Avail
    -- IPC scalability            |  Non-Avail
    -- Instruction scalability    |  Non-Avail
    -- Frequency scalability      |  Non-Avail
==============================================


= Further information =  
= Further information =  

Revision as of 15:14, 2 December 2021

This page is work in progress!!!


Introduction

This page describes a basic workflow for performance analysis based on Extrae and Paraver. The best-practised presented here are tailored to HLRS' Hawk system.

More specifically, we describe steps and commands necessary for

  1. setting up a suitable use-case,
  2. determining the non-instrumented performance,
  3. configuration of Extrae,
  4. obtaining traces,
  5. determining instrumentation overhead,
  6. quick efficiency metrics,
  7. trace visualization with Paraver.

If you get stuck or need further explanation, please get in touch with HLRS user support.

On Hawk load the required modules with

$ module load extrae bsc_tools


Setting up a suitable use-case

Tracing can produce a huge amount of performance analysis data. Typically, when doing tracing it is sufficient to run your code for a few timesteps/iterations only. In most cases, it is good practise to run the code between 1 and 10 minutes.

However, the performance characteristics of a code depend critically on the scale, i.e. number of cores used, and the problem size. Try to keep you performance analysis use-case as close as possible to a realistic use-case of your interest. Where practical, reduce the execution time (and thus the tracing data volume) by reducing the amount of timesteps/iterations, not by reducing the problem size.

Determine the non-instrumented performance

Running your application under the control of a performance analysis tool can incur significant overhead, i.e. your code will take noticeably longer to execute. At the same time, such overhead will have an impact on the quality of your performance analysis and the robustness of your conclusions. Always be aware of the amount of overhead and try to keep it small where possible. In many cases it is possible to reduce the overhead below 5% of the execution time, which is the same order of magnitude of expected performance variability between runs. If your overhead is larger, be aware that performance metrics may be off by at least as much.

It is therefore important to measure the performance of your code for the particular use-case before applying any performance analysis tools. We refer to this as non-instrumented performance.

At the very least you should determine the elapsed time of run. Do for instance

$ time mpirun ... ./app

and record the "User time" portion of the output.

Many codes keep track of an application-specific performance metric, such as for instance iterations per second, or similar. Often, this a better than the raw elapsed time, as it will disregard initialisation and shutdown phases which are negligible for longer production runs, but not for short analysis use-cases. If your code reports such a metric, record this as well in addition to the elapsed time. You may consider adding an application-specific metric to your code, if not available yet.

Consider doing not just one run, but several to get a feeling for the variation of the non-instrumented performance across runs.

Configuration of Extrae

Extrae is a library which is able to record a wide range of relevant performance metrics. It is configured through an XML configuration file which needs to be specified via the environment variable EXTRAE_CONFIG_FILE. At HLRS we have prepared a template which should be OK for most users, at least initially. Let's have a look at it:

$ cat $EXTRAE_HOME/../share/extrae_detail.xml

This template is set up to record events related to MPI, OpenMP and some useful hardware counters. It does not record events related to Pthreads, memory usage, call stack information, etc. If you need any of those, take a copy of the template into your working directory. If the defaults are fine, you will not need a copy of the configuration file.

Obtaining traces

Extrae does not need instrumentation of source code. It attaches to the binary through LD_PRELOADing. Usually, this is done by a tracing wrapper script. Again, we have prepared a template for the wrapper script which should be sufficient for most users. The wrapper script is located at:

$ cat $EXTRAE_HOME/../share/trace_extrae.sh

Again, most user will not have to change or even copy it.

To obtain traces, you just need to place the wrapper script in front of your application binary. For instance, suppose your jobs script is:

time mpirun -n XX ... ./app app_arg1 app_arg2

just replace this with

module load extrae

export MPI_SHEPHERD=1 # if using MPT; not necessary for OpenMPI

time mpirun -n XX ... $EXTRAE_HOME/../share/trace_extrae.sh ./app app_arg1 app_arg2

Note, that you need to load the extrae module in your job script. If you are using MPT, please set the environment variable TMPDIR as indicated.

After running your job, you will find a few files and directories in your working directory

$ ls -ld TRACE.* set-?/

These contain intermediate trace files, which need to be merged with the command

$ export TMPDIR=$(pwd -P)
$ mpi2prv -f TRACE.mpits -o app_my_trace.prv

Make sure to replace app_my_trace with a meaningful name for the resulting trace. If you used the wrapper script above, it will suggest to use the name of the binary plus a timestamp. Feel free to choose any other name. Also, please make sure that the variable TMPDIR points to a suitable directory, preferably on a workspace.

Merging will produce the files

 app_my_trace.prv
 app_my_trace.pcf
 app_my_trace.row

where app_my_trace.prv contains the actual trace, and the others contain meta-data. After successful merging, you may delete the intermediate files TRACE.* set-?/.

Default tracing mode: burst or detailed?

Quick efficiency metrics

Basic performance metrics can be determined from the trace using the tool modelfactors.py, which is developed by the POP project.

export TMPDIR=$(pwd -P) modelfactors.py -tmd prv app_my_trace.prv


If your trace is larger than 1024 MiBi, please add the option -ms TRACE_SIZE, where TRACE_SIZE is larger than the size of your trace in MiBi. Please note, that traces larger than 1024 MiBi take several hours to process. If your code uses MPI + OpenMP, please add the option -m hybrid. Finally, make sure to set TMPDIR to a suitable directory, preferably on a workspace. Trace analysis produces significant amounts of intermediate data.

The tool modelfactors.py will produce output similar to

Overview of the Efficiency metrics:
==============================================
                      Trace mode |        MPI
         Processes [Trace Order] |     128[1]
==============================================
Global efficiency                 |     78.93%
-- Parallel efficiency            |     78.93%
   -- Load balance                |     87.56%
   -- Communication efficiency    |     90.14%
      -- Serialization efficiency |     93.91%
      -- Transfer efficiency      |     95.99%
-- Computation scalability        |  Non-Avail
   -- IPC scalability             |  Non-Avail
   -- Instruction scalability     |  Non-Avail
   -- Frequency scalability       |  Non-Avail
==============================================

Further information

If you need further information on Extrae and Paraver, please have a look on the following resources: