- Infos im HLRS Wiki sind nicht rechtsverbindlich und ohne Gewähr -
- Information contained in the HLRS Wiki is not legally binding and HLRS is not responsible for any damages that might result from its use -

Workflow for Profiling with mpiP

From HLRS Platforms
Jump to navigationJump to search

This page is work in progress!!!


Introduction

This page describes a basic workflow for performance analysis based on mpiP. The best-practises presented here are tailored to HLRS' Hawk system.

More specifically, we describe steps and commands necessary for

  1. setting up a suitable use-case,
  2. determining the non-instrumented performance,
  3. configuration of mpiP,
  4. obtaining profiles,
  5. determining instrumentation overhead,
  6. quick efficiency metrics.,

If you get stuck or need further explanation, please get in touch with HLRS user support.

On Hawk load the required modules with

$ module load mpip


Setting up a suitable use-case

In contrast to full tracing, profiling will in general not produce such huge amounts of performance analysis data. It is therefore in most cases not necessary to tailor use-cases for low amounts of performance trace data. Often, profiling can be done on the same configuration as production runs. In many cases however, users will still want to generate special use-cases of short duration for profiling in order to save compute resources.

However, the performance characteristics of a code depend critically on the scale, i.e. number of cores used, and the problem size. Try to keep you performance analysis use-case as close as possible to a realistic use-case of your interest. Where practical, reduce the execution time (and thus the tracing data volume) by reducing the amount of timesteps/iterations, not by reducing the problem size.

On the other hand, the number of timesteps/iterations should not be too small. For profiling in particular, it is important to make sure, that the total execution time is dominated by the main computational loop while the initialisation and shutdown phases take only a small fraction of the execution time. As a rule of thumb, aim for less than 5% init and shutdown phase where possible.

Determine the non-instrumented performance

Running your application under the control of a performance analysis tool can incur significant overhead, i.e. your code will take noticeably longer to execute. At the same time, such overhead will have an impact on the quality of your performance analysis and the robustness of your conclusions. Always be aware of the amount of overhead and try to keep it small where possible. In many cases it is possible to reduce the overhead below 5% of the execution time, which is the same order of magnitude of expected performance variability between runs. If your overhead is larger, be aware that performance metrics may be off by at least as much.

It is therefore important to measure the performance of your code for the particular use-case before applying any performance analysis tools. We refer to this as non-instrumented performance.

At the very least you should determine the elapsed time of run. Do for instance

$ time mpirun ... ./app

and record the "User time" portion of the output.

Many codes keep track of an application-specific performance metric, such as for instance iterations per second, or similar. Often, this a better than the raw elapsed time, as it will disregard initialisation and shutdown phases which are negligible for longer production runs, but not for short analysis use-cases. If your code reports such a metric, record this as well in addition to the elapsed time. You may consider adding an application-specific metric to your code, if not available yet.

Consider doing not just one run, but several to get a feeling for the variation of the non-instrumented performance across runs.

Configuration of mpiP

mpiP is configured by setting the environment variable MPIP. All available settings are listed here. Note, that the wrapper script introduced in the next section already sets a sensible default configuration. Most users will not need to change it.


Obtaining profiles

mpiP does not need instrumentation of source code. It attaches to the binary through LD_PRELOADing. Usually, this is done by a wrapper script. Again, we have prepared a template for the wrapper script which should be sufficient for most users. The wrapper script is located at:

$ cat $HLRS_MPIP_ROOT/../share/trace_mpiP.sh

Again, most user will not have to change or even copy it as the wrapper script is located in your $PATH.

To obtain profiles, you just need to place the wrapper script in front of your application binary. For instance, suppose your jobs script is:

time mpirun -n XX ... ./app app_arg1 app_arg2

just replace this with

module load extrae
time mpirun -n XX ... trace_mpiP.sh ./app app_arg1 app_arg2

Note, that you need to load the mpip module in your job script.

After running your job, you will find one or two files in your working directory

$ ls -l *.?.mpiP

The name of the file `APP.NRANKS.UNIQUEID.?.mpiP` consists of the name of your application binary, the number of MPI ranks, and a unique ID, and either the ending `1.mpiP` for a short concise report or `2.mpiP` for a longer detailed report.

Quick efficiency metrics

A set of basic performance efficiency metrics can be calculated from an mpiP profile with the command mpip2POP. More specifically, this command produces metrics developed by the project POP.

Invoking the command as

module load mpip
mpip2POP.py --scaling weak first_profile.mpiP [second_profile.mpiP]

will produce output similar to the following.

---------------------------------
                |  128  |  512  |
---------------------------------
 GE             | 0.91  | 0.89  |
   PE           | 0.91  | 0.87  |
     LB         | 0.92  | 0.94  |
     CE         | 0.98  | 0.93  |
   CScal        | 1.00  | 1.02  |
---------------------------------
 Elapsed time   |  7.5  |  7.7  |
 Average useful |  6.8  |  6.7  |
 Max useful     |  7.4  |  7.2  |
---------------------------------

Here each column corresponds to a profile with the number of MPI ranks indicated at the top (here 128 and 512, respectively). While rows show the value of particular metrics. See here for a brief explanation of POP metrics.