- Infos im HLRS Wiki sind nicht rechtsverbindlich und ohne Gewähr -
- Information contained in the HLRS Wiki is not legally binding and HLRS is not responsible for any damages that might result from its use -

Workflow for Profiling and Tracing with Score-P and Scalasca

From HLRS Platforms
Jump to navigationJump to search

This is still work-in-progress!!

Introduction

This page describes a basic workflow for performance analysis based on Score-P and Scalasca. The best-practised presented here are tailored to HLRS' Hawk system.

More specifically, we describe steps and commands necessary for

  1. setting up a suitable use-case,
  2. determining the non-instrumented performance,
  3. instrumenting your code,
  4. getting an initial profile,
  5. determine instrumentation overhead,
  6. scoring for filtering and trace file size determination,
  7. filtering,
  8. profiling with Scalasca,
  9. tracing with Scalasca.

If you get stuck or need further explanation, please get in touch with HLRS user support.

Setting up a suitable use-case

Profiling and in particular tracing can produce a huge amount of performance analysis data. Typically, when doing profiling/tracing it is sufficient to run your code for a few timesteps/iterations only. In most cases, it is good practise to run the code between 1 and 10 minutes.

However, the performance characteristics of a code depend critically on the scale, i.e. number of cores used, and the problem size. Try to keep you performance analysis use-case as close as possible to a realistic use-case of your interest. Where practical, reduce the execution time (and thus the profiling/tracing data volume) by reducing the amount of timesteps/iterations, not by reducing the problem size. If you are interested only in profiling, but not in tracing, the amount of data is usually much smaller allowing to do longer runs if necessary.

Determine the non-instrumented performance

Running your application under the control of a performance analysis tool can incur significant overhead, i.e. your code will take noticeably longer to execute. At the same time, such overhead will have an impact on the quality of your performance analysis and the robustness of your conclusions. Always be aware of the amount of overhead and try to keep it small where possible. In many cases it is possible to reduce the overhead below 5% of the execution time, which is the same order of magnitude of expected performance variability between runs. If your overhead is larger, be aware that performance metrics may be off by at least as much.

It is therefore important to measure the performance of your code for the particular use-case before applying any performance analysis tools. We refer to this as non-instrumented performance. For a Score-P based workflow, this means that you need to measure the code before even compiling with Score-P.

At the very least you should determine the elapsed time of run. Do for instance

$ time mpirun ... ./app_non-instrumented

and record the "User time" portion of the output.

Many codes keep track of an application-specific performance metric, such as for instance iterations per second, or similar. Often this a better than the raw elapsed time, as it will disregard initialisation and shutdown phases which are negligible for longer production runs, but not for short analysis use-cases. If your code reports such a metric, record this as well in addition to the elapsed time. You may consider adding a application-specific metric to your code, if not available yet.

Consider doing not just one run, but several to get a feeling for the variation of the non-instrumented performance across runs.

Basic instrumentation of the code

Next you need to instrument your code by re-compiling it with Score-P. Essentially, you need to replace every invocation of the compiler with the corresponding Score-P compiler wrapper (TODO: also linker? There is no scorep-ld!)' Score-P provides a number of such compiler wrappers.

If you are using Makefiles to build your code, we recommend adding "scorep " (including the space " ") in front of every compiler command, as for instance

# Makefile

%.o: %.f90

scorep mpif90 -o $*.o $<

(How to indent scorep command above??)

For build systems relying on CMake or autotools, it is easier to use more specific wrappers such as "scorep-mpicc" (note the dash, it is not a space). For instance

# autotools
$ MPICC="scorep-mpicc" ./configure ...

(TODO: add CMake example above)

(TODO: add link to Score-P instrumentation docs page)


Initial Score-P profile

To get an initial profile just run your application as usual. Make sure to use the instrumented binary and record the execution time. The main purpose of this of this initial profile is to determine instrumentation overheads and determine the expected size of a full trace as explained in the next section.

$ time mpirun ... ./app_instrumented

$ ls scorep-app_instrumented-20210419

MANIFEST.md profile.cubex scorep.cfg


Score-P will create a directory with the naming pattern scorep-APPNAME-TIMESTAMP, where APPNAME is the name of the executable, i.e. app_instrumented in the example above, and TIMESTAMP the time of profiling. Inside this directory Score-P will create the following files:

  1. MANIFEST.md: erklärt Inhalte des Verzeichnis
  2. profile.cubex: das eigentliche Profile; anzeigbar mit Cube
  3. scorep.cfg: Score-P Konfiguration für diesen Lauf

Overhead

Compare the execution time (or application specific performance metric if available) of the non-instrumented run, with the execution time of the instrumented binary obtained in the previous step.

Overheads of 5% or less are acceptable in most cases; the run-to-run variability is often in the same order of magnitude. The next two sections on scoring and filtering will describe techniques which will reduce the overhead in many cases. However, there might be situations where the overhead cannot be reduced further without significant effort. In those cases, proceed with the analysis, but be aware that performance metrics will be affected by the overhead, possibly to an unexpectedly large extend.

Scoring

Before even looking at the profile, it should be "scored". Scoring will summaries the profile and estimate the size based on the number of invocations of user-code functions, but also functions belonging to MPI or OpenMP. To score your profile, do

$ scorep-score -r -c 2 scorep-APP-TIMESTAMP/profile.cubex > scorep-APP-TIMESTAMP/scorep.score $ cat scorep-APP-TIMESTAMP/scorep.score


The command above will produce a detailed report (-r, including a breakdown to function level, and estimate the trace size including space for two hardware counters (-c 2).

Trace size calculation

The expected size of a full trace can be read off near the top of the scoring output

$ cat scorep-APP-TIMESTAMP/scorep.score

...
Estimated aggregate size of event trace: 38MB

hint: set SCOREP_TOTAL_MEMORY=4097kB

In the example above, a full trace is estimated to have a size of 38MB. Anything below 10GB is fine for a trace. Larger values will take significant time to process during performance analysis with Scalasca or Vampir. Trace size might be reduced by filtering event as explained in the next section on filtering.

Please note, that Score-P suggests to set the environment variable SCOREP_TOTAL_MEMORY for subsequent runs. This will reserve memory for Score-P and avoid intermediate flushes of Score-P's trace buffer, which is relatively small per default. In practise it is easier to round up to a simple number such as 10MB rather than 4097kB. Note, this amount of memory will be taken by each process, not per compute node as a whole.

Filtering

Two reasons for filtering: overhead and trace size reduction.