- Infos im HLRS Wiki sind nicht rechtsverbindlich und ohne Gewähr -

Workflow for Profiling and Tracing with Score-P and Scalasca

From HLRS Platforms
Jump to navigationJump to search

Introduction

This page describes a basic workflow for performance analysis based on Score-P and Scalasca. The best-practised presented here are tailored to HLRS' Hawk system.

More specifically, we describe steps and commands necessary for

  1. setting up a suitable use-case,
  2. determining the non-instrumented performance,
  3. instrumenting your code,
  4. getting an initial profile,
  5. determine instrumentation overhead,
  6. scoring for filtering and trace file size determination,
  7. filtering,
  8. profiling with Scalasca,
  9. tracing with Scalasca.

If you get stuck or need further explanation, please get in touch with HLRS user support.

On Hawk load the required modules with

$ module load scorep scalasca cube vampir


Setting up a suitable use-case

Profiling and in particular tracing can produce a huge amount of performance analysis data. Typically, when doing profiling/tracing it is sufficient to run your code for a few timesteps/iterations only. In most cases, it is good practise to run the code between 1 and 10 minutes.

However, the performance characteristics of a code depend critically on the scale, i.e. number of cores used, and the problem size. Try to keep you performance analysis use-case as close as possible to a realistic use-case of your interest. Where practical, reduce the execution time (and thus the profiling/tracing data volume) by reducing the amount of timesteps/iterations, not by reducing the problem size. If you are interested only in profiling, but not in tracing, the amount of data is usually much smaller allowing to do longer runs if necessary.

Determine the non-instrumented performance

Running your application under the control of a performance analysis tool can incur significant overhead, i.e. your code will take noticeably longer to execute. At the same time, such overhead will have an impact on the quality of your performance analysis and the robustness of your conclusions. Always be aware of the amount of overhead and try to keep it small where possible. In many cases it is possible to reduce the overhead below 5% of the execution time, which is the same order of magnitude of expected performance variability between runs. If your overhead is larger, be aware that performance metrics may be off by at least as much.

It is therefore important to measure the performance of your code for the particular use-case before applying any performance analysis tools. We refer to this as non-instrumented performance. For a Score-P based workflow, this means that you need to measure the code before even compiling with Score-P.

At the very least you should determine the elapsed time of run. Do for instance

$ time mpirun ... ./app_non-instrumented

and record the "User time" portion of the output.

Many codes keep track of an application-specific performance metric, such as for instance iterations per second, or similar. Often this a better than the raw elapsed time, as it will disregard initialisation and shutdown phases which are negligible for longer production runs, but not for short analysis use-cases. If your code reports such a metric, record this as well in addition to the elapsed time. You may consider adding a application-specific metric to your code, if not available yet.

Consider doing not just one run, but several to get a feeling for the variation of the non-instrumented performance across runs.

Basic instrumentation of the code

Next you need to instrument your code by re-compiling it with Score-P. Essentially, you need to replace every invocation of the compiler with the corresponding Score-P compiler wrapper (TODO: also linker? There is no scorep-ld!)' Score-P provides a number of such compiler wrappers.

If you are using Makefiles to build your code, we recommend adding "scorep " (including the space " ") in front of every compiler command, as for instance

# Makefile

%.o: %.f90

scorep mpif90 -o $*.o $<

(How to indent scorep command above??)

For build systems relying on CMake or autotools, it is easier to use more specific wrappers such as "scorep-mpicc" (note the dash, it is not a space). For instance

# autotools
$ MPICC="scorep-mpicc" ./configure ...

(TODO: add CMake example above)

(TODO: add link to Score-P instrumentation docs page)

Initial Score-P profile

To get an initial profile just run your application as usual. Make sure to use the instrumented binary and record the execution time. The main purpose of this of this initial profile is to determine instrumentation overheads and determine the expected size of a full trace as explained in the next section.

$ time mpirun ... ./app_instrumented

$ ls scorep-app_instrumented-20210419

MANIFEST.md profile.cubex scorep.cfg


Score-P will create a directory with the naming pattern scorep-APPNAME-TIMESTAMP, where APPNAME is the name of the executable, i.e. app_instrumented in the example above, and TIMESTAMP the time of profiling. Inside this directory Score-P will create the following files:

  1. MANIFEST.md: manifest of this directory
  2. profile.cubex: actual Score-P profile; use Cube (module load cube) to display
  3. scorep.cfg: Score-P configuration for this experiment

Overhead

Compare the execution time (or application specific performance metric if available) of the non-instrumented run with the execution time of the instrumented binary obtained in the previous step.

Overheads of 5% or less are acceptable in most cases; the run-to-run variability is often in the same order of magnitude. The next two sections on scoring and filtering will describe techniques which will reduce the overhead in many cases. However, there might be situations where the overhead cannot be reduced further without significant effort. In those cases, proceed with the analysis, but be aware that performance metrics will be affected by the overhead, possibly to an unexpectedly large extent.

Scoring

Before even looking at the profile, it should be "scored". Scoring will summarise the profile and estimate the size based on the number of invocations of user-code functions, but also functions belonging to MPI or OpenMP. To score your profile, do

$ scorep-score -r -c 3 scorep-APP-TIMESTAMP/profile.cubex > scorep-APP-TIMESTAMP/scorep.score
$ cat scorep-APP-TIMESTAMP/scorep.score


The command above will produce a detailed report (-r), including a breakdown to function level, and estimate the trace size including space for three hardware counters (-c 3).

Trace size calculation

The expected size of a full trace can be read off near the top of the scoring output

$ cat scorep-APP-TIMESTAMP/scorep.score

...
Estimated aggregate size of event trace: 38MB

hint: set SCOREP_TOTAL_MEMORY=4097kB

In the example above, a full trace is estimated to have a size of 38MB. Anything below 10GB is fine for a trace. Larger values will take significant time to process during performance analysis with Scalasca or Vampir. Trace size might be reduced by filtering event as explained in the next section on filtering.

Please note, that Score-P suggests to set the environment variable SCOREP_TOTAL_MEMORY for subsequent runs. This will reserve memory for Score-P and avoid intermediate flushes of Score-P's trace buffer, which is relatively small per default. In practise it is easier to round up to a simple number such as 10MB rather than 4097kB. Note, this amount of memory will be taken by each process, not per compute node as a whole. Furthermore please also note that flushes should be avoided in any case as they would affect the observed performance!

Filtering

Filtering is a way to reduce the amount of information / events which is recorded during execution of your application under the control of Score-P. There is two reasons to do so:

  1. reducing the overhead by discarding events as soon as possible
  2. reducing the size or memory requirements of a trace by storing less events.

With Score-P, filtering is controlled at the level of functions/methods in the user code. One chooses to either record invocation of particular functions or else to discard these. The output of scoring is the place to look for candidates for filtering. See example below.

flt     type max_buf[B]    visits time[s] time[%] time/visit[us]  region
        ALL    714,914 1,050,818 2398.23   100.0        2282.25  ALL
        USR    690,030 1,026,658 2020.74    84.3        1968.27  USR
        MPI     17,729    15,008   75.65     3.2        5040.69  MPI
        COM      7,055     9,024  301.82    12.6       33446.60  COM
     SCOREP        100       128    0.02     0.0         129.98  SCOREP

        USR    279,480   411,264    1.14     0.0           2.78  timing::cpu_time_measure
        USR    122,400   184,320 1465.16    61.1        7948.99  lbm_functions::stream_collide_bgk
        USR    122,400   184,320    0.42     0.0           2.26  lbm_step_tiled::lb_step_tile_task
        USR    122,400   184,320    0.13     0.0           0.70  lbm_step_tiled::lb_step_tile
        USR     12,240    18,432    0.02     0.0           1.24  lbm_step_tiled::allocate_tile
        USR     12,240    18,432  400.54    16.7       21730.45  lbm_step_tiled::localize_tile
        MPI      9,300     6,080   65.78     2.7       10819.51  MPI_Sendrecv
        COM      5,100     6,080    0.02     0.0           3.39  mpl_set::mpl_communicate_buffer
        USR      5,100     6,080   10.65     0.4        1751.09  mpl_set::mpl_read_buffer
        USR      5,100     6,080   14.21     0.6        2337.34  mpl_set::mpl_fill_buffer
        MPI      4,064     4,096    4.33     0.2        1055.97  MPI_Reduce
        MPI      2,794     2,816    0.12     0.0          41.16  MPI_Bcast

The first 6 lines show a summary profile of the application split into user code (USR), MPI runtime (MPI), OpenMP runtime (OMP, not in this example), and functions invoking MPI operations (COM), and all together (ALL). Columns are estimated trace buffer size (max_buf), number of function invocations (visits), time spent in function (time[s]), fraction of total execution time (time[%]), and average duration per invocation (time/visit[us]). The second block starting at line 8 shows the same information for each function (given in the region column).

To reduce the trace size or overhead, one should filter functions which are frequent and of short duration. Conveniently, frequently invoked functions are at the top of the list. Starting from the top, take note of functions with short duration. Functions with duration of less the 1-2us are likely to cause overheads and should to filtered. In this particular example, one would filter

  • timing::cpu_time_measure (largest contribution to trace buf_size, negligible %time)
  • lbm_step_tiled::lb_step_tile (large contribution to trace buf_size, very short <1us -> overhead likely)
  • lbm_step_tiled::lb_step_tile_task (large contribution to trace buf_size, negligible %time)

These three functions cause roughly 60% of the trace buffer volume, and one of it is probably a source of overheads.

Then produce a filter file APP-scorep.filt with content similar to:

$ cat APP-scorep.filt
 SCOREP_REGION_NAMES_BEGIN
     EXCLUDE
         timing::cpu_time_measure
         lbm_step_tiled::lb_step_tile
 SCOREP_REGION_NAMES_END

To apply the filter file during profiling/tracing you need to set the environment variable SCOREP_FILTERING_FILE, e.g.

export SCOREP_FILTERING_FILE=APP-scorep.filt
time mpirun ... ./app_instrumented


Verify the overhead, and redo the scoring for trace size calculation. Repeat filtering and scoring until happy.

Summary profiling with Scalasca

After setting up filtering, etc, it is time to take a profile with Scalasca. You need to specify the amount of memory available to Score-P (see section Trace size calculation), filter file, and the hardware counters that you want to record by setting appropriate environment variables. Typically it looks similar to

export SCOREP_TOTAL_MEMORY=10MB

export SCOREP_FILTERING_FILE=APP-scorep.filt
export SCOREP_METRIC_PAPI=PAPI_TOT_INS,PAPI_TOT_CYC,PERF_COUNT_HW_STALLED_CYCLES_BACKEND
scalasca -analyze -s mpirun ... ./app_instrumented

ls scorep_app_instrumented_NCORES_sum

MANIFEST.md profile.cubex scorep.cfg scorep.filter scorep.log


The command scalasca -analyze will execute the code and take a profile (-s). Scalasca will place the profile in a directory called scorep_APP_NCORES_sum, where APP is the name of the binary and NCORES the total number of cores used for the run.

If you need omplace with MPT, please do the following instead

export SCOREP_TOTAL_MEMORY=10MB

export SCOREP_FILTERING_FILE=APP-scorep.filt
export SCOREP_METRIC_PAPI=PAPI_TOT_INS,PAPI_TOT_CYC,PERF_COUNT_HW_STALLED_CYCLES_BACKEND
export SCAN_TARGET=./app_instrumented

scalasca -analyze -s mpirun omplace ... $SCAN_TARGET

otherwise Scalasca will actually analyse the binary omplace, not your application.

To produce the summary profile report run scalasca -examine on the profile directoy.

scalasca -examine -s scorep_app_instrumented_NCORES_sum

ls scorep_app_instrumented_NCORES_sum

... scorep.score summary.cubex

This will add a Score-P scoring file scorep.score and the summary profile summary.cubex. The latter contains more metrics (for instance load-balance) and may be viewed with Cube. The option (-s) suppresses opening the summary profile in Cube automatically; remove if you would like to view it immediately. Cube will also be able to calculate basic POP metrics from summary profiles.

Tracing with Scalasca

Doing full tracing, rather than just profiling, will allow Scalasca do more analysis on your application and calculate further performance metrics. The procedure is similar to profiling with Scalasca, but the scalasca command looks a bit different and takes the option -t (tracing) rather than -s (summary profiling).

export SCOREP_TOTAL_MEMORY=10MB

export SCOREP_FILTERING_FILE=APP-scorep.filt
export SCOREP_METRIC_PAPI=PAPI_TOT_INS,PAPI_TOT_CYC,PERF_COUNT_HW_STALLED_CYCLES_BACKEND
scalasca -analyze -t mpirun ... ./app_instrumented

ls scorep_app_instrumented_NCORES_trace
MANIFEST.md profile.cubex scorep.cfg scorep.filter scorep.log
traces.def trace.stat traces traces.otf2

scout.log scout.cubex


The command scalasca -analyze will execute the code and take a full trace (-t). Scalasca will place the results in a directory called scorep_APP_NCORES_trace, where APP is the name of the binary and NCORES the total number of cores used for the run.

If you need omplace on Hawk, please do the following instead

export SCOREP_TOTAL_MEMORY=10MB

export SCOREP_FILTERING_FILE=APP-scorep.filt
export SCOREP_METRIC_PAPI=PAPI_TOT_INS,PAPI_TOT_CYC,PERF_COUNT_HW_STALLED_CYCLES_BACKEND
export SCAN_TARGET=./app_instrumented

scalasca -analyze -t mpirun omplace ... $SCAN_TARGET

otherwise Scalasca will actually analyse the binary omplace, not your application.

To produce the summary trace report run scalasca -examine on the trace directoy.

scalasca -examine -s scorep_app_instrumented_NCORES_trace

ls scorep_app_instrumented_NCORES_sum

... scorep.score summary.cubex trace.cubex

This will add a Score-P scoring file scorep.score, the summary profile summary.cubex, and the summary trace report trace.cubex. The summary trace report contains more metrics (for instance load-balance and critical path analysis) and may be viewed with Cube. The option (-s) suppresses opening the summary trace report in Cube automatically; remove if you would like to view it immediately. Cube will also be able to calculate full POP metrics from full traces.

Displaying POP efficiency metrics

Recent Scalasca versions can calculate POP efficiency metrics. To do so, load either a summary or trace report in Cube. Then select your focus of analysis (area of interest) in the central panel. Select the tab "General" at the right most edge of the window, select the tab "Advisor" at the top of the right panel, and click "Recalculate".

For summary reports, this will show the POP metrics Parallel Efficiency (time lost due to MPI and/or OpenMP), Loadbalance Efficiency (time lost due to computational imbalances in user code), and Communication Efficiency (time lost in MPI communication or OpenMP Synchronisation). If you loaded a trace, it will additionally break down Communication Efficiency into Transfer Efficiency (time spent in actual transfer of data) and Serialisation Efficiency (time lost due to communication pattern).

The POP plugin usually also calculates the metric "Stalled resources". For this to work for Hawk traces, you need to execute the following command before loading the trace:

MYCUBEX=summary.cubex

cube_derive -t postderived -e "metric::PERF_COUNT_HW_STALLED_CYCLES_BACKEND()" \
-p root PAPI_RES_STL $MYCUBEX -o $MYCUBEX

cube_derive -t postderived -e "metric::PERF_COUNT_HW_STALLED_CYCLES_BACKEND() / metric::PAPI_TOT_CYC()" \

-p root stalled_cycles_per_cycle $MYCUBEX -o $MYCUBEX

Replace summary.cubex with trace.cubex if necessary.

Further information

If you need further information on Score-P, Scalasca and Cube, please have a look on the following resources: