Latest revision as of 09:45, 5 July 2023

Introduction

This page describes a basic workflow for performance analysis based on Score-P and Scalasca. The best-practised presented here are tailored to HLRS' Hawk system.

More specifically, we describe steps and commands necessary for

setting up a suitable use-case,
determining the non-instrumented performance,
instrumenting your code,
getting an initial profile,
determine instrumentation overhead,
scoring for filtering and trace file size determination,
filtering,
profiling with Scalasca,
tracing with Scalasca.

If you get stuck or need further explanation, please get in touch with HLRS user support.

On Hawk load the required modules with

$ module load scorep scalasca cube vampir

Setting up a suitable use-case

Profiling and in particular tracing can produce a huge amount of performance analysis data. Typically, when doing profiling/tracing it is sufficient to run your code for a few timesteps/iterations only. In most cases, it is good practise to run the code between 1 and 10 minutes.

However, the performance characteristics of a code depend critically on the scale, i.e. number of cores used, and the problem size. Try to keep you performance analysis use-case as close as possible to a realistic use-case of your interest. Where practical, reduce the execution time (and thus the profiling/tracing data volume) by reducing the amount of timesteps/iterations, not by reducing the problem size. If you are interested only in profiling, but not in tracing, the amount of data is usually much smaller allowing to do longer runs if necessary.

Determine the non-instrumented performance

Running your application under the control of a performance analysis tool can incur significant overhead, i.e. your code will take noticeably longer to execute. At the same time, such overhead will have an impact on the quality of your performance analysis and the robustness of your conclusions. Always be aware of the amount of overhead and try to keep it small where possible. In many cases it is possible to reduce the overhead below 5% of the execution time, which is the same order of magnitude of expected performance variability between runs. If your overhead is larger, be aware that performance metrics may be off by at least as much.

It is therefore important to measure the performance of your code for the particular use-case before applying any performance analysis tools. We refer to this as non-instrumented performance. For a Score-P based workflow, this means that you need to measure the code before even compiling with Score-P.

At the very least you should determine the elapsed time of run. Do for instance

$ time mpirun ... ./app_non-instrumented

and record the "User time" portion of the output.

Many codes keep track of an application-specific performance metric, such as for instance iterations per second, or similar. Often this a better than the raw elapsed time, as it will disregard initialisation and shutdown phases which are negligible for longer production runs, but not for short analysis use-cases. If your code reports such a metric, record this as well in addition to the elapsed time. You may consider adding a application-specific metric to your code, if not available yet.

Consider doing not just one run, but several to get a feeling for the variation of the non-instrumented performance across runs.

Basic instrumentation of the code

Next you need to instrument your code by re-compiling it with Score-P. Essentially, you need to replace every invocation of the compiler with the corresponding Score-P compiler wrapper (TODO: also linker? There is no scorep-ld!)' Score-P provides a number of such compiler wrappers.

If you are using Makefiles to build your code, we recommend adding "scorep " (including the space " ") in front of every compiler command, as for instance

# Makefile 
%.o: %.f90

scorep mpif90 -o $*.o $<

(How to indent scorep command above??)

For build systems relying on CMake or autotools, it is easier to use more specific wrappers such as "scorep-mpicc" (note the dash, it is not a space). For instance

# autotools

$ MPICC="scorep-mpicc" ./configure ...

(TODO: add CMake example above)

(TODO: add link to Score-P instrumentation docs page)

Manual instrumentation of the code

1. Start and stop measurements

 
SCOREP_RECORDING_OFF()
SCOREP_RECORDING_ON()

Inportant: the initial SCOREP_RECORDING_OFF() should be after the MPI_Init (presumably done by initialize_MPI_lib) and use SCOREP_RECORDING_ON() again before MPI_Finalize (presumably finalize_MPI).

2. Nonstandard type declaration Score-P is using the nonstandard (GNU Extension) type declaration INTEGER*8 for its handles which are not accepted when F2008 standard conformance is required (-std=f2008):

Error: GNU Extension: Nonstandard type declaration INTEGER*8

As a workaround you can try adding a redefinition with the proper type:

#include "scorep/SCOREP_User.inc"
#undef SCOREP_USER_REGION_HANDLE
#define SCOREP_USER_REGION_HANDLE integer(8)

You should refine the SCOREP_USER_REGION_HANDLE macro instead of the SCOREP_USER_REGION_DEFINE macro immediately after including SCOREP_User.inc. You need this in each source module where SCOREP_USER_REGION macros are used.

3. SCOREP_USER_REGION_DEFINE macros and its location

You need to ensure that the SCOREP_USER_REGION_DEFINE macros for handles are placed with the other declarations prior to the first executable statement within the function/subroutine, i.e., before the assignment commands.

SCOREP_USER_REGION_DEFINE(region_handle)

Initial Score-P profile

To get an initial profile just run your application as usual. Make sure to use the instrumented binary and record the execution time. The main purpose of this of this initial profile is to determine instrumentation overheads and determine the expected size of a full trace as explained in the next section.

$ time mpirun ... ./app_instrumented
$ ls scorep-app_instrumented-20210419

MANIFEST.md  profile.cubex  scorep.cfg

Score-P will create a directory with the naming pattern scorep-APPNAME-TIMESTAMP, where APPNAME is the name of the executable, i.e. app_instrumented in the example above, and TIMESTAMP the time of profiling. Inside this directory Score-P will create the following files:

MANIFEST.md: manifest of this directory
profile.cubex: actual Score-P profile; use Cube (module load cube) to display
scorep.cfg: Score-P configuration for this experiment

Overhead

Compare the execution time (or application specific performance metric if available) of the non-instrumented run with the execution time of the instrumented binary obtained in the previous step.

Overheads of 5% or less are acceptable in most cases; the run-to-run variability is often in the same order of magnitude. The next two sections on scoring and filtering will describe techniques which will reduce the overhead in many cases. However, there might be situations where the overhead cannot be reduced further without significant effort. In those cases, proceed with the analysis, but be aware that performance metrics will be affected by the overhead, possibly to an unexpectedly large extent.

By default, automatic instrumentation of user-level source routines by the compiler is enabled (equivalent to specifying --compiler). The compiler instrumentation can be disabled with --nocompiler. This will reduce the overhead and obtain information only about the MPI.

Scoring

Before even looking at the profile, it should be "scored". Scoring will summarise the profile and estimate the size based on the number of invocations of user-code functions, but also functions belonging to MPI or OpenMP. To score your profile, do

$ scorep-score -r -c 3 scorep-APP-TIMESTAMP/profile.cubex > scorep-APP-TIMESTAMP/scorep.score 

$ cat scorep-APP-TIMESTAMP/scorep.score

The command above will produce a detailed report (-r), including a breakdown to function level, and estimate the trace size including space for three hardware counters (-c 3).

Trace size calculation

The expected size of a full trace can be read off near the top of the scoring output

$ cat scorep-APP-TIMESTAMP/scorep.score 
...

Estimated aggregate size of event trace:                   38MB

hint: set SCOREP_TOTAL_MEMORY=4097kB

In the example above, a full trace is estimated to have a size of 38MB. Anything below 10GB is fine for a trace. Larger values will take significant time to process during performance analysis with Scalasca or Vampir. Trace size might be reduced by filtering event as explained in the next section on filtering.

Please note, that Score-P suggests to set the environment variable SCOREP_TOTAL_MEMORY for subsequent runs. This will reserve memory for Score-P and avoid intermediate flushes of Score-P's trace buffer, which is relatively small per default. In practise it is easier to round up to a simple number such as 10MB rather than 4097kB. Note, this amount of memory will be taken by each process, not per compute node as a whole. Furthermore please also note that flushes should be avoided in any case as they would affect the observed performance!

Filtering

Filtering is a way to reduce the amount of information / events which is recorded during execution of your application under the control of Score-P. There is two reasons to do so:

reducing the overhead by discarding events as soon as possible
reducing the size or memory requirements of a trace by storing less events.

With Score-P, filtering is controlled at the level of functions/methods in the user code. One chooses to either record invocation of particular functions or else to discard these. The output of scoring is the place to look for candidates for filtering. See example below.

flt     type max_buf[B]    visits time[s] time[%] time/visit[us]  region
        ALL    714,914 1,050,818 2398.23   100.0        2282.25  ALL
        USR    690,030 1,026,658 2020.74    84.3        1968.27  USR
        MPI     17,729    15,008   75.65     3.2        5040.69  MPI
        COM      7,055     9,024  301.82    12.6       33446.60  COM
     SCOREP        100       128    0.02     0.0         129.98  SCOREP

        USR    279,480   411,264    1.14     0.0           2.78  timing::cpu_time_measure
        USR    122,400   184,320 1465.16    61.1        7948.99  lbm_functions::stream_collide_bgk
        USR    122,400   184,320    0.42     0.0           2.26  lbm_step_tiled::lb_step_tile_task
        USR    122,400   184,320    0.13     0.0           0.70  lbm_step_tiled::lb_step_tile
        USR     12,240    18,432    0.02     0.0           1.24  lbm_step_tiled::allocate_tile
        USR     12,240    18,432  400.54    16.7       21730.45  lbm_step_tiled::localize_tile
        MPI      9,300     6,080   65.78     2.7       10819.51  MPI_Sendrecv
        COM      5,100     6,080    0.02     0.0           3.39  mpl_set::mpl_communicate_buffer
        USR      5,100     6,080   10.65     0.4        1751.09  mpl_set::mpl_read_buffer
        USR      5,100     6,080   14.21     0.6        2337.34  mpl_set::mpl_fill_buffer
        MPI      4,064     4,096    4.33     0.2        1055.97  MPI_Reduce
        MPI      2,794     2,816    0.12     0.0          41.16  MPI_Bcast

The first 6 lines show a summary profile of the application split into user code (USR), MPI runtime (MPI), OpenMP runtime (OMP, not in this example), and functions invoking MPI operations (COM), and all together (ALL). Columns are estimated trace buffer size (max_buf), number of function invocations (visits), time spent in function (time[s]), fraction of total execution time (time[%]), and average duration per invocation (time/visit[us]). The second block starting at line 8 shows the same information for each function (given in the region column).

To reduce the trace size or overhead, one should filter functions which are frequent and of short duration. Conveniently, frequently invoked functions are at the top of the list. Starting from the top, take note of functions with short duration. Functions with duration of less the 1-2us are likely to cause overheads and should to filtered. In this particular example, one would filter

timing::cpu_time_measure (largest contribution to trace buf_size, negligible %time)
lbm_step_tiled::lb_step_tile (large contribution to trace buf_size, very short <1us -> overhead likely)
lbm_step_tiled::lb_step_tile_task (large contribution to trace buf_size, negligible %time)

These three functions cause roughly 60% of the trace buffer volume, and one of it is probably a source of overheads.

Then produce a filter file APP-scorep.filt with content similar to:

$ cat APP-scorep.filt
 SCOREP_REGION_NAMES_BEGIN
     EXCLUDE
         timing::cpu_time_measure
         lbm_step_tiled::lb_step_tile
         lbm_step_tiled::lb_step_tile_task
 SCOREP_REGION_NAMES_END

To apply the filter file during profiling/tracing you need to set the environment variable SCOREP_FILTERING_FILE, e.g.

export SCOREP_FILTERING_FILE=APP-scorep.filt 

time mpirun ... ./app_instrumented

Verify the overhead, and redo the scoring for trace size calculation. Repeat filtering and scoring until happy.

Summary profiling with Scalasca

After setting up filtering, etc, it is time to take a profile with Scalasca. You need to specify the amount of memory available to Score-P (see section Trace size calculation), filter file, and the hardware counters that you want to record by setting appropriate environment variables. Typically it looks similar to

export SCOREP_TOTAL_MEMORY=10MB 
export SCOREP_FILTERING_FILE=APP-scorep.filt 

export SCOREP_METRIC_PAPI=PAPI_TOT_INS,PAPI_TOT_CYC,PERF_COUNT_HW_STALLED_CYCLES_BACKEND 

scalasca -analyze -s mpirun ... ./app_instrumented 

ls scorep_app_instrumented_NCORES_sum

MANIFEST.md  profile.cubex  scorep.cfg  scorep.filter  scorep.log

The command scalasca -analyze will execute the code and take a profile (-s). Scalasca will place the profile in a directory called scorep_APP_NCORES_sum, where APP is the name of the binary and NCORES the total number of cores used for the run.

If you need omplace with MPT, please do the following instead

export SCOREP_TOTAL_MEMORY=10MB 
export SCOREP_FILTERING_FILE=APP-scorep.filt 

export SCOREP_METRIC_PAPI=PAPI_TOT_INS,PAPI_TOT_CYC,PERF_COUNT_HW_STALLED_CYCLES_BACKEND  

export SCAN_TARGET=./app_instrumented

scalasca -analyze -s mpirun omplace ... $SCAN_TARGET 

otherwise Scalasca will actually analyse the binary omplace, not your application.

To produce the summary profile report run scalasca -examine on the profile directoy.

scalasca -examine -s scorep_app_instrumented_NCORES_sum 
ls scorep_app_instrumented_NCORES_sum

... scorep.score  summary.cubex

This will add a Score-P scoring file scorep.score and the summary profile summary.cubex. The latter contains more metrics (for instance load-balance) and may be viewed with Cube. The option (-s) suppresses opening the summary profile in Cube automatically; remove if you would like to view it immediately. Cube will also be able to calculate basic POP metrics from summary profiles.

Generating a summary profile may take a long time and lots of RAM and may fail if there isn't enough, rough estimate:

time[h] = 6*profile_size_in_GB

RAM needed = 10*profile_size

Tracing with Scalasca

Doing full tracing, rather than just profiling, will allow Scalasca do more analysis on your application and calculate further performance metrics. The procedure is similar to profiling with Scalasca, but the scalasca command looks a bit different and takes the option -t (tracing) rather than -s (summary profiling).

export SCOREP_TOTAL_MEMORY=10MB 
export SCOREP_FILTERING_FILE=APP-scorep.filt 

export SCOREP_METRIC_PAPI=PAPI_TOT_INS,PAPI_TOT_CYC,PERF_COUNT_HW_STALLED_CYCLES_BACKEND  

scalasca -analyze -t mpirun ... ./app_instrumented 

ls scorep_app_instrumented_NCORES_trace

MANIFEST.md  profile.cubex  scorep.cfg  scorep.filter  scorep.log

traces.def trace.stat traces traces.otf2

scout.log scout.cubex

The command scalasca -analyze will execute the code and take a full trace (-t). Scalasca will place the results in a directory called scorep_APP_NCORES_trace, where APP is the name of the binary and NCORES the total number of cores used for the run.

If you need omplace on Hawk, please do the following instead

export SCOREP_TOTAL_MEMORY=10MB 
export SCOREP_FILTERING_FILE=APP-scorep.filt 

export SCOREP_METRIC_PAPI=PAPI_TOT_INS,PAPI_TOT_CYC,PERF_COUNT_HW_STALLED_CYCLES_BACKEND  

export SCAN_TARGET=./app_instrumented

scalasca -analyze -t mpirun omplace ... $SCAN_TARGET 

otherwise Scalasca will actually analyse the binary omplace, not your application.

To produce the summary profile report and the trace report run scalasca -examine on the trace directoy.

scalasca -examine -s scorep_app_instrumented_NCORES_trace 
ls scorep_app_instrumented_NCORES_sum

... scorep.score  summary.cubex trace.cubex

This will add a Score-P scoring file scorep.score, the summary profile summary.cubex, and the trace report trace.cubex. The trace report contains more metrics (for instance load-balance and critical path analysis) and may be viewed with Cube. The option (-s) suppresses opening the trace report in Cube automatically; remove if you would like to view it immediately. Cube will also be able to calculate full POP metrics from full traces.

Displaying POP efficiency metrics

Recent Scalasca versions can calculate POP efficiency metrics. To do so, load either a summary or trace report in Cube. Then select your focus of analysis (area of interest) in the central panel. Select the tab "General" at the right most edge of the window, select the tab "Advisor" at the top of the right panel, and click "Recalculate".

For summary reports, this will show the POP metrics Parallel Efficiency (time lost due to MPI and/or OpenMP), Loadbalance Efficiency (time lost due to computational imbalances in user code), and Communication Efficiency (time lost in MPI communication or OpenMP Synchronisation). If you loaded a trace, it will additionally break down Communication Efficiency into Transfer Efficiency (time spent in actual transfer of data) and Serialisation Efficiency (time lost due to communication pattern).

The POP plugin usually also calculates the metric "Stalled resources". For this to work for Hawk traces, you need to execute the following command before loading the trace:

MYCUBEX=summary.cubex 
cube_derive -t postderived -e "metric::PERF_COUNT_HW_STALLED_CYCLES_BACKEND()" \ 

-p root PAPI_RES_STL $MYCUBEX -o $MYCUBEX 

cube_derive -t postderived -e "metric::PERF_COUNT_HW_STALLED_CYCLES_BACKEND() / metric::PAPI_TOT_CYC()" \ 

-p root stalled_cycles_per_cycle $MYCUBEX -o $MYCUBEX

Replace summary.cubex with trace.cubex if necessary.

Further information

If you need further information on Score-P, Scalasca and Cube, please have a look on the following resources:

Slides presented at 35th VI-HPS Tools Workshop (sections "Day 3: Wednesday 16 September" and "Day 4: Thursday 17 September")
Video of talks given at 34th VI-HPS Tools Workshop:
- Intro to the tools
- Hand's on session
Score-P User Manual
Scalasca User Guide
Cube User Guide
Score-P and Scalasca Cheatsheet

@@ Line 4: / Line 4: @@
 The best-practised presented here are tailored to HLRS' [[HPE Hawk|Hawk]] system.
 More specifically, we describe steps and commands necessary for
 # setting up a suitable use-case,
 # determining the non-instrumented performance,
@@ Line 66: / Line 66: @@
 '''''(TODO: add link to Score-P instrumentation docs page)'''''
+= Manual instrumentation of the code =
+. Start and stop measurements
+<pre>
+SCOREP_RECORDING_OFF()
+SCOREP_RECORDING_ON()
+</pre>
+'''''Inportant:'''''
+the initial <code>SCOREP_RECORDING_OFF()</code> should be after the <code>MPI_Init</code> (presumably done by <code>initialize_MPI_lib</code>) and use <code>SCOREP_RECORDING_ON()</code> again before <code>MPI_Finalize</code> (presumably <code>finalize_MPI</code>).
+. Nonstandard type declaration
+Score-P is using the nonstandard (GNU Extension) type declaration INTEGER*8 for its handles which are not accepted when F2008 standard conformance is required (-std=f2008):
+Error: GNU Extension: Nonstandard type declaration INTEGER*8
+As a workaround you can try adding a redefinition with the proper type:
+<pre>
+#include "scorep/SCOREP_User.inc"
+#undef SCOREP_USER_REGION_HANDLE
+#define SCOREP_USER_REGION_HANDLE integer(8)
+</pre>
+You should refine the <code>SCOREP_USER_REGION_HANDLE</code> macro instead of the <code>SCOREP_USER_REGION_DEFINE</code> macro immediately after including <code>SCOREP_User.inc</code>. You need this in each source module where <code>SCOREP_USER_REGION</code> macros are used.
+. <code>SCOREP_USER_REGION_DEFINE</code> macros and its location
+You need to ensure that the <code>SCOREP_USER_REGION_DEFINE</code> macros for handles are placed with the other declarations prior to the first executable statement within the function/subroutine, i.e., before the assignment commands.
+<pre>
+SCOREP_USER_REGION_DEFINE(region_handle)
+</pre>
 = Initial Score-P profile =
@@ Line 80: / Line 114: @@
 Inside this directory Score-P will create the following files:
 # <code>MANIFEST.md</code>: manifest of this directory
-# <code>profile.cubex</code>: actual Score-P profile; use Cube to display
+# <code>profile.cubex</code>: actual Score-P profile; use Cube (<code>module load cube</code>) to display
 # <code>scorep.cfg</code>: Score-P configuration for this experiment
@@ Line 86: / Line 120: @@
 Compare the execution time (or application specific performance metric if available) of the non-instrumented run with the execution time of the instrumented binary obtained in the previous step.
-Overheads of 5% or less are acceptable in most cases; the run-to-run variability is often in the same order of magnitude. The next two sections on [[#Scoring|scoring]] and [[#Filtering|filtering]] will describe techniques which will reduce the overhead in many cases. However, there might be situations where the overhead cannot be reduced further without significant effort. In those cases, proceed with the analysis, but be aware that performance metrics will be affected by the overhead, possibly to an unexpectedly large extend.
+Overheads of 5% or less are acceptable in most cases; the run-to-run variability is often in the same order of magnitude. The next two sections on [[#Scoring|scoring]] and [[#Filtering|filtering]] will describe techniques which will reduce the overhead in many cases. However, there might be situations where the overhead cannot be reduced further without significant effort. In those cases, proceed with the analysis, but be aware that performance metrics will be affected by the overhead, possibly to an unexpectedly large extent.
+By default, automatic instrumentation of user-level source routines by the compiler is enabled (equivalent to specifying --compiler). The compiler instrumentation can be disabled with --nocompiler. This will reduce the overhead and obtain information only about the MPI.
 = Scoring =
-Before even looking at the profile, it should be "scored". Scoring will summaries the profile and estimate the size based on the number of invocations of user-code functions, but also functions belonging to MPI or OpenMP. To score your profile, do
+Before even looking at the profile, it should be "scored". Scoring will summarise the profile and estimate the size based on the number of invocations of user-code functions, but also functions belonging to MPI or OpenMP. To score your profile, do
 {{Command
@@ Line 110: / Line 146: @@
 In the example above, a full trace is estimated to have a size of 38MB. Anything below 10GB is fine for a trace. Larger values will take significant time to process during performance analysis with Scalasca or Vampir. Trace size might be reduced by filtering event as explained in the next section on filtering.
-Please note, that Score-P suggests to set the environment variable <code>SCOREP_TOTAL_MEMORY</code> for subsequent runs. This will reserve memory for Score-P and avoid intermediate flushes of Score-P's trace buffer, which is relatively small per default. In practise it is easier to round up to a simple number such as 10MB rather than 4097kB. Note, this amount of memory will be taken by each process, not per compute node as a whole.
+Please note, that Score-P suggests to set the environment variable <code>SCOREP_TOTAL_MEMORY</code> for subsequent runs. This will reserve memory for Score-P and avoid intermediate flushes of Score-P's trace buffer, which is relatively small per default. In practise it is easier to round up to a simple number such as 10MB rather than 4097kB. Note, this amount of memory will be taken by each process, not per compute node as a whole. Furthermore please also note that flushes should be avoided in any case as they would affect the observed performance!
 = Filtering =
@@ Line 154: / Line 190: @@
            timing::cpu_time_measure
            lbm_step_tiled::lb_step_tile
+          lbm_step_tiled::lb_step_tile_task
    SCOREP_REGION_NAMES_END
@@ Line 181: / Line 218: @@
 Scalasca will place the profile in a directory called <code>scorep_APP_NCORES_sum</code>, where <code>APP</code> is the name of the binary and <code>NCORES</code> the total number of cores used for the run.
-If you need <code>omplace</code> on Hawk, please do the following instead
+If you need <code>omplace</code> with MPT, please do the following instead
 {{Command
 | command =
@@ Line 199: / Line 236: @@
 }}
 This will add a Score-P scoring file <code>scorep.score</code> and the summary profile <code>summary.cubex</code>. The latter contains more metrics (for instance load-balance) and may be viewed with Cube. The option (<code>-s</code>) suppresses opening the summary profile in Cube automatically; remove if you would like to view it immediately. Cube will also be able to calculate basic POP metrics from summary profiles.
+Generating a summary profile may take a long time and lots of RAM and '''may fail''' if there isn't enough, rough estimate:
+'''time[h] = 6*profile_size_in_GB'''
+'''RAM needed = 10*profile_size'''
 = Tracing with Scalasca =
@@ Line 217: / Line 260: @@
 The command <code>scalasca -analyze</code> will execute the code and take a full trace (<code>-t</code>).
-Scalasca will place the profile in a directory called <code>scorep_APP_NCORES_trace</code>, where <code>APP</code> is the name of the binary and <code>NCORES</code> the total number of cores used for the run.
+Scalasca will place the results in a directory called <code>scorep_APP_NCORES_trace</code>, where <code>APP</code> is the name of the binary and <code>NCORES</code> the total number of cores used for the run.
 If you need <code>omplace</code> on Hawk, please do the following instead
@@ Line 230: / Line 273: @@
 otherwise Scalasca will actually analyse the binary <code>omplace</code>, not your application.
-To produce the summary profile report run <code>scalasca -examine</code> on the trace directoy.
+To produce the summary profile report and the trace report run <code>scalasca -examine</code> on the trace directoy.
 {{Command
 | command = scalasca -examine -s scorep_app_instrumented_NCORES_trace <br/>
@@ Line 236: / Line 279: @@
 ... scorep.score  summary.cubex trace.cubex
 }}
-This will add a Score-P scoring file <code>scorep.score</code>, the summary profile <code>summary.cubex</code>, and the event trace report <code>trace.cubex</code>. The trace report contains more metrics (for instance load-balance and critical path analysis) and may be viewed with Cube. The option (<code>-s</code>) suppresses opening the summary profile in Cube automatically; remove if you would like to view it immediately. Cube will also be able to calculate full POP metrics from full traces.
+This will add a Score-P scoring file <code>scorep.score</code>, the summary profile <code>summary.cubex</code>, and the trace report <code>trace.cubex</code>. The trace report contains more metrics (for instance load-balance and critical path analysis) and may be viewed with Cube. The option (<code>-s</code>) suppresses opening the trace report in Cube automatically; remove if you would like to view it immediately. Cube will also be able to calculate full POP metrics from full traces.
 = Displaying POP efficiency metrics =
@@ Line 244: / Line 287: @@
 For summary reports, this will show the POP metrics Parallel Efficiency (time lost due to MPI and/or OpenMP), Loadbalance Efficiency (time lost due to computational imbalances in user code), and Communication Efficiency (time lost in MPI communication or OpenMP Synchronisation). If you loaded a trace, it will additionally break down Communication Efficiency into Transfer Efficiency (time spent in actual transfer of data) and Serialisation Efficiency (time lost due to communication pattern).
-The POP plugin usually also calculates the metric "Stalled resources". For this is work for Hawk traces, you need to execute the following command before loading the trace:
+The POP plugin usually also calculates the metric "Stalled resources". For this to work for Hawk traces, you need to execute the following command before loading the trace:
 {{Command
 | command = MYCUBEX=summary.cubex <br/>
@@ Line 254: / Line 297: @@
 }}
 Replace <code>summary.cubex</code> with <code>trace.cubex</code> if necessary.
+= Further information =
+If you need further information on Score-P, Scalasca and Cube, please have a look on the following resources:
+* [https://www.vi-hps.org/training/tws/tw35.html Slides presented at 35th VI-HPS Tools Workshop] (sections "Day 3: Wednesday 16 September" and "Day 4: Thursday 17 September")
+* Video of talks given at 34th VI-HPS Tools Workshop:
+<!-- TODO: link to (huge) videos at https://zenodo.org/record/4286314 (recorded at 35th VI-HPS Tools Workshop @ HLRS!)? -->
+** [https://www.youtube.com/watch?v=dy_xwvYJIqE Intro to the tools]
+** [https://www.youtube.com/watch?v=ZwQ77vQ6Eg4 Hand's on session]
+* [http://perftools.pages.jsc.fz-juelich.de/cicd/scorep/tags/scorep-7.1/pdf/scorep.pdf Score-P User Manual]
+* [http://apps.fz-juelich.de/scalasca/releases/scalasca/2.6/docs/UserGuide.pdf Scalasca User Guide]
+* [https://apps.fz-juelich.de/scalasca/releases/cube/4.6/docs/CubeUserGuide.pdf Cube User Guide]
+* [https://kb.hlrs.de/platforms/upload/Score-P_and_Scalasca_Cheatsheet.pdf Score-P and Scalasca Cheatsheet]

Workflow for Profiling and Tracing with Score-P and Scalasca: Difference between revisions

Latest revision as of 09:45, 5 July 2023

Contents