- Infos im HLRS Wiki sind nicht rechtsverbindlich und ohne Gewähr -
- Information contained in the HLRS Wiki is not legally binding and HLRS is not responsible for any damages that might result from its use -
Workflow for Profiling and Tracing with Score-P and Scalasca: Difference between revisions
(→Overhead: --nocompiler option) |
|||
(44 intermediate revisions by 5 users not shown) | |||
Line 4: | Line 4: | ||
The best-practised presented here are tailored to HLRS' [[HPE Hawk|Hawk]] system. | The best-practised presented here are tailored to HLRS' [[HPE Hawk|Hawk]] system. | ||
More specifically, we describe steps and commands necessary for | More specifically, we describe steps and commands necessary for | ||
# setting up a suitable use-case, | # setting up a suitable use-case, | ||
# determining the non-instrumented performance, | # determining the non-instrumented performance, | ||
Line 66: | Line 66: | ||
'''''(TODO: add link to Score-P instrumentation docs page)''''' | '''''(TODO: add link to Score-P instrumentation docs page)''''' | ||
= Manual instrumentation of the code = | |||
1. Start and stop measurements | |||
<pre> | |||
SCOREP_RECORDING_OFF() | |||
SCOREP_RECORDING_ON() | |||
</pre> | |||
'''''Inportant:''''' | |||
the initial <code>SCOREP_RECORDING_OFF()</code> should be after the <code>MPI_Init</code> (presumably done by <code>initialize_MPI_lib</code>) and use <code>SCOREP_RECORDING_ON()</code> again before <code>MPI_Finalize</code> (presumably <code>finalize_MPI</code>). | |||
2. Nonstandard type declaration | |||
Score-P is using the nonstandard (GNU Extension) type declaration INTEGER*8 for its handles which are not accepted when F2008 standard conformance is required (-std=f2008): | |||
Error: GNU Extension: Nonstandard type declaration INTEGER*8 | |||
As a workaround you can try adding a redefinition with the proper type: | |||
<pre> | |||
#include "scorep/SCOREP_User.inc" | |||
#undef SCOREP_USER_REGION_HANDLE | |||
#define SCOREP_USER_REGION_HANDLE integer(8) | |||
</pre> | |||
You should refine the <code>SCOREP_USER_REGION_HANDLE</code> macro instead of the <code>SCOREP_USER_REGION_DEFINE</code> macro immediately after including <code>SCOREP_User.inc</code>. You need this in each source module where <code>SCOREP_USER_REGION</code> macros are used. | |||
3. <code>SCOREP_USER_REGION_DEFINE</code> macros and its location | |||
You need to ensure that the <code>SCOREP_USER_REGION_DEFINE</code> macros for handles are placed with the other declarations prior to the first executable statement within the function/subroutine, i.e., before the assignment commands. | |||
<pre> | |||
SCOREP_USER_REGION_DEFINE(region_handle) | |||
</pre> | |||
= Initial Score-P profile = | = Initial Score-P profile = | ||
Line 80: | Line 114: | ||
Inside this directory Score-P will create the following files: | Inside this directory Score-P will create the following files: | ||
# <code>MANIFEST.md</code>: manifest of this directory | # <code>MANIFEST.md</code>: manifest of this directory | ||
# <code>profile.cubex</code>: actual Score-P profile; use Cube to display | # <code>profile.cubex</code>: actual Score-P profile; use Cube (<code>module load cube</code>) to display | ||
# <code>scorep.cfg</code>: Score-P configuration for this experiment | # <code>scorep.cfg</code>: Score-P configuration for this experiment | ||
Line 86: | Line 120: | ||
Compare the execution time (or application specific performance metric if available) of the non-instrumented run with the execution time of the instrumented binary obtained in the previous step. | Compare the execution time (or application specific performance metric if available) of the non-instrumented run with the execution time of the instrumented binary obtained in the previous step. | ||
Overheads of 5% or less are acceptable in most cases; the run-to-run variability is often in the same order of magnitude. The next two sections on [[#Scoring|scoring]] and [[#Filtering|filtering]] will describe techniques which will reduce the overhead in many cases. However, there might be situations where the overhead cannot be reduced further without significant effort. In those cases, proceed with the analysis, but be aware that performance metrics will be affected by the overhead, possibly to an unexpectedly large | Overheads of 5% or less are acceptable in most cases; the run-to-run variability is often in the same order of magnitude. The next two sections on [[#Scoring|scoring]] and [[#Filtering|filtering]] will describe techniques which will reduce the overhead in many cases. However, there might be situations where the overhead cannot be reduced further without significant effort. In those cases, proceed with the analysis, but be aware that performance metrics will be affected by the overhead, possibly to an unexpectedly large extent. | ||
By default, automatic instrumentation of user-level source routines by the compiler is enabled (equivalent to specifying --compiler). The compiler instrumentation can be disabled with --nocompiler. This will reduce the overhead and obtain information only about the MPI. | |||
= Scoring = | = Scoring = | ||
Before even looking at the profile, it should be "scored". Scoring will | Before even looking at the profile, it should be "scored". Scoring will summarise the profile and estimate the size based on the number of invocations of user-code functions, but also functions belonging to MPI or OpenMP. To score your profile, do | ||
{{Command | {{Command | ||
Line 110: | Line 146: | ||
In the example above, a full trace is estimated to have a size of 38MB. Anything below 10GB is fine for a trace. Larger values will take significant time to process during performance analysis with Scalasca or Vampir. Trace size might be reduced by filtering event as explained in the next section on filtering. | In the example above, a full trace is estimated to have a size of 38MB. Anything below 10GB is fine for a trace. Larger values will take significant time to process during performance analysis with Scalasca or Vampir. Trace size might be reduced by filtering event as explained in the next section on filtering. | ||
Please note, that Score-P suggests to set the environment variable <code>SCOREP_TOTAL_MEMORY</code> for subsequent runs. This will reserve memory for Score-P and avoid intermediate flushes of Score-P's trace buffer, which is relatively small per default. In practise it is easier to round up to a simple number such as 10MB rather than 4097kB. Note, this amount of memory will be taken by each process, not per compute node as a whole. | Please note, that Score-P suggests to set the environment variable <code>SCOREP_TOTAL_MEMORY</code> for subsequent runs. This will reserve memory for Score-P and avoid intermediate flushes of Score-P's trace buffer, which is relatively small per default. In practise it is easier to round up to a simple number such as 10MB rather than 4097kB. Note, this amount of memory will be taken by each process, not per compute node as a whole. Furthermore please also note that flushes should be avoided in any case as they would affect the observed performance! | ||
= Filtering = | = Filtering = | ||
Line 154: | Line 190: | ||
timing::cpu_time_measure | timing::cpu_time_measure | ||
lbm_step_tiled::lb_step_tile | lbm_step_tiled::lb_step_tile | ||
lbm_step_tiled::lb_step_tile_task | |||
SCOREP_REGION_NAMES_END | SCOREP_REGION_NAMES_END | ||
Line 181: | Line 218: | ||
Scalasca will place the profile in a directory called <code>scorep_APP_NCORES_sum</code>, where <code>APP</code> is the name of the binary and <code>NCORES</code> the total number of cores used for the run. | Scalasca will place the profile in a directory called <code>scorep_APP_NCORES_sum</code>, where <code>APP</code> is the name of the binary and <code>NCORES</code> the total number of cores used for the run. | ||
If you need <code>omplace</code> | If you need <code>omplace</code> with MPT, please do the following instead | ||
{{Command | {{Command | ||
| command = | | command = | ||
Line 199: | Line 236: | ||
}} | }} | ||
This will add a Score-P scoring file <code>scorep.score</code> and the summary profile <code>summary.cubex</code>. The latter contains more metrics (for instance load-balance) and may be viewed with Cube. The option (<code>-s</code>) suppresses opening the summary profile in Cube automatically; remove if you would like to view it immediately. Cube will also be able to calculate basic POP metrics from summary profiles. | This will add a Score-P scoring file <code>scorep.score</code> and the summary profile <code>summary.cubex</code>. The latter contains more metrics (for instance load-balance) and may be viewed with Cube. The option (<code>-s</code>) suppresses opening the summary profile in Cube automatically; remove if you would like to view it immediately. Cube will also be able to calculate basic POP metrics from summary profiles. | ||
Generating a summary profile may take a long time and lots of RAM and '''may fail''' if there isn't enough, rough estimate: | |||
'''time[h] = 6*profile_size_in_GB''' | |||
'''RAM needed = 10*profile_size''' | |||
= Tracing with Scalasca = | = Tracing with Scalasca = | ||
Line 217: | Line 260: | ||
The command <code>scalasca -analyze</code> will execute the code and take a full trace (<code>-t</code>). | The command <code>scalasca -analyze</code> will execute the code and take a full trace (<code>-t</code>). | ||
Scalasca will place the | Scalasca will place the results in a directory called <code>scorep_APP_NCORES_trace</code>, where <code>APP</code> is the name of the binary and <code>NCORES</code> the total number of cores used for the run. | ||
If you need <code>omplace</code> on Hawk, please do the following instead | If you need <code>omplace</code> on Hawk, please do the following instead | ||
Line 230: | Line 273: | ||
otherwise Scalasca will actually analyse the binary <code>omplace</code>, not your application. | otherwise Scalasca will actually analyse the binary <code>omplace</code>, not your application. | ||
To produce the summary profile report run <code>scalasca -examine</code> on the trace directoy. | To produce the summary profile report and the trace report run <code>scalasca -examine</code> on the trace directoy. | ||
{{Command | {{Command | ||
| command = scalasca -examine -s scorep_app_instrumented_NCORES_trace <br/> | | command = scalasca -examine -s scorep_app_instrumented_NCORES_trace <br/> | ||
Line 236: | Line 279: | ||
... scorep.score summary.cubex trace.cubex | ... scorep.score summary.cubex trace.cubex | ||
}} | }} | ||
This will add a Score-P scoring file <code>scorep.score</code>, the summary profile <code>summary.cubex</code>, and the | This will add a Score-P scoring file <code>scorep.score</code>, the summary profile <code>summary.cubex</code>, and the trace report <code>trace.cubex</code>. The trace report contains more metrics (for instance load-balance and critical path analysis) and may be viewed with Cube. The option (<code>-s</code>) suppresses opening the trace report in Cube automatically; remove if you would like to view it immediately. Cube will also be able to calculate full POP metrics from full traces. | ||
= Displaying POP efficiency metrics = | = Displaying POP efficiency metrics = | ||
Line 244: | Line 287: | ||
For summary reports, this will show the POP metrics Parallel Efficiency (time lost due to MPI and/or OpenMP), Loadbalance Efficiency (time lost due to computational imbalances in user code), and Communication Efficiency (time lost in MPI communication or OpenMP Synchronisation). If you loaded a trace, it will additionally break down Communication Efficiency into Transfer Efficiency (time spent in actual transfer of data) and Serialisation Efficiency (time lost due to communication pattern). | For summary reports, this will show the POP metrics Parallel Efficiency (time lost due to MPI and/or OpenMP), Loadbalance Efficiency (time lost due to computational imbalances in user code), and Communication Efficiency (time lost in MPI communication or OpenMP Synchronisation). If you loaded a trace, it will additionally break down Communication Efficiency into Transfer Efficiency (time spent in actual transfer of data) and Serialisation Efficiency (time lost due to communication pattern). | ||
The POP plugin usually also calculates the metric "Stalled resources". For this | The POP plugin usually also calculates the metric "Stalled resources". For this to work for Hawk traces, you need to execute the following command before loading the trace: | ||
{{Command | {{Command | ||
| command = MYCUBEX=summary.cubex <br/> | | command = MYCUBEX=summary.cubex <br/> | ||
Line 254: | Line 297: | ||
}} | }} | ||
Replace <code>summary.cubex</code> with <code>trace.cubex</code> if necessary. | Replace <code>summary.cubex</code> with <code>trace.cubex</code> if necessary. | ||
= Further information = | |||
If you need further information on Score-P, Scalasca and Cube, please have a look on the following resources: | |||
* [https://www.vi-hps.org/training/tws/tw35.html Slides presented at 35th VI-HPS Tools Workshop] (sections "Day 3: Wednesday 16 September" and "Day 4: Thursday 17 September") | |||
* Video of talks given at 34th VI-HPS Tools Workshop: | |||
<!-- TODO: link to (huge) videos at https://zenodo.org/record/4286314 (recorded at 35th VI-HPS Tools Workshop @ HLRS!)? --> | |||
** [https://www.youtube.com/watch?v=dy_xwvYJIqE Intro to the tools] | |||
** [https://www.youtube.com/watch?v=ZwQ77vQ6Eg4 Hand's on session] | |||
* [http://perftools.pages.jsc.fz-juelich.de/cicd/scorep/tags/scorep-7.1/pdf/scorep.pdf Score-P User Manual] | |||
* [http://apps.fz-juelich.de/scalasca/releases/scalasca/2.6/docs/UserGuide.pdf Scalasca User Guide] | |||
* [https://apps.fz-juelich.de/scalasca/releases/cube/4.6/docs/CubeUserGuide.pdf Cube User Guide] | |||
* [https://kb.hlrs.de/platforms/upload/Score-P_and_Scalasca_Cheatsheet.pdf Score-P and Scalasca Cheatsheet] |
Latest revision as of 09:45, 5 July 2023
Introduction
This page describes a basic workflow for performance analysis based on Score-P and Scalasca. The best-practised presented here are tailored to HLRS' Hawk system.
More specifically, we describe steps and commands necessary for
- setting up a suitable use-case,
- determining the non-instrumented performance,
- instrumenting your code,
- getting an initial profile,
- determine instrumentation overhead,
- scoring for filtering and trace file size determination,
- filtering,
- profiling with Scalasca,
- tracing with Scalasca.
If you get stuck or need further explanation, please get in touch with HLRS user support.
On Hawk load the required modules with
Setting up a suitable use-case
Profiling and in particular tracing can produce a huge amount of performance analysis data. Typically, when doing profiling/tracing it is sufficient to run your code for a few timesteps/iterations only. In most cases, it is good practise to run the code between 1 and 10 minutes.
However, the performance characteristics of a code depend critically on the scale, i.e. number of cores used, and the problem size. Try to keep you performance analysis use-case as close as possible to a realistic use-case of your interest. Where practical, reduce the execution time (and thus the profiling/tracing data volume) by reducing the amount of timesteps/iterations, not by reducing the problem size. If you are interested only in profiling, but not in tracing, the amount of data is usually much smaller allowing to do longer runs if necessary.
Determine the non-instrumented performance
Running your application under the control of a performance analysis tool can incur significant overhead, i.e. your code will take noticeably longer to execute. At the same time, such overhead will have an impact on the quality of your performance analysis and the robustness of your conclusions. Always be aware of the amount of overhead and try to keep it small where possible. In many cases it is possible to reduce the overhead below 5% of the execution time, which is the same order of magnitude of expected performance variability between runs. If your overhead is larger, be aware that performance metrics may be off by at least as much.
It is therefore important to measure the performance of your code for the particular use-case before applying any performance analysis tools. We refer to this as non-instrumented performance. For a Score-P based workflow, this means that you need to measure the code before even compiling with Score-P.
At the very least you should determine the elapsed time of run. Do for instance
and record the "User time" portion of the output.
Many codes keep track of an application-specific performance metric, such as for instance iterations per second, or similar. Often this a better than the raw elapsed time, as it will disregard initialisation and shutdown phases which are negligible for longer production runs, but not for short analysis use-cases. If your code reports such a metric, record this as well in addition to the elapsed time. You may consider adding a application-specific metric to your code, if not available yet.
Consider doing not just one run, but several to get a feeling for the variation of the non-instrumented performance across runs.
Basic instrumentation of the code
Next you need to instrument your code by re-compiling it with Score-P. Essentially, you need to replace every invocation of the compiler with the corresponding Score-P compiler wrapper (TODO: also linker? There is no scorep-ld!)' Score-P provides a number of such compiler wrappers.
If you are using Makefiles to build your code, we recommend adding "scorep " (including the space " ") in front of every compiler command, as for instance
%.o: %.f90
scorep mpif90 -o $*.o $<(How to indent scorep command above??)
For build systems relying on CMake or autotools, it is easier to use more specific wrappers such as "scorep-mpicc" (note the dash, it is not a space). For instance
$ MPICC="scorep-mpicc" ./configure ...
(TODO: add CMake example above)
(TODO: add link to Score-P instrumentation docs page)
Manual instrumentation of the code
1. Start and stop measurements
SCOREP_RECORDING_OFF() SCOREP_RECORDING_ON()
Inportant:
the initial SCOREP_RECORDING_OFF()
should be after the MPI_Init
(presumably done by initialize_MPI_lib
) and use SCOREP_RECORDING_ON()
again before MPI_Finalize
(presumably finalize_MPI
).
2. Nonstandard type declaration Score-P is using the nonstandard (GNU Extension) type declaration INTEGER*8 for its handles which are not accepted when F2008 standard conformance is required (-std=f2008):
Error: GNU Extension: Nonstandard type declaration INTEGER*8
As a workaround you can try adding a redefinition with the proper type:
#include "scorep/SCOREP_User.inc" #undef SCOREP_USER_REGION_HANDLE #define SCOREP_USER_REGION_HANDLE integer(8)
You should refine the SCOREP_USER_REGION_HANDLE
macro instead of the SCOREP_USER_REGION_DEFINE
macro immediately after including SCOREP_User.inc
. You need this in each source module where SCOREP_USER_REGION
macros are used.
3. SCOREP_USER_REGION_DEFINE
macros and its location
You need to ensure that the SCOREP_USER_REGION_DEFINE
macros for handles are placed with the other declarations prior to the first executable statement within the function/subroutine, i.e., before the assignment commands.
SCOREP_USER_REGION_DEFINE(region_handle)
Initial Score-P profile
To get an initial profile just run your application as usual. Make sure to use the instrumented binary and record the execution time. The main purpose of this of this initial profile is to determine instrumentation overheads and determine the expected size of a full trace as explained in the next section.
$ ls scorep-app_instrumented-20210419
MANIFEST.md profile.cubex scorep.cfg
Score-P will create a directory with the naming pattern scorep-APPNAME-TIMESTAMP
, where APPNAME
is the name of the executable, i.e. app_instrumented
in the example above, and TIMESTAMP
the time of profiling.
Inside this directory Score-P will create the following files:
MANIFEST.md
: manifest of this directoryprofile.cubex
: actual Score-P profile; use Cube (module load cube
) to displayscorep.cfg
: Score-P configuration for this experiment
Overhead
Compare the execution time (or application specific performance metric if available) of the non-instrumented run with the execution time of the instrumented binary obtained in the previous step.
Overheads of 5% or less are acceptable in most cases; the run-to-run variability is often in the same order of magnitude. The next two sections on scoring and filtering will describe techniques which will reduce the overhead in many cases. However, there might be situations where the overhead cannot be reduced further without significant effort. In those cases, proceed with the analysis, but be aware that performance metrics will be affected by the overhead, possibly to an unexpectedly large extent.
By default, automatic instrumentation of user-level source routines by the compiler is enabled (equivalent to specifying --compiler). The compiler instrumentation can be disabled with --nocompiler. This will reduce the overhead and obtain information only about the MPI.
Scoring
Before even looking at the profile, it should be "scored". Scoring will summarise the profile and estimate the size based on the number of invocations of user-code functions, but also functions belonging to MPI or OpenMP. To score your profile, do
$ cat scorep-APP-TIMESTAMP/scorep.score
The command above will produce a detailed report (-r
), including a breakdown to function level, and estimate the trace size including space for three hardware counters (-c 3
).
Trace size calculation
The expected size of a full trace can be read off near the top of the scoring output
...
Estimated aggregate size of event trace: 38MB
In the example above, a full trace is estimated to have a size of 38MB. Anything below 10GB is fine for a trace. Larger values will take significant time to process during performance analysis with Scalasca or Vampir. Trace size might be reduced by filtering event as explained in the next section on filtering.
Please note, that Score-P suggests to set the environment variable SCOREP_TOTAL_MEMORY
for subsequent runs. This will reserve memory for Score-P and avoid intermediate flushes of Score-P's trace buffer, which is relatively small per default. In practise it is easier to round up to a simple number such as 10MB rather than 4097kB. Note, this amount of memory will be taken by each process, not per compute node as a whole. Furthermore please also note that flushes should be avoided in any case as they would affect the observed performance!
Filtering
Filtering is a way to reduce the amount of information / events which is recorded during execution of your application under the control of Score-P. There is two reasons to do so:
- reducing the overhead by discarding events as soon as possible
- reducing the size or memory requirements of a trace by storing less events.
With Score-P, filtering is controlled at the level of functions/methods in the user code. One chooses to either record invocation of particular functions or else to discard these. The output of scoring is the place to look for candidates for filtering. See example below.
flt type max_buf[B] visits time[s] time[%] time/visit[us] region ALL 714,914 1,050,818 2398.23 100.0 2282.25 ALL USR 690,030 1,026,658 2020.74 84.3 1968.27 USR MPI 17,729 15,008 75.65 3.2 5040.69 MPI COM 7,055 9,024 301.82 12.6 33446.60 COM SCOREP 100 128 0.02 0.0 129.98 SCOREP USR 279,480 411,264 1.14 0.0 2.78 timing::cpu_time_measure USR 122,400 184,320 1465.16 61.1 7948.99 lbm_functions::stream_collide_bgk USR 122,400 184,320 0.42 0.0 2.26 lbm_step_tiled::lb_step_tile_task USR 122,400 184,320 0.13 0.0 0.70 lbm_step_tiled::lb_step_tile USR 12,240 18,432 0.02 0.0 1.24 lbm_step_tiled::allocate_tile USR 12,240 18,432 400.54 16.7 21730.45 lbm_step_tiled::localize_tile MPI 9,300 6,080 65.78 2.7 10819.51 MPI_Sendrecv COM 5,100 6,080 0.02 0.0 3.39 mpl_set::mpl_communicate_buffer USR 5,100 6,080 10.65 0.4 1751.09 mpl_set::mpl_read_buffer USR 5,100 6,080 14.21 0.6 2337.34 mpl_set::mpl_fill_buffer MPI 4,064 4,096 4.33 0.2 1055.97 MPI_Reduce MPI 2,794 2,816 0.12 0.0 41.16 MPI_Bcast
The first 6 lines show a summary profile of the application split into user code (USR), MPI runtime (MPI), OpenMP runtime (OMP, not in this example), and functions invoking MPI operations (COM), and all together (ALL). Columns are estimated trace buffer size (max_buf), number of function invocations (visits), time spent in function (time[s]), fraction of total execution time (time[%]), and average duration per invocation (time/visit[us]). The second block starting at line 8 shows the same information for each function (given in the region column).
To reduce the trace size or overhead, one should filter functions which are frequent and of short duration. Conveniently, frequently invoked functions are at the top of the list. Starting from the top, take note of functions with short duration. Functions with duration of less the 1-2us are likely to cause overheads and should to filtered. In this particular example, one would filter
timing::cpu_time_measure
(largest contribution to trace buf_size, negligible %time)lbm_step_tiled::lb_step_tile
(large contribution to trace buf_size, very short <1us -> overhead likely)lbm_step_tiled::lb_step_tile_task
(large contribution to trace buf_size, negligible %time)
These three functions cause roughly 60% of the trace buffer volume, and one of it is probably a source of overheads.
Then produce a filter file APP-scorep.filt
with content similar to:
$ cat APP-scorep.filt SCOREP_REGION_NAMES_BEGIN EXCLUDE timing::cpu_time_measure lbm_step_tiled::lb_step_tile lbm_step_tiled::lb_step_tile_task SCOREP_REGION_NAMES_END
To apply the filter file during profiling/tracing you need to set the environment variable SCOREP_FILTERING_FILE
, e.g.
time mpirun ... ./app_instrumented
Verify the overhead, and redo the scoring for trace size calculation. Repeat filtering and scoring until happy.
Summary profiling with Scalasca
After setting up filtering, etc, it is time to take a profile with Scalasca. You need to specify the amount of memory available to Score-P (see section Trace size calculation), filter file, and the hardware counters that you want to record by setting appropriate environment variables. Typically it looks similar to
export SCOREP_FILTERING_FILE=APP-scorep.filt
export SCOREP_METRIC_PAPI=PAPI_TOT_INS,PAPI_TOT_CYC,PERF_COUNT_HW_STALLED_CYCLES_BACKEND
scalasca -analyze -s mpirun ... ./app_instrumented
ls scorep_app_instrumented_NCORES_sum
The command scalasca -analyze
will execute the code and take a profile (-s
).
Scalasca will place the profile in a directory called scorep_APP_NCORES_sum
, where APP
is the name of the binary and NCORES
the total number of cores used for the run.
If you need omplace
with MPT, please do the following instead
export SCOREP_FILTERING_FILE=APP-scorep.filt
export SCOREP_METRIC_PAPI=PAPI_TOT_INS,PAPI_TOT_CYC,PERF_COUNT_HW_STALLED_CYCLES_BACKEND
export SCAN_TARGET=./app_instrumented
otherwise Scalasca will actually analyse the binary omplace
, not your application.
To produce the summary profile report run scalasca -examine
on the profile directoy.
ls scorep_app_instrumented_NCORES_sum
... scorep.score summary.cubexThis will add a Score-P scoring file scorep.score
and the summary profile summary.cubex
. The latter contains more metrics (for instance load-balance) and may be viewed with Cube. The option (-s
) suppresses opening the summary profile in Cube automatically; remove if you would like to view it immediately. Cube will also be able to calculate basic POP metrics from summary profiles.
Generating a summary profile may take a long time and lots of RAM and may fail if there isn't enough, rough estimate:
time[h] = 6*profile_size_in_GB
RAM needed = 10*profile_size
Tracing with Scalasca
Doing full tracing, rather than just profiling, will allow Scalasca do more analysis on your application and calculate further performance metrics. The procedure is similar to profiling with Scalasca, but the scalasca command looks a bit different and takes the option -t
(tracing) rather than -s
(summary profiling).
export SCOREP_FILTERING_FILE=APP-scorep.filt
export SCOREP_METRIC_PAPI=PAPI_TOT_INS,PAPI_TOT_CYC,PERF_COUNT_HW_STALLED_CYCLES_BACKEND
scalasca -analyze -t mpirun ... ./app_instrumented
ls scorep_app_instrumented_NCORES_trace
MANIFEST.md profile.cubex scorep.cfg scorep.filter scorep.log
traces.def trace.stat traces traces.otf2
The command scalasca -analyze
will execute the code and take a full trace (-t
).
Scalasca will place the results in a directory called scorep_APP_NCORES_trace
, where APP
is the name of the binary and NCORES
the total number of cores used for the run.
If you need omplace
on Hawk, please do the following instead
export SCOREP_FILTERING_FILE=APP-scorep.filt
export SCOREP_METRIC_PAPI=PAPI_TOT_INS,PAPI_TOT_CYC,PERF_COUNT_HW_STALLED_CYCLES_BACKEND
export SCAN_TARGET=./app_instrumented
otherwise Scalasca will actually analyse the binary omplace
, not your application.
To produce the summary profile report and the trace report run scalasca -examine
on the trace directoy.
ls scorep_app_instrumented_NCORES_sum
... scorep.score summary.cubex trace.cubexThis will add a Score-P scoring file scorep.score
, the summary profile summary.cubex
, and the trace report trace.cubex
. The trace report contains more metrics (for instance load-balance and critical path analysis) and may be viewed with Cube. The option (-s
) suppresses opening the trace report in Cube automatically; remove if you would like to view it immediately. Cube will also be able to calculate full POP metrics from full traces.
Displaying POP efficiency metrics
Recent Scalasca versions can calculate POP efficiency metrics. To do so, load either a summary or trace report in Cube. Then select your focus of analysis (area of interest) in the central panel. Select the tab "General" at the right most edge of the window, select the tab "Advisor" at the top of the right panel, and click "Recalculate".
For summary reports, this will show the POP metrics Parallel Efficiency (time lost due to MPI and/or OpenMP), Loadbalance Efficiency (time lost due to computational imbalances in user code), and Communication Efficiency (time lost in MPI communication or OpenMP Synchronisation). If you loaded a trace, it will additionally break down Communication Efficiency into Transfer Efficiency (time spent in actual transfer of data) and Serialisation Efficiency (time lost due to communication pattern).
The POP plugin usually also calculates the metric "Stalled resources". For this to work for Hawk traces, you need to execute the following command before loading the trace:
cube_derive -t postderived -e "metric::PERF_COUNT_HW_STALLED_CYCLES_BACKEND()" \
-p root PAPI_RES_STL $MYCUBEX -o $MYCUBEX
cube_derive -t postderived -e "metric::PERF_COUNT_HW_STALLED_CYCLES_BACKEND() / metric::PAPI_TOT_CYC()" \
Replace summary.cubex
with trace.cubex
if necessary.
Further information
If you need further information on Score-P, Scalasca and Cube, please have a look on the following resources:
- Slides presented at 35th VI-HPS Tools Workshop (sections "Day 3: Wednesday 16 September" and "Day 4: Thursday 17 September")
- Video of talks given at 34th VI-HPS Tools Workshop:
- Score-P User Manual
- Scalasca User Guide
- Cube User Guide
- Score-P and Scalasca Cheatsheet