- Infos im HLRS Wiki sind nicht rechtsverbindlich und ohne Gewähr -
- Information contained in the HLRS Wiki is not legally binding and HLRS is not responsible for any damages that might result from its use -
Debugging On XC40
In this article some tools are briefly listed to debug and monitor you application.
ATP
In case of a segmentation fault, CRAYs Abnormal Termination Processing ( ATP) can print an application stack trace at the moment of the error of this process, with a minimal amount of effort. Only an environmental variable has to be set:
export ATP_ENABLED=1
More information can be found using
man intro_atp
STAT
In case of a hanging application, CRAYs Stack Trace Analysis Tool ( STAT) can help identifying dead locks by presenting merged Stack Trace of all processes. The tool can simply attached to the running application by:
module load stat STATGUI <JOBID>
Additional information can be found using:
module load stat; man intro_stat
Allinea DDT
For more complex issues or in case of wrong results a debugger can be utilized to monitor the applications behavior. Allinea DDT is a powerful parallel debugger. DDT is described in detail in the User Guide and command line options using “ddt --help”. Allinea DDT has a user-friendly graphical user interface, which also has the capability to start the batch job or connect to a running one. Nevertheless, in this article we concentrate on offline debugging, thus we get rid of the requirements of a interactive session.
Start DDT in offline mode
DDT offline mode is started and controlled using the following command line options in the batch script, for example on three fully populated nodes:
#PBS -l nodes=3:ppn=24 ... cd $PBS_O_WORKDIR # change in the directory where the job is started from module load ddt # make DDT available ddt --offline report.txt aprun -n 72 -N24 a.out param1 # alternatively the DDT output file can be selected as html version, e.g. report.html.
Sessions
In the DDT GUI a session can be saved in the menu, which stores, e.g. break- and tracepoints into a file, e.g. my.session. Thus debugging steps of a small run can be transferred to a large scale run, using the option:
ddt [...] --session=my.session
Some options
Breakpoints can be defined using:
ddt [...] --break-at="main.c:22 if rank==0"
Further conditions can be additionally set, e.g. only triggering when variable rank equals 0.
Tracepoints can be set in a similar manner:
ddt [...] --trace-at=main.c:22,var1,var2
Here additionally variables var1 and var2 are logged.
Memory debugging can be enabled, which will activate memory leak reports. Before the debugging run, the application has to be re-linked with the dmalloc library:
module load ddt-memdebug make # OR cc ...
Then the memory debugging session is started using the following options:
ddt [...] --mem-debug=(fast|balanced|thorough)
ddt [...] --check-bounds=(after|before)
More documentation can be found in the User Guide or using:
ddt --help
Example
In the following a test application is analyzed with a breakpoint, a tracepoint (printing two variables), which is written as txt, using the command: "ddt --offline report.txt --verbose --break-at=dbgMPIapp_corr.c:51 --trace-at=dbgMPIapp_corr.c:25,rank,randomNr aprun -n 24 -N 24 dbgMPIapp.exe 10 9". The output consists beside general job information on the tracepoint information:
tracepoint (1-23): [calc (dbgMPIapp_corr.c:25)] randomNr: from 5 to 95 rank: from 1 to 23 tracepoint (1-23): [calc (dbgMPIapp_corr.c:25)] randomNr: from 1 to 100 rank: from 1 to 23 tracepoint (1-23): [calc (dbgMPIapp_corr.c:25)] randomNr: from 4 to 99 rank: from 1 to 23 tracepoint (1-23): [calc (dbgMPIapp_corr.c:25)] randomNr: from 2 to 96 rank: from 1 to 23 ...
where we can see how the variables changes from time to time.
Further, the breakpoint information are printed, e.g:
message (6-23): Process stopped at breakpoint in bar (dbgMPIapp_corr.c:51). message: Stacks message: Processes Function message: 6-23 main (dbgMPIapp_corr.c:181) message: 6-23 worker (dbgMPIapp_corr.c:120) message: 6-23 bar (dbgMPIapp_corr.c:51) message (n/a): Select process 6 message: Current Stack message: #2 main (argc=3, argv=0x7fffffff2738) at /lustre/[...]/dbgMPIapp_corr.c:181 (at 0x00000000200017cb) message: #1 worker (rank=32767, loopCount=-55504, burro=0) at /lustre/[...]/dbgMPIapp_corr.c:120 (at 0x000000002000154b) message: #0 bar (loopCount=10, rank=6, burro=-2080374782) at /lustre/[...]/dbgMPIapp_corr.c:51 (at 0x0000000020001204) message: Locals message: buf: burro: -2080374782 g_rank: 0 (from 0 to 17) g_size: 18 i: 10922 loopCount: 10 rank: 6 (from 6 to 23) sum: -1293743216 val: 0