- Infos im HLRS Wiki sind nicht rechtsverbindlich und ohne Gewähr -
- Information contained in the HLRS Wiki is not legally binding and HLRS is not responsible for any damages that might result from its use -
Debugging On XC40
This article treat some tools for debugging and monitoring you application.
ATP
In case of segmentation faults, CRAYs Abnormal Termination Processing ( ATP) provides an application stack trace at the moment of the error of the related process. It can be obtained with a minimal amount of effort, only one environmental variable has to be set, before running the appliaction:
export ATP_ENABLED=1
More information can be found using
man intro_atp
STAT
In case of a hanging application, CRAYs Stack Trace Analysis Tool ( STAT) can help identifying dead locks by presenting a merged Stack Trace of all processes. The tool can simply attached to the running application by:
module load stat STATGUI <JOBID>
Additional information can be found using:
module load stat; man intro_stat
Allinea DDT
More complex issues or wrong results can be investigated using a parallel debugger to monitor the applications behavior. Allinea DDT is described in detail in the | User Guide and command line options are listed using “ddt --help”. DDT has a user-friendly graphical interface, which also has the capability to start the batch job or connect to a running one. Furthermore, DDT can be run in batch mode (not in an interactive session) using the offline debugging feature.
Start DDT in offline mode
DDT offline mode can be started and controlled using the following command line options in the batch script. For example, a job on three fully populated nodes:
#PBS -l nodes=3:ppn=24 ... cd $PBS_O_WORKDIR # change in the directory where the job is started from module load forge # make DDT available ddt --offline report.txt aprun -n 72 -N24 a.out param1 # alternatively the DDT output file can be selected as html version, e.g. report.html.
This creates a ASCII text file with the program output and the debugging report. Alternatively, the output file format can be changed to html, simply by specifying the related file ending. As a result, a more structured output including small graphics is presented.
Some options
The file menu of the DDT GUI provides an option to save the session information, like settings, break- and tracepoints. Therewith a debugging session can be defined interactively in a small run and be transferred to large run, by using the option:
ddt [...] --session=my.session
Breakpoints can be defined in multiple ways, e.g. specifying the position in the source code:
ddt [...] --break-at="main.c:22 if rank==0"
Furthermore, conditions can be set additionally, e.g. only triggering when variable rank equals 0.
Tracepoints can be specified in a similar manner:
ddt [...] --trace-at=myFunction,2:3:16,var1,var2
Here the function myFunction is traced, especially the variables var1 and var2. The tracing is here limited with the additional option to every 3 pass starting from the 2 pass up to the 16 pass.
Memory debugging can be enabled, which will activate memory leak reports. Before the debugging run, the application has to be re-linked with the dmalloc library:
module load ddt-memdebug make # OR cc ...
Then the memory debugging session is started using the following options:
ddt [...] --mem-debug=(fast|balanced|thorough)
ddt [...] --check-bounds=(after|before)
More documentation can be found in the User Guide or using:
ddt --help
Example
In the following, a test application is investigates using a breakpoint, a tracepoint (printing two variables, rank and randonNr), and the output is written in report.txt: "ddt --offline report.txt --verbose --break-at=dbgMPIapp_corr.c:51 --trace-at=dbgMPIapp_corr.c:25,rank,randomNr aprun -n 24 -N 24 dbgMPIapp.exe 10 9". Beside general job information, the output consists of the tracepoint information:
tracepoint (1-23): [calc (dbgMPIapp_corr.c:25)] randomNr: from 5 to 95 rank: from 1 to 23 tracepoint (1-23): [calc (dbgMPIapp_corr.c:25)] randomNr: from 1 to 100 rank: from 1 to 23 tracepoint (1-23): [calc (dbgMPIapp_corr.c:25)] randomNr: from 4 to 99 rank: from 1 to 23 tracepoint (1-23): [calc (dbgMPIapp_corr.c:25)] randomNr: from 2 to 96 rank: from 1 to 23 ...
where we can see how the variables changes from pass to pass.
Further, the breakpoint information are printed like:
message (6-23): Process stopped at breakpoint in bar (dbgMPIapp_corr.c:51). message: Stacks message: Processes Function message: 6-23 main (dbgMPIapp_corr.c:181) message: 6-23 worker (dbgMPIapp_corr.c:120) message: 6-23 bar (dbgMPIapp_corr.c:51) message (n/a): Select process 6 message: Current Stack message: #2 main (argc=3, argv=0x7fffffff2738) at /lustre/[...]/dbgMPIapp_corr.c:181 (at 0x00000000200017cb) message: #1 worker (rank=32767, loopCount=-55504, burro=0) at /lustre/[...]/dbgMPIapp_corr.c:120 (at 0x000000002000154b) message: #0 bar (loopCount=10, rank=6, burro=-2080374782) at /lustre/[...]/dbgMPIapp_corr.c:51 (at 0x0000000020001204) message: Locals message: buf: burro: -2080374782 g_rank: 0 (from 0 to 17) g_size: 18 i: 10922 loopCount: 10 rank: 6 (from 6 to 23) sum: -1293743216 val: 0
where we can also inspect the handed variables in the function calls of the stack.