- Infos im HLRS Wiki sind nicht rechtsverbindlich und ohne Gewähr -
- Information contained in the HLRS Wiki is not legally binding and HLRS is not responsible for any damages that might result from its use -
Debugging On XC40
This article treat some tools for debugging and monitoring you application.
ATP
In case of segmentation faults, CRAYs Abnormal Termination Processing ( ATP) provides an application stack trace at the moment of the error of the related process. It can be obtained with a minimal amount of effort, only one environmental variable has to be set, before running the appliaction:
export ATP_ENABLED=1
More information can be found using
man intro_atp
STAT
In case of a hanging application, CRAYs Stack Trace Analysis Tool ( STAT) can help identifying dead locks by presenting a merged Stack Trace of all processes. The tool can simply attached to the running application by:
module load stat STATGUI <JOBID>
Additional information can be found using:
module load stat; man intro_stat
Allinea DDT
More complex issues or wrong results can be investigated using a parallel debugger to monitor the applications behavior. Allinea DDT is described in detail in the User Guide and command line options are listed using “ddt --help”. DDT has a user-friendly graphical interface, which also has the capability to start the batch job or connect to a running one. Nevertheless, in this article we concentrate on offline debugging, thus we get rid of the requirements of a interactive session.
Start DDT in offline mode
DDT offline mode can be started and controlled using the following command line options in the batch script. For example, a job on three fully populated nodes:
#PBS -l nodes=3:ppn=24 ... cd $PBS_O_WORKDIR # change in the directory where the job is started from module load ddt # make DDT available ddt --offline report.txt aprun -n 72 -N24 a.out param1 # alternatively the DDT output file can be selected as html version, e.g. report.html.
This creates a ASCII text file with the program output and the debugging report. Alternatively, the output file format can be changed to html, simply by specifying the related file ending. As a result, a more structured output including small graphics is presented.
Some options
The file menu of the DDT GUI provides an option to save the session information, like settings, break- and tracepoints. Therewith a debugging session can be defined interactively in a small run and be transferred to large run, by using the option:
ddt [...] --session=my.session
Breakpoints can be defined in multiple ways, e.g. specifying the position in the source code:
ddt [...] --break-at="main.c:22 if rank==0"
Furthermore, conditions can be set additionally, e.g. only triggering when variable rank equals 0.
Tracepoints can be specified in a similar manner:
ddt [...] --trace-at=myFunction,2:3:16,var1,var2
Here the function myFunction is traced, especially the variables var1 and var2. The tracing is here limited with the additional option to every 3 pass starting from the 2 pass up to the 16 pass.
Memory debugging can be enabled, which will activate memory leak reports. Before the debugging run, the application has to be re-linked with the dmalloc library:
module load ddt-memdebug make # OR cc ...
Then the memory debugging session is started using the following options:
ddt [...] --mem-debug=(fast|balanced|thorough)
ddt [...] --check-bounds=(after|before)
More documentation can be found in the User Guide or using:
ddt --help
Example
In the following, a test application is investigates using a breakpoint, a tracepoint (printing two variables, rank and randonNr), and the output is written in report.txt: "ddt --offline report.txt --verbose --break-at=dbgMPIapp_corr.c:51 --trace-at=dbgMPIapp_corr.c:25,rank,randomNr aprun -n 24 -N 24 dbgMPIapp.exe 10 9". Beside general job information, the output consists of the tracepoint information:
tracepoint (1-23): [calc (dbgMPIapp_corr.c:25)] randomNr: from 5 to 95 rank: from 1 to 23 tracepoint (1-23): [calc (dbgMPIapp_corr.c:25)] randomNr: from 1 to 100 rank: from 1 to 23 tracepoint (1-23): [calc (dbgMPIapp_corr.c:25)] randomNr: from 4 to 99 rank: from 1 to 23 tracepoint (1-23): [calc (dbgMPIapp_corr.c:25)] randomNr: from 2 to 96 rank: from 1 to 23 ...
where we can see how the variables changes from pass to pass.
Further, the breakpoint information are printed like:
message (6-23): Process stopped at breakpoint in bar (dbgMPIapp_corr.c:51). message: Stacks message: Processes Function message: 6-23 main (dbgMPIapp_corr.c:181) message: 6-23 worker (dbgMPIapp_corr.c:120) message: 6-23 bar (dbgMPIapp_corr.c:51) message (n/a): Select process 6 message: Current Stack message: #2 main (argc=3, argv=0x7fffffff2738) at /lustre/[...]/dbgMPIapp_corr.c:181 (at 0x00000000200017cb) message: #1 worker (rank=32767, loopCount=-55504, burro=0) at /lustre/[...]/dbgMPIapp_corr.c:120 (at 0x000000002000154b) message: #0 bar (loopCount=10, rank=6, burro=-2080374782) at /lustre/[...]/dbgMPIapp_corr.c:51 (at 0x0000000020001204) message: Locals message: buf: burro: -2080374782 g_rank: 0 (from 0 to 17) g_size: 18 i: 10922 loopCount: 10 rank: 6 (from 6 to 23) sum: -1293743216 val: 0
where we can also inspect the handed variables in the function calls of the stack.