- Infos im HLRS Wiki sind nicht rechtsverbindlich und ohne Gewähr -
- Information contained in the HLRS Wiki is not legally binding and HLRS is not responsible for any damages that might result from its use -

Debugging On XC40

From HLRS Platforms
Jump to navigationJump to search

This article treat some tools for debugging and monitoring you application.

ATP

In case of segmentation faults, CRAYs Abnormal Termination Processing ( ATP) provides an application stack trace at the moment of the error of the related process. It can be obtained with a minimal amount of effort, only one environmental variable has to be set, before running the appliaction:

export ATP_ENABLED=1

More information can be found using

 man intro_atp 

STAT

In case of a hanging application, CRAYs Stack Trace Analysis Tool ( STAT) can help identifying dead locks by presenting a merged Stack Trace of all processes. The tool can simply attached to the running application by:

module load stat
STATGUI <JOBID> 

Additional information can be found using:

 module load stat; man intro_stat 

Allinea DDT

More complex issues or wrong results can be investigated using a parallel debugger to monitor the applications behavior. Allinea DDT is described in detail in the | User Guide and command line options are listed using “ddt --help”. DDT has a user-friendly graphical interface, which also has the capability to start the batch job or connect to a running one. Furthermore, DDT can be run in batch mode (not in an interactive session) using the offline debugging feature.

Start DDT in offline mode

DDT offline mode can be started and controlled using the following command line options in the batch script. For example, a job on three fully populated nodes:

#PBS -l nodes=3:ppn=24
... 
cd $PBS_O_WORKDIR # change in the directory where the job is started from
module load forge   # make DDT available

ddt --offline report.txt aprun -n 72 -N24 a.out param1  # alternatively the DDT output file can be selected as html version, e.g. report.html.

This creates a ASCII text file with the program output and the debugging report. Alternatively, the output file format can be changed to html, simply by specifying the related file ending. As a result, a more structured output including small graphics is presented.

Some options

The file menu of the DDT GUI provides an option to save the session information, like settings, break- and tracepoints. Therewith a debugging session can be defined interactively in a small run and be transferred to large run, by using the option:

 ddt [...] --session=my.session 

Breakpoints can be defined in multiple ways, e.g. specifying the position in the source code:

 ddt [...] --break-at="main.c:22 if rank==0"

Furthermore, conditions can be set additionally, e.g. only triggering when variable rank equals 0.

Tracepoints can be specified in a similar manner:

 ddt [...] --trace-at=myFunction,2:3:16,var1,var2 

Here the function myFunction is traced, especially the variables var1 and var2. The tracing is here limited with the additional option to every 3 pass starting from the 2 pass up to the 16 pass.

Memory debugging can be enabled, which will activate memory leak reports. Before the debugging run, the application has to be re-linked with the dmalloc library:

 module load ddt-memdebug
 make # OR cc ... 

Then the memory debugging session is started using the following options:

 ddt [...] --mem-debug=(fast|balanced|thorough) 
 ddt [...] --check-bounds=(after|before) 


More documentation can be found in the User Guide or using:

ddt --help

Example

In the following, a test application is investigates using a breakpoint, a tracepoint (printing two variables, rank and randonNr), and the output is written in report.txt: "ddt --offline report.txt --verbose --break-at=dbgMPIapp_corr.c:51 --trace-at=dbgMPIapp_corr.c:25,rank,randomNr aprun -n 24 -N 24 dbgMPIapp.exe 10 9". Beside general job information, the output consists of the tracepoint information:

tracepoint (1-23): [calc (dbgMPIapp_corr.c:25)] randomNr: from 5 to 95 rank: from 1 to 23
tracepoint (1-23): [calc (dbgMPIapp_corr.c:25)] randomNr: from 1 to 100 rank: from 1 to 23
tracepoint (1-23): [calc (dbgMPIapp_corr.c:25)] randomNr: from 4 to 99 rank: from 1 to 23
tracepoint (1-23): [calc (dbgMPIapp_corr.c:25)] randomNr: from 2 to 96 rank: from 1 to 23
...

where we can see how the variables changes from pass to pass.

Further, the breakpoint information are printed like:

message (6-23): Process stopped at breakpoint in bar (dbgMPIapp_corr.c:51).
message: Stacks
message: Processes Function                            
message: 6-23      main (dbgMPIapp_corr.c:181)         
message: 6-23        worker (dbgMPIapp_corr.c:120) 
message: 6-23          bar (dbgMPIapp_corr.c:51)   
message (n/a): Select process 6
message: Current Stack
message: #2 main (argc=3, argv=0x7fffffff2738) at /lustre/[...]/dbgMPIapp_corr.c:181 (at 0x00000000200017cb)
message: #1 worker (rank=32767, loopCount=-55504, burro=0) at /lustre/[...]/dbgMPIapp_corr.c:120 (at 0x000000002000154b)
message: #0 bar (loopCount=10, rank=6, burro=-2080374782) at /lustre/[...]/dbgMPIapp_corr.c:51 (at 0x0000000020001204)
message: Locals
message: buf: burro: -2080374782 g_rank: 0 (from 0 to 17) g_size: 18 i: 10922 loopCount: 10 rank: 6 (from 6 to 23) sum: -1293743216 val: 0

where we can also inspect the handed variables in the function calls of the stack.