- Infos im HLRS Wiki sind nicht rechtsverbindlich und ohne Gewähr -
- Information contained in the HLRS Wiki is not legally binding and HLRS is not responsible for any damages that might result from its use -

Debugging On XC40

From HLRS Platforms
Revision as of 13:02, 19 January 2016 by Hpcmscho (talk | contribs)
Jump to navigationJump to search

In this article some tools are briefly listed to debug and monitor you application.

ATP

In case of a segmentation fault, CRAYs Abnormal Termination Processing ( ATP) can print an application stack trace at the moment of the error of this process, with a minimal amount of effort. Only an environmental variable has to be set:

export ATP_ENABLED=1

More information can be found using

 man intro_atp 

STAT

In case of a hanging application, CRAYs Stack Trace Analysis Tool ( STAT) can help identifying dead locks by presenting merged Stack Trace of all processes. The tool can simply attached to the running application by:

module load stat
STATGUI <JOBID> 

Additional information can be found using:

 module load stat; man intro_stat 

Allinea DDT

For more complex issues or in case of wrong results a debugger can be utilized to monitor the applications behavior. Allinea DDT is a powerful parallel debugger. DDT is described in detail in the User Guide and command line options using “ddt --help”. Allinea DDT has a user-friendly graphical user interface, which also has the capability to start the batch job or connect to a running one. Nevertheless, in this article we concentrate on offline debugging, thus we get rid of the requirements of a interactive session.

Start DDT in offline mode

DDT offline mode is started and controlled using the following command line options in the batch script, for example on three fully populated nodes:

#PBS -l nodes=3:ppn=24
... 
cd $PBS_O_WORKDIR # change in the directory where the job is started from
module load ddt   # make DDT available

ddt --offline report.txt aprun -n 72 -N24 a.out param1  # alternatively the DDT output file can be selected as html version, e.g. report.html.

Sessions

In the DDT GUI a session can be saved in the menu, which stores, e.g. break- and tracepoints into a file, e.g. my.session. Thus debugging steps of a small run can be transferred to a large scale run, using the option:

 ddt [...] --session=my.session 

Some options

Breakpoints can be defined using:

 ddt [...] --break-at="main.c:22 if rank==0"

Further conditions can be additionally set, e.g. only triggering when variable rank equals 0.

Tracepoints can be set in a similar manner:

 ddt [...] --trace-at=main.c:22,var1,var2 

Here additionally variables var1 and var2 are logged.

Memory debugging can be enabled, which will activate memory leak reports. Before the debugging run, the application has to be re-linked with the dmalloc library:

 module load ddt-memdebug
 make # OR cc ... 

Then the memory debugging session is started using the following options:

 ddt [...] --mem-debug=(fast|balanced|thorough) 
 ddt [...] --check-bounds=(after|before) 


More documentation can be found in the User Guide or using:

ddt --help

Example

In the following a test application is analyzed with a breakpoint, a tracepoint (printing two variables), which is written as txt, using the command: "ddt --offline report.txt --verbose --break-at=dbgMPIapp_corr.c:51 --trace-at=dbgMPIapp_corr.c:25,rank,randomNr aprun -n 24 -N 24 dbgMPIapp.exe 10 9". The output consists beside general job information on the tracepoint information:

tracepoint (1-23): [calc (dbgMPIapp_corr.c:25)] randomNr: from 5 to 95 rank: from 1 to 23
tracepoint (1-23): [calc (dbgMPIapp_corr.c:25)] randomNr: from 1 to 100 rank: from 1 to 23
tracepoint (1-23): [calc (dbgMPIapp_corr.c:25)] randomNr: from 4 to 99 rank: from 1 to 23
tracepoint (1-23): [calc (dbgMPIapp_corr.c:25)] randomNr: from 2 to 96 rank: from 1 to 23
...

where we can see how the variables changes from time to time.

Further, the breakpoint information are printed, e.g:

message (6-23): Process stopped at breakpoint in bar (dbgMPIapp_corr.c:51).
message: Stacks
message: Processes Function                            
message: 6-23      main (dbgMPIapp_corr.c:181)         
message: 6-23        worker (dbgMPIapp_corr.c:120) 
message: 6-23          bar (dbgMPIapp_corr.c:51)   
message (n/a): Select process 6
message: Current Stack
message: #2 main (argc=3, argv=0x7fffffff2738) at /lustre/[...]/dbgMPIapp_corr.c:181 (at 0x00000000200017cb)
message: #1 worker (rank=32767, loopCount=-55504, burro=0) at /lustre/[...]/dbgMPIapp_corr.c:120 (at 0x000000002000154b)
message: #0 bar (loopCount=10, rank=6, burro=-2080374782) at /lustre/[...]/dbgMPIapp_corr.c:51 (at 0x0000000020001204)
message: Locals
message: buf: burro: -2080374782 g_rank: 0 (from 0 to 17) g_size: 18 i: 10922 loopCount: 10 rank: 6 (from 6 to 23) sum: -1293743216 val: 0