- Infos im HLRS Wiki sind nicht rechtsverbindlich und ohne Gewähr -
- Information contained in the HLRS Wiki is not legally binding and HLRS is not responsible for any damages that might result from its use -

Debugging On XC40: Difference between revisions

From HLRS Platforms
Jump to navigationJump to search
No edit summary
No edit summary
Line 2: Line 2:


== ATP ==
== ATP ==
In case of '''segmentation faults''', CRAYs Abnormal Termination Processing ([[CRAY_XC40_Tools#ATP : Abnormal Termination Processing | ATP]]) provides an application stack trace at the moment of the error of the related process. It can be obtained with a minimal amount of effort, only one environmental variable has to be set:
In case of '''segmentation faults''', CRAYs Abnormal Termination Processing ([[CRAY_XC40_Tools#ATP : Abnormal Termination Processing | ATP]]) provides an application stack trace at the moment of the error of the related process. It can be obtained with a minimal amount of effort, only one environmental variable has to be set, before running the appliaction:
<pre>export ATP_ENABLED=1</pre>
<pre>export ATP_ENABLED=1</pre>
More information can be found using  
More information can be found using  

Revision as of 14:56, 1 February 2016

This article treat some tools for debugging and monitoring you application.

ATP

In case of segmentation faults, CRAYs Abnormal Termination Processing ( ATP) provides an application stack trace at the moment of the error of the related process. It can be obtained with a minimal amount of effort, only one environmental variable has to be set, before running the appliaction:

export ATP_ENABLED=1

More information can be found using

 man intro_atp 

STAT

In case of a hanging application, CRAYs Stack Trace Analysis Tool ( STAT) can help identifying dead locks by presenting a merged Stack Trace of all processes. The tool can simply attached to the running application by:

module load stat
STATGUI <JOBID> 

Additional information can be found using:

 module load stat; man intro_stat 

Allinea DDT

More complex issues or wrong results can be investigated using a parallel debugger to monitor the applications behavior. Allinea DDT is described in detail in the User Guide and command line options are listed using “ddt --help”. DDT has a user-friendly graphical interface, which also has the capability to start the batch job or connect to a running one. Nevertheless, in this article we concentrate on offline debugging, thus we get rid of the requirements of a interactive session.

Start DDT in offline mode

DDT offline mode can be started and controlled using the following command line options in the batch script. For example, a job on three fully populated nodes:

#PBS -l nodes=3:ppn=24
... 
cd $PBS_O_WORKDIR # change in the directory where the job is started from
module load ddt   # make DDT available

ddt --offline report.txt aprun -n 72 -N24 a.out param1  # alternatively the DDT output file can be selected as html version, e.g. report.html.

This creates a ASCII text file with the program output and the debugging report. Alternatively, the output file format can be changed to html, simply by specifying the related file ending. As a result, a more structured output including small graphics is presented.

Some options

The file menu of the DDT GUI provides an option to save the session information, like settings, break- and tracepoints. Therewith a debugging session can be defined interactively in a small run and be transferred to large run, by using the option:

 ddt [...] --session=my.session 

Breakpoints can be defined in multiple ways, e.g. specifying the position in the source code:

 ddt [...] --break-at="main.c:22 if rank==0"

Furthermore, conditions can be set additionally, e.g. only triggering when variable rank equals 0.

Tracepoints can be specified in a similar manner:

 ddt [...] --trace-at=myFunction,2:3:16,var1,var2 

Here the function myFunction is traced, especially the variables var1 and var2. The tracing is here limited with the additional option to every 3 pass starting from the 2 pass up to the 16 pass.

Memory debugging can be enabled, which will activate memory leak reports. Before the debugging run, the application has to be re-linked with the dmalloc library:

 module load ddt-memdebug
 make # OR cc ... 

Then the memory debugging session is started using the following options:

 ddt [...] --mem-debug=(fast|balanced|thorough) 
 ddt [...] --check-bounds=(after|before) 


More documentation can be found in the User Guide or using:

ddt --help

Example

In the following, a test application is investigates using a breakpoint, a tracepoint (printing two variables, rank and randonNr), and the output is written in report.txt: "ddt --offline report.txt --verbose --break-at=dbgMPIapp_corr.c:51 --trace-at=dbgMPIapp_corr.c:25,rank,randomNr aprun -n 24 -N 24 dbgMPIapp.exe 10 9". Beside general job information, the output consists of the tracepoint information:

tracepoint (1-23): [calc (dbgMPIapp_corr.c:25)] randomNr: from 5 to 95 rank: from 1 to 23
tracepoint (1-23): [calc (dbgMPIapp_corr.c:25)] randomNr: from 1 to 100 rank: from 1 to 23
tracepoint (1-23): [calc (dbgMPIapp_corr.c:25)] randomNr: from 4 to 99 rank: from 1 to 23
tracepoint (1-23): [calc (dbgMPIapp_corr.c:25)] randomNr: from 2 to 96 rank: from 1 to 23
...

where we can see how the variables changes from pass to pass.

Further, the breakpoint information are printed like:

message (6-23): Process stopped at breakpoint in bar (dbgMPIapp_corr.c:51).
message: Stacks
message: Processes Function                            
message: 6-23      main (dbgMPIapp_corr.c:181)         
message: 6-23        worker (dbgMPIapp_corr.c:120) 
message: 6-23          bar (dbgMPIapp_corr.c:51)   
message (n/a): Select process 6
message: Current Stack
message: #2 main (argc=3, argv=0x7fffffff2738) at /lustre/[...]/dbgMPIapp_corr.c:181 (at 0x00000000200017cb)
message: #1 worker (rank=32767, loopCount=-55504, burro=0) at /lustre/[...]/dbgMPIapp_corr.c:120 (at 0x000000002000154b)
message: #0 bar (loopCount=10, rank=6, burro=-2080374782) at /lustre/[...]/dbgMPIapp_corr.c:51 (at 0x0000000020001204)
message: Locals
message: buf: burro: -2080374782 g_rank: 0 (from 0 to 17) g_size: 18 i: 10922 loopCount: 10 rank: 6 (from 6 to 23) sum: -1293743216 val: 0

where we can also inspect the handed variables in the function calls of the stack.