- Infos im HLRS Wiki sind nicht rechtsverbindlich und ohne Gewähr -
- Information contained in the HLRS Wiki is not legally binding and HLRS is not responsible for any damages that might result from its use -

Debugging On XC40: Difference between revisions

From HLRS Platforms
Jump to navigationJump to search
No edit summary
 
(8 intermediate revisions by one other user not shown)
Line 1: Line 1:
In this article some tools are briefly listed to debug and monitor you application.  
This article treat some tools for debugging and monitoring you application.  


== ATP ==
== ATP ==
In case of a segmentation fault, CRAYs Abnormal Termination Processing ([[CRAY_XC40_Tools#ATP : Abnormal Termination Processing | ATP]]) can print an application stack trace at the moment of the error of this process, with a minimal amount of effort.  Only an environmental variable has to be set:
In case of '''segmentation faults''', CRAYs Abnormal Termination Processing ([[CRAY_XC40_Tools#ATP : Abnormal Termination Processing | ATP]]) provides an application stack trace at the moment of the error of the related process. It can be obtained with a minimal amount of effort, only one environmental variable has to be set, before running the appliaction:
<pre>export ATP_ENABLED=1</pre>
<pre>export ATP_ENABLED=1</pre>
More information can be found using  
More information can be found using  
Line 8: Line 8:


== STAT ==
== STAT ==
In case of a hanging application, CRAYs Stack Trace Analysis Tool ([[CRAY_XC40_Tools#STAT : Stack Trace Analysis Tool | STAT]]) can help identifying dead locks by presenting merged Stack Trace of all processes. The tool can simply attached to the running application by:
In case of a '''hanging application''', CRAYs Stack Trace Analysis Tool ([[CRAY_XC40_Tools#STAT : Stack Trace Analysis Tool | STAT]]) can help identifying '''dead locks''' by presenting a merged Stack Trace of all processes. The tool can simply attached to the running application by:
<pre>module load stat
<pre>module load stat
STATGUI <JOBID> </pre>
STATGUI <JOBID> </pre>
Line 15: Line 15:


== Allinea DDT ==
== Allinea DDT ==
For more complex issues or in case of wrong results a debugger can be utilized to monitor the applications behavior. [https://www.allinea.com/products/develop-allinea-forge  Allinea DDT] is a powerful parallel debugger. DDT is described in detail in the [http://content.allinea.com/downloads/userguide-forge.pdf| User Guide] and command line options using “ddt --help”. Allinea DDT has a user-friendly graphical user interface, which also has the capability to start the batch job or connect to a running one.  
More complex issues or wrong results can be investigated using a parallel debugger to monitor the applications behavior. [https://www.allinea.com/products/develop-allinea-forge  Allinea DDT] is described in detail in the [http://content.allinea.com/downloads/userguide-forge.pdf | User Guide] and command line options are listed using “ddt --help”. DDT has a user-friendly graphical interface, which also has the capability to start the batch job or connect to a running one.  
Nevertheless, in this article we concentrate on ''offline debugging'', thus we get rid of the requirements of a interactive session.  
Furthermore, DDT can be run in batch mode (not in an interactive session) using the '''offline debugging''' feature.


=== Start DDT in offline mode ===
=== Start DDT in offline mode ===
DDT offline mode is started and controlled using the following command line options in the batch script, for example on three fully populated nodes:
DDT offline mode can be started and controlled using the following command line options in the batch script. For example, a job on three fully populated nodes:
<pre>
<pre>
#PBS -l nodes=3:ppn=24
#PBS -l nodes=3:ppn=24
...  
...  
cd $PBS_O_WORKDIR # change in the directory where the job is started from
cd $PBS_O_WORKDIR # change in the directory where the job is started from
module load ddt   # make DDT available
module load forge   # make DDT available


ddt --offline report.txt aprun -n 72 -N24 a.out param1  # alternatively the DDT output file can be selected as html version, e.g. report.html.
ddt --offline report.txt aprun -n 72 -N24 a.out param1  # alternatively the DDT output file can be selected as html version, e.g. report.html.
</pre>
</pre>
This creates a ASCII text file with the program output and the debugging report. Alternatively, the output file format can be changed to html, simply by specifying the related file ending. As a result, a more structured output including small graphics is presented.


=== Sessions ===
=== Some options ===
In the DDT GUI a session can be saved in the menu, which stores, e.g. break- and tracepoints into a file, e.g. my.session. Thus debugging steps of a small run can be transferred to a large scale run, using the option:
The file menu of the DDT GUI provides an option to save the '''session information''', like settings, break- and tracepoints. Therewith a debugging session can be defined interactively in a small run and be transferred to large run, by using the option:
<pre> ddt [...] --session=my.session </pre>
<pre> ddt [...] --session=my.session </pre>


=== Some options ===
'''Breakpoints''' can be defined in multiple ways, e.g. specifying the position in the source code:  
Breakpoints can be defined using:  
<pre> ddt [...] --break-at="main.c:22 if rank==0"</pre>
<pre> ddt [...] --break-at="main.c:22 if rank==0"</pre>
Further conditions can be additionally set, e.g. only triggering when variable rank equals 0.
Furthermore, conditions can be set additionally, e.g. only triggering when variable rank equals 0.


Tracepoints can be set in a similar manner:  
'''Tracepoints''' can be specified in a similar manner:  
<pre> ddt [...] --trace-at=main.c:22,var1,var2 </pre>
<pre> ddt [...] --trace-at=myFunction,2:3:16,var1,var2 </pre>
Here additionally variables '''var1''' and '''var2''' are logged.  
Here the function ''myFunction'' is traced, especially the variables var1 and var2. The tracing is here limited with the additional option to every 3 pass starting from the 2 pass up to the 16 pass.  
<!--On program pause expressions can be evaluated using, e.g.
<!--On program pause expressions can be evaluated using, e.g.
  <pre> ddt [...] --evaluate="rank / i" </pre> -->
  <pre> ddt [...] --evaluate="rank / i" </pre> -->


Memory debugging can be enabled, which will activate memory leak reports. Before the debugging run, the application has to be re-linked with the dmalloc library:
'''Memory debugging''' can be enabled, which will activate memory leak reports. Before the debugging run, the application has to be re-linked with the dmalloc library:
<pre> module load ddt-memdebug
<pre> module load ddt-memdebug
  make # OR cc ... </pre>
  make # OR cc ... </pre>
Line 57: Line 57:


=== Example ===
=== Example ===
In the following a test application is analyzed with a breakpoint, a tracepoint (printing two variables), which is written as txt, using the command: "ddt --offline report.txt --verbose --break-at=dbgMPIapp_corr.c:51 --trace-at=dbgMPIapp_corr.c:25,rank,randomNr aprun -n 24 -N 24 dbgMPIapp.exe 10 9".  
In the following, a test application is investigates using a breakpoint, a tracepoint (printing two variables, rank and randonNr), and the output is written in report.txt: "ddt --offline report.txt --verbose --break-at=dbgMPIapp_corr.c:51 --trace-at=dbgMPIapp_corr.c:25,rank,randomNr aprun -n 24 -N 24 dbgMPIapp.exe 10 9".  
The output consists beside general job information on the tracepoint information:
Beside general job information, the output consists of the tracepoint information:
<pre>
<pre>
tracepoint (1-23): [calc (dbgMPIapp_corr.c:25)] randomNr: from 5 to 95 rank: from 1 to 23
tracepoint (1-23): [calc (dbgMPIapp_corr.c:25)] randomNr: from 5 to 95 rank: from 1 to 23
Line 66: Line 66:
...
...
</pre>
</pre>
where we can see how the variables changes from time to time.  
where we can see how the variables changes from pass to pass.  


Further, the breakpoint information are printed, e.g:
Further, the breakpoint information are printed like:
<pre>
<pre>
message (6-23): Process stopped at breakpoint in bar (dbgMPIapp_corr.c:51).
message (6-23): Process stopped at breakpoint in bar (dbgMPIapp_corr.c:51).
Line 84: Line 84:
message: buf: burro: -2080374782 g_rank: 0 (from 0 to 17) g_size: 18 i: 10922 loopCount: 10 rank: 6 (from 6 to 23) sum: -1293743216 val: 0
message: buf: burro: -2080374782 g_rank: 0 (from 0 to 17) g_size: 18 i: 10922 loopCount: 10 rank: 6 (from 6 to 23) sum: -1293743216 val: 0
</pre>
</pre>
where we can also inspect the handed variables in the function calls of the stack.

Latest revision as of 15:04, 26 July 2017

This article treat some tools for debugging and monitoring you application.

ATP

In case of segmentation faults, CRAYs Abnormal Termination Processing ( ATP) provides an application stack trace at the moment of the error of the related process. It can be obtained with a minimal amount of effort, only one environmental variable has to be set, before running the appliaction:

export ATP_ENABLED=1

More information can be found using

 man intro_atp 

STAT

In case of a hanging application, CRAYs Stack Trace Analysis Tool ( STAT) can help identifying dead locks by presenting a merged Stack Trace of all processes. The tool can simply attached to the running application by:

module load stat
STATGUI <JOBID> 

Additional information can be found using:

 module load stat; man intro_stat 

Allinea DDT

More complex issues or wrong results can be investigated using a parallel debugger to monitor the applications behavior. Allinea DDT is described in detail in the | User Guide and command line options are listed using “ddt --help”. DDT has a user-friendly graphical interface, which also has the capability to start the batch job or connect to a running one. Furthermore, DDT can be run in batch mode (not in an interactive session) using the offline debugging feature.

Start DDT in offline mode

DDT offline mode can be started and controlled using the following command line options in the batch script. For example, a job on three fully populated nodes:

#PBS -l nodes=3:ppn=24
... 
cd $PBS_O_WORKDIR # change in the directory where the job is started from
module load forge   # make DDT available

ddt --offline report.txt aprun -n 72 -N24 a.out param1  # alternatively the DDT output file can be selected as html version, e.g. report.html.

This creates a ASCII text file with the program output and the debugging report. Alternatively, the output file format can be changed to html, simply by specifying the related file ending. As a result, a more structured output including small graphics is presented.

Some options

The file menu of the DDT GUI provides an option to save the session information, like settings, break- and tracepoints. Therewith a debugging session can be defined interactively in a small run and be transferred to large run, by using the option:

 ddt [...] --session=my.session 

Breakpoints can be defined in multiple ways, e.g. specifying the position in the source code:

 ddt [...] --break-at="main.c:22 if rank==0"

Furthermore, conditions can be set additionally, e.g. only triggering when variable rank equals 0.

Tracepoints can be specified in a similar manner:

 ddt [...] --trace-at=myFunction,2:3:16,var1,var2 

Here the function myFunction is traced, especially the variables var1 and var2. The tracing is here limited with the additional option to every 3 pass starting from the 2 pass up to the 16 pass.

Memory debugging can be enabled, which will activate memory leak reports. Before the debugging run, the application has to be re-linked with the dmalloc library:

 module load ddt-memdebug
 make # OR cc ... 

Then the memory debugging session is started using the following options:

 ddt [...] --mem-debug=(fast|balanced|thorough) 
 ddt [...] --check-bounds=(after|before) 


More documentation can be found in the User Guide or using:

ddt --help

Example

In the following, a test application is investigates using a breakpoint, a tracepoint (printing two variables, rank and randonNr), and the output is written in report.txt: "ddt --offline report.txt --verbose --break-at=dbgMPIapp_corr.c:51 --trace-at=dbgMPIapp_corr.c:25,rank,randomNr aprun -n 24 -N 24 dbgMPIapp.exe 10 9". Beside general job information, the output consists of the tracepoint information:

tracepoint (1-23): [calc (dbgMPIapp_corr.c:25)] randomNr: from 5 to 95 rank: from 1 to 23
tracepoint (1-23): [calc (dbgMPIapp_corr.c:25)] randomNr: from 1 to 100 rank: from 1 to 23
tracepoint (1-23): [calc (dbgMPIapp_corr.c:25)] randomNr: from 4 to 99 rank: from 1 to 23
tracepoint (1-23): [calc (dbgMPIapp_corr.c:25)] randomNr: from 2 to 96 rank: from 1 to 23
...

where we can see how the variables changes from pass to pass.

Further, the breakpoint information are printed like:

message (6-23): Process stopped at breakpoint in bar (dbgMPIapp_corr.c:51).
message: Stacks
message: Processes Function                            
message: 6-23      main (dbgMPIapp_corr.c:181)         
message: 6-23        worker (dbgMPIapp_corr.c:120) 
message: 6-23          bar (dbgMPIapp_corr.c:51)   
message (n/a): Select process 6
message: Current Stack
message: #2 main (argc=3, argv=0x7fffffff2738) at /lustre/[...]/dbgMPIapp_corr.c:181 (at 0x00000000200017cb)
message: #1 worker (rank=32767, loopCount=-55504, burro=0) at /lustre/[...]/dbgMPIapp_corr.c:120 (at 0x000000002000154b)
message: #0 bar (loopCount=10, rank=6, burro=-2080374782) at /lustre/[...]/dbgMPIapp_corr.c:51 (at 0x0000000020001204)
message: Locals
message: buf: burro: -2080374782 g_rank: 0 (from 0 to 17) g_size: 18 i: 10922 loopCount: 10 rank: 6 (from 6 to 23) sum: -1293743216 val: 0

where we can also inspect the handed variables in the function calls of the stack.