- Infos im HLRS Wiki sind nicht rechtsverbindlich und ohne Gewähr -
- Information contained in the HLRS Wiki is not legally binding and HLRS is not responsible for any damages that might result from its use -

CRAY XC40 Tools: Difference between revisions

From HLRS Platforms
Jump to navigationJump to search
 
(27 intermediate revisions by the same user not shown)
Line 1: Line 1:
Cray does provide several official tools. Below is a list of some of the tools, you can get more information about them in the online manual ('''man atp''' for example).
At HLRS Cray also supports some tools with limited or no support. Beside the CrayPAT currently also available is the Cray Profiler


== Cray provided tools ==
= ATP : Abnormal Termination Processing =
Cray does provide several official tools. Below is a list of some of the tools, you can get more information about them in the online manual ('''man atp''' for example).  
<!--{{Warning|text= Doesn't work yet.}} -->
This tool can be used when the application crashes, e.g. with a segmentation fault.
Abnormal Termination Processing (ATP) is a system that monitors Cray system user applications. If an application takes a system trap, ATP
performs analysis on the dying application.
In the stderr a stack walkback of the crashing rank is presented. In the following example, rank 1 crashes:
<pre>
Application 5408137 is crashing. ATP analysis proceeding...


Jump to [[CRAY_XC40_Tools#ATP : Abnormal Termination Processing|ATP]], [[CRAY_XC40_Tools#STAT : Stack Trace Analysis Tool|STAT]] and [[CRAY_XC40_Tools#IOBUF - I/O buffering library|IOBUF]]
ATP Stack walkback for Rank 1 starting:
  _start@start.S:113
  __libc_start_main@libc-start.c:242
  main@0x200015f6
  ConverseInit@0x20255a59
  _processHandler(void*, CkCoreState*)@0x201a320d
  CkDeliverMessageFree@0x2019d402
  CkArray::recvBroadcast(CkMessage*)@0x201c50f7
  CkArrayBroadcaster::deliver(CkArrayMessage*, ArrayElement*, bool)@0x201c4c30
  CkIndex_TreePiece::_call_drift_marshall51(void*, void*)@0x20051cb9
  TreePiece::drift(double, int, int, double, double, int, bool, CkCallback const&)@0x200169eb
  ArrayElement::contribute(int, void const*, CkReduction::reducerType, CkCallback const&, unsig
ned short)@0x201c213a
  CkReductionMsg::buildNew(int, void const*, CkReduction::reducerType, CkReductionMsg*)@0x201cd
f20
  memcpy@memcpy.S:196
ATP Stack walkback for Rank 1 done
Process died with signal 11: 'Segmentation fault'
Forcing core dump of rank 1
View application merged backtrace tree with: stat-view atpMergedBT.dot
You may need to: module load stat


At HLRS Cray also supports some tools with limited or no support. Currently available is the [[CRAY_XC40_Tools#Cray Profiler : Function-level instrumentation for MPI, SHMEM, heap, and I/O|Cray Profiler]]
_pmiu_daemon(SIGCHLD): [NID 07469] [c2-3c2s11n1] [Fri Sep 23 10:37:51 2016] PE RANK 0 exit sign
al Killed
[NID 07469] 2016-09-23 10:37:51 Apid 5408137: initiated application termination
</pre>
In this example, ''memcopy'' in ''CkReductionMsg::buildNew'' seems to have an issue.


=== ATP : Abnormal Termination Processing ===
In addition to the text output, stack backtraces of '''ALL''' the application processes are gathered into a merged stack backtrace
<!--{{Warning|text= Doesn't work yet.}} -->
Abnormal Termination Processing (ATP) is a system that monitors Cray system user applications. If an application takes a system trap, ATP
performs analysis on the dying application. All stack backtraces of the application processes are gathered into a merged stack backtrace
tree and written to disk as the file, atpMergedBT.dot. The stack backtrace tree for the first process to die is sent to stderr as is the
tree and written to disk as the file, atpMergedBT.dot. The stack backtrace tree for the first process to die is sent to stderr as is the
number of the signal that caused the application to fail. If Linux core dumping is enabled (see ulimit or limit in your shell
number of the signal that caused the application to fail. If Linux core dumping is enabled (see ulimit or limit in your shell
Line 19: Line 49:
backtrace tree provides a concise yet comprehensive view of what the application was doing at the time of its termination.
backtrace tree provides a concise yet comprehensive view of what the application was doing at the time of its termination.


At HLRS ATP module is loaded by default. To use it you have to set ATP_ENABLED=1 in your batch script.
At HLRS ATP module is loaded by default. To use it you have to set  
<pre>export ATP_ENABLED=1</pre>
in your batch script.
ATP provide few important core files, if you
<pre>limit -c unlimited</pre>


=== STAT :  Stack Trace Analysis Tool ===
= STAT :  Stack Trace Analysis Tool =
<!-- {{Warning|text= Doesn't work yet.}} -->
<!-- {{Warning|text= Doesn't work yet.}} -->
Stack Trace Analysis Tool (STAT) is a cross-platform tool from the University of Wisconsin-Madison. It gathers and merges stack traces from a running application’s parallel processes. It creates call graph prefix tree, which are a compressed representation, with scalable visualization and scalable analysis
Stack Trace Analysis Tool (STAT) is a cross-platform tool from the University of Wisconsin-Madison. It gathers and merges stack traces from a running application’s parallel processes. It creates call graph prefix tree, which are a compressed representation, with scalable visualization and scalable analysis
Line 40: Line 74:
</pre>
</pre>


=== IOBUF - I/O buffering library ===
= IOBUF - I/O buffering library =


IOBUF is an I/O buffering library that can reduce the I/O wait time for programs that read or write large files sequentially. IOBUF intercepts I/O system calls such as read and open and adds a layer of buffering, thus improving program performance by enabling asynchronous prefetching and caching of file data.
IOBUF is an I/O buffering library that can reduce the I/O wait time for programs that read or write large files sequentially. IOBUF intercepts I/O system calls such as read and open and adds a layer of buffering, thus improving program performance by enabling asynchronous prefetching and caching of file data.
Line 67: Line 101:
You should never use IOBUF in the case when several parallel processes operates on a single file.
You should never use IOBUF in the case when several parallel processes operates on a single file.


=== Perftools :  Performance Analysis Tool Kit ===
= Perftools :  Performance Analysis Tool Kit =


==== Description ====
The Cray Performance Measurement and Analysis Tools (or CrayPat) are a suite of optional utilities that enable you to capture and analyze performance data generated during the execution of your program on a Cray system. The information collected and analysis produced by use of these tools can help you to find answers to two fundamental programming questions: How fast is my program running? and How can I make it run faster?
The Cray Performance Measurement and Analysis Tools (or CrayPat) are a suite of optional utilities that enable you to capture and analyze performance data generated during the execution of your program on a Cray system. The information collected and analysis produced by use of these tools can help you to find answers to two fundamental programming questions: How fast is my program running? and How can I make it run faster?
A detailed documantation about CrayPAT can be found in document [http://docs.cray.com/books/S-2376-622/S-2376-622.pdf S-2376-622].
A detailed documantation about CrayPAT can be found in document [http://docs.cray.com/books/S-2376-622/S-2376-622.pdf S-2376-622].
Here a short summary is presented, concentrating on the usage.
Here a short summary is presented, concentrating on the usage.


==== Usage ====
Profiling is mainly distinguished between two main run cases, sampling and tracing:
Starting with perftools version 6.3. as a starting point
{|border="1" cellpadding="2"
!width="250"|Sampling
!width="250"|Tracing
|-
|Advantages
*Only need to instrument main routine
*Low Overhead – depends only on sampling frequency
*Smaller volumes of data produced
|Advantages
*More accurate and more detailed information
*Data collected from every traced function call not statistical averages
|-
|Disadvantages
*Only statistical averages available
*Limited information from performance counters
|Disadvantages
*Increased overheads as number of function calls increases
*Huge volumes of data generated
|}
Using the fully adjustable CrayPAT, Automatic Profiling Analysis (APA) is a guided tracing combining the advantages of Sampling and tracing.
Furthermore, the event tracing can be enhanced by using loop profiling.  


===== CrayPAT-lite =====


===== CrayPAT =====
'''[[CRAY_XC40_Tools#perftools-base|perftools-base]]''' should be loaded as a starting place. This provides access to man pages, Reveal, Cray Apprentice2, and the new instrumentation modules. This module can be kept loaded without impact to applications.
As instrumentation modules following is available:
* '''[[CRAY_XC40_Tools#perftools-lite|perftools-lite]]''' (sampling experiments)
* '''[[CRAY_XC40_Tools#perftools-lite-events|perftools-lite-events]]''' (tracing experimants)
* '''[[CRAY_XC40_Tools#perftools-lite-loops|perftools-lite-loops]]''' (collect data for auto-parallelization / loop estimates in Reveal)
* '''perftools-lite-gpu''' (gpu kernel and data movemnets)
* '''[[CRAY_XC40_Tools#perftools|perftools]]''' (fully adjustable CrayPAT, using pat_build and pat_report)


===== Reveal =====
GENERAL REMARKS: '''MUST run on Lustre !''' Always check that the instrumented binary has not affected the run time notably compared to the original. Collecting event traces on large numbers of frequently called functions, or setting the sampling interval very low can introduce a lot of overhead (check trace-text-size option to pat_build). The runtime analysis can be modified through the use of environment variables of the form PAT_RT_*.


===== Apprentice2 =====


=== Cray Profiler ===
== CrayPAT ==
The Cray profiler library is deprecated, but still available on the system. A description can be found [[ CrayProfiler | here ]]
The perftools-lite modules provide a user-friendly way to auto-instrument your application for various profiling cases. The perftools module provide CrayPATs full functionality. As described below instrumentation and report generation can be triggered manually specifying various options.
In the following descriptions we assume using a simple batch job script:
<pre>$> cat job.pbs
#!/bin/bash
#PBS –l nodes=1:ppn=24
#PBS –l walltime=00:10:00
#PBS –j oe
#PBS -o job.out


cd $PBS_O_WORKDIR
aprun –n 384 –N 24 <exe>
</pre>


An application is instrumented and run using the following commands:
<pre>$> module load perftools-base
$> module load <CrayPAT-lite-module>
$> make clean; make # or what is necessary to rebuild your application
$> qsub job.pbs    # no changes needed for aprun inside this script
$> less job.out
</pre>
As a result a *.rpt and a *.ap2 file are created and the report is additionally printed to stdout.


== Third party tools ==
Additional information and representation can be gathered using '''pat_report''' with the produced *.ap2 file.  
 
<pre>$> pat_report <option> *.ap2 </pre>
=== Gnu-Tools ===
Descriptions of the available option can be obtained using ''man pat_report''
The module gnu-tools collects more recent versions of basic functionalities, including the GNU building system ('''autoconf''', '''automake''', '''libtool''', '''m4'''), as well as '''bash''', '''cmake''', '''gperf''', '''git''', '''gwak''', '''swig''', and '''bison'''.  
The actual versions can be listed using


<pre>% module whatis tools/gnu-tools</pre>
You can inspect visually the created self-contained ap2 file using [[CRAY_XC40_Tools#Apprentice2|Apprentice2]].


To use the actual version of '''bash''' with full support of the module environment you can simply call
REMEMBER: After the experiment is complete, unload perftools-lite-XXX module to prevent further program instrumentation. The perftools-base module can be kept loaded.


<pre>% bash -l myScript.sh</pre>
=== perftools-base ===
The perftools-base module provides access to man pages, utilities such as Reveal, Cray Apprentice2 and grid_order, and instrumentation modules. It does not add compiler flags to enable performance data collection (such as symbol table information), as the earlier perftools or perftools-lite did or the newly available instrumentation modules do. It is a low-impact module that does not alter program behavior and can be left loaded even when building and running programs without CrayPat instrumentation.


or define the absolute path in the first line of your script


<pre>#!/opt/hlrs/tools/gnu-tools/generic/bin/bash -l</pre>
=== perftools-lite ===
This module is default CrayPat-lite profiling. It enables sampling of the application.


=== Octave ===
Beside other information the Profiling by Function Group and Function is presented in the report:
[https://www.gnu.org/software/octave/ GNU Octave] is a high-level interpreted language, primarily intended for numerical computations. It provides capabilities for the numerical solution of linear and nonlinear problems, and for performing other numerical experiments. It also provides extensive graphics capabilities for data visualization and manipulation. GNU Octave is normally used through its interactive interface (CLI and GUI), but it can also be used to write non-interactive programs. The GNU Octave language is quite similar to Matlab so that most programs are easily portable.
<pre>Table 1: Profile by Function Group and Function (top 8 functions shown)


Octave is compiled to run on the compute nodes and can be launched e.g. in an interactive session:
  Samp% |  Samp |  Imb. |  Imb. |Group
<pre>
        |      |  Samp | Samp% | Function
% qsub -I [options]
        |      |      |      |  PE=HIDE
% module load tools/octave
     
% aprun -n 1 -N 1 octave octave.script
100.0% | 263.4 |    -- |    -- |Total
|----------------------------------------------------------------------
|  78.0% | 205.3 |    -- |    -- |MPI
||---------------------------------------------------------------------
||  62.4% | 164.4 | 115.6 | 42.2% |mpi_bcast
||  10.4% |  27.4 | 186.6 | 89.1% |MPI_ALLREDUCE
||  4.7% |  12.4 |  86.6 | 89.3% |MPI_IPROBE
||=====================================================================
|  13.1% |  34.5 |    -- |    -- |USER
||---------------------------------------------------------------------
...
|======================================================================
</pre>
</pre>
Where the stack trace of all processes are merged and the combined information is presented as relative and absolute values of the counted samples in the group/function and imbalances between processes.


=== PARPACK ===
=== perftools-lite-events ===
With the module hlrs_PARPACK the collections of f77 routines designed to solve large scale eigenvalue problems ([http://www.caam.rice.edu/software/ARPACK/ ARPACK]) and the parallel version ([http://www.caam.rice.edu/~kristyn/parpack_home.html PARPACK]) are provided. To link these libraries you only have to load the module
This module enables CrayPATs event tracing of applications. After loading the modules, re-compiled/linked the application and submitting the job as usual, the report is written in the above described way.
<pre> numlib/hlrs_PARPACK </pre>
In contrast to sampling, event tracing reports out real time in groups / functions.


Important Features of ARPACK:
=== perftools-lite-loops ===
This module enables CrayPat-lite loop work estimates. It must be used with Cray compiler. After proceeding in the above described way, loop work estimates are sent to stdout and to .ap2 file. Performance data can be combined with source code information and compiler annotation using the .ap2 file with Reveal.
The module modify the compile and link steps to include CCE’s –h profile_generate option and instrumenting the program for tracing (pat_build -w). Remember that –h profile_generate reduces compiler optimization levels.  After experiment is complete, unload perftools-lite-loops to prevent further program instrumentation.


* Reverse Communication Interface.
<pre>Table 1:  Inclusive and Exclusive Time in Loops (from -hprofile_generate)
* Single and Double Precision Real Arithmetic Versions for Symmetric, Non-symmetric,
  Loop | Loop Incl |      Time |    Loop |  Loop |  Loop |  Loop |Function=/.LOOP[.]
* Standard or Generalized Problems.
  Incl |      Time |    (Loop |    Hit | Trips | Trips | Trips | PE=HIDE
*    Single and Double Precision Complex Arithmetic Versions for Standard or Generalized Problems.
Time% |          |    Adj.) |        |  Avg |  Min |  Max |
*    Routines for Banded Matrices - Standard or Generalized Problems.
|-----------------------------------------------------------------------------
*   Routines for The Singular Value Decomposition.
| 93.0% | 19.232051 |  0.000849 |      2 |  26.5 |    3 |   50 |jacobi.LOOP.1.li.236
*    Example driver routines that may be used as templates to implement numerous Shift-Invert strategies for all problem types, data types and precision.  
| 77.8% | 16.092021 |  0.001350 |      53 | 255.0 |  255 |  255 |jacobi.LOOP.2.li.240
 
| 77.8% | 16.090671 |  0.110827 |  13515 | 255.0 |   255 |   255 |jacobi.LOOP.3.li.241
{{Warning|text=after swapping the PrgEnv this module has to be (re)loaded again (module load numlib/hlrs_PARPACK). }}
| 77.3% | 15.979844 | 15.979844 | 3446325 | 511.0 |  511 |  511 |jacobi.LOOP.4.li.242
 
| 14.1% |  2.906115 |  0.001238 |      53 | 255.0 |  255 |   255 |jacobi.LOOP.5.li.263
=== Python ===
 
Actual versions of Python can be used loading the module tools/python.
 
=== SLEPc ===
The [http://www.grycap.upv.es/slepc/ SLEPc (Scalable Library for Eigenvalue Problem Computations)] is an extantion of [http://www.mcs.anl.gov/petsc/ PETSc] for solving linear eigenvalue problems in either standard or generalized form. Furthermore, SLEPc can compute partial SVD of a large, sparse, rectangular matrix, and solve nonlinear eigenvalue problems (polynomial or general). Additionally, SLEPc provides solvers for the computation of the action of a matrix function on a vector.  
SLEPc can be used for real (default) and complex arithmetics, therefore two different modules are provided:
<pre>
   module load numlib/hlrs_SLEPc    # deafault version
</pre>
OR
<pre>
   module load numlib/hlrs_SLEPc/3.5.3-complex
</pre>
As usual the modules provides all compiler and linker flags, thus ''ex1.c'' (containing SLEPc calls) can be simply compiled by
<pre>
   cc ex1.c -o ex1.exe
</pre>
</pre>


{{Warning|text=Please select first the desired PrgEnv or after swapping the PrgEnv (re)loaded this module again (module load numlib/hlrs_SLEPc). Supported programming environments are PrgEnv-cray, PrgEnv-gnu, and PrgEnv-intel. }}
=== perftools ===
 
In contrast to the perftools-lite module, which automatically instrument and report, the perftools module require a manually instrumentation and report generation:
=== SVN ===
<pre>$> module load perftools-base
Subversion is installed with the following repository access (RA) modules: ra_svn, ra_local, ra_serf. Plaintext and GPG-Agent authentication credential caches are avaiable.
$> module load perftools
<pre>
$> make clean; make # If your application is already built with perftools loaded you do not have to rebuild when switching the experiment.
  module load tools/svn
$> pat_build <pat_options> app.exe # pat_options are described below; Creates instrumented binary app.exe+pat
$> qsub job.pbs        # ATTENTION: now you have to use the new instrumented binary "aprun <options> ./app.exe+pat"
$> pat_report –o myrep.txt app.exe+pat+*  # .xf file or related directory
</pre>
</pre>
Running the “+pat” binary creates a data file or directory. ''pat_report'' reads that data file and prints lots of human-readable performance data. It also creates an *.ap2 file which contains all profiling data. (The app.exe+pat+* file/directory can be deleted after the creation of the .ap2 file)


=== Utilities for processing netcdf files ===
The instrumentation can be adjusted using ''pat_build'' options, which are listed in '''man pat_build''', some few commonly used options are:
The module tools/netcdf_utils contains the follwing tools:
{|border="1" cellpadding="2"
!width="150"|pat_build Option
!width="350"|Description
|-
|
|Sampling profile
|-
| style="text-align:center;"| -u
|tracing of functions in source file owned by the user
|-
| style="text-align:center;"| -w
|Tracing is default experiment
|-
| style="text-align:center;"| -T <func>
| Specifies a function which will be traced
|-
| style="text-align:center;"| -t <file>
|All functions in the specified file will be traces
|-
| style="text-align:center;"| -g <group>
|Instrument all functions belonging to the specified trace function group, e.g. blas, io, mpi, netcdf, syscall
|}


*nco (see http://nco.sourceforge.net/)
It should be noted, that only true function calls can be traced. Functions that are inlined by the compiler or that have local scope in a compilation unit cannot be traced.
*ncview (see http://meteora.ucsd.edu/~pierce/ncview_home_page.html)
*cdo  (see https://code.zmaw.de/projects/cdo)


== Third party scientific software ==
The '''pat_report''' tool combines information from *.xf output (raw data files, optimized for writing to disk). During conversion the instruments binary must still exist. As a result  *.ap2 file is produced, which is a compressed performance file, optimized for visualization analysis. The ap2 file is the input for subsequent ''pat_report'' calls and ''Reveal'' or  ''Apprentice2''. Once the ap2 file is generated *.xf files and instrumented binary files can be removed.
Many options for sorting, slicing or dicing data in the tables are provided using
<pre>$> pat_report –O <table option> *.ap2
$> pat_report –O help (list of available profiles)</pre>
Volume and type of information depends upon sampling vs tracing. Several output formats {plot | rpt | ap2 | ap2‐xml | ap2‐txt | xf‐xml | xf‐txt | html} are available through the –f option. Furthermore, gathered data can be filtered using
<pre>$> pat_report –sfilter_input=‘condition’ … </pre>
where the ‘condition’ could be an expression involving 'pe' such as 'pe<1024' or 'pe%2==0'.


=== CP2K ===
'''Loop Work Estimation''' can be collected by using the CCE compiler option ''-h profile_generate'' and the described tracing experiment. It is recommended to turn off OpenMP and OpenACC for the loop work estimates via –h noomp –h noacc.
[http://www.cp2k.org CP2K]  is a freely available (GPL) program to perform atomistic and molecular simulations of solid state, liquid, molecular and biological systems. It provides a general framework for different methods such as e.g. density functional theory (DFT) using a mixed Gaussian and plane waves approach (GPW), and classical pair and many-body potentials. It is very well and consistently written, standards-conforming Fortran 95, parallelized with MPI and in some parts with hybrid OpenMP+MPI as an option.


CP2K provides state-of-the-art methods for efficient and accurate atomistic simulations, sources are freely available and actively improved. It has an active international development team, with the unofficial head quarters in the University of Zürich.
'''Hardware counter Selection''' can be enabled using ''export PAT_RT_PERFCTR= <group> | <event list>'', where the related groups and events can be listed using ''man hwpc'' and ''papi_avail'' or pat_help -> counters.


The molecular simulation package is installed, optimized for the present architecture, compiled with gfortran using optimized versions of libxc, libint and libsmm.  
'''Energy information''' can be gathered using ''pat_report –O program_energy *.ap2''.
<pre> module load chem/cp2k </pre>
provide four versions of different kind of parallelizations:


  cp2k.ssmp  - only OpenMP
  cp2k.popt  - only MPI
  cp2k.psmp  - hybrid MPI + OpenMP
  cp2k.pdbg  - only MPI compiled with debug flags


After loading the related module (chem/cp2k), the binary can be directly called in the job submission script, e.g.:
==== Automatic Profiling Analysis (APA) ====
The advantages of sampling and tracing are combined in the guided profiling APA.
The target are large, long-running programs (general a trace will inject considerable overhead). The goal is the limitation of tracing to those functions that consume the most time.
As a procedure a preliminary sampling experiment is used to determine and instrument functions consuming the most time.
<pre>$> module load perftools
$> make clean; make
$> pat_build himeno.exe           # The APA is the default experiment. No option needed.
$> qsub job.pbs                          # using the new instrumented binary in "aprun <option> ./app.exe+pat"
$> pat_report –o myrep.txt app+pat+*     


<pre>aprun -n 24 -N 24 cp2k.psmp myCp2kInputFile.inp > myOutput.out</pre>
$> vi *.apa                           # The *.apa file contains instructions for the next instrumentation step. Modify it according to your needs.
$> pat_build –O *.apa                   # Generates an instrumented binary *.exe+apa for tracing
$> qsub job.pbs                          # using the new instrumented binary in "aprun <option> ./app.exe+apa"
$> pat_report –o myrep.txt app+apa+*      # .xf file or related directory
</pre>


Some examples for CP2K input files are provided on the [http://www.cp2k.org CP2K homepage] and there also exist the [http://manual.cp2k.org/trunk/ input reference].
== Reveal ==
Reveal is Cray’s next-generation integrated performance analysis and code optimization tool.  
main features:
* inspecting combined view of loop work estimations with source code (compiler annotations)
* assist an OpenMP port


=== Gromacs ===
For an OpenMP port a developer has to understand the scoping of the variables, i.e. whether variables are shared or private. Reveal assists by navigating through the source code using whole program analysis (data provided by the Cray compilation environment; listing files). Reveal couples with performance data collected during execution by CrayPAT. It understand which high level serial loops could benefit from parallelism. It gathers and present dependency information for targeted loops, assist users optimize code by providing variable scoping feedback and suggested compile directives.


[http://www.gromacs.org/ GROMACS]  (GROningen MAchine for Chemical Simulations) is a molecular dynamics package which can be used by
Usage:
<pre> module load chem/gromacs </pre>
<pre>$> module load perftools-base
$> ftn -O3 -hpl=my_program.pl -c my_program_file1.f90
$> ftn -O3 -hpl=my_program.pl -c my_program_file1.f90 #Recompile to generate program library
# run instrumented binary to gather performance data using loop work estimation (see above)
$> reveal my_program.pl my_program.ap2 &
</pre>
You can omit the *.ap2 and inspect only compiler feedback.
Note that the ''-h profile_generate'' option disables most automatic compiler optimizations, which is why Cray recommends generating this data '''separately''' from generating the program_library file.


== Apprentice2 ==
Cray Apprentice2 is a post-processing performance data visualization tool, which takes *.ap2 files as input.


=== LAMMPS ===
Main features are:
*Call graph profile
*Communication statistics
*Time-line view for Communication and IO.
*Activity view
*Pair-wise communication statistics
*Text reports
It helps identify:
*Load imbalance
*Excessive communication
*Network contention
*Excessive serialization
*I/O Problems


[http://http://lammps.sandia.gov/ LAMMPS]  "LAMMPS Molecular Dynamics Simulator" is a molecular dynamics package which can be used by
<pre>$> module load perftools-base
<pre> module load chem/lammps </pre>
$> app2 *.ap2 & </pre>
The executable is named lmp_CrayXC.


=== NAMD ===
If the full trace is enabled (using the environment variable ''PAT_RT_SUMMARY=0''), a time line view is activated, which helps to see communication bottlenecks. But please use it only for small experiments !
[http://www.ks.uiuc.edu/Research/namd/  NAMD (Scalable Molecular Dynamics)] is a parallel molecular dynamics code designed for high-performance simulation of large biomolecular systems, based on Charm++ parallel objects. The package can be loaded using
<pre> module load chem/namd </pre>
A tutorial can be found [http://www.ks.uiuc.edu/Training/Tutorials/ here].


=== OpenFOAM ===
You can istall Apprentice2 on you local machine. It is available from a Cray login node
[http://www.openfoam.com/ OpenFOAM (Open Field Operation and Manipulation)] is an open source CFD software package. Multiple versions of OpenFOAM are available compiled with gnu and intel. Available versions can be listed using
*module load perftools-base
<pre> module avail cae/openfoam </pre>
* Go to: $CRAYPAT_ROOT/share/desktop_installers/
OpenFOAM can be used with '''PrgEnv-gnu''' and '''PrgEnv-intel''', e.g.
* Download .dmg or .exe installer to laptop
<pre>
* Double click on installer and follow directions to install
module swap PrgEnv-cray PrgEnv-gnu
module load cae/openfoam
</pre>
Furthermore, '''Foam-extend''' is available but only for PrgEnv-gnu
<pre>
module swap PrgEnv-cray PrgEnv-gnu
module load cae/openfoam/3.0-extend
</pre>


As a first example a test case of incompressible laminar flow in a cavity using '''blockMesh''' and '''icoFoam''' is provided, which can be found in the directory
== Cray Profiler ==
<pre> /opt/hlrs/cae/fluid/OPENFOAM/ESM/CRAY-Versionen/hornet-example </pre>
The Cray profiler library is deprecated, but still available on the system. A description can be found [[ CrayProfiler | here ]]
To run this example you have to copy the directory and submit the '''prepareOF''' and '''runOF''' jobs.
 
It is also possible to use '''CrayPAT''' profiling for certain version of OpenFOAM. Therefore, specialized module exist providing relevant versions '''cae/openfoam/'''xxx'''-perftools''', where xxx are version numbers. The related binaries still has to be instrumented using
<pre>
pat_build $FOAM_APPBIN/icoFoam
</pre>
As a result a binary '''icoFoam+pat''' is generated in the current directory. Using these binary in the batch script the profiling will be performed. To analyze the resulting profiling data '''pat_report''' and further tools can be used ([http://docs.cray.com/books/S-2376-610/ Cray Performance Tools]). If during the execution of your instrumented binary you notice that the MPI is not recognized, i.e. you see replicated output or several *.xf files not collected in a single directory in your workspace, you cat export PAT_BUILD_PROG_MODELS="0x1" in your shell and run the '''pat_build''' command again after removing the instrumented binary. Please file a ticket if this did not work for you.

Latest revision as of 15:36, 9 October 2016

Cray does provide several official tools. Below is a list of some of the tools, you can get more information about them in the online manual (man atp for example).

At HLRS Cray also supports some tools with limited or no support. Beside the CrayPAT currently also available is the Cray Profiler

ATP : Abnormal Termination Processing

This tool can be used when the application crashes, e.g. with a segmentation fault. Abnormal Termination Processing (ATP) is a system that monitors Cray system user applications. If an application takes a system trap, ATP performs analysis on the dying application. In the stderr a stack walkback of the crashing rank is presented. In the following example, rank 1 crashes:

Application 5408137 is crashing. ATP analysis proceeding...

ATP Stack walkback for Rank 1 starting:
  _start@start.S:113
  __libc_start_main@libc-start.c:242
  main@0x200015f6
  ConverseInit@0x20255a59
  _processHandler(void*, CkCoreState*)@0x201a320d
  CkDeliverMessageFree@0x2019d402
  CkArray::recvBroadcast(CkMessage*)@0x201c50f7
  CkArrayBroadcaster::deliver(CkArrayMessage*, ArrayElement*, bool)@0x201c4c30
  CkIndex_TreePiece::_call_drift_marshall51(void*, void*)@0x20051cb9
  TreePiece::drift(double, int, int, double, double, int, bool, CkCallback const&)@0x200169eb
  ArrayElement::contribute(int, void const*, CkReduction::reducerType, CkCallback const&, unsig
ned short)@0x201c213a
  CkReductionMsg::buildNew(int, void const*, CkReduction::reducerType, CkReductionMsg*)@0x201cd
f20
  memcpy@memcpy.S:196
ATP Stack walkback for Rank 1 done
Process died with signal 11: 'Segmentation fault'
Forcing core dump of rank 1
View application merged backtrace tree with: stat-view atpMergedBT.dot
You may need to: module load stat

_pmiu_daemon(SIGCHLD): [NID 07469] [c2-3c2s11n1] [Fri Sep 23 10:37:51 2016] PE RANK 0 exit sign
al Killed
[NID 07469] 2016-09-23 10:37:51 Apid 5408137: initiated application termination

In this example, memcopy in CkReductionMsg::buildNew seems to have an issue.

In addition to the text output, stack backtraces of ALL the application processes are gathered into a merged stack backtrace tree and written to disk as the file, atpMergedBT.dot. The stack backtrace tree for the first process to die is sent to stderr as is the number of the signal that caused the application to fail. If Linux core dumping is enabled (see ulimit or limit in your shell documentation), a heuristically selected set of processes also dump their cores.

The atpMergedBT.dot file can be viewed with statview, (the Stack Trace Analysis Tool viewer), which is included in the Cray Debugger Support Tools (module load stat), or alternatively with the file viewer dotty, which can be found on most Linux systems. The merged stack backtrace tree provides a concise yet comprehensive view of what the application was doing at the time of its termination.

At HLRS ATP module is loaded by default. To use it you have to set

export ATP_ENABLED=1

in your batch script. ATP provide few important core files, if you

limit -c unlimited

STAT : Stack Trace Analysis Tool

Stack Trace Analysis Tool (STAT) is a cross-platform tool from the University of Wisconsin-Madison. It gathers and merges stack traces from a running application’s parallel processes. It creates call graph prefix tree, which are a compressed representation, with scalable visualization and scalable analysis It is very useful when application seems to be stuck/hung. Full information including use cases is available at {http://www.paradyn.org/STAT/STAT.html paradyn}. STAT scales to many thousands of concurrent process.

To use it, you simply load the module and attach it to your running/hanging application.

$> module load stat
$> qsub  job.pbs
	#start the application e.g. using a batch script
	#Wait until application reaches the suspicious state
$> STATGUI <JOBID> 
	#Launches the graphical interface
	#Attach to the job
	#Shows the calltree
$> qdel <JOBID>
	#Terminate the running application

IOBUF - I/O buffering library

IOBUF is an I/O buffering library that can reduce the I/O wait time for programs that read or write large files sequentially. IOBUF intercepts I/O system calls such as read and open and adds a layer of buffering, thus improving program performance by enabling asynchronous prefetching and caching of file data.

IOBUF can also gather runtime statistics and print a summary report of I/O activity for each file.

In general, no program source changes are needed in order to take advantage of IOBUF. Instead, IOBUF is implemented by following these steps:

Load the IOBUF module:

% module load iobuf

Relink the program. Set the IOBUF_PARAMS environment variable as needed.

% export IOBUF_PARAMS='*:verbose'

Execute the program.

If a memory allocation error occurs, buffering is reduced or disabled for that file and a diagnostic is printed to stderr. When the file is opened, a single buffer is allocated if buffering is enabled. The allocation of additional buffers is done when a buffer is needed. When a file is closed, its buffers are freed (unless asynchronous I/O is pending on the buffer and lazyclose is specified).

Please check the complete manual and all environment variables available by reading the man page (man iobuf, after loading the iobuf module)

 IMPORTANT NOTICE : As iobuf is written for serial IO, its behavior is undefined 
 when used for parallel I/O into a single file. 

You should never use IOBUF in the case when several parallel processes operates on a single file.

Perftools : Performance Analysis Tool Kit

The Cray Performance Measurement and Analysis Tools (or CrayPat) are a suite of optional utilities that enable you to capture and analyze performance data generated during the execution of your program on a Cray system. The information collected and analysis produced by use of these tools can help you to find answers to two fundamental programming questions: How fast is my program running? and How can I make it run faster? A detailed documantation about CrayPAT can be found in document S-2376-622. Here a short summary is presented, concentrating on the usage.

Profiling is mainly distinguished between two main run cases, sampling and tracing:

Sampling Tracing
Advantages
  • Only need to instrument main routine
  • Low Overhead – depends only on sampling frequency
  • Smaller volumes of data produced
Advantages
  • More accurate and more detailed information
  • Data collected from every traced function call not statistical averages
Disadvantages
  • Only statistical averages available
  • Limited information from performance counters
Disadvantages
  • Increased overheads as number of function calls increases
  • Huge volumes of data generated

Using the fully adjustable CrayPAT, Automatic Profiling Analysis (APA) is a guided tracing combining the advantages of Sampling and tracing. Furthermore, the event tracing can be enhanced by using loop profiling.


perftools-base should be loaded as a starting place. This provides access to man pages, Reveal, Cray Apprentice2, and the new instrumentation modules. This module can be kept loaded without impact to applications. As instrumentation modules following is available:

GENERAL REMARKS: MUST run on Lustre ! Always check that the instrumented binary has not affected the run time notably compared to the original. Collecting event traces on large numbers of frequently called functions, or setting the sampling interval very low can introduce a lot of overhead (check trace-text-size option to pat_build). The runtime analysis can be modified through the use of environment variables of the form PAT_RT_*.


CrayPAT

The perftools-lite modules provide a user-friendly way to auto-instrument your application for various profiling cases. The perftools module provide CrayPATs full functionality. As described below instrumentation and report generation can be triggered manually specifying various options. In the following descriptions we assume using a simple batch job script:

$> cat job.pbs
#!/bin/bash
#PBS –l nodes=1:ppn=24
#PBS –l walltime=00:10:00
#PBS –j oe
#PBS -o job.out

cd $PBS_O_WORKDIR
aprun –n 384 –N 24 <exe>

An application is instrumented and run using the following commands:

$> module load perftools-base
$> module load <CrayPAT-lite-module>
$> make clean; make # or what is necessary to rebuild your application
$> qsub job.pbs     # no changes needed for aprun inside this script 
$> less job.out

As a result a *.rpt and a *.ap2 file are created and the report is additionally printed to stdout.

Additional information and representation can be gathered using pat_report with the produced *.ap2 file.

$> pat_report <option> *.ap2 

Descriptions of the available option can be obtained using man pat_report

You can inspect visually the created self-contained ap2 file using Apprentice2.

REMEMBER: After the experiment is complete, unload perftools-lite-XXX module to prevent further program instrumentation. The perftools-base module can be kept loaded.

perftools-base

The perftools-base module provides access to man pages, utilities such as Reveal, Cray Apprentice2 and grid_order, and instrumentation modules. It does not add compiler flags to enable performance data collection (such as symbol table information), as the earlier perftools or perftools-lite did or the newly available instrumentation modules do. It is a low-impact module that does not alter program behavior and can be left loaded even when building and running programs without CrayPat instrumentation.


perftools-lite

This module is default CrayPat-lite profiling. It enables sampling of the application.

Beside other information the Profiling by Function Group and Function is presented in the report:

Table 1:  Profile by Function Group and Function (top 8 functions shown)

  Samp% |  Samp |  Imb. |  Imb. |Group
        |       |  Samp | Samp% | Function
        |       |       |       |  PE=HIDE
       
 100.0% | 263.4 |    -- |    -- |Total
|----------------------------------------------------------------------
|  78.0% | 205.3 |    -- |    -- |MPI
||---------------------------------------------------------------------
||  62.4% | 164.4 | 115.6 | 42.2% |mpi_bcast
||  10.4% |  27.4 | 186.6 | 89.1% |MPI_ALLREDUCE
||   4.7% |  12.4 |  86.6 | 89.3% |MPI_IPROBE
||=====================================================================
|  13.1% |  34.5 |    -- |    -- |USER
||---------------------------------------------------------------------
...
|======================================================================

Where the stack trace of all processes are merged and the combined information is presented as relative and absolute values of the counted samples in the group/function and imbalances between processes.

perftools-lite-events

This module enables CrayPATs event tracing of applications. After loading the modules, re-compiled/linked the application and submitting the job as usual, the report is written in the above described way. In contrast to sampling, event tracing reports out real time in groups / functions.

perftools-lite-loops

This module enables CrayPat-lite loop work estimates. It must be used with Cray compiler. After proceeding in the above described way, loop work estimates are sent to stdout and to .ap2 file. Performance data can be combined with source code information and compiler annotation using the .ap2 file with Reveal. The module modify the compile and link steps to include CCE’s –h profile_generate option and instrumenting the program for tracing (pat_build -w). Remember that –h profile_generate reduces compiler optimization levels. After experiment is complete, unload perftools-lite-loops to prevent further program instrumentation.

Table 1:  Inclusive and Exclusive Time in Loops (from -hprofile_generate)
  Loop | Loop Incl |      Time |    Loop |  Loop |  Loop |  Loop |Function=/.LOOP[.]
  Incl |      Time |     (Loop |     Hit | Trips | Trips | Trips | PE=HIDE
 Time% |           |     Adj.) |         |   Avg |   Min |   Max |
|-----------------------------------------------------------------------------
| 93.0% | 19.232051 |  0.000849 |       2 |  26.5 |     3 |    50 |jacobi.LOOP.1.li.236 
| 77.8% | 16.092021 |  0.001350 |      53 | 255.0 |   255 |   255 |jacobi.LOOP.2.li.240 
| 77.8% | 16.090671 |  0.110827 |   13515 | 255.0 |   255 |   255 |jacobi.LOOP.3.li.241 
| 77.3% | 15.979844 | 15.979844 | 3446325 | 511.0 |   511 |   511 |jacobi.LOOP.4.li.242 
| 14.1% |  2.906115 |  0.001238 |      53 | 255.0 |   255 |   255 |jacobi.LOOP.5.li.263

perftools

In contrast to the perftools-lite module, which automatically instrument and report, the perftools module require a manually instrumentation and report generation:

$> module load perftools-base
$> module load perftools
$> make clean; make	# If your application is already built with perftools loaded you do not have to rebuild when switching the experiment.
$> pat_build <pat_options> app.exe	# pat_options are described below; Creates instrumented binary app.exe+pat
$> qsub job.pbs         # ATTENTION: now you have to use the new instrumented binary "aprun <options> ./app.exe+pat"
$> pat_report –o myrep.txt app.exe+pat+*  # .xf file or related directory

Running the “+pat” binary creates a data file or directory. pat_report reads that data file and prints lots of human-readable performance data. It also creates an *.ap2 file which contains all profiling data. (The app.exe+pat+* file/directory can be deleted after the creation of the .ap2 file)

The instrumentation can be adjusted using pat_build options, which are listed in man pat_build, some few commonly used options are:

pat_build Option Description
Sampling profile
-u tracing of functions in source file owned by the user
-w Tracing is default experiment
-T <func> Specifies a function which will be traced
-t <file> All functions in the specified file will be traces
-g <group> Instrument all functions belonging to the specified trace function group, e.g. blas, io, mpi, netcdf, syscall

It should be noted, that only true function calls can be traced. Functions that are inlined by the compiler or that have local scope in a compilation unit cannot be traced.

The pat_report tool combines information from *.xf output (raw data files, optimized for writing to disk). During conversion the instruments binary must still exist. As a result *.ap2 file is produced, which is a compressed performance file, optimized for visualization analysis. The ap2 file is the input for subsequent pat_report calls and Reveal or Apprentice2. Once the ap2 file is generated *.xf files and instrumented binary files can be removed. Many options for sorting, slicing or dicing data in the tables are provided using

$> pat_report –O <table option> *.ap2
$> pat_report –O help (list of available profiles)

Volume and type of information depends upon sampling vs tracing. Several output formats {plot | rpt | ap2 | ap2‐xml | ap2‐txt | xf‐xml | xf‐txt | html} are available through the –f option. Furthermore, gathered data can be filtered using

$> pat_report –sfilter_input=‘condition’ … 

where the ‘condition’ could be an expression involving 'pe' such as 'pe<1024' or 'pe%2==0'.

Loop Work Estimation can be collected by using the CCE compiler option -h profile_generate and the described tracing experiment. It is recommended to turn off OpenMP and OpenACC for the loop work estimates via –h noomp –h noacc.

Hardware counter Selection can be enabled using export PAT_RT_PERFCTR= <group> | <event list>, where the related groups and events can be listed using man hwpc and papi_avail or pat_help -> counters.

Energy information can be gathered using pat_report –O program_energy *.ap2.


Automatic Profiling Analysis (APA)

The advantages of sampling and tracing are combined in the guided profiling APA. The target are large, long-running programs (general a trace will inject considerable overhead). The goal is the limitation of tracing to those functions that consume the most time. As a procedure a preliminary sampling experiment is used to determine and instrument functions consuming the most time.

$> module load perftools
$> make clean; make
$> pat_build himeno.exe 	          # The APA is the default experiment. No option needed.
$> qsub job.pbs                           # using the new instrumented binary in "aprun <option> ./app.exe+pat"
$> pat_report –o myrep.txt app+pat+*      

$> vi *.apa 	                          # The *.apa file contains instructions for the next instrumentation step. Modify it according to your needs.
$> pat_build –O *.apa 	                  # Generates an instrumented binary *.exe+apa for tracing
$> qsub job.pbs                           # using the new instrumented binary in "aprun <option> ./app.exe+apa"
$> pat_report –o myrep.txt app+apa+*      # .xf file or related directory

Reveal

Reveal is Cray’s next-generation integrated performance analysis and code optimization tool. main features:

  • inspecting combined view of loop work estimations with source code (compiler annotations)
  • assist an OpenMP port

For an OpenMP port a developer has to understand the scoping of the variables, i.e. whether variables are shared or private. Reveal assists by navigating through the source code using whole program analysis (data provided by the Cray compilation environment; listing files). Reveal couples with performance data collected during execution by CrayPAT. It understand which high level serial loops could benefit from parallelism. It gathers and present dependency information for targeted loops, assist users optimize code by providing variable scoping feedback and suggested compile directives.

Usage:

$> module load perftools-base
$> ftn -O3 -hpl=my_program.pl -c my_program_file1.f90
$> ftn -O3 -hpl=my_program.pl -c my_program_file1.f90 #Recompile to generate program library
# run instrumented binary to gather performance data using loop work estimation (see above)
$> reveal my_program.pl my_program.ap2 &

You can omit the *.ap2 and inspect only compiler feedback. Note that the -h profile_generate option disables most automatic compiler optimizations, which is why Cray recommends generating this data separately from generating the program_library file.

Apprentice2

Cray Apprentice2 is a post-processing performance data visualization tool, which takes *.ap2 files as input.

Main features are:

  • Call graph profile
  • Communication statistics
  • Time-line view for Communication and IO.
  • Activity view
  • Pair-wise communication statistics
  • Text reports

It helps identify:

  • Load imbalance
  • Excessive communication
  • Network contention
  • Excessive serialization
  • I/O Problems
$> module load perftools-base
$> app2 *.ap2 & 

If the full trace is enabled (using the environment variable PAT_RT_SUMMARY=0), a time line view is activated, which helps to see communication bottlenecks. But please use it only for small experiments !

You can istall Apprentice2 on you local machine. It is available from a Cray login node

  • module load perftools-base
  • Go to: $CRAYPAT_ROOT/share/desktop_installers/
  • Download .dmg or .exe installer to laptop
  • Double click on installer and follow directions to install

Cray Profiler

The Cray profiler library is deprecated, but still available on the system. A description can be found here