- Infos im HLRS Wiki sind nicht rechtsverbindlich und ohne Gewähr -
- Information contained in the HLRS Wiki is not legally binding and HLRS is not responsible for any damages that might result from its use -

CRAY XC40 Tools: Difference between revisions

From HLRS Platforms
Jump to navigationJump to search
(copy&paste von XE6)
 
 
(65 intermediate revisions by 4 users not shown)
Line 1: Line 1:
== Cray provided tools ==
Cray does provide several official tools. Below is a list of some of the tools, you can get more information about them in the online manual ('''man atp''' for example).  
Cray does provide several official tools. Below is a list of some of the tools, you can get more information about them in the online manual ('''man atp''' for example).  


Jump to [[CRAY_XE6_Cray_Tools#ATP : Abnormal Termination Processing|ATP]], [[CRAY_XE6_Cray_Tools#STAT :  Stack Trace Analysis Tool|STAT]] and [[CRAY_XE6_Cray_Tools#IOBUF - I/O buffering library|IOBUF]]
At HLRS Cray also supports some tools with limited or no support. Beside the CrayPAT currently also available is the Cray Profiler


At HLRS Cray also supports some tools with limited or no support. Currently available is the [[CRAY_XE6_Cray_Tools#Cray Profiler : Function-level instrumentation for MPI, SHMEM, heap, and I/O|Cray Profiler]]
= ATP : Abnormal Termination Processing =
<!--{{Warning|text= Doesn't work yet.}} -->
This tool can be used when the application crashes, e.g. with a segmentation fault.  
Abnormal Termination Processing (ATP) is a system that monitors Cray system user applications. If an application takes a system trap, ATP
performs analysis on the dying application.
In the stderr a stack walkback of the crashing rank is presented. In the following example, rank 1 crashes:
<pre>
Application 5408137 is crashing. ATP analysis proceeding...


=== ATP : Abnormal Termination Processing ===
ATP Stack walkback for Rank 1 starting:
Abnormal Termination Processing (ATP) is a system that monitors Cray system user applications. If an application takes a system trap, ATP
  _start@start.S:113
performs analysis on the dying application. All stack backtraces of the application processes are gathered into a merged stack backtrace
  __libc_start_main@libc-start.c:242
  main@0x200015f6
  ConverseInit@0x20255a59
  _processHandler(void*, CkCoreState*)@0x201a320d
  CkDeliverMessageFree@0x2019d402
  CkArray::recvBroadcast(CkMessage*)@0x201c50f7
  CkArrayBroadcaster::deliver(CkArrayMessage*, ArrayElement*, bool)@0x201c4c30
  CkIndex_TreePiece::_call_drift_marshall51(void*, void*)@0x20051cb9
  TreePiece::drift(double, int, int, double, double, int, bool, CkCallback const&)@0x200169eb
  ArrayElement::contribute(int, void const*, CkReduction::reducerType, CkCallback const&, unsig
ned short)@0x201c213a
  CkReductionMsg::buildNew(int, void const*, CkReduction::reducerType, CkReductionMsg*)@0x201cd
f20
  memcpy@memcpy.S:196
ATP Stack walkback for Rank 1 done
Process died with signal 11: 'Segmentation fault'
Forcing core dump of rank 1
View application merged backtrace tree with: stat-view atpMergedBT.dot
You may need to: module load stat
 
_pmiu_daemon(SIGCHLD): [NID 07469] [c2-3c2s11n1] [Fri Sep 23 10:37:51 2016] PE RANK 0 exit sign
al Killed
[NID 07469] 2016-09-23 10:37:51 Apid 5408137: initiated application termination
</pre>
In this example, ''memcopy'' in ''CkReductionMsg::buildNew'' seems to have an issue.
 
In addition to the text output, stack backtraces of '''ALL''' the application processes are gathered into a merged stack backtrace
tree and written to disk as the file, atpMergedBT.dot. The stack backtrace tree for the first process to die is sent to stderr as is the
tree and written to disk as the file, atpMergedBT.dot. The stack backtrace tree for the first process to die is sent to stderr as is the
number of the signal that caused the application to fail. If Linux core dumping is enabled (see ulimit or limit in your shell
number of the signal that caused the application to fail. If Linux core dumping is enabled (see ulimit or limit in your shell
Line 18: Line 49:
backtrace tree provides a concise yet comprehensive view of what the application was doing at the time of its termination.
backtrace tree provides a concise yet comprehensive view of what the application was doing at the time of its termination.


At HLRS ATP is disabled by default. To use it you have to set ATP_ENABLED=1 in your batch script.
At HLRS ATP module is loaded by default. To use it you have to set  
<pre>export ATP_ENABLED=1</pre>
in your batch script.
ATP provide few important core files, if you
<pre>limit -c unlimited</pre>


=== STAT :  Stack Trace Analysis Tool ===
= STAT :  Stack Trace Analysis Tool =
STAT is a toll for collecting tracebacks of a running program. You use '''statview''' to view the output of STAT, both tools are part of the module stat.
<!-- {{Warning|text= Doesn't work yet.}} -->
Stack Trace Analysis Tool (STAT) is a cross-platform tool from the University of Wisconsin-Madison. It gathers and merges stack traces from a running application’s parallel processes. It creates call graph prefix tree, which are a compressed representation, with scalable visualization and scalable analysis
It is very useful when application seems to be stuck/hung. Full information including use cases is available at {http://www.paradyn.org/STAT/STAT.html paradyn}.
STAT scales to many thousands of concurrent process.


STAT needs the process id of the aprun (apid) command which runs your program. As this apid is not available on the login nodes, we have written a wrapper called '''STAT_hermit'''.
To use it, you simply load the module and attach it to your running/hanging application.
<pre>$> module load stat
$> qsub  job.pbs
#start the application e.g. using a batch script
#Wait until application reaches the suspicious state
$> STATGUI <JOBID>
#Launches the graphical interface
#Attach to the job
#Shows the calltree
$> qdel <JOBID>
#Terminate the running application
</pre>


Instead of the apid, the wrapper uses id of your batch job (use qstat to get it) and tries to find the corresponding aprun command. If there are several possibilities, it will show you a list of possibities and ask you to select the one you want to trace.
= IOBUF - I/O buffering library =
 
=== IOBUF - I/O buffering library ===


IOBUF is an I/O buffering library that can reduce the I/O wait time for programs that read or write large files sequentially. IOBUF intercepts I/O system calls such as read and open and adds a layer of buffering, thus improving program performance by enabling asynchronous prefetching and caching of file data.
IOBUF is an I/O buffering library that can reduce the I/O wait time for programs that read or write large files sequentially. IOBUF intercepts I/O system calls such as read and open and adds a layer of buffering, thus improving program performance by enabling asynchronous prefetching and caching of file data.
Line 54: Line 101:
You should never use IOBUF in the case when several parallel processes operates on a single file.
You should never use IOBUF in the case when several parallel processes operates on a single file.


=== Cray Profiler : Function-level instrumentation for MPI, SHMEM, heap, and I/O ===
= Perftools : Performance Analysis Tool Kit =


To use this library, load the "tools/cray_profiler" module and relink your application
The Cray Performance Measurement and Analysis Tools (or CrayPat) are a suite of optional utilities that enable you to capture and analyze performance data generated during the execution of your program on a Cray system. The information collected and analysis produced by use of these tools can help you to find answers to two fundamental programming questions: How fast is my program running? and How can I make it run faster?
A detailed documantation about CrayPAT can be found in document [http://docs.cray.com/books/S-2376-622/S-2376-622.pdf S-2376-622].
Here a short summary is presented, concentrating on the usage.


When an application is run with the profiler library, profiler becomes active
Profiling is mainly distinguished between two main run cases, sampling and tracing:
when MPI_Init() is called, MPI calls are timed during the run and a
{|border="1" cellpadding="2"
"profile.txt" file is written when MPI_Finalize() is called.
!width="250"|Sampling
!width="250"|Tracing
|-
|Advantages
*Only need to instrument main routine
*Low Overhead – depends only on sampling frequency
*Smaller volumes of data produced
|Advantages
*More accurate and more detailed information
*Data collected from every traced function call not statistical averages
|-
|Disadvantages
*Only statistical averages available
*Limited information from performance counters
|Disadvantages
*Increased overheads as number of function calls increases
*Huge volumes of data generated
|}
Using the fully adjustable CrayPAT, Automatic Profiling Analysis (APA) is a guided tracing combining the advantages of Sampling and tracing.
Furthermore, the event tracing can be enhanced by using loop profiling.  


<html>
<head>
<style>
<!--
h1        { font-family: arial,helvetica,sans; font-size: 18pt }
h2        { font-family: arial,helvetica,sans; font-size: 16pt }
h3        { font-family: arial,helvetica,sans; font-size: 14pt }
h4        { font-family: arial,helvetica,sans; font-size: 12pt }
span.l    { font-family: arial,helvetica,sans; font-weight: bold }
th        { font-family: arial,helvetica,sans; text-align: left }
td.h      { font-family: arial,helvetica,sans }
p          { margin-left: 20px }
ul        { margin-left: 20pt }
div        { margin-left: 20px; margin-right: 20px }
pre        { margin-left: 20px }
table      { margin-left: 20pt }
p.usage    { font-family: helvetica,sans; font-weight: bold; margin-top: 0; margin-bottom: 6pt }
b          { font-family: helvetica,sans }
span.call  { font-family: helvetica,sans }
-->
</style>
</head>
<body bgcolor=white>


<pre>
'''[[CRAY_XC40_Tools#perftools-base|perftools-base]]''' should be loaded as a starting place. This provides access to man pages, Reveal, Cray Apprentice2, and the new instrumentation modules. This module can be kept loaded without impact to applications.
THE SOFTWARE IS PROVIDED "AS IS", WITH ALL FAULTS AND WITHOUT WARRANTY
As instrumentation modules following is available:
OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE
* '''[[CRAY_XC40_Tools#perftools-lite|perftools-lite]]''' (sampling experiments)
WARRANTIES OF MERCHANTABILITY, AND FITNESS FOR A PARTICULAR PURPOSE.
* '''[[CRAY_XC40_Tools#perftools-lite-events|perftools-lite-events]]''' (tracing experimants)
THIS SOFTWARE MAY NOT BE REDISTRIBUTED OR SUBLICENSED WITHOUT WRITTEN
* '''[[CRAY_XC40_Tools#perftools-lite-loops|perftools-lite-loops]]''' (collect data for auto-parallelization / loop estimates in Reveal)
PERMISSION FROM CRAY INC.
* '''perftools-lite-gpu''' (gpu kernel and data movemnets)
</pre>
* '''[[CRAY_XC40_Tools#perftools|perftools]]''' (fully adjustable CrayPAT, using pat_build and pat_report)
 
GENERAL REMARKS: '''MUST run on Lustre !''' Always check that the instrumented binary has not affected the run time notably compared to the original. Collecting event traces on large numbers of frequently called functions, or setting the sampling interval very low can introduce a lot of overhead (check trace-text-size option to pat_build). The runtime analysis can be modified through the use of environment variables of the form PAT_RT_*.  


<a name="Usage"><h4>Usage</h4></a>
<table cellspacing=8>
<tr><td>
  <div>module load tools/cray_profiler</div>
  <div>relink your executable</div>
</td></tr>
</table>


<h4>Description</h4>
== CrayPAT ==
<p>
The perftools-lite modules provide a user-friendly way to auto-instrument your application for various profiling cases. The perftools module provide CrayPATs full functionality. As described below instrumentation and report generation can be triggered manually specifying various options.
Profiler is a library which intercepts calls to other libraries,
In the following descriptions we assume using a simple batch job script:
and collects statistics to be printed at the end of the run to the
<pre>$> cat job.pbs
<b>profile.txt</b> report file.
#!/bin/bash
</p>
#PBS –l nodes=1:ppn=24
<p>
#PBS –l walltime=00:10:00
A summary showing minimum, maximum, and average values for the processes
#PBS –j oe
in the run is included, followed by per-process details.
#PBS -o job.out
The summary is printed when MPI_Finalize or shmem_finalize is called,
and includes activity up to that point.
The per-process details are printed when the program terminates, and
includes activity for the full run.
The summary report is omitted for single process jobs.
For MPMD jobs, the summary is repeated for each application in the job.
Summaries on arbitrary ranges of processes can be printed by setting <a href="#PROFILER_GROUPS">PROFILER_GROUPS</a>.
The report format has comma-separated fields for easy parsing or
importing into spreadsheet tools such as Excel.
</p>
<p>
The intercepts which are implemented are: MPI (including MPI-IO), SHMEM
(including symmetric heap usage), standard I/O, POSIX I/O, heap allocation,
Global Arrays and ARMCI.
In addition, system statistics such as user CPU time are collected and hardware
performance counters are optionally collected.
Profiler is compatible with <a href="#iobuf">IOBUF</a>
</p>
<p>
Additional features are available and are detailed below.
</p>
<ul>
<li><a href="#PAPI">PAPI performance counter collection and reporting.</a></li>
<li><a href="#tracing">Strace-like call tracing can be produced for all intercepted calls.</a></li>
<li><a href="#bin">Messages and I/O requests can be reported by size via the binning feature.</a></li>
<li><a href="#samp">Instruction pointer sampling</a></li>
<li><a href="#caller">Profile by caller provides a call traceback feature.</a></li>
<li><a href="#partners">Point-to-point partners generates a report of top message send destination process ranks.</a></li>
<li><a href="#playback">Record and playback allows the MPI calls of one or more processes of a job to be recorded and then individual processes of the same job can be rerun via playback.</a></li>
<li><a href="#runtime">Run time data collection control API.</a></li>
<li><a href="#aborts">Handling of error conditions and program aborts.</a></li>
<li><a href="#files">Per-file I/O reporting.</a></li>
<li><a href="#threads">OpenMP thread reporting.</a></li>
<li><a href="#PROFILER_GROUPS">Process group summary</a></li>
</ul>


<h4>Linking Profiler</h4>
cd $PBS_O_WORKDIR
<p>
aprun –n 384 –N 24 <exe>
Since the profiler implementation is based on adding a layer of intercept functions to gather many of its statistics, profiler must be linked between a reference to a routine and its definition.
</pre>
This is easily done in cases where an application contains calls to a library such as MPI by linking profiler after the application object files but before the MPI library.
</p>
<p>
On Cray XT/XE/XC systems, linking can often be done implicitly by loading the cray_profiler module before linking the application.
To ensure proper linking order, load the cray_profiler module after all other library modules.
If module load order can't be guaranteed, adding explicit library references to the link command can define the order. For example, using <b>-lprofiler -lmpich</b> will link profiler before MPICH.
</p>
<p>
For the case where a library makes calls to another library, adding an extra link option to include profiler between the two libraries is necessary.
For example, to instrument the calls to ARMCI made from the Global Arrays library, use link options like <b>-lga -lprofiler -larmci</b>.
</p>
<p>
For some programs, it is necessary to include an extra <b>-Wl,-uprofiler</b>
option when linking. This option forces profiler to be included.
Some types of programs, in particular pure-Fortran serial programs, have
no profiled function intercepts at the program level, but including
<b>-lprofiler</b> after the appropriate libraries is difficult.
This link option is included automatically when the <b>cray_profiler</b> module is loaded.
</p>


<h4>Dynamic Linking</h4>
An application is instrumented and run using the following commands:
<p>
<pre>$> module load perftools-base
The profiler module references the static profiler library by linking as <b>-lprofiler</b>.
$> module load <CrayPAT-lite-module>
This is the preferred way to include profiler in dynamically linked executables.
$> make clean; make # or what is necessary to rebuild your application
Profiler can also be included by the dynamic linker at run time by setting
$> qsub job.pbs    # no changes needed for aprun inside this script
the LD_PRELOAD environment variable.
$> less job.out
LD_PRELOAD only works with dynamically linked executables.
</pre>
</p>
As a result a *.rpt and a *.ap2 file are created and the report is additionally printed to stdout.  
<p>
Care must be taken when setting LD_PRELOAD, because it affects all dynamically linked executables.
This includes system commands like cp and mv, and also includes aprun.
The preferred way to run a compute-node program launched by aprun using
the LD_PRELOAD feature to include profiler in a prelinked executable.
To make the process more easy to use the cray_profiler module provides an enviroment variable named CRAY_PROFILER_PRELOAD which is used  as in the following commands:
</p>


<div>aprun -n X ... $CRAY_PROFILER_PRELOAD a.out"</div>
Additional information and representation can be gathered using '''pat_report''' with the produced *.ap2 file.  
<p>
<pre>$> pat_report <option> *.ap2 </pre>  
This method ensures that only the application program (a.out in the example above)
Descriptions of the available option can be obtained using ''man pat_report''
will be profiled.
If LD_PRELOAD is set before the aprun command, aprun and all dynamically linked
executables started by aprun will be profiled.
Profiler attempts to identify this situation and will avoid writing a profile
if the command name is <b>aprun</b> or <b>gzip</b>.
So while not recommended, the following syntax also works:
</p>
<div>env LD_PRELOAD=/opt/hlrs/tools/cray_profiler/201303/libprofiler-shared.so aprun ... a.out</div>
<p>
If profiler has been explicitly linked into an exectuable, do not use the
LD_PRELOAD method to include profiler.  This may result in two versions
of profiler executing simultaneously, and unpredictable results.
The most likely outcome is that profiles are written by both the static
and dynamic versions of profiler, resulting in garbled, truncated, or
incorrect output.
</p>
<h4>Environment Variables</h4>
<p>
The following environment variables can be set to change the behavior of profiler.
</p>
<table border cellpadding=4 cellspacing=0>
<tr><th colspan=4 style='background-color: #e0f4ff'><div align=center>Environment Variables</div></th></tr>
<tr>
  <th>Variable</th>
  <th>Values</th>
  <th>Default</th>
  <th>Description</th>
</tr>
<tr>
  <th>PROFILER_BIN</th>
  <td nowrap>0, 1, or module list</td>
  <td nowrap>0 (disabled)</td>
  <td>
    <a name="bin"></a>
    If enabled, prints a breakdown of messages or I/O based on size.
    Nine bins are defined, starting with zero and 1-16 bytes, then
    increasing as multiples of 16 to a maximum bin size of 4 GB.
    Also, specifying samp will report IP sampling by source file line number.
  </td>
</tr>
<tr>
  <th>PROFILER_CALLER</th>
  <td nowrap>0, 1, or module list</td>
  <td nowrap>0 (not enabled)</td>
  <td>
    If enabled, captures an extra level of detail by reporting timings
    based on the routine caller via a call stack traceback.
    PROFILER_COLLECT must be enabled for the module.
  </td>
</tr>
<tr>
  <th>PROFILER_COLLECT</th>
  <td nowrap>0, 1, or module list</td>
  <td nowrap>1 (enabled)</td>
  <td>
      Control timing and summary reporting of profiled routines.
      If timing is turned off, the profiled routines are still
      intercepted, but no data collection is performed.
      When the profile report is printed, any modules which had
      timing collection disabled are skipped.
      The samp module is off by default, so it must be explicitly enabled
      by setting PROFILER_COLLECT to <b>samp</b> or <b>all,samp</b>.
</td>
</tr>
<tr>
  <th>PROFILER_COLLECTIVE_SYNC</th>
  <td nowrap>0, 1, or module list</td>
  <td nowrap>1 (enabled)</td>
  <td>
      If enabled, all collectives are preceeded with a barrier to measure collective
      synchronization time.  Can be used with mpi and shmem.</td>
</tr>
<tr>
  <th>PROFILER_DEBUG</th>
  <td nowrap>0, 1, or module list</td>
  <td nowrap>0 (disabled)</td>
  <td>Print profiler debugging information.</td>
</tr>
<tr>
  <th>PROFILER_DETAIL</th>
  <td nowrap>0, 1, or module list</td>
  <td nowrap>1 (enabled)</td>
  <td>
      Print per-process profiling report.
      The default is 1 (enabled) if the parallel run size is less than 100 processes,
      otherwise the default is 0 (disabled).
      The summary report will still be printed for a parallel run even when
      the detail report is disabled.
      This setting is ignored for serial runs; the detail report is always printed.
  </td>
</tr>
<tr>
  <th>PROFILER_DISABLE</th>
  <td nowrap>0 or 1</td>
  <td nowrap>0</td>
  <td>
  If set to 1, then profiling is disabled and no report is printed.
  This allows an executable linked with profiler to be run with no profiling.
  </td>
</tr>
<tr>
  <th>PROFILER_GROUPS</th>
  <td nowrap>ranks list</td>
  <td nowrap>No groups</td>
  <td>
    Each range of ranks <b>x-y</b> given results in a separate summary report
    giving the min, max, and average values for the processes within the range.
    See <a href="#PROFILER_GROUPS">Process group summary</a> below.
  </td>
</tr>
<tr>
  <th>PROFILER_LABEL</th>
  <td nowrap>0, 1</td>
  <td nowrap>0 (disabled)</td>
  <td>
    <a name="label"></a>
    If enabled, adds MPI process rank labels to the beginning of every line
    printed to stdout (standard output).
    If only labels are desired, and no profiling, also set PROFILER_DISABLE=1.
  </td>
</tr>
<tr>
  <th>PROFILER_OUTPUT</th>
  <td nowrap>filename</td>
  <td nowrap>profile.txt</td>
  <td>
  Profile report output file name.
  Special tokens are expanded at run time and can be used to create unique output files for multiple runs (<a href="#PROFILER_OUTPUT">see below</a>).
  </td>
</tr>
<tr>
  <th>PROFILER_PAPI</th>
  <td nowrap>counter list</td>
  <td nowrap>&nbsp;</td>
  <td>
    A comma-separated list of PAPI event names (preset or native). This is
    ignored if the PAPI module is not active.
    The default and maximum list size depends upon the processor type, but usually
    includes the number of floating point operations.
  </td>
</tr>
<tr>
  <th>PROFILER_PARTNERS</th>
  <td nowrap>integer</td>
  <td nowrap>0</td>
  <td>
    The number of top point-to-point message partners reported.
    If set to <b>all</b>, all destination processes are listed.
  </td>
</tr>
<tr>
  <th>PROFILER_RANKS</th>
  <td nowrap>ranks list</td>
  <td nowrap>- (all ranks)</td>
  <td>
    List of processes enabled for recording (see PROFILER_RECORD).
    The value is a semicolon-separated list of ranks or rank ranges.
    A rank is a single integer, and a range <b>x-y</b> includes
    the ranks from <b>x</b> to <b>y</b>.
    If <b>x</b> is omitted, then all ranks less than or equal
    to <b>y</b> are selected.
    If <b>y</b> is omitted, then all ranks equal to or greater
    than <b>x</b> are selected.
    For example, <b>0</b> selects rank 0 only, while
    <b>1-</b> selects all ranks except 0.
  </td>
</tr>
<tr>
  <th>PROFILER_PLAYBACK</th>
  <td nowrap>filename</td>
  <td nowrap>None (disabled)</td>
  <td>
    This is the name of the input file for playback (record.<i>n</i).
    The calling sequence and messages for the program must exactly match the
    sequence recorded in the file.
  </td>
</tr>
<tr>
  <th>PROFILER_RECORD</th>
  <td nowrap>0, 1, or module list</td>
  <td nowrap>0 (disabled)</td>
  <td>
    If enabled, each process rank writes a file name <b>record.<i>rank</i></b>
    for each supported routine.
    The file contains the results of the routine call (return value and modified
    argument variables) along with the time to execute the call.
    A subset of all processes can be selected by setting PROFILER_RANKS.
  </td>
</tr>
<tr>
  <th>PROFILER_TRACE</th>
  <td nowrap>0, 1, or module list</td>
  <td nowrap>0 (disabled)</td>
  <td><a name="tracing"></a>
    Print a trace of profiled routines when they are called, including arguments and return values.
      The traces are printed to stderr.</td>
</tr>
<tr>
  <th>PROFILER_SHEPHERD</th>
  <td nowrap>0 or 1</td>
  <td nowrap>1 (enabled)</td>
  <td>
  By default, no profile is written for a PMI shepherd process.
  A PMI shepherd process is identified as a program which contains MPI callls but
  does not call any MPI functions.
  In some cases programs are mistaken as PMI shepherd processes, no profile is written.
  If set to 0, then the check for PMI shepherd process is skipped, so a profile is written.
  </td>
</tr>
<tr>
  <th>PROFILER_TRAP</th>
  <td nowrap>0 or 1</td>
  <td nowrap>1 (enabled)</td>
  <td>
  By default, a set of signals are caught when a fault condition occurs, and a profile is written.
  The signals caught are: SIGHUP, SIGINT, SIGILL, SIGQUIT, SIGABRT,
  SIGIOT, SIGBUS, SIGFPE, SIGSEGV, SIGPIPE, and SIGTERM.
  If set to 1, then signals are caught, and a profile is written if a fault occurs.
  </td>
</tr>
<tr>
  <th>PROFILER_UNWIND_LEVELS</th>
  <td nowrap>integer</td>
  <td nowrap>0</td>
  <td>
      The number of call stack levels to be included in the profile by caller report.
      This can have the effect of expanding the number of results in the report,
      because each unique call path is reported separately.
      Requires caller profiling to be enabled (PROFILER_CALLER set).
  </td>
</tr>
<tr>
  <th>PROFILER_WRAPPER_LEVELS</th>
  <td nowrap>integer</td>
  <td nowrap>0</td>
  <td>
      The number of call stack levels to be discarded when unwinding the call stack.
      This is used when a wrapper library is in use by the application, but the
      call sites to the wrapper functions are of interest.
      Wrappers are sometimes used in programs which support multiple communications packages,
      such as MPI and shmem, where the package is a compile-time option for the wrappers.
      Requires caller profiling to be enabled (PROFILER_CALLER set).
  </td>
</tr>
<tr>
  <th>PROFILER_WARN</th>
  <td nowrap>0, 1, or module list</td>
  <td nowrap>1 (enabled)</td>
  <td>Print warning messages for profiler errors.</td>
</tr>
<tr>
  <th>PROFILER_VERBOSE</th>
  <td nowrap>0, 1, or module list</td>
  <td nowrap>0 (disabled)</td>
  <td>All available statistics are printed.  The default is to suppress values which are zero.</td>
</tr>
</table>
<p>
A setting of 1 selects that feature for all modules, and 0 deselects it.
A synonym for 1 is <b>all</b>, while <b>!all</b> is the same as 0.
</p>
<p>
A module list provides more selectivity when selecting features.
The value is a comma-separated list of one or more of the following keywords.
If a <b>!</b> preceeds the keyword, then feature is deselected for that module.
The default if a module is not given in a list is to deselect the feature for
that module even if the feature is enabled by default.
As a shorthand, if the list begins with a deselected module, then
unlisted modules are by selected by default.
So <b>mpi</b> selects only the MPI module, while <b>!mpi</b> selects all
modules except MPI.
</p>
<p>
Not all keywords are significant for all environment variables.
</p>
<table border cellspacing=0 cellpadding=4>
<tr><th colspan=2 style='background-color: #e0f4ff'><div align=center>Module Names</div></th></tr>
<tr>
  <th>Keyword</th>
  <th>Description</th>
</tr>
<tr>
  <th>arcmi</th>
  <td>ARMCI data transfer functions. Requires that the application also calls MPI or shmem.
</tr>
<tr>
  <th>all</th>
  <td>Select all modules.  Useful to change the default for unlisted modules from deselected to selected.
      Should always be specified as the first keyword.
      The <b>samp</b> module is not part of <b>all</b>, so it needs to be explicitly selected.
</td>
</tr>
<tr>
  <th>caller</th>
  <td>Profile by caller module extension.</td>
</tr>
<tr>
  <th>ga</th>
  <td>Global Arrays data transfer functions. Requires that the application also calls MPI or shmem.</td>
</tr>
<tr>
  <th>heap</th>
  <td>Heap functions (malloc, free, etc.)</td>
</tr>
<tr>
  <th>mpi</th>
  <td>MPI functions (except MPI I/O).</td>
</tr>
<tr>
  <th>mpio</th>
  <td>MPI I/O functions.</td>
</tr>
<tr>
  <th>papi</th>
  <td>Hardware performance counters.</td>
</tr>
<tr>
  <th>partners</th>
  <td>MPI point-to-point message partners. Use with PROFILER_VERBOSE to get per-function reports.</td>
</tr>
<tr>
  <th>pio</th>
  <td>POSIX I/O calls (read, write, open, close, etc.).</td>
</tr>
<tr>
  <th>prof</th>
  <td>The profiler itself.  Used with PROFILER_DISABLE, PROFILER_WARN and PROFILER_DEBUG.</td>
</tr>
<tr>
  <th>samp</th>
  <td>Instruction pointer sampling.</td>
</tr>
<tr>
  <th>shmem</th>
  <td>Shmem functions (except symmetric heap allocation).</td>
</tr>
<tr>
  <th>stdio</th>
  <td>Fortan I/O and standard I/O functions (fopen, fread, printf, etc.)</td>
</tr>
<tr>
  <th>symheap</th>
  <td>Symmetric heap functions (shmalloc, etc.).</td>
</tr>
<tr>
  <th>sys</th>
  <td>
    System statistics.  This is a default module for all runs, and includes
    values like total run time, memory usage, and hardware configuration.
    Also includes floating point exception reporting.
  </td>
</tr>
<tr>
  <th>syscall</th>
  <td>System calls.</td>
</tr>
</table>


<a name="PROFILER_GROUPS"></a>
You can inspect visually the created self-contained ap2 file using [[CRAY_XC40_Tools#Apprentice2|Apprentice2]].
<h4>Profiler summary groups</h4>
<p>
The default behavior for parallel runs is to print a summary report
showing the minimum, maximum, and average value for each profiler result
for all of the processes in the run.
Setting PROFILER_GROUPS creates additional summary reports for the specified
groups of processes.
This feature requires MPI, and does not work with shmem-only programs.
</p>
<p>
PROFILER_GROUPS is set to comma-separated list of process rank ranges.
A range can be a single integer, or a range <b>x-y</b> which includes
the ranks from <b>x</b> to <b>y</b>.
If <b>x</b> is omitted, then all ranks less than or equal to <b>y</b> are selected.
If <b>y</b> is omitted, then all ranks equal to or greater
than <b>x</b> are selected.
For example, <b>0</b> selects rank 0 only, while <b>1-</b> selects all ranks except 0.
</p>
<p>
Setting PROFILER_GROUPS=0-7,8-15 for a run with 16 processes will print three
summary reports.
The first report summarizes all 16 processes,
the second report summarizes processes 0 through 7,
and the third report summarizes processes 8 through 15.
</p>
<p>
An optional <b>:stride</b> suffix can be included on any process rank range.
The default stride is 1.
<p>
Up to 10 process group ranges can be specified, and the PROFILER_GROUPS value can be
at most 127 characters long.
</p>


<a name="PROFILER_OUTPUT"></a>
REMEMBER: After the experiment is complete, unload perftools-lite-XXX module to prevent further program instrumentation. The perftools-base module can be kept loaded.
<h4>Profiler output file name</h4>
<p>
The <b>PROFILER_OUTPUT</b> environment variable can be set to change the
default profile output file name (profile.txt).
The value can include a relative path from the current directory,
or an absolute path (if the string begins with a "/" character).
The value can contain special tokens which are expanded at run time to
create unique output files for multiple runs.
</p>
<table border cellspacing=0 cellpadding=2>
<tr><th colspan=2 style='background-color: #e0f4ff'><div align=center>Filename Tokens</div></th></tr>
<tr><th>Token</th><th>Substituted string</th></tr>
<tr><td align=center><b>%h</b></td><td>host name (node name)</td></tr>
<tr><td align=center><b>%j</b></td><td>batch job id (PBS or Moab/Torque)</td></tr>
<tr><td align=center><b>%r</b></td><td>process rank (MPI_COMM_WORLD rank)</td></tr>
<tr><td align=center><b>%p</b></td><td>process id (unique for every process on a node, but may be duplicated across nodes)</td></tr>
<tr><td align=center><b>%s</b></td><td>number of processes (MPI_COMM_WORLD size)</td></tr>
<tr><td align=center><b>%t</b></td><td>number of threads per process (OpenMP or pthreads)</td></tr>
<tr><td align=center><b>%x</b></td><td>executable name</td></tr>
</table>
<p>
Note that rank 0 writes the summary report, while every process writes detail reports (if enabled).
This means that using <b>%h</b>, <b>%p</b>, or <b>%r</b> can result in multiple profile output files.
For example, setting <b>PROFILER_OUTPUT=profile_%r.txt</b> and <b>PROFILER_DETAIL=all</b> will result in detail reports written to file <b>profile_0.txt</b> for rank 0, and so on.
</p>


<a name="PBS"></a>
=== perftools-base ===
<h4>Profiler with PBS</h4>
The perftools-base module provides access to man pages, utilities such as Reveal, Cray Apprentice2 and grid_order, and instrumentation modules. It does not add compiler flags to enable performance data collection (such as symbol table information), as the earlier perftools or perftools-lite did or the newly available instrumentation modules do. It is a low-impact module that does not alter program behavior and can be left loaded even when building and running programs without CrayPat instrumentation.
<p>
The default report output file name includes the PBS job id when
profiler is running under PBS or Torque.
The file name format is <b>profiler.<i>jobid</i>.txt</b>.
</p>


<a name="playback"></a>
<h4>Using the record and playback feature</h4>
<p>
The record and playback feature is meant to address a performance
estimation issue when a large benchmark job can be run on an
existing computer system,
but only a much smaller configuration of a new system is available.
It is not possible to run the full large job on the small
system, but playback of a small portion of the large job
on the small system is possible.
Other possible uses are to assist in single-process optimization or debugging
of a large job.  Rather than rerunning the large job repeatedly,
a process can be recorded and replayed many times as incremental
changes are made, so long as the MPI communication is not altered.
</p>
<p>
In earlier releases of profiler, the procedure was to link with <b>-lprofiler</b>
for the record step, but to link with <b>-lplayback</b> for the
playback step.
This difference was required because the profiler library supports the record
feature, but playback was only supported by the playback version of the library.
This is no longer the case: both record and playback is supported with
<b>-lprofiler</b>, and the use of <b>-lplayback</b> is deprecated.
</p>
<p>
An important note is that for both the record and the playback steps,
the program must also be linked with the native MPI library.
Although MPI is not used for communication during playback (since playback
is always for a single process with MPI messages coming from the trace file),
the library is still necessary to resolve some remaining issues with playback.
</p>
<p>
Setting the PROFILER_RECORD and PROFILER_RANKS variables allows a small
portion of a larger job to be captured in trace files.
The trace file data can then be replayed on another system
by setting PROFILER_PLAYBACK for one or more processes.
With these steps, it is possible to obtain performance information from
a small portion of a large job during the replay step.
</p>
<p>
Currently only the <b>mpi</b> module implements the record and playback feature.
When recording calls to MPI routines, the results of each call is saved
to the trace file.
Arguments to the call which are used as input data by the routine are not saved.
Writing the trace file introduces overhead to the job, so it executes
slower than a non-record job.
</p>
<p>
During playback, each process executes the identical calls as the process
did in the large job.  The trace file provides the output variable values
for each of the calls.
In this way, the replay process experiences an environment which exactly
replicates the original process.
</p>
<p>
For example, consider a large MPI job running on a system.
If MPI process rank 49 calls <span class=call>MPI_Comm_rank(MPI_COMM_WORLD,rank)</span>
while PROFILER_RECORD is active, the rank returned by the call is saved to
the trace file <b>record.49</b>.
When the same program is run as a single process with <b>PROFILER_PLAYBACK=record.49</b>,
the call to <span class=call>MPI_Comm_rank(MPI_COMM_WORLD,rank)</span> will read the trace file
and set rank equal to 49.
If the MPI process calls <span class=call>MPI_Recv</span> with PROFILER_RECORD active, the result
of the call (the message buffer contents) is saved to the trace file.
On playback, the same call to <span class=call>MPI_Recv</span> will read the trace file and fill
the buffer with the correct message data.
</p>
<p>
Because of timing differences between the recorded calls and the playback calls,
the performance of the program is impacted.
In order to get a useful performance estimate for the overall job, the program run
time needs to be divided into categories such as compute, communication, and I/O.
The compute time is least impacted by the profiler instrumentation, so the compute
time of the single process can be compared directly to the compute time of the
corresponding MPI process of the full job.
When PROFILER_RECORD=mpi is used, communication time for record and playback
is not useful.
Communication time without record or playback should be captured
with a separate run.
Depending upon the systems involved, it may make more sense to use the I/O
time from either the record or the playback system.
</p>
<p>
Trace file size can be an issue when using this procedure.
Long-running and communication-intensive program can generate trace files
with many gigabytes of data.
In most cases, a job with the full problem size and run time will not be possible.
</p>
<p>
There are possible issues with the playback environment that are outside the
scope of profiler.
All files read by the playback process must be available as in the original job,
including temporary files written by another process (e.g., if an input file
is partitioned by a master process into individual input files for each process).
Also, environment variables needed by the process must be provided.
</p>
<a name="runtime"></a>
<h4>Run-time dynamic data collection control</h4>
<p>
Data collection can be controlled dynamically during program execution by
calling <b>profiler_enable(flag)</b>, where <b>flag</b> is 1 to enable
data collection and 0 to disable it.
When collection is disabled, statistics other than those reported by
the System module are not collected.
By the nature of the System module, some of the values reported
(wall clock time, user CPU time, etc.) are for the whole program execution.
However, if collection is disabled for some of the execution time,
an additional statistic, <b>Enabled wall time</b>, is reported.
</p>
<p>
When collection is enabled, only those modules which were enabled via
PROFILER_COLLECT (by default all modules) are enabled.
If a subset of modules were selected via PROFILER_COLLECT, then
only that subset is enabled and other modules remain disabled.
</p>
<p>
Regardless of the data collection status at the end of the run (enabled
or disabled), statistics are still reported as long as PROFILER_DISABLE
was not set.
If PROFILER_DISABLE was set, no data collection and no reporting occurs.
</p>


=== perftools-lite ===
This module is default CrayPat-lite profiling. It enables sampling of the application.


<a name="iobuf"></a>
Beside other information the Profiling by Function Group and Function is presented in the report:
<h4>Usage with IOBUF</h4>
<pre>Table 1:  Profile by Function Group and Function (top 8 functions shown)
<p>
Profiler can work with iobuf, but there are some quirks.
iobuf combines all types of read and write calls,
so individual call types like fprintf and fgets are not reported.
The report sections from the profiler iobuf module are otherwise
similar to the stdio module.
Tracing of standard I/O calls is not possible when iobuf is in use.
</p>


<a name="PAPI"></a>
  Samp% |  Samp |  Imb. |  Imb. |Group
<h4>Usage with PAPI</h4>
        |      |  Samp | Samp% | Function
<p>
        |      |      |      |  PE=HIDE
To select a non-default set of PAPI counter events, set PROFILER_PAPI to a comma-separated
     
list of PAPI preset or native events.
100.0% | 263.4 |    -- |    -- |Total
See <a href="papi_opteron.html">papi_opteron</a> for a list of Opteron native events.
|----------------------------------------------------------------------
</p>
|  78.0% | 205.3 |    -- |    -- |MPI
<p>
||---------------------------------------------------------------------
The default set of hardware counters for Barcelona Opteron (quad core) and later
||  62.4% | 164.4 | 115.6 | 42.2% |mpi_bcast
are double precision floating point operations, single precision floating point operations,
||  10.4% |  27.4 | 186.6 | 89.1% |MPI_ALLREDUCE
Packed SSE instructions, and SSE Merge MOV micro-ops (specified as
||  4.7% |  12.4 |  86.6 | 89.3% |MPI_IPROBE
RETIRED_SSE_OPERATIONS:DOUBLE_ADD_SUB_OPS:DOUBLE_MUL_OPS:DOUBLE_DIV_OPS:OP_TYPE,
||=====================================================================
RETIRED_SSE_OPERATIONS:SINGLE_ADD_SUB_OPS:SINGLE_MUL_OPS:SINGLE_DIV_OPS:OP_TYPE,
|  13.1% |  34.5 |    -- |    -- |USER
RETIRED_MMX_AND_FP_INSTRUCTIONS:PACKED_SSE_AND_SSE2, and
||---------------------------------------------------------------------
RETIRED_MOVE_OPS:LOW_QW_MOVE_UOPS:HIGH_QW_MOVE_UOPS:ALL_OTHER_MERGING_MOVE_UOPS).
...
The default set of hardware counters for Opteron prior to Barcelona (quad core) are
|======================================================================
Add pipe ops, Multiply pipe ops, Packed SSE instructions,
</pre>
and L1 data cache accesses
Where the stack trace of all processes are merged and the combined information is presented as relative and absolute values of the counted samples in the group/function and imbalances between processes.
(specified as PROFILER_PAPI=FP_ADD_PIPE,FP_MULT_PIPE,FR_FPU_SSE_SSE2_PACKED,DC_ACCESS).
The maximum number of active counters is 4.
</p>
<p>
As an additional feature, when PAPI is available, an extra test is run on Catamount
to determine the virtual memory page size (2 MB large pages or 4 KB small pages)
by counting the number of TLB misses for a memory access loop.
This heuristic provides the page size notation in the report.
</p>


<a name="persistent"></a>
=== perftools-lite-events ===
<h4>MPI persistent request handling</h4>
This module enables CrayPATs event tracing of applications. After loading the modules, re-compiled/linked the application and submitting the job as usual, the report is written in the above described way.
<p>
In contrast to sampling, event tracing reports out real time in groups / functions.
Calls to create MPI persistent requests (MPI_Send_init, MPI_Recv_init, etc.)
are reported, including the byte counts of the requests.
Additionally, calls to the start, wait, and test routines
include the total byte counts.
For example, if a persistent request is created via MPI_Send_init
with a message size of 100 bytes, then MPI_Start is called
three times with this request, then 100 bytes will be reported
for Send_init and 300 bytes will be reported for Start.
</p>


<a name="aborts"></a>
=== perftools-lite-loops ===
<h4>Reports for programs which exit early</h4>
This module enables CrayPat-lite loop work estimates. It must be used with Cray compiler. After proceeding in the above described way, loop work estimates are sent to stdout and to .ap2 file. Performance data can be combined with source code information and compiler annotation using the .ap2 file with Reveal.
<p>
The module modify the compile and link steps to include CCE’s –h profile_generate option and instrumenting the program for tracing (pat_build -w). Remember that –h profile_generate reduces compiler optimization levels. After experiment is complete, unload perftools-lite-loops to prevent further program instrumentation.
If an MPI program exits before MPI_Finalize is called, if
a fault such as segmentation violation occurs, or if
an asynchronous signal such as SIGTERM from qdel is received,
a profile is still written.
In this case, no summary can be created because there is no
way to coordinate the processes.
All processes which encounter the condition will write
individual detailed reports.
</p>


<a name="partners"></a>
<pre>Table 1:  Inclusive and Exclusive Time in Loops (from -hprofile_generate)
<h4>Point-to-point message report</h4>
  Loop | Loop Incl |      Time |    Loop |  Loop |  Loop |  Loop |Function=/.LOOP[.]
<p>
  Incl |      Time |    (Loop |    Hit | Trips | Trips | Trips | PE=HIDE
When PROFILER_PARTNERS is set to an integer value, a report is
Time% |          |    Adj.) |        |  Avg |  Min |  Max |
generated showing the top destination process ranks for
|-----------------------------------------------------------------------------
the point-to-point send calls made by each process.
| 93.0% | 19.232051 |  0.000849 |      2 |  26.5 |    3 |    50 |jacobi.LOOP.1.li.236
The point-to-point send calls are MPI_Send, MPI_Isend, MPI_Sendreceive, etc.
| 77.8% | 16.092021 |  0.001350 |      53 | 255.0 |  255 |  255 |jacobi.LOOP.2.li.240
MPI one-sided get/put, persistent requests, and shmem calls are not supported.
| 77.8% | 16.090671 |  0.110827 |  13515 | 255.0 |  255 |  255 |jacobi.LOOP.3.li.241
If a process calls no point-to-point send functions, it is listed as
| 77.3% | 15.979844 | 15.979844 | 3446325 | 511.0 |  511 |  511 |jacobi.LOOP.4.li.242
"No messages sent" in the report.
| 14.1% |  2.906115 |  0.001238 |     53 | 255.0 |  255 |  255 |jacobi.LOOP.5.li.263
</p>
<p>
The report lists each process rank and its top partners,
giving the number of messages sent, the total number of
bytes sent, and the time for the send operations.
A simple example of the report using 2 processes and
PROFILER_PARTNERS=1 is
</p>
<pre>
Point-to-point message partners report.
  Src, Dest,        Count,        Bytes,        Time
    0,    1,          120,      250480,        0.002
     1,    0,          130,      250480,        0.030
</pre>
</pre>


<a name="samp"></a>
=== perftools ===
<h4>Instruction pointer (IP) sampling</h4>
In contrast to the perftools-lite module, which automatically instrument and report, the perftools module require a manually instrumentation and report generation:
<p>
<pre>$> module load perftools-base
A profile breakdown by time spent in routines can be included in the report by
$> module load perftools
setting <b>PROFILER_COLLECT=samp</b> (sampling only) or <b>PROFILER_COLLECT=all,samp</b>
$> make clean; make # If your application is already built with perftools loaded you do not have to rebuild when switching the experiment.
(regular profile report plus sampling).
$> pat_build <pat_options> app.exe # pat_options are described below; Creates instrumented binary app.exe+pat
This feature is similar to the CrayPat default report for the sampling experiment.
$> qsub job.pbs        # ATTENTION: now you have to use the new instrumented binary "aprun <options> ./app.exe+pat"
The progam's instruction pointer is sampled at 10 millisecond intervals.
$> pat_report –o myrep.txt app.exe+pat+* # .xf file or related directory
Each sample taken while a routine is executing contributes to its total.
</p>
<p>
For an extra level of detail, specifying <b>PROFILER_BIN=samp</b> will add
reporting by source code line number.
For line number reporting, the routine must be compiled with <b>-g</b> or
with the craypat module loaded (which automatically adds the <b>-g</b> option).
</p>
<p>
Below is an example of the IP sampling summary report.
</p>
<pre>
IP samples:  34846 total samples
  Min,    Max,    Avg, Min PE, Max PE, Function
17.0%,  17.2%,  17.1%,      2,      3, ngb_treefind_variable, ngb.c
17.0%,  17.0%, 17.0%,      3,      0, ngb_treefind_pairs, ngb.c
12.9%,  13.2%,  13.1%,      0,      2, force_treeevaluate_shortrange, forcetree.c
  8.7%,  8.8%,  8.7%,      1,      3, pmforce_periodic, pm_periodic.c
  6.6%,  6.8%,  6.7%,      1,      3, hydro_evaluate, hydra.c
  5.6%,  5.9%,  5.7%,      3,      2, density_evaluate, density.c
  2.5%,  2.7%,  2.6%,      1,      3, __exp, w_exp.c
  1.3%,  1.4%,  1.4%,      2,      0, __erfc, s_erf.c
  1.0%,  1.2%,  1.1%,      2,      1, ewald_force, forcetree.c
</pre>
</pre>
<p>
Running the “+pat” binary creates a data file or directory. ''pat_report'' reads that data file and prints lots of human-readable performance data. It also creates an *.ap2 file which contains all profiling data. (The app.exe+pat+* file/directory can be deleted after the creation of the .ap2 file)
The percentages reported correspond to the relative number of samples for
 
that routine, and roughly correspond to the program run time percentage.
The instrumentation can be adjusted using ''pat_build'' options, which are listed in '''man pat_build''', some few commonly used options are:
By default, only routines with 1% or greater are listed, and lines with 0.05%
{|border="1" cellpadding="2"
are printed.
!width="150"|pat_build Option
If <b>PROFILER_VERBOSE=samp</b> is given, all samples are printed.
!width="350"|Description
</p>
|-
|
|Sampling profile
|-
| style="text-align:center;"| -u
|tracing of functions in source file owned by the user
|-
| style="text-align:center;"| -w
|Tracing is default experiment
|-
| style="text-align:center;"| -T <func>
| Specifies a function which will be traced
|-
| style="text-align:center;"| -t <file>
|All functions in the specified file will be traces
|-
| style="text-align:center;"| -g <group>
|Instrument all functions belonging to the specified trace function group, e.g. blas, io, mpi, netcdf, syscall
|}


<a name="caller"><h4>Profile by caller</h4></a>
It should be noted, that only true function calls can be traced. Functions that are inlined by the compiler or that have local scope in a compilation unit cannot be traced.
<p>
 
A profile breakdown by caller can be included in the report by
The '''pat_report''' tool combines information from *.xf output (raw data files, optimized for writing to disk). During conversion the instruments binary must still exist. As a result  *.ap2 file is produced, which is a compressed performance file, optimized for visualization analysis. The ap2 file is the input for subsequent ''pat_report'' calls and ''Reveal'' or  ''Apprentice2''. Once the ap2 file is generated *.xf files and instrumented binary files can be removed.
setting the <b>PROFILER_CALLER</b> environment variable.
Many options for sorting, slicing or dicing data in the tables are provided using
</p>
<pre>$> pat_report –O <table option> *.ap2
<p>
$> pat_report –O help (list of available profiles)</pre>
To get source file line numbers in the profile, use the <b>-g</b> compiler
Volume and type of information depends upon sampling vs tracing. Several output formats {plot | rpt | ap2 | ap2‐xml | ap2‐txt | xf‐xml | xf‐txt | html} are available through the –f option. Furthermore, gathered data can be filtered using
option. Since <b>-g</b> sets the optimization level to <b>-O0</b>,
<pre>$> pat_report –sfilter_input=‘condition’ … </pre>
list all optimization flags after <b>-g</b> (e.g., <b>-g -O3</b>).
where the ‘condition’ could be an expression involving 'pe' such as 'pe<1024' or 'pe%2==0'.
One way to get the proper compile options is by loading the <b>xt-craypat</b> module.
 
</p>
'''Loop Work Estimation''' can be collected by using the CCE compiler option ''-h profile_generate'' and the described tracing experiment. It is recommended to turn off OpenMP and OpenACC for the loop work estimates via –h noomp –h noacc.
<p>
 
The profile by caller report lists the calling routine for
'''Hardware counter Selection''' can be enabled using ''export PAT_RT_PERFCTR= <group> | <event list>'', where the related groups and events can be listed using ''man hwpc'' and ''papi_avail'' or pat_help -> counters.
profiled functions in the mpi, mpio, shmem, and stdio modules.
 
</p>
'''Energy information''' can be gathered using ''pat_report –O program_energy *.ap2''.
<a name="files"><h4>Per-file I/O reporting</h4></a>
 
<p>
 
Files read or written by a process are tracked and statistics for each file
==== Automatic Profiling Analysis (APA) ====
is given in the detail report.
The advantages of sampling and tracing are combined in the guided profiling APA.  
The report lists the number of I/O calls, total amount of I/O,
The target are large, long-running programs (general a trace will inject considerable overhead). The goal is the limitation of tracing to those functions that consume the most time.  
the I/O time, along with the average I/O rate.
As a procedure a preliminary sampling experiment is used to determine and instrument functions consuming the most time.  
Also included is a breakdown of the I/O sizes (minimum, maximum, and average).
<pre>$> module load perftools
</p>
$> make clean; make
<p>
$> pat_build himeno.exe           # The APA is the default experiment. No option needed.
Below is an example of a per-file I/O report.
$> qsub job.pbs                          # using the new instrumented binary in "aprun <option> ./app.exe+pat"
</p>
$> pat_report –o myrep.txt app+pat+*      
<pre>
 
Process 0: File I/O details
$> vi *.apa                           # The *.apa file contains instructions for the next instrumentation step. Modify it according to your needs.
  2005 writes,      0.084 MB total,      0.013 sec, 6.550 MB/sec, stdout
$> pat_build –O *.apa                   # Generates an instrumented binary *.exe+apa for tracing
      2 reads,      0.000 MB total,      0.001 sec, 0.030 MB/sec, fd(6)
$> qsub job.pbs                          # using the new instrumented binary in "aprun <option> ./app.exe+apa"
    73 writes,      0.063 MB total,      0.001 sec, 51.931 MB/sec, ./cpu.txt
$> pat_report –o myrep.txt app+apa+*     # .xf file or related directory
    73 writes,      0.007 MB total,      0.006 sec, 1.062 MB/sec, ./info.txt
    74 writes,     0.023 MB total,      0.014 sec, 1.680 MB/sec, ./timings.txt
    73 writes,      0.013 MB total,      0.003 sec, 3.800 MB/sec, ./balance.txt
    73 writes,      0.001 MB total,      0.005 sec, 0.265 MB/sec, ./sfr.txt
    11 writes,    22.021 MB total,      0.029 sec, 769.798 MB/sec, ./snap_smallset_003
write size (KB),      0.000 min,      0.467 max,      0.042 avg,  stdout
read  size (KB),      0.000 min,      0.012 max,      0.012 avg,  fd(6)
write size (KB),      0.854 min,      0.860 max,      0.860 avg,  ./cpu.txt
write size (KB),      0.065 min,      0.094 max,      0.093 avg,  ./info.txt
write size (KB),     0.293 min,      0.316 max,      0.308 avg,  ./timings.txt
write size (KB),      0.171 min,      0.956 max,      0.182 avg,  ./balance.txt
write size (KB),      0.013 min,      0.018 max,      0.018 avg,  ./sfr.txt
write size (KB),  1049.096 min,  2097.152 max,  2001.874 avg,  ./snap_smallset_003
</pre>
</pre>
<p>
fd(6) in the report indicates an unnamed file (most likely a
<a href="http://wwwapps.us.cray.com/script/kjt/xt_man?pipe(2)&product=suse_allm/11">pipe(2)</a>)
with that file descriptor number.
One source of such unnamed pipes is the SMP device implemented within MPT3 (see
<a href="http://wwwapps.us.cray.com/script/kjt/xt_man?intro_mpi(3)&product=xt_mptm/31">intro_mpt(3)</a>).
</p>
<p>
Up to 64 files are tracked simultaneously, and up to 256 files are included in
the report.
Files beyond these limits are included in the process I/O totals, but are
not reported individually.
This report is generated by the <b>pio</b> module.
Disabling the module (PROFILER_COLLECT=!pio) will disable this report.
</p>
<a name="threads"><h4>OpenMP and POSIX thread reporting</h4></a>
<p>
Extra statistics are reported for programs using POSIX threads, including
OpenMP in compilers that are compatible and based on pthreads.
The OpenMP implementations in PGI and CCE are compatible, but
Pathscale OpenMP is not.
</p>
<p>
The <b>system details</b> report section includes CPU affinity settings for each thread
created during the run.
</p>
<p>
The <b>PAPI details</b> report section includes the PAPI counter totals for all threads
and the counters for each individual thread
when more than one thread is active during the run.
Note that the PAPI counters reported in the <b>PAPI summary</b> report section
only include the master thread counters.
This is because the counters for the other threads are not available until
those threads terminate.
Most OpenMP implementations keep a pool of threads waiting for parallel execution
throughout the run, so the threads are not terminated until after the summary
has been printed.
</p>


<h4>Notes on units</h4>
== Reveal ==
<p>
Reveal is Cray’s next-generation integrated performance analysis and code optimization tool.  
If the label <b>KB</b> is used, the value is in units of kilobytes and is scaled by 2**10 (1024).
main features:
If the label <b>MB</b> is used, the value is in units of megabytes and is scaled by 2**20 (1048576).
* inspecting combined view of loop work estimations with source code (compiler annotations)
If the label <b>GB</b> is used, the value is in units of megabytes and is scaled by 2**30 (1073741824).
* assist an OpenMP port
Otherwise, the value may be printed with a trailing character to indicate the scale factor.
<b>K</b> is 10**3.
<b>M</b> is 10**6.
<b>G</b> is 10**9.
If the value is printed with no trailing scale character and no scale label, then the value is unscaled.
</p>


<h4>Example Summary Report</h4>
For an OpenMP port a developer has to understand the scoping of the variables, i.e. whether variables are shared or private. Reveal assists by navigating through the source code using whole program analysis (data provided by the Cray compilation environment; listing files). Reveal couples with performance data collected during execution by CrayPAT. It understand which high level serial loops could benefit from parallelism. It gathers and present dependency information for targeted loops, assist users optimize code by providing variable scoping feedback and suggested compile directives.  
<pre>
 
Profile of allpair
Usage:
  Number of processes          ,            2
<pre>$> module load perftools-base
  Profiling started            , Tue Aug  1 16:16:21 2006
$> ftn -O3 -hpl=my_program.pl -c my_program_file1.f90
  Profiling ended              , Tue Aug  1 16:16:21 2006
$> ftn -O3 -hpl=my_program.pl -c my_program_file1.f90 #Recompile to generate program library
System summary                  ,          min,          max,          avg
# run instrumented binary to gather performance data using loop work estimation (see above)
  Wall clock time              ,        0.379,        0.379,        0.379
$> reveal my_program.pl my_program.ap2 &
  User CPU time                ,        0.379,        0.379,        0.379
  Processor clock (GHz)         ,        2.400,        2.400,        2.400
PAPI summary                    ,          min,          max,          avg
  Total processor cycles        ,    791855041,    791857657,    791856349
  User processor cycles        ,    791855669,    791858555,    791857112
  FP add pipe ops              ,      383795,      1275119,      829457
  FP multiply pipe ops          ,      158558,      830361,      494459
  Packed SSE instructions      ,      419725,      444353,      432039
  L1 data cache accesses        ,    313840770,    337758406,    325799588
  Processor clock (GHz)        ,        2.400,        2.400,       2.400
STDIO summary                  ,          min,          max,          avg
  Total I/O time                ,        0.000,        0.322,        0.161
  TOtal I/O (MB)                ,        0.000,        0.001,        0.001
  Total write time              ,        0.000,        0.322,        0.161
  Total write (MB)              ,        0.000,        0.001,        0.001
  fwrite calls                  ,            0,          330,          165
  fwrite time                  ,        0.000,        0.322,        0.161
  fwrite total (MB)            ,        0.000,        0.001,        0.001
Heap summary                    ,          min,          max,          avg
  Maximum heap size (MB)       ,      66.527,      66.542,      66.534
  Average block size (KB)      ,      140.310,      240.756,      190.533
  Maximum blocks allocated      ,          127,          129,          128
  Total blocks allocated        ,          283,          486,          384
MPI summary                    ,          min,          max,          avg
  Elapsed time                  ,        0.330,        0.330,        0.330
  Communication time            ,        0.002,        0.207,        0.105
  Wait time                    ,        0.001,        0.060,        0.031
  MPI_Send total calls          ,          10,          50,          30
  MPI_Send average bytes        ,    1600.000,    1705.600,    1652.800
  MPI_Send total bytes          ,        16000,        85280,        50640
  MPI_Ssend total calls        ,          10,          10,          10
  MPI_Ssend average bytes      ,        2664,        2664,        2664
  MPI_Ssend total bytes        ,        26640,        26640,        26640
  MPI_Rsend total calls        ,            0,          10,            5
  MPI_Rsend average bytes      ,            0,        2664,        1332
  MPI_Rsend total bytes        ,            0,        26640,        13320
  MPI_Recv total calls          ,          40,          60,          50
  MPI_Recv average time        ,        0.000,        0.003,        0.001
  MPI_Recv total time          ,        0.000,        0.177,        0.089
  MPI_Recv average bytes        ,        6000,        8000,        7000
  MPI_Recv total bytes          ,      240000,      480000,      360000
</pre>
</pre>
<h4>Example Detail Report</h4>
You can omit the *.ap2 and inspect only compiler feedback.
<pre>
Note that the ''-h profile_generate'' option disables most automatic compiler optimizations, which is why Cray recommends generating this data '''separately''' from generating the program_library file.
Process 0: System details
 
  Wall clock time              ,        0.427
== Apprentice2 ==
  User CPU time                ,        0.427
Cray Apprentice2 is a post-processing performance data visualization tool, which takes *.ap2 files as input.
  Processor clock (GHz)        ,       2.400
 
  Hostname                      ,      salmon
Main features are:
Process 0: PAPI details
*Call graph profile
  Total processor cycles        ,    791855041
*Communication statistics
  User processor cycles        ,    791855669
*Time-line view for Communication and IO.  
  FP add pipe ops              ,      383795
*Activity view
  FP multiply pipe ops          ,      158558
*Pair-wise communication statistics
  Packed SSE instructions      ,      444353
*Text reports
  L1 data cache accesses        ,    337758406
It helps identify:
  Processor clock (GHz)        ,       2.400
*Load imbalance
Process 0: STDIO details
*Excessive communication
  Total I/O time                ,        0.322
*Network contention
  TOtal I/O (MB)                ,        0.001
*Excessive serialization
  Total write time              ,        0.322
*I/O Problems
  Total write (MB)              ,        0.001
 
  fwrite calls                  ,          330
<pre>$> module load perftools-base
  fwrite time                  ,        0.322
$> app2 *.ap2 & </pre>
  fwrite total (MB)            ,        0.001
 
Process 0: Heap details
If the full trace is enabled (using the environment variable ''PAT_RT_SUMMARY=0''), a time line view is activated, which helps to see communication bottlenecks. But please use it only for small experiments !
  Maximum heap size (MB)        ,      66.542
  Average block size (KB)      ,      140.310
  Maximum blocks allocated      ,          129
  Total blocks allocated        ,          486
Process 0: MPI details
  Elapsed time                  ,        0.330
  Communication time            ,        0.002
  Wait time                    ,        0.001
  MPI_Send total calls          ,          10
  MPI_Send average bytes        ,        1600
  MPI_Send total bytes          ,        16000
  MPI_Ssend total calls        ,          10
  MPI_Ssend average bytes      ,        2664
  MPI_Ssend total bytes        ,        26640
  MPI_Rsend total calls        ,          10
  MPI_Rsend average bytes      ,        2664
  MPI_Rsend total bytes        ,        26640
  MPI_Recv total calls          ,          40
  MPI_Recv average bytes        ,        6000
  MPI_Recv total bytes          ,      240000
</pre>
<h4>Ignored Processes</h4>
<p>
By default, no profile is written for some processes.
These processes are part of regular job launch on Cray XT systems.
</p>
<ul>
<li>PMI shepherd process in MPI programs.</li>
<li>The ALPS aprun command.</li>
<li>The ALPS gzip utility (/usr/bin/gzip).</li>
</ul>
<br>


<hr>
You can istall Apprentice2 on you local machine. It is available from a Cray login node
*module load perftools-base
* Go to: $CRAYPAT_ROOT/share/desktop_installers/
* Download .dmg or .exe installer to laptop
* Double click on installer and follow directions to install


</body>
== Cray Profiler ==
</html>
The Cray profiler library is deprecated, but still available on the system. A description can be found [[ CrayProfiler | here ]]

Latest revision as of 15:36, 9 October 2016

Cray does provide several official tools. Below is a list of some of the tools, you can get more information about them in the online manual (man atp for example).

At HLRS Cray also supports some tools with limited or no support. Beside the CrayPAT currently also available is the Cray Profiler

ATP : Abnormal Termination Processing

This tool can be used when the application crashes, e.g. with a segmentation fault. Abnormal Termination Processing (ATP) is a system that monitors Cray system user applications. If an application takes a system trap, ATP performs analysis on the dying application. In the stderr a stack walkback of the crashing rank is presented. In the following example, rank 1 crashes:

Application 5408137 is crashing. ATP analysis proceeding...

ATP Stack walkback for Rank 1 starting:
  _start@start.S:113
  __libc_start_main@libc-start.c:242
  main@0x200015f6
  ConverseInit@0x20255a59
  _processHandler(void*, CkCoreState*)@0x201a320d
  CkDeliverMessageFree@0x2019d402
  CkArray::recvBroadcast(CkMessage*)@0x201c50f7
  CkArrayBroadcaster::deliver(CkArrayMessage*, ArrayElement*, bool)@0x201c4c30
  CkIndex_TreePiece::_call_drift_marshall51(void*, void*)@0x20051cb9
  TreePiece::drift(double, int, int, double, double, int, bool, CkCallback const&)@0x200169eb
  ArrayElement::contribute(int, void const*, CkReduction::reducerType, CkCallback const&, unsig
ned short)@0x201c213a
  CkReductionMsg::buildNew(int, void const*, CkReduction::reducerType, CkReductionMsg*)@0x201cd
f20
  memcpy@memcpy.S:196
ATP Stack walkback for Rank 1 done
Process died with signal 11: 'Segmentation fault'
Forcing core dump of rank 1
View application merged backtrace tree with: stat-view atpMergedBT.dot
You may need to: module load stat

_pmiu_daemon(SIGCHLD): [NID 07469] [c2-3c2s11n1] [Fri Sep 23 10:37:51 2016] PE RANK 0 exit sign
al Killed
[NID 07469] 2016-09-23 10:37:51 Apid 5408137: initiated application termination

In this example, memcopy in CkReductionMsg::buildNew seems to have an issue.

In addition to the text output, stack backtraces of ALL the application processes are gathered into a merged stack backtrace tree and written to disk as the file, atpMergedBT.dot. The stack backtrace tree for the first process to die is sent to stderr as is the number of the signal that caused the application to fail. If Linux core dumping is enabled (see ulimit or limit in your shell documentation), a heuristically selected set of processes also dump their cores.

The atpMergedBT.dot file can be viewed with statview, (the Stack Trace Analysis Tool viewer), which is included in the Cray Debugger Support Tools (module load stat), or alternatively with the file viewer dotty, which can be found on most Linux systems. The merged stack backtrace tree provides a concise yet comprehensive view of what the application was doing at the time of its termination.

At HLRS ATP module is loaded by default. To use it you have to set

export ATP_ENABLED=1

in your batch script. ATP provide few important core files, if you

limit -c unlimited

STAT : Stack Trace Analysis Tool

Stack Trace Analysis Tool (STAT) is a cross-platform tool from the University of Wisconsin-Madison. It gathers and merges stack traces from a running application’s parallel processes. It creates call graph prefix tree, which are a compressed representation, with scalable visualization and scalable analysis It is very useful when application seems to be stuck/hung. Full information including use cases is available at {http://www.paradyn.org/STAT/STAT.html paradyn}. STAT scales to many thousands of concurrent process.

To use it, you simply load the module and attach it to your running/hanging application.

$> module load stat
$> qsub  job.pbs
	#start the application e.g. using a batch script
	#Wait until application reaches the suspicious state
$> STATGUI <JOBID> 
	#Launches the graphical interface
	#Attach to the job
	#Shows the calltree
$> qdel <JOBID>
	#Terminate the running application

IOBUF - I/O buffering library

IOBUF is an I/O buffering library that can reduce the I/O wait time for programs that read or write large files sequentially. IOBUF intercepts I/O system calls such as read and open and adds a layer of buffering, thus improving program performance by enabling asynchronous prefetching and caching of file data.

IOBUF can also gather runtime statistics and print a summary report of I/O activity for each file.

In general, no program source changes are needed in order to take advantage of IOBUF. Instead, IOBUF is implemented by following these steps:

Load the IOBUF module:

% module load iobuf

Relink the program. Set the IOBUF_PARAMS environment variable as needed.

% export IOBUF_PARAMS='*:verbose'

Execute the program.

If a memory allocation error occurs, buffering is reduced or disabled for that file and a diagnostic is printed to stderr. When the file is opened, a single buffer is allocated if buffering is enabled. The allocation of additional buffers is done when a buffer is needed. When a file is closed, its buffers are freed (unless asynchronous I/O is pending on the buffer and lazyclose is specified).

Please check the complete manual and all environment variables available by reading the man page (man iobuf, after loading the iobuf module)

 IMPORTANT NOTICE : As iobuf is written for serial IO, its behavior is undefined 
 when used for parallel I/O into a single file. 

You should never use IOBUF in the case when several parallel processes operates on a single file.

Perftools : Performance Analysis Tool Kit

The Cray Performance Measurement and Analysis Tools (or CrayPat) are a suite of optional utilities that enable you to capture and analyze performance data generated during the execution of your program on a Cray system. The information collected and analysis produced by use of these tools can help you to find answers to two fundamental programming questions: How fast is my program running? and How can I make it run faster? A detailed documantation about CrayPAT can be found in document S-2376-622. Here a short summary is presented, concentrating on the usage.

Profiling is mainly distinguished between two main run cases, sampling and tracing:

Sampling Tracing
Advantages
  • Only need to instrument main routine
  • Low Overhead – depends only on sampling frequency
  • Smaller volumes of data produced
Advantages
  • More accurate and more detailed information
  • Data collected from every traced function call not statistical averages
Disadvantages
  • Only statistical averages available
  • Limited information from performance counters
Disadvantages
  • Increased overheads as number of function calls increases
  • Huge volumes of data generated

Using the fully adjustable CrayPAT, Automatic Profiling Analysis (APA) is a guided tracing combining the advantages of Sampling and tracing. Furthermore, the event tracing can be enhanced by using loop profiling.


perftools-base should be loaded as a starting place. This provides access to man pages, Reveal, Cray Apprentice2, and the new instrumentation modules. This module can be kept loaded without impact to applications. As instrumentation modules following is available:

GENERAL REMARKS: MUST run on Lustre ! Always check that the instrumented binary has not affected the run time notably compared to the original. Collecting event traces on large numbers of frequently called functions, or setting the sampling interval very low can introduce a lot of overhead (check trace-text-size option to pat_build). The runtime analysis can be modified through the use of environment variables of the form PAT_RT_*.


CrayPAT

The perftools-lite modules provide a user-friendly way to auto-instrument your application for various profiling cases. The perftools module provide CrayPATs full functionality. As described below instrumentation and report generation can be triggered manually specifying various options. In the following descriptions we assume using a simple batch job script:

$> cat job.pbs
#!/bin/bash
#PBS –l nodes=1:ppn=24
#PBS –l walltime=00:10:00
#PBS –j oe
#PBS -o job.out

cd $PBS_O_WORKDIR
aprun –n 384 –N 24 <exe>

An application is instrumented and run using the following commands:

$> module load perftools-base
$> module load <CrayPAT-lite-module>
$> make clean; make # or what is necessary to rebuild your application
$> qsub job.pbs     # no changes needed for aprun inside this script 
$> less job.out

As a result a *.rpt and a *.ap2 file are created and the report is additionally printed to stdout.

Additional information and representation can be gathered using pat_report with the produced *.ap2 file.

$> pat_report <option> *.ap2 

Descriptions of the available option can be obtained using man pat_report

You can inspect visually the created self-contained ap2 file using Apprentice2.

REMEMBER: After the experiment is complete, unload perftools-lite-XXX module to prevent further program instrumentation. The perftools-base module can be kept loaded.

perftools-base

The perftools-base module provides access to man pages, utilities such as Reveal, Cray Apprentice2 and grid_order, and instrumentation modules. It does not add compiler flags to enable performance data collection (such as symbol table information), as the earlier perftools or perftools-lite did or the newly available instrumentation modules do. It is a low-impact module that does not alter program behavior and can be left loaded even when building and running programs without CrayPat instrumentation.


perftools-lite

This module is default CrayPat-lite profiling. It enables sampling of the application.

Beside other information the Profiling by Function Group and Function is presented in the report:

Table 1:  Profile by Function Group and Function (top 8 functions shown)

  Samp% |  Samp |  Imb. |  Imb. |Group
        |       |  Samp | Samp% | Function
        |       |       |       |  PE=HIDE
       
 100.0% | 263.4 |    -- |    -- |Total
|----------------------------------------------------------------------
|  78.0% | 205.3 |    -- |    -- |MPI
||---------------------------------------------------------------------
||  62.4% | 164.4 | 115.6 | 42.2% |mpi_bcast
||  10.4% |  27.4 | 186.6 | 89.1% |MPI_ALLREDUCE
||   4.7% |  12.4 |  86.6 | 89.3% |MPI_IPROBE
||=====================================================================
|  13.1% |  34.5 |    -- |    -- |USER
||---------------------------------------------------------------------
...
|======================================================================

Where the stack trace of all processes are merged and the combined information is presented as relative and absolute values of the counted samples in the group/function and imbalances between processes.

perftools-lite-events

This module enables CrayPATs event tracing of applications. After loading the modules, re-compiled/linked the application and submitting the job as usual, the report is written in the above described way. In contrast to sampling, event tracing reports out real time in groups / functions.

perftools-lite-loops

This module enables CrayPat-lite loop work estimates. It must be used with Cray compiler. After proceeding in the above described way, loop work estimates are sent to stdout and to .ap2 file. Performance data can be combined with source code information and compiler annotation using the .ap2 file with Reveal. The module modify the compile and link steps to include CCE’s –h profile_generate option and instrumenting the program for tracing (pat_build -w). Remember that –h profile_generate reduces compiler optimization levels. After experiment is complete, unload perftools-lite-loops to prevent further program instrumentation.

Table 1:  Inclusive and Exclusive Time in Loops (from -hprofile_generate)
  Loop | Loop Incl |      Time |    Loop |  Loop |  Loop |  Loop |Function=/.LOOP[.]
  Incl |      Time |     (Loop |     Hit | Trips | Trips | Trips | PE=HIDE
 Time% |           |     Adj.) |         |   Avg |   Min |   Max |
|-----------------------------------------------------------------------------
| 93.0% | 19.232051 |  0.000849 |       2 |  26.5 |     3 |    50 |jacobi.LOOP.1.li.236 
| 77.8% | 16.092021 |  0.001350 |      53 | 255.0 |   255 |   255 |jacobi.LOOP.2.li.240 
| 77.8% | 16.090671 |  0.110827 |   13515 | 255.0 |   255 |   255 |jacobi.LOOP.3.li.241 
| 77.3% | 15.979844 | 15.979844 | 3446325 | 511.0 |   511 |   511 |jacobi.LOOP.4.li.242 
| 14.1% |  2.906115 |  0.001238 |      53 | 255.0 |   255 |   255 |jacobi.LOOP.5.li.263

perftools

In contrast to the perftools-lite module, which automatically instrument and report, the perftools module require a manually instrumentation and report generation:

$> module load perftools-base
$> module load perftools
$> make clean; make	# If your application is already built with perftools loaded you do not have to rebuild when switching the experiment.
$> pat_build <pat_options> app.exe	# pat_options are described below; Creates instrumented binary app.exe+pat
$> qsub job.pbs         # ATTENTION: now you have to use the new instrumented binary "aprun <options> ./app.exe+pat"
$> pat_report –o myrep.txt app.exe+pat+*  # .xf file or related directory

Running the “+pat” binary creates a data file or directory. pat_report reads that data file and prints lots of human-readable performance data. It also creates an *.ap2 file which contains all profiling data. (The app.exe+pat+* file/directory can be deleted after the creation of the .ap2 file)

The instrumentation can be adjusted using pat_build options, which are listed in man pat_build, some few commonly used options are:

pat_build Option Description
Sampling profile
-u tracing of functions in source file owned by the user
-w Tracing is default experiment
-T <func> Specifies a function which will be traced
-t <file> All functions in the specified file will be traces
-g <group> Instrument all functions belonging to the specified trace function group, e.g. blas, io, mpi, netcdf, syscall

It should be noted, that only true function calls can be traced. Functions that are inlined by the compiler or that have local scope in a compilation unit cannot be traced.

The pat_report tool combines information from *.xf output (raw data files, optimized for writing to disk). During conversion the instruments binary must still exist. As a result *.ap2 file is produced, which is a compressed performance file, optimized for visualization analysis. The ap2 file is the input for subsequent pat_report calls and Reveal or Apprentice2. Once the ap2 file is generated *.xf files and instrumented binary files can be removed. Many options for sorting, slicing or dicing data in the tables are provided using

$> pat_report –O <table option> *.ap2
$> pat_report –O help (list of available profiles)

Volume and type of information depends upon sampling vs tracing. Several output formats {plot | rpt | ap2 | ap2‐xml | ap2‐txt | xf‐xml | xf‐txt | html} are available through the –f option. Furthermore, gathered data can be filtered using

$> pat_report –sfilter_input=‘condition’ … 

where the ‘condition’ could be an expression involving 'pe' such as 'pe<1024' or 'pe%2==0'.

Loop Work Estimation can be collected by using the CCE compiler option -h profile_generate and the described tracing experiment. It is recommended to turn off OpenMP and OpenACC for the loop work estimates via –h noomp –h noacc.

Hardware counter Selection can be enabled using export PAT_RT_PERFCTR= <group> | <event list>, where the related groups and events can be listed using man hwpc and papi_avail or pat_help -> counters.

Energy information can be gathered using pat_report –O program_energy *.ap2.


Automatic Profiling Analysis (APA)

The advantages of sampling and tracing are combined in the guided profiling APA. The target are large, long-running programs (general a trace will inject considerable overhead). The goal is the limitation of tracing to those functions that consume the most time. As a procedure a preliminary sampling experiment is used to determine and instrument functions consuming the most time.

$> module load perftools
$> make clean; make
$> pat_build himeno.exe 	          # The APA is the default experiment. No option needed.
$> qsub job.pbs                           # using the new instrumented binary in "aprun <option> ./app.exe+pat"
$> pat_report –o myrep.txt app+pat+*      

$> vi *.apa 	                          # The *.apa file contains instructions for the next instrumentation step. Modify it according to your needs.
$> pat_build –O *.apa 	                  # Generates an instrumented binary *.exe+apa for tracing
$> qsub job.pbs                           # using the new instrumented binary in "aprun <option> ./app.exe+apa"
$> pat_report –o myrep.txt app+apa+*      # .xf file or related directory

Reveal

Reveal is Cray’s next-generation integrated performance analysis and code optimization tool. main features:

  • inspecting combined view of loop work estimations with source code (compiler annotations)
  • assist an OpenMP port

For an OpenMP port a developer has to understand the scoping of the variables, i.e. whether variables are shared or private. Reveal assists by navigating through the source code using whole program analysis (data provided by the Cray compilation environment; listing files). Reveal couples with performance data collected during execution by CrayPAT. It understand which high level serial loops could benefit from parallelism. It gathers and present dependency information for targeted loops, assist users optimize code by providing variable scoping feedback and suggested compile directives.

Usage:

$> module load perftools-base
$> ftn -O3 -hpl=my_program.pl -c my_program_file1.f90
$> ftn -O3 -hpl=my_program.pl -c my_program_file1.f90 #Recompile to generate program library
# run instrumented binary to gather performance data using loop work estimation (see above)
$> reveal my_program.pl my_program.ap2 &

You can omit the *.ap2 and inspect only compiler feedback. Note that the -h profile_generate option disables most automatic compiler optimizations, which is why Cray recommends generating this data separately from generating the program_library file.

Apprentice2

Cray Apprentice2 is a post-processing performance data visualization tool, which takes *.ap2 files as input.

Main features are:

  • Call graph profile
  • Communication statistics
  • Time-line view for Communication and IO.
  • Activity view
  • Pair-wise communication statistics
  • Text reports

It helps identify:

  • Load imbalance
  • Excessive communication
  • Network contention
  • Excessive serialization
  • I/O Problems
$> module load perftools-base
$> app2 *.ap2 & 

If the full trace is enabled (using the environment variable PAT_RT_SUMMARY=0), a time line view is activated, which helps to see communication bottlenecks. But please use it only for small experiments !

You can istall Apprentice2 on you local machine. It is available from a Cray login node

  • module load perftools-base
  • Go to: $CRAYPAT_ROOT/share/desktop_installers/
  • Download .dmg or .exe installer to laptop
  • Double click on installer and follow directions to install

Cray Profiler

The Cray profiler library is deprecated, but still available on the system. A description can be found here