- Infos im HLRS Wiki sind nicht rechtsverbindlich und ohne Gewähr -
- Information contained in the HLRS Wiki is not legally binding and HLRS is not responsible for any damages that might result from its use -

CRAY XC40 Tools: Difference between revisions

From HLRS Platforms
Jump to navigationJump to search
No edit summary
Line 710: Line 710:
[http://http://lammps.sandia.gov/ LAMMPS]  "LAMMPS Molecular Dynamics Simulator" is a molecular dynamics package which can be used by  
[http://http://lammps.sandia.gov/ LAMMPS]  "LAMMPS Molecular Dynamics Simulator" is a molecular dynamics package which can be used by  
<pre> module load chem/lammps </pre>
<pre> module load chem/lammps </pre>
The executable is called lmp_CrayXC.


=== NAMD ===
=== NAMD ===

Revision as of 14:37, 21 September 2015

Cray provided tools

Cray does provide several official tools. Below is a list of some of the tools, you can get more information about them in the online manual (man atp for example).

Jump to ATP, STAT and IOBUF

At HLRS Cray also supports some tools with limited or no support. Currently available is the Cray Profiler

ATP : Abnormal Termination Processing

Warning: Doesn't work yet.

Abnormal Termination Processing (ATP) is a system that monitors Cray system user applications. If an application takes a system trap, ATP performs analysis on the dying application. All stack backtraces of the application processes are gathered into a merged stack backtrace tree and written to disk as the file, atpMergedBT.dot. The stack backtrace tree for the first process to die is sent to stderr as is the number of the signal that caused the application to fail. If Linux core dumping is enabled (see ulimit or limit in your shell documentation), a heuristically selected set of processes also dump their cores.

The atpMergedBT.dot file can be viewed with statview, (the Stack Trace Analysis Tool viewer), which is included in the Cray Debugger Support Tools (module load stat), or alternatively with the file viewer dotty, which can be found on most Linux systems. The merged stack backtrace tree provides a concise yet comprehensive view of what the application was doing at the time of its termination.

At HLRS ATP is disabled by default. To use it you have to set ATP_ENABLED=1 in your batch script.

STAT : Stack Trace Analysis Tool

Warning: Doesn't work yet.

STAT is a toll for collecting tracebacks of a running program. You use statview to view the output of STAT, both tools are part of the module stat.

STAT needs the process id of the aprun (apid) command which runs your program.


IOBUF - I/O buffering library

IOBUF is an I/O buffering library that can reduce the I/O wait time for programs that read or write large files sequentially. IOBUF intercepts I/O system calls such as read and open and adds a layer of buffering, thus improving program performance by enabling asynchronous prefetching and caching of file data.

IOBUF can also gather runtime statistics and print a summary report of I/O activity for each file.

In general, no program source changes are needed in order to take advantage of IOBUF. Instead, IOBUF is implemented by following these steps:

Load the IOBUF module:

% module load iobuf

Relink the program. Set the IOBUF_PARAMS environment variable as needed.

% export IOBUF_PARAMS='*:verbose'

Execute the program.

If a memory allocation error occurs, buffering is reduced or disabled for that file and a diagnostic is printed to stderr. When the file is opened, a single buffer is allocated if buffering is enabled. The allocation of additional buffers is done when a buffer is needed. When a file is closed, its buffers are freed (unless asynchronous I/O is pending on the buffer and lazyclose is specified).

Please check the complete manual and all environment variables available by reading the man page (man iobuf, after loading the iobuf module)

 IMPORTANT NOTICE : As iobuf is written for serial IO, its behavior is undefined 
 when used for parallel I/O into a single file. 

You should never use IOBUF in the case when several parallel processes operates on a single file.

Cray Profiler : Function-level instrumentation for MPI, SHMEM, heap, and I/O

To use this library, load the "tools/cray_profiler" module and relink your application

When an application is run with the profiler library, profiler becomes active when MPI_Init() is called, MPI calls are timed during the run and a "profile.txt" file is written when MPI_Finalize() is called.


THE SOFTWARE IS PROVIDED "AS IS", WITH ALL FAULTS AND WITHOUT WARRANTY
OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE
WARRANTIES OF MERCHANTABILITY, AND FITNESS FOR A PARTICULAR PURPOSE.
THIS SOFTWARE MAY NOT BE REDISTRIBUTED OR SUBLICENSED WITHOUT WRITTEN
PERMISSION FROM CRAY INC.

Usage

module load tools/cray_profiler
relink your executable

Description

Profiler is a library which intercepts calls to other libraries, and collects statistics to be printed at the end of the run to the profile.txt report file.

A summary showing minimum, maximum, and average values for the processes in the run is included, followed by per-process details. The summary is printed when MPI_Finalize or shmem_finalize is called, and includes activity up to that point. The per-process details are printed when the program terminates, and includes activity for the full run. The summary report is omitted for single process jobs. For MPMD jobs, the summary is repeated for each application in the job. Summaries on arbitrary ranges of processes can be printed by setting PROFILER_GROUPS. The report format has comma-separated fields for easy parsing or importing into spreadsheet tools such as Excel.

The intercepts which are implemented are: MPI (including MPI-IO), SHMEM (including symmetric heap usage), standard I/O, POSIX I/O, heap allocation, Global Arrays and ARMCI. In addition, system statistics such as user CPU time are collected and hardware performance counters are optionally collected. Profiler is compatible with IOBUF

Additional features are available and are detailed below.

Linking Profiler

Since the profiler implementation is based on adding a layer of intercept functions to gather many of its statistics, profiler must be linked between a reference to a routine and its definition. This is easily done in cases where an application contains calls to a library such as MPI by linking profiler after the application object files but before the MPI library.

On Cray XT/XE/XC systems, linking can often be done implicitly by loading the cray_profiler module before linking the application. To ensure proper linking order, load the cray_profiler module after all other library modules. If module load order can't be guaranteed, adding explicit library references to the link command can define the order. For example, using -lprofiler -lmpich will link profiler before MPICH.

For the case where a library makes calls to another library, adding an extra link option to include profiler between the two libraries is necessary. For example, to instrument the calls to ARMCI made from the Global Arrays library, use link options like -lga -lprofiler -larmci.

For some programs, it is necessary to include an extra -Wl,-uprofiler option when linking. This option forces profiler to be included. Some types of programs, in particular pure-Fortran serial programs, have no profiled function intercepts at the program level, but including -lprofiler after the appropriate libraries is difficult. This link option is included automatically when the cray_profiler module is loaded.

Dynamic Linking

The profiler module references the static profiler library by linking as -lprofiler. This is the preferred way to include profiler in dynamically linked executables. Profiler can also be included by the dynamic linker at run time by setting the LD_PRELOAD environment variable. LD_PRELOAD only works with dynamically linked executables.

Care must be taken when setting LD_PRELOAD, because it affects all dynamically linked executables. This includes system commands like cp and mv, and also includes aprun. The preferred way to run a compute-node program launched by aprun using the LD_PRELOAD feature to include profiler in a prelinked executable. To make the process more easy to use the cray_profiler module provides an enviroment variable named CRAY_PROFILER_PRELOAD which is used as in the following commands:

aprun -n X ... $CRAY_PROFILER_PRELOAD a.out"

This method ensures that only the application program (a.out in the example above) will be profiled. If LD_PRELOAD is set before the aprun command, aprun and all dynamically linked executables started by aprun will be profiled. Profiler attempts to identify this situation and will avoid writing a profile if the command name is aprun or gzip. So while not recommended, the following syntax also works:

env LD_PRELOAD=/opt/hlrs/tools/cray_profiler/201303/libprofiler-shared.so aprun ... a.out

If profiler has been explicitly linked into an exectuable, do not use the LD_PRELOAD method to include profiler. This may result in two versions of profiler executing simultaneously, and unpredictable results. The most likely outcome is that profiles are written by both the static and dynamic versions of profiler, resulting in garbled, truncated, or incorrect output.

Environment Variables

The following environment variables can be set to change the behavior of profiler.

Environment Variables
Variable Values Default Description
PROFILER_BIN 0, 1, or module list 0 (disabled) If enabled, prints a breakdown of messages or I/O based on size. Nine bins are defined, starting with zero and 1-16 bytes, then increasing as multiples of 16 to a maximum bin size of 4 GB. Also, specifying samp will report IP sampling by source file line number.
PROFILER_CALLER 0, 1, or module list 0 (not enabled) If enabled, captures an extra level of detail by reporting timings based on the routine caller via a call stack traceback. PROFILER_COLLECT must be enabled for the module.
PROFILER_COLLECT 0, 1, or module list 1 (enabled) Control timing and summary reporting of profiled routines. If timing is turned off, the profiled routines are still intercepted, but no data collection is performed. When the profile report is printed, any modules which had timing collection disabled are skipped. The samp module is off by default, so it must be explicitly enabled by setting PROFILER_COLLECT to samp or all,samp.
PROFILER_COLLECTIVE_SYNC 0, 1, or module list 1 (enabled) If enabled, all collectives are preceeded with a barrier to measure collective synchronization time. Can be used with mpi and shmem.
PROFILER_DEBUG 0, 1, or module list 0 (disabled) Print profiler debugging information.
PROFILER_DETAIL 0, 1, or module list 1 (enabled) Print per-process profiling report. The default is 1 (enabled) if the parallel run size is less than 100 processes, otherwise the default is 0 (disabled). The summary report will still be printed for a parallel run even when the detail report is disabled. This setting is ignored for serial runs; the detail report is always printed.
PROFILER_DISABLE 0 or 1 0 If set to 1, then profiling is disabled and no report is printed. This allows an executable linked with profiler to be run with no profiling.
PROFILER_GROUPS ranks list No groups

Each range of ranks x-y given results in a separate summary report giving the min, max, and average values for the processes within the range. See [#PROFILER_GROUPS Process group summary] below.

PROFILER_LABEL 0, 1 0 (disabled) If enabled, adds MPI process rank labels to the beginning of every line printed to stdout (standard output). If only labels are desired, and no profiling, also set PROFILER_DISABLE=1.
PROFILER_OUTPUT filename profile.txt

Profile report output file name. Special tokens are expanded at run time and can be used to create unique output files for multiple runs ([#PROFILER_OUTPUT see below]).

PROFILER_PAPI counter list A comma-separated list of PAPI event names (preset or native). This is ignored if the PAPI module is not active. The default and maximum list size depends upon the processor type, but usually includes the number of floating point operations.
PROFILER_PARTNERS integer 0 The number of top point-to-point message partners reported. If set to all, all destination processes are listed.
PROFILER_RANKS ranks list - (all ranks) List of processes enabled for recording (see PROFILER_RECORD). The value is a semicolon-separated list of ranks or rank ranges. A rank is a single integer, and a range x-y includes the ranks from x to y. If x is omitted, then all ranks less than or equal to y are selected. If y is omitted, then all ranks equal to or greater than x are selected. For example, 0 selects rank 0 only, while 1- selects all ranks except 0.
PROFILER_PLAYBACK filename None (disabled) This is the name of the input file for playback (record.n
PROFILER_RECORD 0, 1, or module list 0 (disabled) If enabled, each process rank writes a file name record.rank for each supported routine. The file contains the results of the routine call (return value and modified argument variables) along with the time to execute the call. A subset of all processes can be selected by setting PROFILER_RANKS.
PROFILER_TRACE 0, 1, or module list 0 (disabled) Print a trace of profiled routines when they are called, including arguments and return values. The traces are printed to stderr.
PROFILER_SHEPHERD 0 or 1 1 (enabled) By default, no profile is written for a PMI shepherd process. A PMI shepherd process is identified as a program which contains MPI callls but does not call any MPI functions. In some cases programs are mistaken as PMI shepherd processes, no profile is written. If set to 0, then the check for PMI shepherd process is skipped, so a profile is written.
PROFILER_TRAP 0 or 1 1 (enabled) By default, a set of signals are caught when a fault condition occurs, and a profile is written. The signals caught are: SIGHUP, SIGINT, SIGILL, SIGQUIT, SIGABRT, SIGIOT, SIGBUS, SIGFPE, SIGSEGV, SIGPIPE, and SIGTERM. If set to 1, then signals are caught, and a profile is written if a fault occurs.
PROFILER_UNWIND_LEVELS integer 0 The number of call stack levels to be included in the profile by caller report. This can have the effect of expanding the number of results in the report, because each unique call path is reported separately. Requires caller profiling to be enabled (PROFILER_CALLER set).
PROFILER_WRAPPER_LEVELS integer 0 The number of call stack levels to be discarded when unwinding the call stack. This is used when a wrapper library is in use by the application, but the call sites to the wrapper functions are of interest. Wrappers are sometimes used in programs which support multiple communications packages, such as MPI and shmem, where the package is a compile-time option for the wrappers. Requires caller profiling to be enabled (PROFILER_CALLER set).
PROFILER_WARN 0, 1, or module list 1 (enabled) Print warning messages for profiler errors.
PROFILER_VERBOSE 0, 1, or module list 0 (disabled) All available statistics are printed. The default is to suppress values which are zero.

A setting of 1 selects that feature for all modules, and 0 deselects it. A synonym for 1 is all, while !all is the same as 0.

A module list provides more selectivity when selecting features. The value is a comma-separated list of one or more of the following keywords. If a ! preceeds the keyword, then feature is deselected for that module. The default if a module is not given in a list is to deselect the feature for that module even if the feature is enabled by default. As a shorthand, if the list begins with a deselected module, then unlisted modules are by selected by default. So mpi selects only the MPI module, while !mpi selects all modules except MPI.

Not all keywords are significant for all environment variables.

Module Names
Keyword Description
arcmi ARMCI data transfer functions. Requires that the application also calls MPI or shmem.
all Select all modules. Useful to change the default for unlisted modules from deselected to selected. Should always be specified as the first keyword. The samp module is not part of all, so it needs to be explicitly selected.
caller Profile by caller module extension.
ga Global Arrays data transfer functions. Requires that the application also calls MPI or shmem.
heap Heap functions (malloc, free, etc.)
mpi MPI functions (except MPI I/O).
mpio MPI I/O functions.
papi Hardware performance counters.
partners MPI point-to-point message partners. Use with PROFILER_VERBOSE to get per-function reports.
pio POSIX I/O calls (read, write, open, close, etc.).
prof The profiler itself. Used with PROFILER_DISABLE, PROFILER_WARN and PROFILER_DEBUG.
samp Instruction pointer sampling.
shmem Shmem functions (except symmetric heap allocation).
stdio Fortan I/O and standard I/O functions (fopen, fread, printf, etc.)
symheap Symmetric heap functions (shmalloc, etc.).
sys System statistics. This is a default module for all runs, and includes values like total run time, memory usage, and hardware configuration. Also includes floating point exception reporting.
syscall System calls.

Profiler summary groups

The default behavior for parallel runs is to print a summary report showing the minimum, maximum, and average value for each profiler result for all of the processes in the run. Setting PROFILER_GROUPS creates additional summary reports for the specified groups of processes. This feature requires MPI, and does not work with shmem-only programs.

PROFILER_GROUPS is set to comma-separated list of process rank ranges. A range can be a single integer, or a range x-y which includes the ranks from x to y. If x is omitted, then all ranks less than or equal to y are selected. If y is omitted, then all ranks equal to or greater than x are selected. For example, 0 selects rank 0 only, while 1- selects all ranks except 0.

Setting PROFILER_GROUPS=0-7,8-15 for a run with 16 processes will print three summary reports. The first report summarizes all 16 processes, the second report summarizes processes 0 through 7, and the third report summarizes processes 8 through 15.

An optional :stride suffix can be included on any process rank range. The default stride is 1.

Up to 10 process group ranges can be specified, and the PROFILER_GROUPS value can be at most 127 characters long.

Profiler output file name

The PROFILER_OUTPUT environment variable can be set to change the default profile output file name (profile.txt). The value can include a relative path from the current directory, or an absolute path (if the string begins with a "/" character). The value can contain special tokens which are expanded at run time to create unique output files for multiple runs.

Filename Tokens
Token Substituted string
%h host name (node name)
%j batch job id (PBS or Moab/Torque)
%r process rank (MPI_COMM_WORLD rank)
%p process id (unique for every process on a node, but may be duplicated across nodes)
%s number of processes (MPI_COMM_WORLD size)
%t number of threads per process (OpenMP or pthreads)
%x executable name

Note that rank 0 writes the summary report, while every process writes detail reports (if enabled). This means that using %h, %p, or %r can result in multiple profile output files. For example, setting PROFILER_OUTPUT=profile_%r.txt and PROFILER_DETAIL=all will result in detail reports written to file profile_0.txt for rank 0, and so on.

Profiler with PBS

The default report output file name includes the PBS job id when profiler is running under PBS or Torque. The file name format is profiler.jobid.txt.

Using the record and playback feature

The record and playback feature is meant to address a performance estimation issue when a large benchmark job can be run on an existing computer system, but only a much smaller configuration of a new system is available. It is not possible to run the full large job on the small system, but playback of a small portion of the large job on the small system is possible. Other possible uses are to assist in single-process optimization or debugging of a large job. Rather than rerunning the large job repeatedly, a process can be recorded and replayed many times as incremental changes are made, so long as the MPI communication is not altered.

In earlier releases of profiler, the procedure was to link with -lprofiler for the record step, but to link with -lplayback for the playback step. This difference was required because the profiler library supports the record feature, but playback was only supported by the playback version of the library. This is no longer the case: both record and playback is supported with -lprofiler, and the use of -lplayback is deprecated.

An important note is that for both the record and the playback steps, the program must also be linked with the native MPI library. Although MPI is not used for communication during playback (since playback is always for a single process with MPI messages coming from the trace file), the library is still necessary to resolve some remaining issues with playback.

Setting the PROFILER_RECORD and PROFILER_RANKS variables allows a small portion of a larger job to be captured in trace files. The trace file data can then be replayed on another system by setting PROFILER_PLAYBACK for one or more processes. With these steps, it is possible to obtain performance information from a small portion of a large job during the replay step.

Currently only the mpi module implements the record and playback feature. When recording calls to MPI routines, the results of each call is saved to the trace file. Arguments to the call which are used as input data by the routine are not saved. Writing the trace file introduces overhead to the job, so it executes slower than a non-record job.

During playback, each process executes the identical calls as the process did in the large job. The trace file provides the output variable values for each of the calls. In this way, the replay process experiences an environment which exactly replicates the original process.

For example, consider a large MPI job running on a system. If MPI process rank 49 calls MPI_Comm_rank(MPI_COMM_WORLD,rank) while PROFILER_RECORD is active, the rank returned by the call is saved to the trace file record.49. When the same program is run as a single process with PROFILER_PLAYBACK=record.49, the call to MPI_Comm_rank(MPI_COMM_WORLD,rank) will read the trace file and set rank equal to 49. If the MPI process calls MPI_Recv with PROFILER_RECORD active, the result of the call (the message buffer contents) is saved to the trace file. On playback, the same call to MPI_Recv will read the trace file and fill the buffer with the correct message data.

Because of timing differences between the recorded calls and the playback calls, the performance of the program is impacted. In order to get a useful performance estimate for the overall job, the program run time needs to be divided into categories such as compute, communication, and I/O. The compute time is least impacted by the profiler instrumentation, so the compute time of the single process can be compared directly to the compute time of the corresponding MPI process of the full job. When PROFILER_RECORD=mpi is used, communication time for record and playback is not useful. Communication time without record or playback should be captured with a separate run. Depending upon the systems involved, it may make more sense to use the I/O time from either the record or the playback system.

Trace file size can be an issue when using this procedure. Long-running and communication-intensive program can generate trace files with many gigabytes of data. In most cases, a job with the full problem size and run time will not be possible.

There are possible issues with the playback environment that are outside the scope of profiler. All files read by the playback process must be available as in the original job, including temporary files written by another process (e.g., if an input file is partitioned by a master process into individual input files for each process). Also, environment variables needed by the process must be provided.

Run-time dynamic data collection control

Data collection can be controlled dynamically during program execution by calling profiler_enable(flag), where flag is 1 to enable data collection and 0 to disable it. When collection is disabled, statistics other than those reported by the System module are not collected. By the nature of the System module, some of the values reported (wall clock time, user CPU time, etc.) are for the whole program execution. However, if collection is disabled for some of the execution time, an additional statistic, Enabled wall time, is reported.

When collection is enabled, only those modules which were enabled via PROFILER_COLLECT (by default all modules) are enabled. If a subset of modules were selected via PROFILER_COLLECT, then only that subset is enabled and other modules remain disabled.

Regardless of the data collection status at the end of the run (enabled or disabled), statistics are still reported as long as PROFILER_DISABLE was not set. If PROFILER_DISABLE was set, no data collection and no reporting occurs.

Usage with IOBUF

Profiler can work with iobuf, but there are some quirks. iobuf combines all types of read and write calls, so individual call types like fprintf and fgets are not reported. The report sections from the profiler iobuf module are otherwise similar to the stdio module. Tracing of standard I/O calls is not possible when iobuf is in use.

Usage with PAPI

To select a non-default set of PAPI counter events, set PROFILER_PAPI to a comma-separated list of PAPI preset or native events. See [papi_opteron.html papi_opteron] for a list of Opteron native events.

The default set of hardware counters for Barcelona Opteron (quad core) and later are double precision floating point operations, single precision floating point operations, Packed SSE instructions, and SSE Merge MOV micro-ops (specified as RETIRED_SSE_OPERATIONS:DOUBLE_ADD_SUB_OPS:DOUBLE_MUL_OPS:DOUBLE_DIV_OPS:OP_TYPE, RETIRED_SSE_OPERATIONS:SINGLE_ADD_SUB_OPS:SINGLE_MUL_OPS:SINGLE_DIV_OPS:OP_TYPE, RETIRED_MMX_AND_FP_INSTRUCTIONS:PACKED_SSE_AND_SSE2, and RETIRED_MOVE_OPS:LOW_QW_MOVE_UOPS:HIGH_QW_MOVE_UOPS:ALL_OTHER_MERGING_MOVE_UOPS). The default set of hardware counters for Opteron prior to Barcelona (quad core) are Add pipe ops, Multiply pipe ops, Packed SSE instructions, and L1 data cache accesses (specified as PROFILER_PAPI=FP_ADD_PIPE,FP_MULT_PIPE,FR_FPU_SSE_SSE2_PACKED,DC_ACCESS). The maximum number of active counters is 4.

As an additional feature, when PAPI is available, an extra test is run on Catamount to determine the virtual memory page size (2 MB large pages or 4 KB small pages) by counting the number of TLB misses for a memory access loop. This heuristic provides the page size notation in the report.

MPI persistent request handling

Calls to create MPI persistent requests (MPI_Send_init, MPI_Recv_init, etc.) are reported, including the byte counts of the requests. Additionally, calls to the start, wait, and test routines include the total byte counts. For example, if a persistent request is created via MPI_Send_init with a message size of 100 bytes, then MPI_Start is called three times with this request, then 100 bytes will be reported for Send_init and 300 bytes will be reported for Start.

Reports for programs which exit early

If an MPI program exits before MPI_Finalize is called, if a fault such as segmentation violation occurs, or if an asynchronous signal such as SIGTERM from qdel is received, a profile is still written. In this case, no summary can be created because there is no way to coordinate the processes. All processes which encounter the condition will write individual detailed reports.

Point-to-point message report

When PROFILER_PARTNERS is set to an integer value, a report is generated showing the top destination process ranks for the point-to-point send calls made by each process. The point-to-point send calls are MPI_Send, MPI_Isend, MPI_Sendreceive, etc. MPI one-sided get/put, persistent requests, and shmem calls are not supported. If a process calls no point-to-point send functions, it is listed as "No messages sent" in the report.

The report lists each process rank and its top partners, giving the number of messages sent, the total number of bytes sent, and the time for the send operations. A simple example of the report using 2 processes and PROFILER_PARTNERS=1 is


Point-to-point message partners report.
   Src, Dest,        Count,        Bytes,         Time
     0,    1,          120,       250480,        0.002
     1,    0,          130,       250480,        0.030

Instruction pointer (IP) sampling

A profile breakdown by time spent in routines can be included in the report by setting PROFILER_COLLECT=samp (sampling only) or PROFILER_COLLECT=all,samp (regular profile report plus sampling). This feature is similar to the CrayPat default report for the sampling experiment. The progam's instruction pointer is sampled at 10 millisecond intervals. Each sample taken while a routine is executing contributes to its total.

For an extra level of detail, specifying PROFILER_BIN=samp will add reporting by source code line number. For line number reporting, the routine must be compiled with -g or with the craypat module loaded (which automatically adds the -g option).

Below is an example of the IP sampling summary report.


IP samples:  34846 total samples
   Min,    Max,    Avg, Min PE, Max PE, Function
 17.0%,  17.2%,  17.1%,      2,      3, ngb_treefind_variable, ngb.c
 17.0%,  17.0%,  17.0%,      3,      0, ngb_treefind_pairs, ngb.c
 12.9%,  13.2%,  13.1%,      0,      2, force_treeevaluate_shortrange, forcetree.c
  8.7%,   8.8%,   8.7%,      1,      3, pmforce_periodic, pm_periodic.c
  6.6%,   6.8%,   6.7%,      1,      3, hydro_evaluate, hydra.c
  5.6%,   5.9%,   5.7%,      3,      2, density_evaluate, density.c
  2.5%,   2.7%,   2.6%,      1,      3, __exp, w_exp.c
  1.3%,   1.4%,   1.4%,      2,      0, __erfc, s_erf.c
  1.0%,   1.2%,   1.1%,      2,      1, ewald_force, forcetree.c

The percentages reported correspond to the relative number of samples for that routine, and roughly correspond to the program run time percentage. By default, only routines with 1% or greater are listed, and lines with 0.05% are printed. If PROFILER_VERBOSE=samp is given, all samples are printed.

Profile by caller

A profile breakdown by caller can be included in the report by setting the PROFILER_CALLER environment variable.

To get source file line numbers in the profile, use the -g compiler option. Since -g sets the optimization level to -O0, list all optimization flags after -g (e.g., -g -O3). One way to get the proper compile options is by loading the xt-craypat module.

The profile by caller report lists the calling routine for profiled functions in the mpi, mpio, shmem, and stdio modules.

Per-file I/O reporting

Files read or written by a process are tracked and statistics for each file is given in the detail report. The report lists the number of I/O calls, total amount of I/O, the I/O time, along with the average I/O rate. Also included is a breakdown of the I/O sizes (minimum, maximum, and average).

Below is an example of a per-file I/O report.


Process 0: File I/O details
   2005 writes,      0.084 MB total,      0.013 sec, 6.550 MB/sec, stdout
      2 reads,       0.000 MB total,      0.001 sec, 0.030 MB/sec, fd(6)
     73 writes,      0.063 MB total,      0.001 sec, 51.931 MB/sec, ./cpu.txt
     73 writes,      0.007 MB total,      0.006 sec, 1.062 MB/sec, ./info.txt
     74 writes,      0.023 MB total,      0.014 sec, 1.680 MB/sec, ./timings.txt
     73 writes,      0.013 MB total,      0.003 sec, 3.800 MB/sec, ./balance.txt
     73 writes,      0.001 MB total,      0.005 sec, 0.265 MB/sec, ./sfr.txt
     11 writes,     22.021 MB total,      0.029 sec, 769.798 MB/sec, ./snap_smallset_003
write size (KB),      0.000 min,      0.467 max,      0.042 avg,   stdout
read  size (KB),      0.000 min,      0.012 max,      0.012 avg,   fd(6)
write size (KB),      0.854 min,      0.860 max,      0.860 avg,   ./cpu.txt
write size (KB),      0.065 min,      0.094 max,      0.093 avg,   ./info.txt
write size (KB),      0.293 min,      0.316 max,      0.308 avg,   ./timings.txt
write size (KB),      0.171 min,      0.956 max,      0.182 avg,   ./balance.txt
write size (KB),      0.013 min,      0.018 max,      0.018 avg,   ./sfr.txt
write size (KB),   1049.096 min,   2097.152 max,   2001.874 avg,   ./snap_smallset_003

fd(6) in the report indicates an unnamed file (most likely a pipe(2)]]) with that file descriptor number. One source of such unnamed pipes is the SMP device implemented within MPT3.

Up to 64 files are tracked simultaneously, and up to 256 files are included in the report. Files beyond these limits are included in the process I/O totals, but are not reported individually. This report is generated by the pio module. Disabling the module (PROFILER_COLLECT=!pio) will disable this report.

OpenMP and POSIX thread reporting

Extra statistics are reported for programs using POSIX threads, including OpenMP in compilers that are compatible and based on pthreads. The OpenMP implementations in PGI and CCE are compatible, but Pathscale OpenMP is not.

The system details report section includes CPU affinity settings for each thread created during the run.

The PAPI details report section includes the PAPI counter totals for all threads and the counters for each individual thread when more than one thread is active during the run. Note that the PAPI counters reported in the PAPI summary report section only include the master thread counters. This is because the counters for the other threads are not available until those threads terminate. Most OpenMP implementations keep a pool of threads waiting for parallel execution throughout the run, so the threads are not terminated until after the summary has been printed.

Notes on units

If the label KB is used, the value is in units of kilobytes and is scaled by 2**10 (1024). If the label MB is used, the value is in units of megabytes and is scaled by 2**20 (1048576). If the label GB is used, the value is in units of megabytes and is scaled by 2**30 (1073741824). Otherwise, the value may be printed with a trailing character to indicate the scale factor. K is 10**3. M is 10**6. G is 10**9. If the value is printed with no trailing scale character and no scale label, then the value is unscaled.

Example Summary Report

Profile of allpair
  Number of processes           ,            2
  Profiling started             , Tue Aug  1 16:16:21 2006
  Profiling ended               , Tue Aug  1 16:16:21 2006
System summary                  ,          min,          max,          avg
  Wall clock time               ,        0.379,        0.379,        0.379
  User CPU time                 ,        0.379,        0.379,        0.379
  Processor clock (GHz)         ,        2.400,        2.400,        2.400
PAPI summary                    ,          min,          max,          avg
  Total processor cycles        ,    791855041,    791857657,    791856349
  User processor cycles         ,    791855669,    791858555,    791857112
  FP add pipe ops               ,       383795,      1275119,       829457
  FP multiply pipe ops          ,       158558,       830361,       494459
  Packed SSE instructions       ,       419725,       444353,       432039
  L1 data cache accesses        ,    313840770,    337758406,    325799588
  Processor clock (GHz)         ,        2.400,        2.400,        2.400
STDIO summary                   ,          min,          max,          avg
  Total I/O time                ,        0.000,        0.322,        0.161
  TOtal I/O (MB)                ,        0.000,        0.001,        0.001
  Total write time              ,        0.000,        0.322,        0.161
  Total write (MB)              ,        0.000,        0.001,        0.001
  fwrite calls                  ,            0,          330,          165
  fwrite time                   ,        0.000,        0.322,        0.161
  fwrite total (MB)             ,        0.000,        0.001,        0.001
Heap summary                    ,          min,          max,          avg
  Maximum heap size (MB)        ,       66.527,       66.542,       66.534
  Average block size (KB)       ,      140.310,      240.756,      190.533
  Maximum blocks allocated      ,          127,          129,          128
  Total blocks allocated        ,          283,          486,          384
MPI summary                     ,          min,          max,          avg
  Elapsed time                  ,        0.330,        0.330,        0.330
  Communication time            ,        0.002,        0.207,        0.105
  Wait time                     ,        0.001,        0.060,        0.031
  MPI_Send total calls          ,           10,           50,           30
  MPI_Send average bytes        ,     1600.000,     1705.600,     1652.800
  MPI_Send total bytes          ,        16000,        85280,        50640
  MPI_Ssend total calls         ,           10,           10,           10
  MPI_Ssend average bytes       ,         2664,         2664,         2664
  MPI_Ssend total bytes         ,        26640,        26640,        26640
  MPI_Rsend total calls         ,            0,           10,            5
  MPI_Rsend average bytes       ,            0,         2664,         1332
  MPI_Rsend total bytes         ,            0,        26640,        13320
  MPI_Recv total calls          ,           40,           60,           50
  MPI_Recv average time         ,        0.000,        0.003,        0.001
  MPI_Recv total time           ,        0.000,        0.177,        0.089
  MPI_Recv average bytes        ,         6000,         8000,         7000
  MPI_Recv total bytes          ,       240000,       480000,       360000

Example Detail Report

Process 0: System details
  Wall clock time               ,        0.427
  User CPU time                 ,        0.427
  Processor clock (GHz)         ,        2.400
  Hostname                      ,       salmon
Process 0: PAPI details
  Total processor cycles        ,    791855041
  User processor cycles         ,    791855669
  FP add pipe ops               ,       383795
  FP multiply pipe ops          ,       158558
  Packed SSE instructions       ,       444353
  L1 data cache accesses        ,    337758406
  Processor clock (GHz)         ,        2.400
Process 0: STDIO details
  Total I/O time                ,        0.322
  TOtal I/O (MB)                ,        0.001
  Total write time              ,        0.322
  Total write (MB)              ,        0.001
  fwrite calls                  ,          330
  fwrite time                   ,        0.322
  fwrite total (MB)             ,        0.001
Process 0: Heap details
  Maximum heap size (MB)        ,       66.542
  Average block size (KB)       ,      140.310
  Maximum blocks allocated      ,          129
  Total blocks allocated        ,          486
Process 0: MPI details
  Elapsed time                  ,        0.330
  Communication time            ,        0.002
  Wait time                     ,        0.001
  MPI_Send total calls          ,           10
  MPI_Send average bytes        ,         1600
  MPI_Send total bytes          ,        16000
  MPI_Ssend total calls         ,           10
  MPI_Ssend average bytes       ,         2664
  MPI_Ssend total bytes         ,        26640
  MPI_Rsend total calls         ,           10
  MPI_Rsend average bytes       ,         2664
  MPI_Rsend total bytes         ,        26640
  MPI_Recv total calls          ,           40
  MPI_Recv average bytes        ,         6000
  MPI_Recv total bytes          ,       240000

Ignored Processes

By default, no profile is written for some processes. These processes are part of regular job launch on Cray XT systems.

  • PMI shepherd process in MPI programs.
  • The ALPS aprun command.
  • The ALPS gzip utility (/usr/bin/gzip).



Third party tools

Gnu-Tools

The module gnu-tools collects more recent versions of basic functionalities, including the GNU building system (autoconf, automake, libtool, m4), as well as bash, cmake, gperf, git, gwak, swig, and bison. The actual versions can be listed using

% module whatis tools/gnu-tools

To use the actual version of bash with full support of the module environment you can simply call

% bash -l myScript.sh

or define the absolute path in the first line of your script

#!/opt/hlrs/tools/gnu-tools/generic/bin/bash -l

Octave

GNU Octave is a high-level interpreted language, primarily intended for numerical computations. It provides capabilities for the numerical solution of linear and nonlinear problems, and for performing other numerical experiments. It also provides extensive graphics capabilities for data visualization and manipulation. GNU Octave is normally used through its interactive interface (CLI and GUI), but it can also be used to write non-interactive programs. The GNU Octave language is quite similar to Matlab so that most programs are easily portable.

Octave is compiled to run on the compute nodes and can be launched e.g. in an interactive session:

% qsub -I [options]
% module load tools/octave 
% aprun -n 1 -N 1 octave octave.script

PARPACK

With the module hlrs_PARPACK the collections of f77 routines designed to solve large scale eigenvalue problems (ARPACK) and the parallel version (PARPACK) are provided. To link these libraries you only have to load the module

 numlib/hlrs_PARPACK 

Important Features of ARPACK:

  • Reverse Communication Interface.
  • Single and Double Precision Real Arithmetic Versions for Symmetric, Non-symmetric,
  • Standard or Generalized Problems.
  • Single and Double Precision Complex Arithmetic Versions for Standard or Generalized Problems.
  • Routines for Banded Matrices - Standard or Generalized Problems.
  • Routines for The Singular Value Decomposition.
  • Example driver routines that may be used as templates to implement numerous Shift-Invert strategies for all problem types, data types and precision.
Warning: after swapping the PrgEnv this module has to be (re)loaded again (module load numlib/hlrs_PARPACK).


Python

Actual versions of Python can be used loading the module tools/python.

SLEPc

The SLEPc (Scalable Library for Eigenvalue Problem Computations) is an extantion of PETSc for solving linear eigenvalue problems in either standard or generalized form. Furthermore, SLEPc can compute partial SVD of a large, sparse, rectangular matrix, and solve nonlinear eigenvalue problems (polynomial or general). Additionally, SLEPc provides solvers for the computation of the action of a matrix function on a vector. SLEPc can be used for real (default) and complex arithmetics, therefore two different modules are provided:

  module load numlib/hlrs_SLEPc    # deafault version

OR

  module load numlib/hlrs_SLEPc/3.5.3-complex 

As usual the modules provides all compiler and linker flags, thus ex1.c (containing SLEPc calls) can be simply compiled by

 
  cc ex1.c -o ex1.exe
Warning: Please select first the desired PrgEnv or after swapping the PrgEnv (re)loaded this module again (module load numlib/hlrs_SLEPc). Supported programming environments are PrgEnv-cray, PrgEnv-gnu, and PrgEnv-intel.


Utilities for processing netcdf files

The module tools/netcdf_utils contains the follwing tools:

Third party scientific software

CP2K

CP2K is a freely available (GPL) program to perform atomistic and molecular simulations of solid state, liquid, molecular and biological systems. It provides a general framework for different methods such as e.g. density functional theory (DFT) using a mixed Gaussian and plane waves approach (GPW), and classical pair and many-body potentials. It is very well and consistently written, standards-conforming Fortran 95, parallelized with MPI and in some parts with hybrid OpenMP+MPI as an option.

CP2K provides state-of-the-art methods for efficient and accurate atomistic simulations, sources are freely available and actively improved. It has an active international development team, with the unofficial head quarters in the University of Zürich.

The molecular simulation package is installed, optimized for the present architecture, compiled with gfortran using optimized versions of libxc, libint and libsmm.

 module load chem/cp2k 

provide four versions of different kind of parallelizations:

 cp2k.ssmp  - only OpenMP
 cp2k.popt  - only MPI 
 cp2k.psmp  - hybrid MPI + OpenMP
 cp2k.pdbg  - only MPI compiled with debug flags

After loading the related module (chem/cp2k), the binary can be directly called in the job submission script, e.g.:

aprun -n 24 -N 24 cp2k.psmp myCp2kInputFile.inp > myOutput.out

Some examples for CP2K input files are provided on the CP2K homepage and there also exist the input reference.

Gromacs

GROMACS (GROningen MAchine for Chemical Simulations) is a molecular dynamics package which can be used by

 module load chem/gromacs 


LAMMPS

LAMMPS "LAMMPS Molecular Dynamics Simulator" is a molecular dynamics package which can be used by

 module load chem/lammps 

The executable is called lmp_CrayXC.

NAMD

NAMD (Scalable Molecular Dynamics) is a parallel molecular dynamics code designed for high-performance simulation of large biomolecular systems, based on Charm++ parallel objects. The package can be loaded using

 module load chem/namd 

A tutorial can be found here.

OpenFOAM

OpenFOAM (Open Field Operation and Manipulation) is an open source CFD software package. Multiple versions of OpenFOAM are available compiled with gnu and intel. Available versions can be listed using

 module avail cae/openfoam 

OpenFOAM can be used with PrgEnv-gnu and PrgEnv-intel, e.g.

 
module swap PrgEnv-cray PrgEnv-gnu
module load cae/openfoam

Furthermore, Foam-extend is available but only for PrgEnv-gnu

 
module swap PrgEnv-cray PrgEnv-gnu
module load cae/openfoam/3.0-extend

As a first example a test case of incompressible laminar flow in a cavity using blockMesh and icoFoam is provided, which can be found in the directory

 /opt/hlrs/cae/fluid/OPENFOAM/ESM/CRAY-Versionen/hornet-example 

To run this example you have to copy the directory and submit the prepareOF and runOF jobs.

It is also possible to use CrayPAT profiling for certain version of OpenFOAM. Therefore, specialized module exist providing relevant versions cae/openfoam/xxx-perftools, where xxx are version numbers. The related binaries still has to be instrumented using

 
pat_build $FOAM_APPBIN/icoFoam

As a result a binary icoFoam+pat is generated in the current directory. Using these binary in the batch script the profiling will be performed. To analyze the resulting profiling data pat_report and further tools can be used (Cray Performance Tools). If during the execution of your instrumented binary you notice that the MPI is not recognized, i.e. you see replicated output or several *.xf files not collected in a single directory in your workspace, you cat export PAT_BUILD_PROG_MODELS="0x1" in your shell and run the pat_build command again after removing the instrumented binary. Please file a ticket if this did not work for you.