- Infos im HLRS Wiki sind nicht rechtsverbindlich und ohne Gewähr -
- Information contained in the HLRS Wiki is not legally binding and HLRS is not responsible for any damages that might result from its use -

CRAY XC40 Using the Batch System: Difference between revisions

From HLRS Platforms
Jump to navigationJump to search
 
(74 intermediate revisions by 5 users not shown)
Line 6: Line 6:
* the scheduler '''moab'''
* the scheduler '''moab'''


Additionally you have to know that on CRAY XE6/XC30 systems the user applications are always launched on the compute nodes using the application launcher, '''aprun''', which submits applications to the Application Level Placement Scheduler '''(ALPS)''' for placement and execution.
Additionally you have to know that on CRAY XE6/XC40 systems the user applications are always launched on the compute nodes using the application launcher, '''aprun''', which submits applications to the Application Level Placement Scheduler '''(ALPS)''' for placement and execution.


<font color=red>Detailed information for '''CRAY XE6'''</font> about how to use this system and many examples can be found in [http://docs.cray.com/cgi-bin/craydoc.cgi?mode=View;id=S-2396-601;right=/books/S-2396-601/html-S-2396-601//chapter-djg9hyw1-brbethke.html Cray Application Developer's Environment User's Guide] and [http://docs.cray.com/cgi-bin/craydoc.cgi?mode=View;id=S-2496-4001;right=/books/S-2496-4001/html-S-2496-4001//chapter-3vnhd83p-oswald-runningbatchjobs.html Workload Management and Application Placement for the Cray Linux Environment].
<font color=red>Detailed information for '''CRAY XC40'''</font> about how to use this system and many examples can be found in [http://docs.cray.com/books/S-2529-116/ Cray Programming Environment User's Guide] and [http://docs.cray.com/books/S-2496-5202/ Workload Management and Application Placement for the Cray Linux Environment].
 
<font color=red>Detailed information for '''CRAY XC30'''</font> about how to use this system and many examples can be found in [http://docs.cray.com/cgi-bin/craydoc.cgi?mode=Show;q=2529;f=/books/S-2529-103/html-S-2529-103/chapter-djg9hyw1-brbethke.html Cray Programming Environment User's Guide] and [http://docs.cray.com/cgi-bin/craydoc.cgi?mode=Show;q=2496;f=/books/S-2496-5001/html-S-2496-5001/S-2496-5001-toc.html Workload Management and Application Placement for the Cray Linux Environment].


* '''ALPS''' is always used for scheduling a job on the compute nodes. It does not care about the programming model you used. So we need a few general definitions :
* '''ALPS''' is always used for scheduling a job on the compute nodes. It does not care about the programming model you used. So we need a few general definitions :
** '''PE''' : Processing Elements,  basically an Unix ‘Process’, can be a MPI Task, CAF image, UPC tread, ...
** '''PE''' : Processing Elements,  basically an Unix ‘Process’, can be a MPI Task, CAF image, UPC thread, ...
** '''numa_node''' The cores and memory on a node with ‘flat’ memory access, basically one of the 2 Dies on the Intel and the direct attach memory.
** '''numa_node''' The cores and memory on a node with ‘flat’ memory access, basically one of the 2 Dies on the Intel and the direct attach memory.
** '''Thread'''  A thread is contained inside a process. Multiple threads can exist within the same process and share resources such as memory, while different PEs do not share these resources. Most likely you will use OpenMP threads.
** '''Thread'''  A thread is contained inside a process. Multiple threads can exist within the same process and share resources such as memory, while different PEs do not share these resources. Most likely you will use OpenMP threads.
Line 25: Line 23:
*** The total number of PEs: -n
*** The total number of PEs: -n
*** The number of PEs per node: -N
*** The number of PEs per node: -N
*** The number of OpenMP threads: -d    (the 'stride' between 2 PEs in a node
*** The number of OpenMP threads: -d    (the 'stride' between 2 PEs in a node)
** see also [[CRAY_XE6_and_XC30_Using_the_Batch_System#Understanding_aprun | understanding aprun]]
** see also [[CRAY_XE6_and_XC40_Using_the_Batch_System#Understanding_aprun | understanding aprun]]
* '''qsub''' is the torque submission command for batch job scripts.
* '''qsub''' is the torque submission command for batch job scripts.


Line 47: Line 45:
#!/bin/bash
#!/bin/bash
#PBS -N job_name
#PBS -N job_name
#PBS -l nodes=2:ppn=32
#PBS -l nodes=2:ppn=24
#PBS -l walltime=00:20:00             
#PBS -l walltime=00:20:00             
    
    
Line 54: Line 52:


# Launch the parallel job to the allocated compute nodes
# Launch the parallel job to the allocated compute nodes
aprun -n 64 -N 32 ./my_mpi_executable arg1 arg2 > my_output_file 2>&1
aprun -n 48 -N 24 ./my_mpi_executable arg1 arg2 > my_output_file 2>&1
</pre>
</pre>
* The job is submitted by the '''qsub''' command (all script head parameters #PBS can also be adjusted directly by '''qsub''' command options).  
* The job is submitted by the '''qsub''' command (all script head parameters #PBS can also be adjusted directly by '''qsub''' command options).  
   qsub my_batchjob_script.pbs
   qsub my_batchjob_script.pbs
* Setting qsub options on the command line will overwrite the settings given in the batch script:
* Setting qsub options on the command line will overwrite the settings given in the batch script:
   qsub -N other_name -l nodes=2:ppn=32,walltime=00:20:00 my_batchjob_script.pbs
   qsub -N other_name -l nodes=2:ppn=24,walltime=00:20:00 my_batchjob_script.pbs
* The batch script is not necessarily granted resources immediately, it may sit in the queue of pending jobs for some time before its required resources become available.   
* The batch script is not necessarily granted resources immediately, it may sit in the queue of pending jobs for some time before its required resources become available.   
* At the end of the execution output and error files are returned to the submission directory
* At the end of the execution output and error files are returned to the submission directory
* This example will run your executable "my_mpi_executable" in parallel with 64 MPI processes. Torque will allocate 2 nodes to your job for a maximum time of 20 minutes and place 32 processes on each node (one per core). The batch systems allocates nodes exclusively only for one job. After the walltime limit is exceeded, the batch system will terminate your job. The job launcher for the XE6/XC30 parallel jobs (both MPI and OpenMP) is '''aprun'''. This needs to be started from a subdirectory of the /mnt/lustre_server (your workspace). The '''aprun''' example above will start the parallel executable "my_mpi_executable" with the arguments "arg1" and "arg2". The job will be started using 64 MPI processes with 32 processes placed on each of your allocated nodes (remember that a node consists of 32 cores in the XE6 system and only 16 cores in the XC30 system). You need to have nodes allocated by the batch system before (qsub).
* This example will run your executable "my_mpi_executable" in parallel with 48 MPI processes. Torque will allocate 2 nodes to your job for a maximum time of 20 minutes and place 24 processes on each node (one per core). The batch systems allocates nodes exclusively only for one job. After the walltime limit is exceeded, the batch system will terminate your job. The job launcher for the XC40 parallel jobs (both MPI and OpenMP) is '''aprun'''. This needs to be started from a subdirectory of the /mnt/lustre_server (your workspace). The '''aprun''' example above will start the parallel executable "my_mpi_executable" with the arguments "arg1" and "arg2". The job will be started using 48 MPI processes with 24 processes placed on each of your allocated nodes (remember that a node consists of 24 cores in the XC40 system). You need to have nodes allocated by the batch system (qsub) before starting aprun.
To query further options of '''aprun''', please use  
To query further options of '''aprun''', please use  
   man aprun  
   man aprun  
Line 71: Line 69:
=== Interactive batch Mode ===
=== Interactive batch Mode ===
Interactive mode is typically used for debugging or optimizing code but not for running production code. To begin an interactive session, use the "qsub -I" command:
Interactive mode is typically used for debugging or optimizing code but not for running production code. To begin an interactive session, use the "qsub -I" command:
   qsub -I -l nodes=2:ppn=32,walltime=00:30:00
   qsub -I -l nodes=2:ppn=24,walltime=00:30:00


If the requested resources are available and free (in the example above: 2 nodes/32 cores, 30 minutes), then you will get a new session on the mom node for your requested resources.
If the requested resources are available and free (in the example above: 2 nodes/24 cores, 30 minutes), then you will get a new session on the mom node for your requested resources.
Now you have to use the '''aprun''' command to launch your application to the allocated compute nodes.
Now you have to use the '''aprun''' command to launch your application to the allocated compute nodes.
When you are finished, enter '''logout''' to exit the batch system and return to the normal command line.
When you are finished, enter '''logout''' to exit the batch system and return to the normal command line.
Line 137: Line 135:
   showstart -h
   showstart -h
{{Note|text=  be aware that the output of all these commands show a state of the system at the moment when the command is issued. The starting time of jobs for instance also depends on other events like jobs submitted in the future which may fit better into the scheduling of the machine, on the shape of the hardware, other queues and reservations...}}
{{Note|text=  be aware that the output of all these commands show a state of the system at the moment when the command is issued. The starting time of jobs for instance also depends on other events like jobs submitted in the future which may fit better into the scheduling of the machine, on the shape of the hardware, other queues and reservations...}}
*  [[CRAY_XC40_Resource_Utilization_Reporting | Resource Utilization Reporting (RUR) ]] is a tool for gathering statistics on how system resources are being used by applications. AT HLRS RUR is configured to write a single file in user home directory: rur.out. The content of the file is the output of each plugin used by RUR. The plugins are: "taskstats", "energy" and "timestamp".
* At the end of you job output file you will find resource informations like:
<pre>
Application 6640730 resources: utime ~49s, stime ~5s, Rss ~8504, inblocks ~15127, outblocks ~2498
</pre>
where:
** utime:  user time used
** stime:  system time used
** Rss:    maximum resident set size (memory)
** inblocks:  block input operations
** outblocks: block output operations
The values are summed from all app processes.


=== Limitations ===
=== Limitations ===
* see the [[CRAY_XE6_Batch_System_Layout_and_Limits| Batch System Layout and Limits for CRAY XE6]]
<!-- * see the [[CRAY_XE6_Batch_System_Layout_and_Limits| Batch System Layout and Limits for CRAY XE6]]-->
* see the [[CRAY_XC30_Batch_System_Layout_and_Limits| Batch System Layout and Limits for CRAY XC30]]
* see the [[CRAY_XC40_Batch_System_Layout_and_Limits| Batch System Layout and Limits for CRAY XC40]]


=== Understanding aprun ===
=== Understanding aprun ===
Line 154: Line 166:
To get core specialization use '''aprun -r'''
To get core specialization use '''aprun -r'''
   aprun -r1 -n 100 a.out
   aprun -r1 -n 100 a.out
highest numbered cores will be used, starting with 31 on current nodes. (independent on aprun -j setting)
thus one core per node is used for system work and 23 cores are available for computation.  


apcount provided to compute total number of cores required
Furthermore, the tool '''apcount''' computes total number of cores required for a given setup using specialization cores. For further instructions see
   man apcount
   man apcount


==== Hyperthreading only for XC30 system !====
==== Hyperthreading ====
Cray XC30 compute nodes are always booted with hyperthreading switched ON.
Cray XC40 compute nodes are always booted with hyperthreading switched ON.
Users can choose to run with one or two PEs or threads per core. The default is to run with 1. You can make your choice at runtime :
Users can choose to run with one or two PEs or threads per core. The default is to run with 1. You can make your choice at runtime :


aprun –n### -j1 …    ->  Single Stream mode, one rank per core
aprun –n### -j1 …    ->  Single Stream mode, one rank per core (default)


aprun –n### -j2 …    ->  Dual Stream mode, two ranks per core
aprun –n### -j2 …    ->  Dual Stream mode, two ranks per core


The numbering of the cores in single stream mode is 0-7 for die 0 and 8-15 for die 1. If using dual stream mode the numbering of the first 15 cores stays the same and cores 16-23 are on die 0 and 24-31 on die 1. Note that this makes the numbering of the cores in hypterthread mode not contiguous :
The numbering of the cores in single stream mode is 0-11 for die 0 and 12-23 for die 1. If using dual stream mode the numbering of the first 24 cores stays the same and cores 24-35 are on die 0 and 36-47 on die 1. Note that this makes the numbering of the cores in hypterthread mode not contiguous:


{|class="wikitable"
{|class="wikitable"
Line 189: Line 201:
{|
{|
|-
|-
| 0-7
| 0-11
|-
|-
| 8-15
| 0-11, 24-35
|-
|-
|}
|}
Line 199: Line 211:
{|
{|
|-
|-
| 0-7,16-23
| 12-23
|-
|-
| 8-15,24-31
| 12-23, 36-47
|-
|-
|}
|}


|}
|}
Note: the cores are assigned consecutive, which means in hyperthread mode: 0,24,1,25,...,11,35,12,36,...,23,47.


==== aprun CPU Affinity control ====
==== aprun CPU Affinity control ====
Line 218: Line 232:


==== aprun Memory Affinity control ====
==== aprun Memory Affinity control ====
Cray XC30 systems use dual-socket compute nodes with 2 dies.
Cray XC40 systems use dual-socket compute nodes with 2 dies.
For 16-CPU Cray XC30 compute node processors, NUMA nodes 0 and 1 have eight CPUs each (logical CPUs 0-7, 8-15 respectively). If your applications use Intel Hyperthreading Technology, it is possible to use up to 32 processing elements (logical CPUs 16-23 are on NUMA node 0 and CPUs 24-31 are on NUMA node 1).
For 24-CPU Cray XC40 compute node processors, NUMA nodes 0 and 1 have 12 CPUs each (logical CPUs 0-11, 12-23 respectively). If your applications use Intel Hyperthreading Technology, it is possible to use up to 48 processing elements (logical CPUs 0-11 as well as 24-35 are on NUMA node 0 and CPUs 12-23 as well as 36-48 are on NUMA node 1).
Even if you PE and threads are bound to a specific numa_node, the memory used does not have to be ‘local’
Even if your PE and threads are bound to a specific numa_node, the memory used does not have to be ‘local’
* aprun memory affinity options (see also man apron)
* aprun memory affinity options (see also man aprun)
** Suggested setting is –ss (a PE can only allocate the memory local to its assigned NUMA node. If this is not possible, your application will crash.)
** Suggested setting is –ss (a PE can only allocate the memory local to its assigned NUMA node. If this is not possible, your application will crash.)


==== Some basic aprun examples ====
==== Some basic aprun examples ====
Assuming a XC30 with Sandybridge nodes (32 cores per node with Hyperthreading)
Assuming a XC40 with Haswell nodes (48 cores per node with Hyperthreading)


===== Pure MPI application , using all the available cores in a node =====
===== Pure MPI application , using all the available cores in a node =====
   aprun -n 32 -j2 ./a.out
   aprun -n 48 -N 48 -j2 ./a.out


===== Pure MPI application, using only 1 core per node =====
===== Pure MPI application, using only 1 core per node =====
32 MPI tasks, 32 nodes with 32*32 core allocated can be done to increase the available memory for the MPI tasks
24 MPI tasks, 24 nodes with 24*24 core allocated can be done to increase the available memory for the MPI tasks
   aprun -N 1 -n 32 -d 32 -j2 ./a.out
   aprun -n 24 -N 1 -d24 ./a.out


===== Hybrid MPI/OpenMP application, 4 MPI ranks per node =====
===== Hybrid MPI/OpenMP application, 4 MPI ranks per node =====
32 MPI tasks, 8 OpenMP threads each need to set OMP_NUM_THREADS
24 MPI tasks, 12 OpenMP threads each need to set OMP_NUM_THREADS
   export OMP_NUM_THREADS=8
   export OMP_NUM_THREADS=12
   aprun -n 32 -N 4 -d $OMP_NUM_THREADS -j2
   aprun -n 24 -N 4 -d $OMP_NUM_THREADS -j2


===== MPI and OpenMP with ''Intel PE'' =====
===== MPI and OpenMP with ''Intel PE'' =====
Intel RTE creates one extra thread when spawning the worker threads. This makes the pinning for aprun more difficult.
Intel RTE creates one extra thread when spawning the worker threads. This makes the correct, efficient, pinning more difficult for aprun. In the default setting this extra thread is scheduled as second thread. In the default setting (''OMP_NUM_THREADS=$omps'' and ''aprun -d $num_d'') the threads are scheduled round robin, the extra thread on the second cpu, while at the end two application threads (first and last one) are both placed on the first cpu. This results in a significant performance degradation.
 
But this extra thread usually has no significant workload. Thus, this extra thread does not influence the performance of an application thread, when it is located on the same cpu.
 
Thus, we suggest adding the -cc depth option. As a result, all threads can migrate with respect to the specified cpumask. For example when using:
  export OMP_NUM_THREADS=$omps
  aprun -n $npes -N $ppn -d $OMP_NUM_THREADS -cc depth a.out
all $omps computational threads will be located each on a single cpu and the extra thread on one of these.  


Suggestions:
<!--
* Running when “depth” divides evenly into the number of “cpus” on a socket
Thus suggest to specify not more threads that reserving cpus for them, which means: $omps <= $num_d
  export OMP_NUM_THREADS=“<=depth”
-->
  aprun -n npes -d “depth” -cc numa_node a.out
* Running when “depth” does not divide evenly into the number of “cpus” on a socket
  export OMP_NUM_THREADS=“<=depth”
  aprun -n npes -d “depth” -cc none a.out


===== Multiple Program Multiple Data (MPMD) =====
===== Multiple Program Multiple Data (MPMD) =====
aprun supports MPMD – Multiple Program Multiple Data.
aprun supports MPMD – Multiple Program Multiple Data.
* Launching several executables which all are part of the same MPI_COMM_WORLD
* Launching several executables which all are part of the same MPI_COMM_WORLD
   aprun –n 128 exe1 : -n 64 exe2 : -n 64 exe3
   aprun –n 96 exe1 : -n 48 exe2 : -n 48 exe3
* Notice : Each exacutable needs a dedicated node, exe1 and exe2 cannot share a node.  
* Notice : Each executable needs a dedicated node, exe1 and exe2 cannot share a node.  


Example : The following commands needs 3 nodes
Example : The following command needs 3 nodes
   aprun –n 1 exe1 : -n 1 exe2 : -n 1 exe3
   aprun –n 1 exe1 : -n 1 exe2 : -n 1 exe3
* Use a script to start several serial jobs on a node :
* Use a script to start several serial jobs on a node :
   aprun –a xt –n 1 –d 32 –cc none script.sh
   aprun –a xt –n 1 –d 24 –cc none script.sh


   >cat script.sh  
   >cat script.sh  
Line 267: Line 284:
   wait  
   wait  
   >
   >


===== cpu_lists for each PE =====
===== cpu_lists for each PE =====
Line 275: Line 291:
Here an example with 3 threads :
Here an example with 3 threads :
   aprun -n 4 -N 4 -cc <font color=red>1,3,5</font>:<font color=green>7,9,11</font>:<font color=blue>13,15,17</font>:<font color=yellow>19,21,23</font>
   aprun -n 4 -N 4 -cc <font color=red>1,3,5</font>:<font color=green>7,9,11</font>:<font color=blue>13,15,17</font>:<font color=yellow>19,21,23</font>


== Job Examples ==
== Job Examples ==


This batch script template serves as basis for the aprun expamples given later.
This batch script template serves as basis for the aprun expamples given later.  
<!-- {{ Warning| text=Please note the addition of ppn=32. This is only necessary for the XE6 Platform Hermit. For the XC40 the ''ppn'' has to be omitted.}}.-->
<pre>
<pre>
#! /bin/bash
#! /bin/bash


#PBS -N <job_name>
#PBS -N <job_name>
#PBS -l nodes=<number_of_nodes>
#PBS -l nodes=<number_of_nodes>:ppn=24
#PBS -l walltime=00:01:00
#PBS -l walltime=00:01:00


Line 304: Line 319:


The following parameters for the template above should cover the vast majority of applications
The following parameters for the template above should cover the vast majority of applications
and are given for both the XE6 and XC30 platform at HRLS. The ''<exe>'' keword should be replaced
and are given for both the XE6 and XC40 platform at HLRS. The ''<exe>'' keyword should be replaced
by your application.
by your application.


 
=== Job types ===
=== XE6 Platform ===
The compute nodes of the XC40 platform Hazelhen feature two Haswell processors with 12 cores and one NUMA domain each  
The nodes of the XE6 features two Interlagos processors with 16 cores each resulting in a total of  
resulting in a total of 24 cores and 2 NUMA domains per node.  
32 cores per node. Each Interlagos processor forms two NUMA domains of size 8 resulting in totally
One conceptual difference between the Interlagos nodes on the XE6 and the Haswell nodes on the XC40 is the Hyperthreading
four NUMA domains per node.  
feature of the Haswell processor. Hyperthreading is always booted and whether it is used or not is controlled via the
'''-j''' option to aprun. Using '''-j 2''' enables Hyperthreding while '''-j 1''' (the default) does not. With Hyperthreading enabled, the compute node
on the XC40 disposes 48 cores instead of 24.


* <font color=green>'''Description: Serial application (no MPI or OpemMP)'''</font>
* <font color=green>'''Description: Serial application (no MPI or OpemMP)'''</font>
Line 320: Line 337:
* <font color=green>'''Description: Pure OpenMP application (no MPI)'''</font>
* <font color=green>'''Description: Pure OpenMP application (no MPI)'''</font>
   <number_of_nodes>:  1
   <number_of_nodes>:  1
   <nt>: 32
   <nt>: 24
   <aprun_command>:  aprun -n 1 -d $OMP_NUM_THREADS <exe>
   <aprun_command>:  aprun -n 1 -d $OMP_NUM_THREADS <exe>
Comment: You can vary the number of threads from 1-32.
Comment: You can vary the number of threads from 1-24.
 
 
* <font color=green>'''Description: Pure MPI application on two nodes fully packed (no OpenMP)'''</font>
  <number_of_nodes>:  2
  <aprun_command>:  aprun -n 64 -N 32 <exe>
Comment: The '''-n''' specifies the total number of processing elements (PE) and -N the PEs per node.
The '''-n''' has to be less or equal to 32*''<number_of_nodes>'' and '''-N''' less or equal to ''<number_of_nodes>''.
Finally, the '''-n''' value divided by the '''-N''' value has to be less or equal than the ''<number_of_nodes>''. You can
increase the number of nodes as needed and vary the remaining parameters accoridingly.
 
 
* <font color=green>'''Description: Pure MPI application on two nodes in wide-AVX mode (no OpenMP)'''</font>
  <number_of_nodes>:  2
  <aprun_command>:  aprun -n 32 -N 16 -d 2 <exe>
Comment: The '''-d 2''' is used to place the PEs evenly among the cores on the node.
This doubles the memory bandwidth and floating point unit per PE.
 
 
* <font color=green>'''Description: Mixed (Hybrid) MPI OpenMP application on two nodes'''</font>
  <number_of_nodes>:  2
  <nt>: 8
  <aprun_command>:  aprun -n 8 -N 4 -d $OMP_NUM_THREADS <exe>
Comment: In addition to the constraints mentioned above, the '''-d''' value times the '''-N''' value has to be less or equal to 32.
This configuration runs one processing element per NUMA domain and each PE spawns 8 threads.
 
=== XC30 Platform ===
The compute nodes of the XC30 platform Hornet feature two SandyBridge processors with 8 cores and one NUMA domain each
resulting in a total of 16 cores and 2 NUMA domains per node.
One conceptual difference between the Interlagos nodes on the XE6 and the SandyBridge nodes on the XC30 is the Hyperthreading
feature of the SandyBridge processor. Hyperthreading is always booted and whether it is used or not is controlled via the
'''-j''' option to aprun. Using -j 2 enables Hyperthreding while '''-j 1''' (the default) does not. With Hyperthreading enabled, the compute node
on the XC30 disposes 32 cores instead of 16.
 


* <font color=green>'''Description: Pure MPI application on two nodes fully packed (no OpenMP) with Hyperthreads'''</font>
* <font color=green>'''Description: Pure MPI application on two nodes fully packed (no OpenMP) with Hyperthreads'''</font>
   <number_of_nodes>:  2
   <number_of_nodes>:  2
   <aprun_command>:  aprun -n 64 -N 32 -j 2 <exe>
   <aprun_command>:  aprun -n 96 -N 48 -j 2 <exe>




* <font color=green>'''Description: Pure MPI application on two nodes fully packed (no OpenMP) without Hyperthreads'''</font>
* <font color=green>'''Description: Pure MPI application on two nodes fully packed (no OpenMP) without Hyperthreads'''</font>
   <number_of_nodes>:  2
   <number_of_nodes>:  2
   <aprun_command>:  aprun -n 32 -N 16 -j 1 <exe>
   <aprun_command>:  aprun -n 48 -N 24 -j 1 <exe>
Comment: Here you can also omit the '''-j 1''' option as it is the default. This configuration corresponds to the wide-AVX case on the  
Comment: Here you can also omit the '''-j 1''' option as it is the default. This configuration corresponds to the wide-AVX case on the  
XE6 nodes.
XE6 nodes.
Line 372: Line 356:
   <number_of_nodes>:  2
   <number_of_nodes>:  2
   <nt>: 2
   <nt>: 2
   <aprun_command>:  aprun -n 32 -N 16 -d $OMP_NUM_THREADS -j 2 <exe>
   <aprun_command>:  aprun -n 48 -N 24 -d $OMP_NUM_THREADS -j 2 <exe>




Line 379: Line 363:
----
----


=== General remarks for both platforms ===
=== General remarks ===
The <tt>aprun</tt> allows to start an application with more OpenMP threads than compute cores available. This oversubscription results in a substantial performance degradation. The same happens if the <tt>-d</tt> value is smaller than the number of OpenMP threads used by the application. Furthermore, for the Intel programming environment an additional helper thread per processing element is spawned which can lead to an oversubscription. Here, one can use the <tt>-cc numa_node</tt> or the <tt>-cc none</tt> option to <tt>aprun</tt> to avoid this obersubscription of hardware. The default behavrior, i.e. if no <tt>-cc</tt> is specified, is as if <tt>-cc cpu</tt> is used which means that each processing element and thread is pinned to a processor. Please consult the aprun man page. Another popular option to <tt>aprun</tt> is <tt>-ss</tt> which forces memory allocation to be constrained in the same node as the processing element or thread is constrained. One can use the <tt>xthi.c</tt> utility to check the affinity of threads and processing elements.
The <tt>aprun</tt> allows to start an application with more OpenMP threads than compute cores available. This oversubscription results in a substantial performance degradation. The same happens if the <tt>-d</tt> value is smaller than the number of OpenMP threads used by the application. Furthermore, for the Intel programming environment an additional helper thread per processing element is spawned which can lead to an oversubscription. Here, one can use the <tt>-cc numa_node</tt> or the <tt>-cc none</tt> option to <tt>aprun</tt> to avoid this oversubscription of hardware. The default behavior, i.e. if no <tt>-cc</tt> is specified, is as if <tt>-cc cpu</tt> is used which means that each processing element and thread is pinned to a processor. Please consult the aprun man page. Another popular option to <tt>aprun</tt> is <tt>-ss</tt> which forces memory allocation to be constrained in the same node as the processing element or thread is constrained. One can use the <tt>xthi.c</tt> utility to check the affinity of threads and processing elements.




Line 393: Line 377:




   qsub -l nodes=2:ppn=32:mem32gb <myjobscript>
   qsub -l nodes=2:ppn=24 <myjobscript>


(replaces: qsub -l mppwidth=64,mppnppn=32,feature=mem32gb)
(replaces: qsub -l mppwidth=48,mppnppn=24)
* nodes: replacement for mppwidth/mppnppn
* nodes: replacement for mppwidth/mppnppn
* ppn: replacement for mppnppn
* ppn: replacement for mppnppn
* mem32gb: replacement for feature=mem32gb
Please note that in the examples above the keywords such as ''nodes'' or ''ppn'' have been specified directly in the script via the #PBS string.
{{Note | text=omitting argument ''ppn=32'' will also allocate all 32 cores of a node}}  
For the example in this warning box the keywords are specified on the command line which is also allowed. But you cannot specify the keywords both in
the srcript and the command line.
}}


}}
{{ Warning| text=Independent of the selected amount of processes per node (ppn), you will be charged for full nodes even (comparable to ppn=24).}}
----
----


== Special Jobs / Special Nodes ==
== Special Jobs / Special Nodes ==
=== MPP nodes with different memory (ONLY for CRAY XE6!) ===
=== Pre- and Postprocessing/Visualization nodes with large memory ===
<font color=green>3072</font> nodes of total <font color=red>3552</font> nodes are installed with <font color=green>32GB</font> memory; <font color=blue>480</font> nodes are installed with <font color=blue>64GB</font> memory.
11 visualisation nodes are integrated into the external nodes of Hazelhen.
 
* 3 nodes are equipped with 512 GB of main memory. Access to a single node is possible by using the node feature mem512gb.
==== 32 GB nodes or 64 GB nodes ====
* 3 nodes are equipped with 128 GB of main memory. Access to a single node is possible by using the node feature mem128gb.
* If your job has not defined any node feature, then your job gets a default feature '''"<font color=green>mem32gb</font>"''' which will allocate nodes with 32GB memory (see job examples above).  
* 5 nodes are equipped with 256 GB of main memory. Access to a single node is possible by using the node feature mem256gb.
 
<pre>
* If you want one or more of the 64GB nodes, then you have to specify the node feature '''"<font color=blue>mem64gb</font>"''':
user@eslogin00X> qsub -I -lnodes=1:mem512gb
 
</pre>
  qsub -l nodes=1:mem64gb <my_batchjob_script.pbs>
 
Or inside a simple script ''my_batchjob_script.pbs''


=== Shared node SMP with very large memory ===
Two multi user smp nodes are equipped with 1.5 TB of memory and a 3rd node is equipped with 1TB of memory. Access to these nodes is possible by using the node feature smp and the smp queue. Access is scheduled by the amount of required memory which has to be specified by the vmem feature.
<pre>
<pre>
#!/bin/bash
user@eslogin00X> qsub -I -lnodes=1:smp:ppn=1 -q smp -lvmem=5gb
#PBS -N job_name
#PBS -l nodes=1:mem64gb
#PBS -l walltime=00:20:00           
 
# Change to the direcotry that the job was submitted from
cd $PBS_O_WORKDIR
 
# Launch the parallel job to the allocated compute nodes
aprun -n 64 -N 32 ./my_mpi_executable arg1 arg2 > my_output_file 2>&1
</pre>
</pre>


==== Jobs with mixed node features ====
=== Test jobs ===
If your job needs some of the 64GB nodes and some of the 32GB nodes at the same time, then your job submission <font color=red>options</font> looks <font color=red>totally different</font>. <font color=red>Do not specify additional ''-l feature=<nodefeature>''!</font>
A special queue ''test'' is available for small and short test jobs with very high priority.
You only need to specify '''resource name''' ''<font color=blue>nodes=<node count>:ppn=<process count per node>:mem64gb+<node count>:ppn=<proc count per node>:mem32gb</font>'':
Limits are:  
 
* 1 job per user
  qsub -l nodes=1:ppn=32:mem64gb+64:ppn=32:mem32gb,walltime=3600 my_batchjob_script.pbs
* 25 minutes walltime
* 384 nodes per job
* only 400 nodes in total


The example above will allocate 65 nodes to your job for a maximum time of 3600 seconds and can place 32 processes on one node with 64GB memory and 32 processes on each of the 64 allocated nodes with 32GB memory. Important is option ''ppn=32'' to get all cores of the allocated mpp nodes.
Now you need to select your different allocated nodes for your ''aprun'' command in your script ''my_batchjob_script.pbs'':
<pre>
<pre>
#!/bin/bash
user@eslogin00X> qsub -lnodes=16:ppn=24 -q test mybatchjobscript
#PBS -N mixed_job
</pre>
#PBS -l nodes=1:ppn=32:mem64gb+2:ppn=32:mem32gb
#PBS -l walltime=300


### defining the number of PEs (processes per node ( max 32 for hermit | max 16 for hornet) ###
# p32: number of PEs (Processing Elements) on 32GB nodes
# p64: number of PEs (Processing Elements) on 64GB nodes
#-------------------------------------------------------
p32=32
p64=16


# Change to the direcotry that the job was submitted from
=== CCM jobs ===
cd $PBS_O_WORKDIR
The [http://docs.cray.com/cgi-bin/craydoc.cgi?mode=View;id=S-2496-5202;idx=books_search;this_sort=;q=2496;type=books;title=Workload%20Management%20and%20Application%20Placement%20for%20the%20Cray%20Linux%20Environment Cluster Compatibility Mode (CCM)] is a software solution that provides the services needed to run most cluster-based independent software vendor (ISV) applications out-of-the-box with some configuration adjustments.


<font color red>'''Note:'''</font> CCM is only available for some users, not by default!
If you need this feature, please ask your project manager for access to CCM.


### selecting nodes with different memory ###
=== separated output ===
#---------------------------------------
If you need to separate the output of you job you can write a separate file for each rank using a wrapper shell script:
# 1. getting all nodes of my job
<pre>
nids=$(/opt/hlrs/system/tools/getjhostlist)
export PMI_NO_FORK=1
aprun -n24 -N24 bash -c "<exe> >& log.\$ALPS_APP_PE"
</pre>


# 2. getting the nodes with feature mem32gb of my job
OR
nid32=$(/opt/hlrs/system/tools/hostlistf mem32gb "$nids")
# how many nodes do I have with mem32gb:
i32=$(/opt/hlrs/system/tools/cntcommastr "$nid32")


# 3. getting the nodes with feature mem64gb of my job
use the ALPS_STD..._SPEC variable:
nid64=$(/opt/hlrs/system/tools/hostlistf mem64gb "$nids")
<pre>
# how many nodes do I have with mem64gb:
export PMI_NO_FORK=1
i64=$(/opt/hlrs/system/tools/cntcommastr "$nid64")
export ALPS_STDOUTERR_SPEC=<Outputdir>
 
aprun ...
 
(( P32 = $i32 * $p32 ))
(( P64 = $i64 * $p64 ))
(( D32 = 32 / $p32 ))
(( D64 = 32 / $p64 ))
 
# Launch the parallel job to the allocated compute nodes using
# Multi Program, Multi Data (MPMD) mode (see "man aprun")
# -------------------------------------------------
# $nid64 : node list with 64GB memory
# $i64    : number of nodes with 64GB memory
# $p64    : number of PEs per node on nodes with 64GB
# $P64    : total number of PEs (processing elements) on nodes with 64GB
# ----
# $nid32 : node list with 32GB memory
# $i32    : number of nodes with 32GB memory
# $p32    : number of PEs per node on nodes with 32GB
# $P32    : total number of PEs on nodes with 32GB
# ----------
# The "env OMP_NUM_THREADS=...." parts of the aprun command below are only useful for OpenMP (hybrid) programs.
#
aprun -L $nid64 -n $P64 -N $p64 -d $D64 env OMP_NUM_THREADS=$D64 ./my_executable1 : -L $nid32 -n $P32 -N $p32 -d $D32 env OMP_NUM_THREADS=$D32 ./my_executable2
</pre>
</pre>
Where you will than get the following files in your output directory:


By defining p64 and p32 in the example above you can control the number of processes on each node for the different node types (64GB memory and 32GB memory). Important to know is the maximum value is 32, the number of cores of each mpp node.
oe00000, oe00001, .... oe00099


----
It is also possible to separate stdin and stdout :
 
 
=== Nodes with Kepler accelerator cards (ONLY for CRAY XE6!) ===
<font color=green>28</font> nodes of total <font color=red>3552</font> nodes are installed with <font color=green>32GB</font> memory, <font color=blue> 16 </font> cores each and with an additional <font color=blue>Kepler accelerator card</font> installed.
 
There are 2 ways to submit a job allocating the nodes with the Kepler accelerators.
* Using the queue '''gpgpu''': 
  qsub -q gpgpu -l nodes=1:ppn=16 <myjobscript>
This allocates 1 node (16 cores) with Kepler accelerator, respectively nodes with feature '''tesla''' .
* Using the node feature '''tesla''':
  qsub -l nodes=1:ppn=16:tesla <myjobscript>
This also allocates 1 node (16 cores) with Kepler accelerator.
 
'''For ccm jobs you need to use the second way submitting to the queue <tt> ccm </tt>:'''
  qsub -q ccm -l nodes=1:ppn=16:tesla <myjobscript>
 
 
 
 
----
 
=== Pre- and Postprocessing/Visualization nodes with large memory (ONLY for CRAY XE6!) ===
A few number of external nodes are standard cluster nodes with 128GB memory and 32 cores of Intel Xeon CPU X7550.
These nodes are not connected to the GEMINI interconnect network of the CRAY mpp compute nodes. But these nodes have the same workspace and home filesystem mounted and they have a graphic NVIDIA Quadro 6000 installed for visualization.
To get one of the pre-postprocessing node, following qsub options are needed:
  qsub -l nodes=1:mem128gb,walltime=3600 ./my_batchjob_script.pbs
This allocates one of the pre-postprocessing node for 3600 seconds. It's not possible to get more than 1 of those nodes in one job.
 
Detail information for visualization are available on [[CRAY_XE6_Graphic_Environment | Graphic Environment]]. There you will find wrapper scripts and environment settings for your visualization work.
 
 
 
----
=== Shared node SMP with very large memory (ONLY for CRAY XE6!) ===
One external node has 1TB memory and 64 cores of Intel Xeon CPU X7550.
This node is not connected to the GEMINI interconnect network of the CRAY XE6 mpp compute nodes. But this node has the same workspace and home filesystem mounted. <font color=red>This node is shared by several users and jobs at the same time!</font>
Users need to request the number of cores and the maximum of total memory needed by all processes you want to run. Otherwise the requested job gets very small defaults. The requested memory will be enforced, which means the job will be killed in case the job allocates more memory than requested.


To submit a job to this node you need following qsub options:
ALPS_STDOUT_SPEC=<output dir> -> Files with extension 'o'.
  qsub -q smp -l nodes=1:smp:ppn=2,vmem=100gb,walltime=3600 ./my_batchjob_script.pbs
ALPS_STDERR_SPEC=<output dir> -> Files with extension 'e'.
This allocates 2 core (ppn=2) of the SMP node for 3600 seconds and limits the total memory used by all processes to 100GByte.  {{Note| text=All your processess will be grouped on 2 cores on this node. (see Linux cgroups mechanism)}}

Latest revision as of 18:32, 25 January 2017

Introduction

The only way to start a parallel job on the compute nodes of this system is to use the batch system. The installed batch system is based on

  • the resource management system torque and
  • the scheduler moab

Additionally you have to know that on CRAY XE6/XC40 systems the user applications are always launched on the compute nodes using the application launcher, aprun, which submits applications to the Application Level Placement Scheduler (ALPS) for placement and execution.

Detailed information for CRAY XC40 about how to use this system and many examples can be found in Cray Programming Environment User's Guide and Workload Management and Application Placement for the Cray Linux Environment.

  • ALPS is always used for scheduling a job on the compute nodes. It does not care about the programming model you used. So we need a few general definitions :
    • PE : Processing Elements, basically an Unix ‘Process’, can be a MPI Task, CAF image, UPC thread, ...
    • numa_node The cores and memory on a node with ‘flat’ memory access, basically one of the 2 Dies on the Intel and the direct attach memory.
    • Thread A thread is contained inside a process. Multiple threads can exist within the same process and share resources such as memory, while different PEs do not share these resources. Most likely you will use OpenMP threads.

Torque-aprun-cray.jpg

  • aprun is the ALPS (Application Level Placement Scheduler) application launcher
    • It must be used to run application on the XE/XC compute nodes interactively and in a batch job
    • If aprun is not used, the application is launched on the MOM node (and will most likely fail)
    • aprun man page contains several useful examples at least 3 important parameter to control:
      • The total number of PEs: -n
      • The number of PEs per node: -N
      • The number of OpenMP threads: -d (the 'stride' between 2 PEs in a node)
    • see also understanding aprun
  • qsub is the torque submission command for batch job scripts.


Writing a submission script is typically the most convenient way to submit your job to the batch system. You generally interact with the batch system in two ways: through options specified in job submission scripts (these are detailed below in the examples) and by using torque or moab commands on the login nodes. There are three key commands used to interact with torque:

  • qsub
  • qstat
  • qdel

Check the man page of torque for more advanced commands and options

 man pbs

Requesting Resources using batch system TORQUE and ALPS

Batch Mode

Production jobs are typically run in batch mode. Batch scripts are shell scripts containing flags and commands to be interpreted by a shell and are used to run a set of commands in sequence.

  • The number of required nodes, cores, wall time and more can be determined by the parameters in the job script header with "#PBS" before any executable commands in the script.
#!/bin/bash
#PBS -N job_name
#PBS -l nodes=2:ppn=24
#PBS -l walltime=00:20:00             
  
# Change to the direcotry that the job was submitted from
cd $PBS_O_WORKDIR

# Launch the parallel job to the allocated compute nodes
aprun -n 48 -N 24 ./my_mpi_executable arg1 arg2 > my_output_file 2>&1
  • The job is submitted by the qsub command (all script head parameters #PBS can also be adjusted directly by qsub command options).
 qsub my_batchjob_script.pbs
  • Setting qsub options on the command line will overwrite the settings given in the batch script:
 qsub -N other_name -l nodes=2:ppn=24,walltime=00:20:00 my_batchjob_script.pbs
  • The batch script is not necessarily granted resources immediately, it may sit in the queue of pending jobs for some time before its required resources become available.
  • At the end of the execution output and error files are returned to the submission directory
  • This example will run your executable "my_mpi_executable" in parallel with 48 MPI processes. Torque will allocate 2 nodes to your job for a maximum time of 20 minutes and place 24 processes on each node (one per core). The batch systems allocates nodes exclusively only for one job. After the walltime limit is exceeded, the batch system will terminate your job. The job launcher for the XC40 parallel jobs (both MPI and OpenMP) is aprun. This needs to be started from a subdirectory of the /mnt/lustre_server (your workspace). The aprun example above will start the parallel executable "my_mpi_executable" with the arguments "arg1" and "arg2". The job will be started using 48 MPI processes with 24 processes placed on each of your allocated nodes (remember that a node consists of 24 cores in the XC40 system). You need to have nodes allocated by the batch system (qsub) before starting aprun.

To query further options of aprun, please use

 man aprun 
 aprun -h
Warning: You have to change into a subdirectory of /mnt/lustre_server (your workspace), before calling aprun.


Interactive batch Mode

Interactive mode is typically used for debugging or optimizing code but not for running production code. To begin an interactive session, use the "qsub -I" command:

 qsub -I -l nodes=2:ppn=24,walltime=00:30:00

If the requested resources are available and free (in the example above: 2 nodes/24 cores, 30 minutes), then you will get a new session on the mom node for your requested resources. Now you have to use the aprun command to launch your application to the allocated compute nodes. When you are finished, enter logout to exit the batch system and return to the normal command line.

Notes

  • Remember, you use aprun within the context of a batch session and the maximum size of the job is determined by the resources you requested when you launched the batch session. You cannot use the aprun command to use more resources than you reserved using the qsub command. Once a batch session begins, you can only use the resources initially requested or less resources.
  • While your job is running (in Batch Mode), STDOUT and STDERR are written to a file or files in a system directory and the output is copied to your submission directory only after the job completes. Specifying the "qsub -j oe" option here and redirecting the output to a file (see examples above) makes it possible for you to view STDOUT and STDERR while the job is running.

Run job on other Account ID

There are Unix groups associated to the project account ID (ACID). To run a job on a non-default project budget (associated to a secondary group), the groupname of this project has to be passed in the group_list:

qsub -W group_list=<groupname> ...

To get your available groups:

id -Gn
Warning: note that this procedure is neither applicable nor necessary for the default project (associated to the primary group), printed with "id -gn".


Usage of a Reservation

For nodes which are reserved for special groups or users, you need to specify an additional option for this reservation:

E.g. a reservation named john.1 will be used with following command:
qsub -W x=FLAGS:ADVRES:john.1 ...

Deleting a Batch Job

 qdel <jobID>
 canceljob <jobID>

These commands allow you to remove jobs from the job queue. If the job is running, qdel will abort it. You can obtain the Job ID from the output of command "qstat" or you remember the output of your qsub command of your job.

Status Information

* Status of jobs:
 qstat
 qstat -a
 showq
  • Status of Qeues:
 qstat -q
 qstat -Q
 
  • Status of job scheduling
 checkjob <jobID>
 showstart <jobID>
  • Status of backfill. This can help you to build small jobs that can be backfilled immediately while you are waiting for the resources to become available for your larger jobs
 showbf
 xtnodestat
 apstat

Note: for further details type on the login node:

 man qstat
 man apstat
 man xtnodestat
 showbf -h
 showq -h
 checkjob -h
 showstart -h
Note: be aware that the output of all these commands show a state of the system at the moment when the command is issued. The starting time of jobs for instance also depends on other events like jobs submitted in the future which may fit better into the scheduling of the machine, on the shape of the hardware, other queues and reservations...


  • Resource Utilization Reporting (RUR) is a tool for gathering statistics on how system resources are being used by applications. AT HLRS RUR is configured to write a single file in user home directory: rur.out. The content of the file is the output of each plugin used by RUR. The plugins are: "taskstats", "energy" and "timestamp".
  • At the end of you job output file you will find resource informations like:
Application 6640730 resources: utime ~49s, stime ~5s, Rss ~8504, inblocks ~15127, outblocks ~2498

where:

    • utime: user time used
    • stime: system time used
    • Rss: maximum resident set size (memory)
    • inblocks: block input operations
    • outblocks: block output operations

The values are summed from all app processes.

Limitations

Understanding aprun

Running Applications


Core specialization

System 'noise' on compute nodes may significantly degrade scalability for some applications. The Core Specialization can mitigate this problem.

  • 1 core per node will be dedicated for system work (service core)
  • As many system interrupts as possible will be forced to execute on the service core
  • The application will not run on the service core

To get core specialization use aprun -r

 aprun -r1 -n 100 a.out

thus one core per node is used for system work and 23 cores are available for computation.

Furthermore, the tool apcount computes total number of cores required for a given setup using specialization cores. For further instructions see

 man apcount

Hyperthreading

Cray XC40 compute nodes are always booted with hyperthreading switched ON. Users can choose to run with one or two PEs or threads per core. The default is to run with 1. You can make your choice at runtime :

aprun –n### -j1 … -> Single Stream mode, one rank per core (default)

aprun –n### -j2 … -> Dual Stream mode, two ranks per core

The numbering of the cores in single stream mode is 0-11 for die 0 and 12-23 for die 1. If using dual stream mode the numbering of the first 24 cores stays the same and cores 24-35 are on die 0 and 36-47 on die 1. Note that this makes the numbering of the cores in hypterthread mode not contiguous:

Mode cores on die 0 cores on die 1
Single Stream
Dual Stream
0-11
0-11, 24-35
12-23
12-23, 36-47

Note: the cores are assigned consecutive, which means in hyperthread mode: 0,24,1,25,...,11,35,12,36,...,23,47.

aprun CPU Affinity control

CLE can dynamically distribute work by allowing PEs and threads to migrate from one CPU to another within a node. In some cases, moving PEs or threads from CPU to CPU increases cache and translation lookaside buffer (TLB) misses and therefore reduces performance. The CPU affinity options enable to bind a PE or thread to a particular CPU or a subset of CPUs on a node.

  • aprun CPU affinity options (see also man aprun)
    • Default settings: -cc cpu (PEs are bound a to specific core, depended on the –d setting)
    • Binding PEs to a specific numa node : -cc numa_node (PEs are not bound to a specific core but cannot ‘leave’ their numa_node)
    • No binding: -cc none
    • Own binding: -cc 0,4,3,2,1,16,18,31,9,....

aprun Memory Affinity control

Cray XC40 systems use dual-socket compute nodes with 2 dies. For 24-CPU Cray XC40 compute node processors, NUMA nodes 0 and 1 have 12 CPUs each (logical CPUs 0-11, 12-23 respectively). If your applications use Intel Hyperthreading Technology, it is possible to use up to 48 processing elements (logical CPUs 0-11 as well as 24-35 are on NUMA node 0 and CPUs 12-23 as well as 36-48 are on NUMA node 1). Even if your PE and threads are bound to a specific numa_node, the memory used does not have to be ‘local’

  • aprun memory affinity options (see also man aprun)
    • Suggested setting is –ss (a PE can only allocate the memory local to its assigned NUMA node. If this is not possible, your application will crash.)

Some basic aprun examples

Assuming a XC40 with Haswell nodes (48 cores per node with Hyperthreading)

Pure MPI application , using all the available cores in a node
  aprun -n 48 -N 48 -j2 ./a.out
Pure MPI application, using only 1 core per node

24 MPI tasks, 24 nodes with 24*24 core allocated can be done to increase the available memory for the MPI tasks

 aprun -n 24 -N 1 -d24 ./a.out
Hybrid MPI/OpenMP application, 4 MPI ranks per node

24 MPI tasks, 12 OpenMP threads each need to set OMP_NUM_THREADS

 export OMP_NUM_THREADS=12
 aprun -n 24 -N 4 -d $OMP_NUM_THREADS -j2
MPI and OpenMP with Intel PE

Intel RTE creates one extra thread when spawning the worker threads. This makes the correct, efficient, pinning more difficult for aprun. In the default setting this extra thread is scheduled as second thread. In the default setting (OMP_NUM_THREADS=$omps and aprun -d $num_d) the threads are scheduled round robin, the extra thread on the second cpu, while at the end two application threads (first and last one) are both placed on the first cpu. This results in a significant performance degradation.

But this extra thread usually has no significant workload. Thus, this extra thread does not influence the performance of an application thread, when it is located on the same cpu.

Thus, we suggest adding the -cc depth option. As a result, all threads can migrate with respect to the specified cpumask. For example when using:

 export OMP_NUM_THREADS=$omps
 aprun -n $npes -N $ppn -d $OMP_NUM_THREADS -cc depth a.out

all $omps computational threads will be located each on a single cpu and the extra thread on one of these.


Multiple Program Multiple Data (MPMD)

aprun supports MPMD – Multiple Program Multiple Data.

  • Launching several executables which all are part of the same MPI_COMM_WORLD
 aprun –n 96 exe1 : -n 48 exe2 : -n 48 exe3
  • Notice : Each executable needs a dedicated node, exe1 and exe2 cannot share a node.

Example : The following command needs 3 nodes

 aprun –n 1 exe1 : -n 1 exe2 : -n 1 exe3
  • Use a script to start several serial jobs on a node :
 aprun –a xt –n 1 –d 24 –cc none script.sh
 >cat script.sh 
 ./exe1& 
 ./exe2& 
 ./exe3&
 wait 
 >
cpu_lists for each PE

CLE was updated to allow threads and processing elements to have more flexibility in placement. This is ideal for processor architectures whose cores share resources with which they may have to wait to utilize. Separating cpu_lists by colons (:) allows the user to specify the cores used by processing elements and their child processes or threads. Essentially, this provides the user more granularity to specify cpu_lists for each processing element.

Here an example with 3 threads :

 aprun -n 4 -N 4 -cc 1,3,5:7,9,11:13,15,17:19,21,23

Job Examples

This batch script template serves as basis for the aprun expamples given later.

#! /bin/bash

#PBS -N <job_name>
#PBS -l nodes=<number_of_nodes>:ppn=24
#PBS -l walltime=00:01:00

cd $PBS_O_WORKDIR # This is the directory where this script and the executable are located. 
# You can choose any other directory on the lustre file system.

export OMP_NUM_THREADS=<nt>
<aprun_command>


The keywords <job_name>, <number_of_nodes>, <nt>, and <aprun_command> have to be replaced and the walltime adapted accordingly (one minute is given in the template above). The OMP_NUM_THREADS environment variable is only important for applications using OpenMP. Please note that OpenMP directives are recognized by default by the Cray compiler and can be turned off by the -hnoomp option. For the Intel, GNU, and PGI compiler one has to use the corresponding flag to enable OpenMP recognition.

The following parameters for the template above should cover the vast majority of applications and are given for both the XE6 and XC40 platform at HLRS. The <exe> keyword should be replaced by your application.

Job types

The compute nodes of the XC40 platform Hazelhen feature two Haswell processors with 12 cores and one NUMA domain each resulting in a total of 24 cores and 2 NUMA domains per node. One conceptual difference between the Interlagos nodes on the XE6 and the Haswell nodes on the XC40 is the Hyperthreading feature of the Haswell processor. Hyperthreading is always booted and whether it is used or not is controlled via the -j option to aprun. Using -j 2 enables Hyperthreding while -j 1 (the default) does not. With Hyperthreading enabled, the compute node on the XC40 disposes 48 cores instead of 24.

  • Description: Serial application (no MPI or OpemMP)
 <number_of_nodes>:  1
 <aprun_command>:  aprun -n 1 <exe>


  • Description: Pure OpenMP application (no MPI)
 <number_of_nodes>:  1
 <nt>: 24
 <aprun_command>:  aprun -n 1 -d $OMP_NUM_THREADS <exe>

Comment: You can vary the number of threads from 1-24.

  • Description: Pure MPI application on two nodes fully packed (no OpenMP) with Hyperthreads
 <number_of_nodes>:  2
 <aprun_command>:  aprun -n 96 -N 48 -j 2 <exe>


  • Description: Pure MPI application on two nodes fully packed (no OpenMP) without Hyperthreads
 <number_of_nodes>:  2
 <aprun_command>:  aprun -n 48 -N 24 -j 1 <exe>

Comment: Here you can also omit the -j 1 option as it is the default. This configuration corresponds to the wide-AVX case on the XE6 nodes.


  • Description: Mixed (Hybrid) MPI OpenMP application on two nodes with Hyperthreading.
 <number_of_nodes>:  2
 <nt>: 2
 <aprun_command>:  aprun -n 48 -N 24 -d $OMP_NUM_THREADS -j 2 <exe>




General remarks

The aprun allows to start an application with more OpenMP threads than compute cores available. This oversubscription results in a substantial performance degradation. The same happens if the -d value is smaller than the number of OpenMP threads used by the application. Furthermore, for the Intel programming environment an additional helper thread per processing element is spawned which can lead to an oversubscription. Here, one can use the -cc numa_node or the -cc none option to aprun to avoid this oversubscription of hardware. The default behavior, i.e. if no -cc is specified, is as if -cc cpu is used which means that each processing element and thread is pinned to a processor. Please consult the aprun man page. Another popular option to aprun is -ss which forces memory allocation to be constrained in the same node as the processing element or thread is constrained. One can use the xthi.c utility to check the affinity of threads and processing elements.



Warning: Deprecated CRAY qsub syntax using mppwidth, mppnppn, mppdepth, feature, ...

The qsub arguments specially available for CRAY XE6 systems (mppwidth, mppnppn, mppdepth, feature) is deprecated in this batch system version. Most functionalities of those old CRAY qsub arguments are still available in this batch system version. Nevertheless, we recommend not to use these qsub arguments anymore. Please use this syntax described always in all examples of this document:


 qsub -l nodes=2:ppn=24 <myjobscript>

(replaces: qsub -l mppwidth=48,mppnppn=24)

  • nodes: replacement for mppwidth/mppnppn
  • ppn: replacement for mppnppn

Please note that in the examples above the keywords such as nodes or ppn have been specified directly in the script via the #PBS string. For the example in this warning box the keywords are specified on the command line which is also allowed. But you cannot specify the keywords both in

the srcript and the command line.


Warning: Independent of the selected amount of processes per node (ppn), you will be charged for full nodes even (comparable to ppn=24).

Special Jobs / Special Nodes

Pre- and Postprocessing/Visualization nodes with large memory

11 visualisation nodes are integrated into the external nodes of Hazelhen.

  • 3 nodes are equipped with 512 GB of main memory. Access to a single node is possible by using the node feature mem512gb.
  • 3 nodes are equipped with 128 GB of main memory. Access to a single node is possible by using the node feature mem128gb.
  • 5 nodes are equipped with 256 GB of main memory. Access to a single node is possible by using the node feature mem256gb.
user@eslogin00X> qsub -I -lnodes=1:mem512gb

Shared node SMP with very large memory

Two multi user smp nodes are equipped with 1.5 TB of memory and a 3rd node is equipped with 1TB of memory. Access to these nodes is possible by using the node feature smp and the smp queue. Access is scheduled by the amount of required memory which has to be specified by the vmem feature.

user@eslogin00X> qsub -I -lnodes=1:smp:ppn=1 -q smp -lvmem=5gb

Test jobs

A special queue test is available for small and short test jobs with very high priority. Limits are:

  • 1 job per user
  • 25 minutes walltime
  • 384 nodes per job
  • only 400 nodes in total
user@eslogin00X> qsub -lnodes=16:ppn=24 -q test mybatchjobscript


CCM jobs

The Cluster Compatibility Mode (CCM) is a software solution that provides the services needed to run most cluster-based independent software vendor (ISV) applications out-of-the-box with some configuration adjustments.

Note: CCM is only available for some users, not by default! If you need this feature, please ask your project manager for access to CCM.

separated output

If you need to separate the output of you job you can write a separate file for each rank using a wrapper shell script:

export PMI_NO_FORK=1
aprun -n24 -N24 bash -c "<exe> >& log.\$ALPS_APP_PE"

OR

use the ALPS_STD..._SPEC variable:

export PMI_NO_FORK=1
export ALPS_STDOUTERR_SPEC=<Outputdir>
aprun ...

Where you will than get the following files in your output directory:

oe00000, oe00001, .... oe00099

It is also possible to separate stdin and stdout :

ALPS_STDOUT_SPEC=<output dir> -> Files with extension 'o'. ALPS_STDERR_SPEC=<output dir> -> Files with extension 'e'.