- Infos im HLRS Wiki sind nicht rechtsverbindlich und ohne Gewähr -

CRAY XE6 notes for the upgraded Batch System

From HLRS Platforms

The batch system on CRAY XE6 (hermit) has been upgraded on Tue, 6th May 2014. Most functionalities can be used identically as in the old version. But there are some things which has been changed and some users may need to modify their batch job submission scripts on hermit. So, please take a look to following things:

Jobs with mixed node features

see also old version batch system for comparison.

Warning: the old example will not work after the upgrade!
Note: But the new example will work on both batch system versions. You can use this now

Here is a new example batch job for requests with mixed node features for the new version of the batch system on hermit:

You need to specify resource name nodes=<node count>:ppn=<process count per node>:mem64gb+<node count>:ppn=<proc count per node>:mem32gb:

 qsub -l nodes=1:ppn=32:mem64gb+64:ppn=32:mem32gb,walltime=3600 my_batchjob_script.pbs

The example above will allocate 65 nodes to your job for a maximum time of 3600 seconds and can place 32 processes on one node with 64GB memory and 32 processes on each of the 64 allocated nodes with 32GB memory. Important is option ppn=32 to get all cores of the allocated mpp nodes.

Now you need to select your different allocated nodes for your aprun command in your script my_batchjob_script.pbs. A new example for the new batch system is here:

#PBS -N mixed_job
#PBS -l nodes=1:ppn=32:mem64gb+2:ppn=32:mem32gb
#PBS -l walltime=300

### defining the number of PEs (processes per node ( max 32 for hermit | max 16 for hornet) ###
# p32: number of PEs (Processing Elements) on 32GB nodes
# p64: number of PEs (Processing Elements) on 64GB nodes

# Change to the direcotry that the job was submitted from

### selecting nodes with different memory ###
# 1. getting all nodes of my job

# 2. getting the nodes with feature mem32gb of my job
nid32=$(/opt/hlrs/system/tools/hostlistf mem32gb "$nids")
# how many nodes do I have with mem32gb:
i32=$(/opt/hlrs/system/tools/cntcommastr "$nid32")

# 3. getting the nodes with feature mem64gb of my job
nid64=$(/opt/hlrs/system/tools/hostlistf mem64gb "$nids")
# how many nodes do I have with mem64gb:
i64=$(/opt/hlrs/system/tools/cntcommastr "$nid64")

(( P32 = $i32 * $p32 ))
(( P64 = $i64 * $p64 ))
(( D32 = 32 / $p32 ))
(( D64 = 32 / $p64 ))

# Launch the parallel job to the allocated compute nodes using
# Multi Program, Multi Data (MPMD) mode (see "man aprun")
# -------------------------------------------------
# $nid64 : node list with 64GB memory
# $i64     : number of nodes with 64GB memory
# $p64    : number of PEs per node on nodes with 64GB
# $P64    : total number of PEs (processing elements) on nodes with 64GB
# ----
# $nid32 : node list with 32GB memory
# $i32     : number of nodes with 32GB memory
# $p32    : number of PEs per node on nodes with 32GB
# $P32    : total number of PEs on nodes with 32GB
# ----------
# The "env OMP_NUM_THREADS=...." parts of the aprun command below are only useful for OpenMP (hybrid) programs.
aprun -L $nid64 -n $P64 -N $p64 -d $D64 env OMP_NUM_THREADS=$D64 ./my_executable1 : -L $nid32 -n $P32 -N $p32 -d $D32 env OMP_NUM_THREADS=$D32 ./my_executable2

By defining p64 and p32 in the example above you can control the number of processes on each node for the different node types (64GB memory and 32GB memory). This corresponds to the qsub job option "-l mppnppn=32" for mono node type mpp jobs (see examples in previous chapters above). Important to know is the maximum value is 32, the number of cores of each mpp node.

Deprecated CRAY qsub syntax using mppwidth, mppnppn, mppdepth, feature, ...

The qsub arguments specially available for CRAY XE6 systems (mppwidth, mppnppn, mppdepth, feature) is deprecated in the new batch system version. Most functionalities of those qsub arguments are still available in this new batch system version. Nevertheless, we recommend not to use these qsub arguments anymore. Please use following syntax:

 qsub -l nodes=2:ppn=32:mem32gb <myjobscript>

(replaces: qsub -l mppwidth=64,mppnppn=32,feature=mem32gb)

  • nodes: replacement for mppwidth/mppnppn
  • ppn: replacement for mppnppn
  • mem32gb: replacement for feature=mem32gb

Please note that in the examples below the keywords such as nodes or ppn have been specified directly in the script via the #PBS string. For the example in this subsection the keywords are specified on the command line which is also allowed. But you cannot specify the keywords both in the srcript and the command line.

Note: omitting argument ppn=32 will only allocate one core per node on the XE6 but the user will be charged for the entire node.

Parallel Job Examples

This batch script template serves as basis for the aprun expamples given later.

Warning: Please note the addition of ppn=32. This is only necessary for the XE6 Platform Hermit. For the XC30 the ppn has to be omitted.
#! /bin/bash

#PBS -N <job_name>
#PBS -l nodes=<number_of_nodes>:ppn=32
#PBS -l walltime=00:01:00

cd $PBS_O_WORKDIR # This is the directory where this script and the executable are located. 
# You can choose any other directory on the lustre file system.

export OMP_NUM_THREADS=<nt>

The keywords <job_name>, <number_of_nodes>, <nt>, and <aprun_command> have to be replaced and the walltime adapted accordingly (one minute is given in the template above). The OMP_NUM_THREADS environment variable is only important for applications using OpenMP. Please note that OpenMP directives are recognized by default by the Cray compiler and can be turned off by the -hnoomp option. For the Intel, GNU, and PGI compiler one has to use the corresponding flag to enable OpenMP recognition.

The following parameters for the template above should cover the vast majority of applications and are given for both the XE6 and XC30 platform at HRLS. The <exe> keword should be replaced by your application.

XE6 Platform

The nodes of the XE6 features two Interlagos processors with 16 cores each resulting in a total of 32 cores per node. Each Interlagos processor forms two NUMA domains of size 8 resulting in totally four NUMA domains per node.

  • Description: Serial application (no MPI or OpemMP)
 <number_of_nodes>:  1
 <aprun_command>:  aprun -n 1 <exe>

  • Description: Pure OpenMP application (no MPI)
 <number_of_nodes>:  1
 <nt>: 32
 <aprun_command>:  aprun -n 1 -d $OMP_NUM_THREADS <exe>

Comment: You can vary the number of threads from 1-32.

  • Description: Pure MPI application on two nodes fully packed (no OpenMP)
 <number_of_nodes>:  2
 <aprun_command>:  aprun -n 64 -N 32 <exe>

Comment: The -n specifies the total number of processing elements (PE) and -N the PEs per node. The -n has to be less or equal to 32*<number_of_nodes> and -N less or equal to 32. Finally, the -n value divided by the -N value has to be less or equal than the <number_of_nodes>. You can increase the number of nodes as needed and vary the remaining parameters accoridingly.

  • Description: Pure MPI application on two nodes in wide-AVX mode (no OpenMP)
 <number_of_nodes>:  2
 <aprun_command>:  aprun -n 32 -N 16 -d 2 <exe>

Comment: The -d 2 is used to place the PEs evenly among the cores on the node. This doubles the memory bandwidth and floating point unit per PE.

  • Description: Mixed (Hybrid) MPI OpenMP application on two nodes
 <number_of_nodes>:  2
 <nt>: 8
 <aprun_command>:  aprun -n 8 -N 4 -d $OMP_NUM_THREADS <exe>

Comment: In addition to the constraints mentioned above, the -d value times the -N value has to be less or equal to 32. This configuration runs one processing element per NUMA domain and each PE spawns 8 threads.

XC30 Platform

The compute nodes of the XC30 platform Hornet feature two SandyBridge processors with 8 cores and one NUMA domain each resulting in a total of 16 cores and 2 NUMA domains per node. One conceptual difference between the Interlagos nodes on the XE6 and the SandyBridge nodes on the XC30 is the Hyperthreading feature of the SandyBridge processor. Hyperthreading is always booted and whether it is used or not is controlled via the -j option to aprun. Using -j 2 enables Hyperthreding while -j 1 (the default) does not. With Hyperthreading enabled, the compute node on the XC30 disposes 32 cores instead of 16.

  • Description: Pure MPI application on two nodes fully packed (no OpenMP) with Hyperthreads
 <number_of_nodes>:  2
 <aprun_command>:  aprun -n 64 -N 32 -j 2 <exe>

  • Description: Pure MPI application on two nodes fully packed (no OpenMP) without Hyperthreads
 <number_of_nodes>:  2
 <aprun_command>:  aprun -n 32 -N 16 -j 1 <exe>

Comment: Here you can also omit the -j 1 option as it is the default. This configuration corresponds to the wide-AVX case on the XE6 nodes.

  • Description: Mixed (Hybrid) MPI OpenMP application on two nodes with Hyperthreading.
 <number_of_nodes>:  2
 <nt>: 2
 <aprun_command>:  aprun -n 32 -N 16 -d $OMP_NUM_THREADS -j 2 <exe>

General remarks for both platforms

The aprun allows to start an application with more OpenMP threads than compute cores available. This oversubscription results in a substantial performance degradation. The same happens if the -d value is smaller than the number of OpenMP threads used by the application. Furthermore, for the Intel programming environment an additional helper thread per processing element is spawned which can lead to an oversubscription. Here, one can use the -cc numa_node or the -cc none option to aprun to avoid this obersubscription of hardware. The default behavrior, i.e. if no -cc is specified, is as if -cc cpu is used which means that each processing element and thread is pinned to a processor. Please consult the aprun man page. Another popular option to aprun is -ss which forces memory allocation to be constrained in the same node as the processing element or thread is constrained. One can use the xthi.c utility to check the affinity of threads and processing elements.

Run job on other Account ID

There are Unix groups associated to the project account ID (ACID). To run a job on a non-default project budget, the groupname of this project has to be passed in the group_list:

 qsub -W group_list=<groupname> ...

To get your available groups:

Warning: It is not possible anymore to use your primary group as groupname!

Shared node SMP with very large memory

One external node has 1TB memory and 64 cores of Intel Xeon CPU X7550. This node is not connected to the GEMINI interconnect network of the CRAY mpp compute nodes. But this node has the same workspace and home filesystem mounted. This node is shared by several users and jobs at the same time! New after the batch system upgrade will be that users need to request the number of cores and the maximum of total memory needed by all processes you want to run. Otherwise the requested job gets very small defaults. The requested memory will be enforced, which means the job will be killed in case the job allocates more memory than requested.

To submit a job to this node you need following qsub options:

 qsub -q smp -l nodes=1:smp:ppn=2,vmem=100gb,walltime=3600 ./my_batchjob_script.pbs

This allocates 2 core (ppn=2) of the SMP node for 3600 seconds and limits the total memory used by all processes to 100GByte.

Note: All your processess will be grouped on 2 cores on this node. (see Linux cgroups mechanism)