CRAY XE6 notes for the upgraded Batch System

The batch system on CRAY XE6 (hermit) will be upgraded at Tue, 6th Mai 2014. Most functionalities can be used identically to the old version. But there are some things which will change and users needs to modify their batch job submission scripts on hermit after the batch system upgrade has been done. So, please take a look to following things:

Jobs with mixed node features

see also old version batch system for comparison (the old example will not work after the upgrade!).

Here is a new example batch job for requests with mixed node features for the new version of the batch system on hermit:

You need to specify resource name nodes=<node count>:ppn=<process count per node>:mem64gb+<node count>:ppn=<proc count per node>:mem32gb:

 qsub -l nodes=1:ppn=32:mem64gb+64:ppn=32:mem32gb,walltime=3600 my_batchjob_script.pbs

The example above will allocate 65 nodes to your job for a maximum time of 3600 seconds and can place 32 processes on one node with 64GB memory and 32 processes on each of the 64 allocated nodes with 32GB memory. Important is option ppn=32 to get all cores of the allocated mpp nodes.

Now you need to select your different allocated nodes for your aprun command in your script my_batchjob_script.pbs. A new example for the new batch system is here:

#!/bin/bash
#PBS -N mixed_job
#PBS -l nodes=1:ppn=16:mem64gb+2:ppn=32:mem32gb
#PBS -l walltime=300

### defining the number of PEs (processes per node ( max 32 for hermit | max 16 for hornet) ###
# p32: number of PEs (Processing Elements) on 32GB nodes
# p64: number of PEs (Processing Elements) on 64GB nodes
#-------------------------------------------------------
p32=32
p64=16

# Change to the direcotry that the job was submitted from
cd $PBS_O_WORKDIR


### selecting nodes with different memory ###
#---------------------------------------
# 1. getting all nodes of my job
nids=$(/opt/hlrs/system/tools/getjhostlist)

# 2. getting the nodes with feature mem32gb of my job
nid32=$(/opt/hlrs/system/tools/hostlistf mem32gb "$nids")
# how many nodes do I have with mem32gb:
i32=$(/opt/hlrs/system/tools/cntcommastr "$nid32")

# 3. getting the nodes with feature mem64gb of my job
nid64=$(/opt/hlrs/system/tools/hostlistf mem64gb "$nids")
# how many nodes do I have with mem64gb:
i64=$(/opt/hlrs/system/tools/cntcommastr "$nid64")


(( P32 = $i32 * $p32 ))
(( P64 = $i64 * $p64 ))
(( D32 = 32 / $p32 ))
(( D64 = 32 / $p64 ))

# Launch the parallel job to the allocated compute nodes using
# Multi Program, Multi Data (MPMD) mode (see "man aprun")
# -------------------------------------------------
# $nid64 : node list with 64GB memory
# $i64     : number of nodes with 64GB memory
# $p64    : number of PEs per node on nodes with 64GB
# $P64    : total number of PEs (processing elements) on nodes with 64GB
# ----
# $nid32 : node list with 32GB memory
# $i32     : number of nodes with 32GB memory
# $p32    : number of PEs per node on nodes with 32GB
# $P32    : total number of PEs on nodes with 32GB
# ----------
# The "env OMP_NUM_THREADS=...." parts of the aprun command below are only useful for OpenMP (hybrid) programs.
#
aprun -L $nid64 -n $P64 -N $p64 -d $D64 env OMP_NUM_THREADS=$D64 ./my_executable1 : -L $nid32 -n $P32 -N $p32 -d $D32 env OMP_NUM_THREADS=$D32 ./my_executable2

By defining p64 and p32 in the example above you can control the number of processes on each node for the different node types (64GB memory and 32GB memory). This corresponds to the qsub job option "-l mppnppn=32" for mono node type mpp jobs (see examples in previous chapters above). Important to know is the maximum value is 32, the number of cores of each mpp node.

Deprecated CRAY qsub syntax using mppwidth, mppnppn, mppdepth, feature

The qsub arguments specially available for CRAY XE6 system (mppwidth, mppnppn, mppdepth, feature) is deprecated in the new batch system version. Most functionalities of those qsub arguments are still available in this new batch system version. Nevertheless, we recommend not to use this qsub arguments. Please use following syntax:

 qsub -l nodes=2:ppn=32:mem32gb <myjobscript>

(replaces: qsub -l mppwidth=64,mppnppn=32,feature=mem32gb)

nodes: replacement for mppwidth/mppnppn
ppn: replacement for mppnppn
mem32gb: replacement for feature=mem32gb

Parallel Jobs Examples

This batch script template serves as basis for the aprun expamples given later.

#! /bin/bash

#PBS -N <job_name>
#PBS -l nodes=<number_of_nodes>
#PBS -l walltime=00:01:00

cd $PBS_O_WORKDIR # This is the directory where this script and the executable are located. 
# You can choose any other directory on the lustre file system.

export OMP_NUM_THREADS=<nt>
<aprun_command>

The keywords <job_name>, <number_of_nodes>, <nt>, and <aprun_command> have to be replaced and the walltime adapted accoridingly (one minute is given in the template above). The OMP_NUM_THREADS environment variable is only important for applications using OpenMP. Please note that OpenMP directives are recognized by default by the Cray compiler and can be turned off by the -hnoomp option. For the Intel, GNU, and PGI compiler one has to use the corresponding flag to enable OpenMP recognition.

The following parameters for the template above should cover the vast majority of applications and are given for both the XCE and XC30 platform at HRLS. The <exe> keword should be replaced by your application.

XE6 Platform

The nodes of the XE6 features two Interlagos processors with 16 cores each resulting in a total of 32 cores per node. Each Interlagos processor forms two NUMA domains of size 8 resulting in totally four NUMA domains per node.

Description: Serial application (no MPI or OpemMP)

 <number_of_nodes>:  1
 <aprun_command>:  aprun -n 1 <exe>

Description: Pure OpenMP application (no MPI)

 <number_of_nodes>:  1
 <nt>: 32
 <aprun_command>:  aprun -n 1 -d $OMP_NUM_THREADS <exe>

Comment: You can vary the number of threads from 1-32.

Description: Pure MPI application on two nodes fully packed (no OpenMP)

 <number_of_nodes>:  2
 <aprun_command>:  aprun -n 64 -N 32 <exe>

Comment: The -n specifies the total number of processing elements (PE) and -N the PEs per node. The -n has to be lesser or equal to 32*<number_of_nodes> and -N lesser or equal to <number_of_nodes>. Finally, the -n value divided by the -N value has to be lesser or equal than the <number_of_nodes>. You can increase the number of nodes as needed and vary the remaining parameters accoridingly.

Description: Pure MPI application on two nodes in wide-AVX mode (no OpenMP)

 <number_of_nodes>:  2
 <aprun_command>:  aprun -n 32 -N 16 -d 2 <exe>

Comment: The -d 2 is used to place the PEs evenly among the cores on the node. This doubles the memory bandwidth and floating point unit per PE.

Description: Mixed (Hybrid) MPI OpenMP application on two nodes

 <number_of_nodes>:  2
 <nt>: 8
 <aprun_command>:  aprun -n 8 -N 4 -d $OMP_NUM_THREADS <exe>

Comment: In addition to the constraints mentioned above, the -d value times the -N value has to be lesser or equal to 32. This configuration runs one processing element per NUMA domain and each PE spawns 8 threads.

Run job on other Account ID

There are Unix groups associated to the project account ID (ACID). To run a job on a non-default project budget, the groupname of this project has to be passed in the group_list:

 qsub -W group_list=<groupname> ...

To get your available groups:

id

Warning: Its not possible anymore to use your primary group for groupname!

CRAY XE6 notes for the upgraded Batch System

Contents

Jobs with mixed node features

Deprecated CRAY qsub syntax using mppwidth, mppnppn, mppdepth, feature

Parallel Jobs Examples

XE6 Platform

Run job on other Account ID

Navigation menu

Page actions

Page actions

Personal tools

Navigation

Search

Tools