- Infos im HLRS Wiki sind nicht rechtsverbindlich und ohne Gewähr -
- Information contained in the HLRS Wiki is not legally binding and HLRS is not responsible for any damages that might result from its use -
CRAY XC40 Using the Batch System
Introduction
The only way to start a parallel job on the compute nodes of this system is to use the batch system. The installed batch system is based on
- the resource management system torque and
- the scheduler moab
Additional you have to know on CRAY XE6/XC30 the user applications are always launched on the compute nodes using the application launcher, aprun, which submits applications to the Application Level Placement Scheduler (ALPS) for placement and execution.
Detailed information for CRAY XE6 about how to use this system and many examples can be found in Cray Application Developer's Environment User's Guide and Workload Management and Application Placement for the Cray Linux Environment.
Detailed information for CRAY XC30 about how to use this system and many examples can be found in Cray Programming Environment User's Guide and Workload Management and Application Placement for the Cray Linux Environment.
- ALPS is always used for scheduling a job on the compute nodes. It does not care about the programming model you used. So we need a few general definitions :
- PE : Processing Elements, basically an Unix ‘Process’, can be a MPI Task, CAF image, UPC tread, ...
- numa_node The cores and memory on a node with ‘flat’ memory access, basically one of the 2 Dies on the Intel and the direct attach memory.
- Thread A thread is contained inside a process. Multiple threads can exist within the same process and share resources such as memory, while different PEs do not share these resources. Most likely you will use OpenMP threads.
- aprun is the ALPS (Application Level Placement Scheduler) application launcher
- It must be used to run application on the XE/XC compute nodes interactively and in a batch job
- If aprun is not used, the application is launched on the MOM node (and will most likely fail)
- aprun man page contains several useful examples at least 3 important parameter to control:
- The total number of PEs: -n
- The number of PEs per node: -N
- The number of OpenMP threads: -d (the 'stride' between 2 PEs in a node
- CRAY_XE6_and_XC30_Using_the_Batch_System#Understanding_aprun
- qsub is the torque submission command for batch job scripts.
Writing a submission script is typically the most convenient way to submit your job to the batch system.
You generally interact with the batch system in two ways: through options specified in job submission scripts (these are detailed below in the examples) and by using torque or moab commands on the login nodes. There are three key commands used to interact with torque:
- qsub
- qstat
- qdel
Check the man page of torque for more advanced commands and options
man pbs
Requesting Resources using batch system TORQUE and ALPS
Batch Mode
Production jobs are typically run in batch mode. Batch scripts are shell scripts containing flags and commands to be interpreted by a shell and are used to run a set of commands in sequence.
- The number of required nodes, cores, wall time and more can be determined by the parameters in the job script header with "#PBS" before any executable commands in the script.
#!/bin/bash #PBS -N job_name #PBS -l nodes=2:ppn=32 #PBS -l walltime=00:20:00 # Change to the direcotry that the job was submitted from cd $PBS_O_WORKDIR # Launch the parallel job to the allocated compute nodes aprun -n 64 -N 32 ./my_mpi_executable arg1 arg2 > my_output_file 2>&1
- The job is submitted by the qsub command (all script head parameters #PBS can also be submitted directly by qsub command options).
qsub my_batchjob_script.pbs
- Overwriting qsub Options:
qsub -N other_name -l nodes=2:ppn=32,walltime=00:20:00 my_batchjob_script.pbs
- The batch script is not necessarily granted resources immediately, it may sit in the queue of pending jobs for some time before its required resources become available.
- At the end of the execution output and error files are returned to submission directory
- This example will run your executable "my_mpi_executable" in parallel with 64 MPI processes. Torque will allocate 2 nodes to your job for a maximum time of 20 minutes and place 32 processes on each node (one per core). The batch systems allocates nodes exclusively only for one job. After the walltime limit is exceeded, the batch system will terminate your job. The job launcher for the XE6/XC30 parallel jobs (both MPI and OpenMP) is aprun. This needs to be started from a subdirectory of the /mnt/lustre_server (your workspace). The aprun example above will start the parallel executable "my_mpi_executable" with the arguments "arg1" and "arg2". The job will be started using 64 MPI processes with 32 processes placed on each of your allocated nodes (remember that a node consists of 32 cores in the XE6 system and only 16 cores in the XC30 system). You need to have nodes allocated by the batch system before (qsub).
To query further options of aprun, please use
man aprun aprun -h
Interactive batch Mode
Interactive mode is typically used for debugging or optimizing code but not for running production code. To begin an interactive session, use the "qsub -I" command:
qsub -I -l nodes=2:ppn=32,walltime=00:30:00
If the requested resources are available and free (in the example above: 2 nodes/32 cores, 30 minutes), then you will get a new session on the mom node for your requested resources. Now you have to use the aprun command to launch your application to the allocated compute nodes. When you are finished, enter logout to exit the batch system and return to the normal command line.
Notes
- Remember, you use aprun within the context of a batch session and the maximum size of the job is determined by the resources you requested when you launched the batch session. You cannot use the aprun command to use more resources than you reserved using the qsub command. Once a batch session begins, you can only use fewer resources than initially requested.
- While your job is running (in Batch Mode), STDOUT and STDERR are written to a file or files in a system directory and the output is copied to your submission directory only after the job completes. Specifying the "qsub -j oe" option here and redirecting the output to a file (see examples above) makes it possible for you to view STDOUT and STDERR while the job is running.
Run job on other Account ID
There are Unix groups associated to the project account ID (ACID). To run a job on a non-default project budget, the groupname of this project has to be passed in the group_list:
qsub -W group_list=<groupname> ...
To get your available groups:
id
Usage of a Reservation
For nodes which are reserved for special groups or users, you need to specify an additional option for this reservation:
- E.g. a reservation named john.1 will be used with following command:
qsub -W x=FLAGS:ADVRES:john.1 ...
Deleting a Batch Job
qdel <jobID> canceljob <jobID>
This commands enables you to remove jobs from the job queue. If the job is running, qdel will abort it. You can obtain the Job ID from the output of command "qstat" or you remember the output of your qsub command of your job.
Status Information
* Status of jobs: qstat qstat -a showq
- Status of Qeues:
qstat -q qstat -Q
- Status of job scheduling
checkjob <jobID> showstart <jobID>
- Status of backfill. This can help you to build small jobs that can be backfilled immediately while you are waiting for the resources to become available for your larger jobs
showbf
- Status of Nodes/System (see also Gathering Application Status and Information on the Cray System)
xtnodestat apstat
Note: for further details type on the login node:
man qstat man apstat man xtnodestat showbf -h showq -h checkjob -h showstart -h
Limitations
see the Batch System Layout and Limits for CRAY XE6 see the Batch System Layout and Limits for CRAY XC30
Understanding aprun
Core specialization
System 'noise' on compute nodes may significantly degrade scalability for some applications. The Core Specialization can mitigate this problem.
- 1 core per node will be dedicated for system work (service core)
- As many system interrupts as possible will be forced to execute on the service core
- The application will not run on the service core
To get core specialization use aprun -r
aprun -r1 -n 100 a.out
highest numbered cores will be used, starting with 31 on current nodes. (independent on aprun -j setting)
apcount provided to compute total number of cores required
man apcount
Hyperthreading only for XC30 system !
Cray XC30 compute nodes are always booted with hyperthreading on ON. User can choose to run with one or two PEs or threads per core. The default is to run with 1. You can make your choice at runtime :
aprun –n### -j1 … -> Single Stream mode, one rank per core
aprun –n### -j2 … -> Dual Stream mode, two ranks per core
The numbering of the cores in single stream mode is 0-7 for die 0 and 8-15 for die 1. If using dual stream mode the numbering of the first 15 cores stays the same and cores 16-23 are on die 0 and 24-31 on die 1. Note that this make the numbering of the cores in hypterthread mode is not contigues :
Mode | cores on die 0 | cores on die 1 | ||||||
---|---|---|---|---|---|---|---|---|
|
|
|
aprun CPU Affinity control
CLE can dynamically distribute work by allowing PEs and threads to migrate from one CPU to another within a node. In some cases, moving PEs or threads from CPU to CPU increases cache and translation lookaside buffer (TLB) misses and therefore reduces performance. The CPU affinity options enable to bind a PE or thread to a particular CPU or a subset of CPUs on a node.
- aprun CPU affinity options (see also man aprun)
- Default settings: -cc cpu (PEs are bound a to specific core, depended on the –d setting)
- Binding PEs to a specific numa node : -cc numa_node (PEs are not bound to a specific core but cannot ‘leave’ their numa_node)
- No binding: -cc none
- Own binding: -cc 0,4,3,2,1,16,18,31,9,....
aprun Memory Affinity control
Cray XC30 systems use dual-socket compute nodes with 2 dies. For 16-CPU Cray XC30 compute node processors, NUMA nodes 0 and 1 have eight CPUs each (logical CPUs 0-7, 8-15 respectively). If your applications use Intel Hyperthreading Technology, it is possible to use up to 32 processing elements (logical CPUs 16-23 are on NUMA node 0 and CPUs 24-31 are on NUMA node 1). Even if you PE and threads are bound to a specific numa_node, the memory used does not have to be ‘local’
- aprun memory affinity options (see also man apron)
- Suggested setting is –ss (a PE can only allocate the memory local to its assigned NUMA node. If this is not possible, your application will crash.)
Some basic aprun examples
Assuming a XC30 with Sandybridge nodes (32 cores per node with Hyperthreading)
Pure MPI application , using all the available cores in a node
aprun -n 32 -j2 ./a.out
Pure MPI application, using only 1 core per node
32 MPI tasks, 32 nodes with 32*32 core allocated can be done to increase the available memory for the MPI tasks
aprun -N 1 -n 32 -d 32 -j2 ./a.out
Hybrid MPI/OpenMP application, 4 MPI ranks per node
32 MPI tasks, 8 OpenMP threads each need to set OMP_NUM_THREADS
export OMP_NUM_THREADS=8 aprun -n 32 -N 4 -d $OMP_NUM_THREADS -j2
MPI and OpenMP with Intel PE
Intel RTE creates one extra thread when spawning the worker threads. This makes the pinning for aprun more difficult.
Suggestions:
- Running when “depth” divides evenly into the number of “cpus” on a socket
export OMP_NUM_THREADS=“<=depth” aprun -n npes -d “depth” -cc numa_node a.out
- Running when “depth” does not divide evenly into the number of “cpus” on a socket
export OMP_NUM_THREADS=“<=depth” aprun -n npes -d “depth” -cc none a.out
Multiple Program Multiple Data (MPMD)
aprun supports MPMD – Multiple Program Multiple Data.
- Launching several executables which all are part of the same MPI_COMM_WORLD
aprun –n 128 exe1 : -n 64 exe2 : -n 64 exe3
- Notice : Each exacutable needs a dedicated node, exe1 and exe2 cannot share a node.
Example : The following commands needs 3 nodes
aprun –n 1 exe1 : -n 1 exe2 : -n 1 exe3
- Use a script to start several serial jobs on a node :
aprun –a xt –n 1 –d 32 –cc none script.sh
>cat script.sh ./exe1& ./exe2& ./exe3& wait >
cpu_lists for each PE
CLE was updated to allow threads and processing elements to have more flexibility in placement. This is ideal for processor architectures whose cores share resources with which they may have to wait to utilize. Separating cpu_lists by colons (:) allows the user to specify the cores used by processing elements and their child processes or threads. Essentially, this provides the user more granularity to specify cpu_lists for each processing element.
Here an example with 3 threads :
aprun -n 4 -N 4 -cc 1,3,5:7,9,11:13,15,17:19,21,23