- Infos im HLRS Wiki sind nicht rechtsverbindlich und ohne Gewähr -
- Information contained in the HLRS Wiki is not legally binding and HLRS is not responsible for any damages that might result from its use -

CRAY XC30 Using the Batch System SLURM

From HLRS Platforms
Jump to navigationJump to search

The only way to start a parallel job on the compute nodes of this system is to use the batch system. The installed batch system is based on

  • the resource management system SLURM (Simple Linux Utility for Resource Management)

SLURM is designed to operate as a job scheduler over Cray's Application Level Placement Scheduler (ALPS). Use SLURM's sbatch or salloc commands to create a resource allocation in ALPS. Then use ALPS' aprun command to launch parallel jobs within the resource allocation. Additional you have to know on CRAY XC30 the user applications are always launched on the compute nodes using the application launcher, aprun, which submits applications to the Application Level Placement Scheduler (ALPS) for placement and execution!

Detailed information about how to use this system and many examples can be found in Cray Programming Environment User's Guide and Workload Management and Application Placement for the Cray Linux Environment.



Writing a submission script is typically the most convenient way to submit your job to the batch system. You generally interact with the batch system in two ways: through options specified in job submission scripts (these are detailed below in the examples) and by using slurm commands on the login nodes. There are some main commands used to interact with slurm:

  • sbatch is used to submit a job script for later execution.
  • scancel is used to cancel a pending or running job or job step. It can also be used to send an arbitrary signal to all processes associated with a running job or job step.
  • sinfo reports the state of partitions and nodes managed by SLURM. It has a wide variety of filtering, sorting, and formatting options.
  • squeue reports the state of jobs or job steps. It has a wide variety of filtering, sorting, and formatting options. By default, it reports the running jobs in priority order and then the pending jobs in priority order.
  • srun is used to submit a job for execution or initiate job steps in real time.
  • salloc is used to allocate resources for a job in real time.
  • sacct is used to report job or job step accounting information about active or completed jobs.

Check the man page of slurm for more advanced commands and options

 man slurm

or read the SLURM Documentation




Requesting Resources with SLURM

Requesting the required resources for a batch job

  • The number of required nodes and cores can be determined by the parameters in the job script header with "#SBATCH" before any executable commands in the script.
 #SBATCH --job-name=MYJOB
 #SBATCH --nodes=1
 #SBATCH --time=00:10:00
  • The job is submitted by the sbatch command (all script head parameters #SBATCH can also be submitted directly by sbatch command options).
  • The batch script is not necessarily granted resources immediately, it may sit in the queue of pending jobs for some time before its required resources become available.
  • At the end of the execution output and error files are returned to submission directory

Other SLURM options

  • File names of stdout and stderr
 #SBATCH --output=std.out
 #SBATCH --error=std.err
  • Start N PEs (tasks)
 #SBATCH --ntasks=N
  • Start M tasks per node. Total Nodes used are N/M (+1 if mod(N,M)!=0). Max of M=32
 #SBATCH --ntasks-per-node=M
  • Number of threads per tasks. Most used together with OMP_NUM_THREADS
 #SBATCH --cpus-per-task=8

SLURM and aprun

The following SLURM options are translated to these aprun options. SLURM options not listed below have no equivalent aprun translation and while the option can be used for the allocation within Slurm it will not be propagated to aprun.

SLURM options aprun optioins
--ntasks=$PE
--cpu-per-task=$threads
--ntasks-per-node=$N
--ntasks-per-socket=$S
-n $PE
-d $threads
-N $N
-S $S
Number of PE to start
#threads/PE
#(PEs per node)
#(PEs per numa_node)
  • The following aprun options have no equivalent in srun and must be specified by using the SLURM --launcher-opts

options: -a, -b, -B, -cc, -f, -r, and -sl.

  • -B will provide aprun with the SLURM settings for -n, -N , -d and -m
 aprun -B ./a.out


Core specialization

System 'noise' on compute nodes may significantly degrade scalability for some applications. The Core Specialization can mitigate this problem.

  • 1 core per node will be dedicated for system work (service core)
  • As many system interrupts as possible will be forced to execute on the service core
  • The application will not run on the service core

To get core specialization use aprun -r

 aprun -r1 -n 100 a.out

highest numbered cores will be used, starting with 31 on current nodes. (independent on aprun -j setting)

apcount provided to compute total number of cores required

 man apcount





Running a batch application with SLURM

Submitting batch jobs

The number of required nodes, required runtime (wall clock time), job name, threads,.... can be specified in the job header or using sbatch options on command line. Here is a very simple Hybrid MPI + OpenMP job example how to do this (assuming the batch job script is named "myjob.script":

#!/bin/bash
#SBATCH --jobname=hybrid
#SBATCH --ntasks=64
#SBATCH --ntasks-per-node=4
#SBATCH --cpus-per-task=4

export OMP_NUM_THREADS=4
aprun –n64 –d4 –N4 –j1 a.out

The job is submitted by the sbatch command:

 sbatch --time=00:20:00 myjob.script

This job will allocate 64 PEs for 20 minutes in total. Each PE will run 4 OMP threads. Only 4 PEs are running on each node. On our system hornet, the nodes are allocated exclusively (on each node only 1 batch job). So, the job above allocates 16 nodes.


Monitoring batch jobs

  • squeue
    • shows batch jobs
  • sinfo
    • view information about SLURM nodes and partitions.
  • xtnodestat
    • Shows XC nodes allocation and aprun processes
    • All allocations are shown: both job running with aprun and batch dedicated nodes
  • apstat
    • Shows aprun processes status
    • apstat overview
    • apstat -a[ apid] info about all the applications or a specific one
    • apstat -n info about the status of the nodes

Deleting batch jobs

  • scancel





Examples

Starting 512 MPI tasks (PEs)

#!/bin/bash

#SBATCH --job-name=MPIjob
#SBATCH --ntasks=512
#SBATCH --ntasks-per-node=32
#SBATCH --time=01:00:00

export MPICH_ENV_DISPLAY=1
export MPICH_VERSION_DISPLAY=1 
export MPICH_RANK_REORDER_DISPLAY=1
export MALLOC_MMAP_MAX_=0 
export MALLOC_TRIM_THRESHOLD_=-1

aprun  -n 512 –cc cpu –ss –j2 ./a.out

Starting an OpenMP program, using a single node

#!/bin/bash

#SBATCH --job-name=OpenMP
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=32
#SBATCH --time=01:00:00

export MPICH_ENV_DISPLAY=1
export MPICH_VERSION_DISPLAY=1 
export MPICH_RANK_REORDER_DISPLAY=1
export MALLOC_MMAP_MAX_=0
export MALLOC_TRIM_THRESHOLD_=536870912 # large value
export OMP_NUM_THREADS=32

aprun –n1 –d $OMP_NUM_THREADS –cc cpu –ss –j2 ./a.out

Starting a hybrid job single node, 4 MPI tasks, each with 8 threads

#!/bin/bash

#SBATCH --job-name=hybrid
#SBATCH --ntasks=4
#SBATCH --ntasks-per-node=4 
#SBATCH --cpus-per-task=8 
#SBATCH --time=01:00:00

export MPICH_ENV_DISPLAY=1
export MPICH_VERSION_DISPLAY=1 
export MPICH_RANK_REORDER_DISPLAY=1
export MALLOC_MMAP_MAX_=0 
export MALLOC_TRIM_THRESHOLD_=-1
export OMP_NUM_THREADS=8

aprun –n4 –N4 –d $OMP_NUM_THREADS –cc cpu –ss –j2 ./a.out

Starting a hybrid job single node, 8 MPI tasks, each with 2 threads

#!/bin/bash

#SBATCH --job-name=hybrid
#SBATCH --ntasks=8
#SBATCH --ntasks-per-node=8 
#SBATCH --cpus-per-task=2 
#SBATCH --time=01:00:00

export MPICH_ENV_DISPLAY=1
export MPICH_VERSION_DISPLAY=1 
export MPICH_RANK_REORDER_DISPLAY=1
export MALLOC_MMAP_MAX_=0 
export MALLOC_TRIM_THRESHOLD_=-1
export OMP_NUM_THREADS=2

aprun –n8 –N8 –d $OMP_NUM_THREADS –cc cpu –ss –j1 ./a.out

Starting an MPI job on two nodes using only every second integer core

#!/bin/bash

#SBATCH --job-name=mpi
#SBATCH --ntasks=16
#SBATCH --ntasks-per-node=8 
#SBATCH --cpus-per-task=2 
#SBATCH --time=01:00:00

export MPICH_ENV_DISPLAY=1
export MPICH_VERSION_DISPLAY=1 
export MPICH_RANK_REORDER_DISPLAY=1

aprun –n32 –N8 –d 2 –cc cpu –ss –j1 ./a.out

Starting a hybrid job on two nodes using only every second integer core

#!/bin/bash

#SBATCH --job-name=hybrid
#SBATCH --ntasks=32
#SBATCH --ntasks-per-node=16
#SBATCH --cpus-per-task=2
#SBATCH --time=01:00:00

export MPICH_ENV_DISPLAY=1
export MPICH_VERSION_DISPLAY=1 
export MPICH_RANK_REORDER_DISPLAY=1
export OMP_NUM_THREADS=2

aprun –j2 –n32 –N16 –d $OMP_NUM_THREADS –cc 0,2:4,6:8,10:12,14:16,18:20,22:24,26:28,30 –ss ./a.out


Starting a MPMD job using 1 master with 32 threads and 16 workers each with 8 threads

#!/bin/bash

#SBATCH --job-name=mpmd
#SBATCH --nodes=5 ! Note : 5 nodes * 32 cores = 160 cores 
#SBATCH --ntasks-per-node=32
#SBATCH --time=01:00:00

export MPICH_ENV_DISPLAY=1
export MPICH_VERSION_DISPLAY=1 
export MPICH_RANK_REORDER_DISPLAY=1
export MALLOC_MMAP_MAX_=0 
export MALLOC_TRIM_THRESHOLD_=-1
export OMP_NUM_THREADS=8

aprun env OMP_NUM_THREADS=16 –n1 –d16 –N1 –j1 ./master.exe : -n16 –N4 –d$OMP_NUM_THREADS –cc cpu –ss –j2 ./worker.exe