- Infos im HLRS Wiki sind nicht rechtsverbindlich und ohne Gewähr -
- Information contained in the HLRS Wiki is not legally binding and HLRS is not responsible for any damages that might result from its use -

CRAY XC30 Using the Batch System SLURM: Difference between revisions

From HLRS Platforms
Jump to navigationJump to search
(Created page with "The only way to start a parallel job on the compute nodes of this system is to use the batch system. The installed batch system is based on * the resource management system '...")
 
No edit summary
Line 24: Line 24:
   man slurm
   man slurm
or read the [http://slurm.schedmd.com/documentation.html SLURM Documentation]
or read the [http://slurm.schedmd.com/documentation.html SLURM Documentation]
== Resource Allocation with SLURM ==
=== defining resources for a batch job ===
* The number of required nodes and cores can be determined by the parameters in the job script header
  #SBATCH --job-name=MYJOB
  #SBATCH --nodes=1
  #SBATCH --time=00:10:00
* the job is submitted by the '''sbatch''' command
* At the end of the execution output and error files are returned to submission directory

Revision as of 09:43, 31 May 2013

The only way to start a parallel job on the compute nodes of this system is to use the batch system. The installed batch system is based on

  • the resource management system SLURM (Simple Linux Utility for Resource Management)

Additional you have to know on CRAY XC30 the user applications are always launched on the compute nodes using the application launcher, aprun, which submits applications to the Application Level Placement Scheduler (ALPS) for placement and execution.

Detailed information about how to use this system and many examples can be found in Cray Programming Environment User's Guide and Workload Management and Application Placement for the Cray Linux Environment.



Writing a submission script is typically the most convenient way to submit your job to the batch system. You generally interact with the batch system in two ways: through options specified in job submission scripts (these are detailed below in the examples) and by using slurm commands on the login nodes. There are some main commands used to interact with slurm:

  • sbatch is used to submit a job script for later execution.
  • scancel is used to cancel a pending or running job or job step. It can also be used to send an arbitrary signal to all processes associated with a running job or job step.
  • sinfo reports the state of partitions and nodes managed by SLURM. It has a wide variety of filtering, sorting, and formatting options.
  • squeue reports the state of jobs or job steps. It has a wide variety of filtering, sorting, and formatting options. By default, it reports the running jobs in priority order and then the pending jobs in priority order.
  • srun is used to submit a job for execution or initiate job steps in real time.
  • salloc is used to allocate resources for a job in real time.
  • sacct is used to report job or job step accounting information about active or completed jobs.

Check the man page of slurm for more advanced commands and options

 man slurm

or read the SLURM Documentation


Resource Allocation with SLURM

defining resources for a batch job

  • The number of required nodes and cores can be determined by the parameters in the job script header
 #SBATCH --job-name=MYJOB
 #SBATCH --nodes=1
 #SBATCH --time=00:10:00
  • the job is submitted by the sbatch command
  • At the end of the execution output and error files are returned to submission directory