- Infos im HLRS Wiki sind nicht rechtsverbindlich und ohne Gewähr -
- Information contained in the HLRS Wiki is not legally binding and HLRS is not responsible for any damages that might result from its use -

CRAY XE6 Using the Batch System: Difference between revisions

From HLRS Platforms
Jump to navigationJump to search
 
(80 intermediate revisions by 3 users not shown)
Line 1: Line 1:
{{Warning|text=[[CRAY_XE6_notes_for_the_upgraded_Batch_System|Please read this first!]]}}
The only way to start a parallel job on the compute nodes of this system is to use the batch system.
The only way to start a parallel job on the compute nodes of this system is to use the batch system.
The installed batch system is based on  
The installed batch system is based on  
Line 6: Line 8:
Additional you have to know on CRAY XE6 the user applications are always launched on the compute nodes using the application launcher, '''aprun''', which submits applications to the Application Level Placement Scheduler '''(ALPS)''' for placement and execution.
Additional you have to know on CRAY XE6 the user applications are always launched on the compute nodes using the application launcher, '''aprun''', which submits applications to the Application Level Placement Scheduler '''(ALPS)''' for placement and execution.


Detailed information about how to use this system and many examples can be found in [https://fs.hlrs.de/projects/craydoc/docs/books/S-2396-50/html-S-2396-50/chapter-djg9hyw1-brbethke.html Cray Application Developer's Environment User's Guide] and [https://fs.hlrs.de/projects/craydoc/docs/books/S-2496-31/html-S-2496-31/chapter-3vnhd83p-oswald-runningbatchjobs.html Workload Management and Application Placement for the Cray Linux Environment].
Detailed information about how to use this system and many examples can be found in [http://docs.cray.com/cgi-bin/craydoc.cgi?mode=View;id=S-2396-601;right=/books/S-2396-601/html-S-2396-601//chapter-djg9hyw1-brbethke.html Cray Application Developer's Environment User's Guide] and [http://docs.cray.com/cgi-bin/craydoc.cgi?mode=View;id=S-2496-4001;right=/books/S-2496-4001/html-S-2496-4001//chapter-3vnhd83p-oswald-runningbatchjobs.html Workload Management and Application Placement for the Cray Linux Environment].
 
 
 


= Running Jobs =
Writing a submission script is typically the most convenient way to submit your job to the batch system.
Writing a submission script is typically the most convenient way to submit your job to the batch system.
You generally interact with the batch system in two ways: through options specified in job submission scripts (these are detailed below in the examples) and by using torque or moab commands on the login nodes. There are three key commands used to interact with torque:
You generally interact with the batch system in two ways: through options specified in job submission scripts (these are detailed below in the examples) and by using torque or moab commands on the login nodes. There are three key commands used to interact with torque:
Line 19: Line 23:
   man pbs  
   man pbs  


== The qsub command ==
== Submitting a Job / allocating resources ==
=== Batch Mode ===
Production jobs are typically run in batch mode. Batch scripts are shell scripts containing flags and commands to be interpreted by a shell and are used to run a set of commands in sequence.
To submit a job, type
To submit a job, type
   qsub my_batchjob_script.pbs
   qsub my_batchjob_script.pbs
Line 26: Line 32:
A simple MPI job submission script for the XE6 would look like:
A simple MPI job submission script for the XE6 would look like:
<pre>
<pre>
#!/bin/bash --login
#!/bin/bash
#PBS -N job_name
#PBS -N job_name
#PBS -A account
#PBS -l mppwidth=64
#PBS -l mppwidth=32
#PBS -l mppnppn=32
#PBS -l mppnppn=16
#PBS -l walltime=00:20:00             
#PBS -l walltime=00:20:00             
    
    
Line 36: Line 41:
cd $PBS_O_WORKDIR
cd $PBS_O_WORKDIR


# Launch the parallel job
# Launch the parallel job to the allocated compute nodes
aprun -n 32 -N 4 ./my_mpi_executable arg1 arg2 > my_output_file 2>&1
aprun -n 64 -N 32 ./my_mpi_executable arg1 arg2 > my_output_file 2>&1
</pre>
</pre>


This will run your executable "my_mpi_executable" in parallel with 32 MPI processes. Torque will allocate 2 nodes to your job for a maximum time of 20 minutes and place 16 processes on each node (one per core). The batch systems allocates nodes exclusively only for one job. After the walltime limit is exceeded, the batch system will terminate your job.
This will run your executable "my_mpi_executable" in parallel with 64 MPI processes. Torque will allocate 2 nodes to your job for a maximum time of 20 minutes and place 32 processes on each node (one per core). The batch systems allocates nodes exclusively only for one job. After the walltime limit is exceeded, the batch system will terminate your job.


<font color red>Important</font>: You have to change into a subdirectory of /mnt/lustre_server (your [[Workspace_mechanism | workspace]]), before calling aprun.
<font color red>Important</font>: You have to change into a subdirectory of /mnt/lustre_server (your [[Workspace_mechanism | workspace]]), before calling aprun.


All torque options start with a "#PBS"-string. The individual options are explained in:
All torque options start with a "#PBS"-string.  
 
You can overwrite this options on the qsub command line:
  qsub -N other_name -A myother_account -l mppwidth=64,mppnppn=32,walltime=01:00:00 my_batchjob_script.pbs
To have the same environmental settings (exported environment) of your current session in your batchjob, the qsub command needs the option argument -V:
  qsub -V my_batchjob_script.pbs
The individual options are explained in:
   man qsub
   man qsub


The job launcher for the XE6 parallel jobs (both MPI and OpenMP) is '''aprun'''. This needs to be started from a subdirectory of the /mnt/lustre_server (your workspace). The '''aprun''' example above will start the parallel executable "my_mpi_executable" with the arguments "arg1" and "arg2". The job will be started using 32 MPI processes with 16 processes placed on each node (remember that a node consists of 16 cores in the XE6 system).
The job launcher for the XE6 parallel jobs (both MPI and OpenMP) is '''aprun'''. This needs to be started from a subdirectory of the /mnt/lustre_server (your workspace). The '''aprun''' example above will start the parallel executable "my_mpi_executable" with the arguments "arg1" and "arg2". The job will be started using 64 MPI processes with 32 processes placed on each of your allocated nodes (remember that a node consists of 32 cores in the XE6 system). You need to have nodes allocated by the batch system before (qsub).
To query further options, please use  
To query further options, please use  
   man aprun  
   man aprun  
   aprun -h
   aprun -h
An example OpenMP job submission script for the XE6 nodes is shown below.
<pre>
#!/bin/bash
#PBS -N job_name
# Request the number of cores that you need in total
#PBS -l mppwidth=16
#PBS -l mppnppn=16
# Request the time you need for computation
#PBS -l walltime=00:03:00
# Change to the directory that the job was submitted from
cd $PBS_O_WORKDIR
# Set the number of OpenMP threads per node
export OMP_NUM_THREADS=16
# Launch the OpenMP job to the allocated compute node
aprun -n 1 -N 1 -d $OMP_NUM_THREADS ./my_openmp_executable.x arg1 arg2 > my_output_file 2>&1
</pre>
This will run your executable "my_openmp_executable" using 16 threads on one node. We set the environment variable OMP_NUM_THREADS to 16.
=== Interactive Mode ===
Interactive mode is typically used for debugging or optimizing code but not for running production code. To begin an interactive session, use the "qsub -I" command:
  qsub -I -l mppwidth=64,mppnppn=32,walltime=00:30:00
If the requested resources are available and free (in the example above: 2 nodes/32 cores, 30 minutes), then you will get a new session on the login node for your requested resources.
Now you have to use the '''aprun''' command to launch your application to the allocated compute nodes.
When you are finished, enter logout to exit the batch system and return to the normal command line.
=== Notes ===
* You must use (in Interactive Mode) the "-l mppwidth=" option and "-l mppnppn=" to specify at least one core when you start the interactive session. If you do not, your request for an interactive session will pause indefinitely.
* Remember, you use '''aprun''' within the context of a batch session and the maximum size of the job is determined by the resources you requested when you launched the batch session. You cannot use the aprun command to use more resources than you reserved using the qsub command. Once a batch session begins, you can only use fewer resources than initially requested.
* While your job is running (in Batch Mode), STDOUT and STDERR are written to a file or files in a system directory and the output is copied to your submission directory only after the job completes. Specifying the "qsub -j oe" option here and redirecting the output to a file (see examples above) makes it possible for you to view STDOUT and STDERR while the job is running.
==== Run job on other Account ID ====
There are Unix groups associated to the project account ID (ACID).
To run a job on a non-default project budget (associated to secondary group),
the groupname of this project has to be passed in the group_list:
<pre>qsub -W group_list=<groupname> ...</pre>
To get your available groups:
<pre>id -Gn</pre>
Note, that this procedure is neither applicable nor necessary for the default project
(associated to the primary group), printed with "id -gn".
==== Usage of a Reservation ====
For nodes which are reserved for special groups or users, you need to specify an additional option for this reservation:
: E.g. a reservation named ''john.1'' will be used with following command:
qsub -W x=FLAGS:ADVRES:john.1 ...
== Deleting a Batch Job ==
  qdel <jobID>
  canceljob <jobID>
This commands enables you to remove jobs from the job queue. If the job is running, qdel will abort it.
You can obtain the Job ID from the output of command "qstat" or you remember the output of your qsub command of your job.
== Status Information ==
* Status of jobs:
  qstat
  qstat -a
  showq
* Status of Qeues:
  qstat -q
  qstat -Q
 
* Status of job scheduling
  checkjob <jobID>
  showstart <jobID>
* Status of backfill. This can help you to build small jobs that can be backfilled immediately while you are waiting for the resources to become available for your larger jobs
  showbf
* Status of Nodes/System (see also [http://docs.cray.com/cgi-bin/craydoc.cgi?mode=View;id=S-2496-4001;right=/books/S-2496-4001/html-S-2496-4001//cnl_apps.html%23section-id1z011s-craigf Gathering Application Status and Information on the Cray System])
  xtnodestat
  apstat
Note: for further details type on the login node:
  man qstat
  man apstat
  man xtnodestat
  showbf -h
  showq -h
  checkjob -h
  showstart -h
== Limitations ==
  see the [[CRAY_XE6_Batch_System_Layout_and_Limits| Batch System Layout and Limits]]
== Special Jobs / Special Nodes ==
=== MPP nodes with different memory ===
<font color=green>3072</font> nodes of total <font color=red>3552</font> nodes are installed with <font color=green>32GB</font> memory; <font color=blue>480</font> nodes are installed with <font color=blue>64GB</font> memory.
==== 32 GB nodes or 64 GB nodes ====
* If your job has not defined any node feature, then your job gets a default feature '''"<font color=green>mem32gb</font>"''' which will allocate nodes with 32GB memory (see job examples above).
* If you want one or more of the 64GB nodes, then you have to specify the node feature '''"<font color=blue>mem64gb</font>"''':
  qsub -l feature=mem64gb <my_batchjob_script.pbs>
Or inside a simple script ''my_batchjob_script.pbs''
<pre>
#!/bin/bash
#PBS -N job_name
#PBS -l feature=mem64gb
#PBS -l mppwidth=64
#PBS -l mppnppn=32
#PBS -l walltime=00:20:00           
 
# Change to the direcotry that the job was submitted from
cd $PBS_O_WORKDIR
# Launch the parallel job to the allocated compute nodes
aprun -n 64 -N 32 ./my_mpi_executable arg1 arg2 > my_output_file 2>&1
</pre>
==== Jobs with mixed node features ====
If your job needs some of the 64GB nodes and some of the 32GB nodes at the same time, then your job submission <font color=red>options</font> looks <font color=red>totally different</font>. <font color=red>Do not specify additional ''-l feature=<nodefeature>''!</font>
You only need to specify '''resource name''' ''<font color=blue>nodes=<node count>:ppn=<process count per node>:mem64gb+<node count>:ppn=<proc count per node>:mem32gb</font>'':
  qsub -l nodes=1:ppn=32:mem64gb+64:ppn=32:mem32gb,walltime=3600 my_batchjob_script.pbs
The example above will allocate 65 nodes to your job for a maximum time of 3600 seconds and can place 32 processes on one node with 64GB memory and 32 processes on each of the 64 allocated nodes with 32GB memory. Important is option ''ppn=32'' to get all cores of the allocated mpp nodes.
Now you need to select your different allocated nodes for your ''aprun'' command in your script ''my_batchjob_script.pbs'':
<pre>
#!/bin/bash
#PBS -N job_name
#PBS -l nodes=1:ppn=32:mem64gb+64:ppn=32:mem32gb
#PBS -l walltime=3600         
 
# Change to the direcotry that the job was submitted from
cd $PBS_O_WORKDIR
### defining the number of PEs (processes per node ( max 32) ###
# p32: number of PEs (Processing Elements) on 32GB nodes
# p64: number of PEs (Processing Elements) on 64GB nodes
#-------------------------------------------------------
p32=32
p64=16
### selecting nodes with different memory ###
#---------------------------------------
resid=$(qstat -f $PBS_JOBID | grep BATCH_PARTITION_ID | sed -e 's/.*_ID=//' -e 's/,.*//')
sleep 5
nids=$(apstat -n -R $resid | grep UP | awk '{print $1}')
let i32=0
let i64=0
filetmp=./tmp.$$
xtprocadmin -A | grep compute > $filetmp
for n in $nids; do
  let size=$(grep -w $n  $filetmp | awk '{print $9}' )
  if (( $size == 32768 )); then
      let i32+=1
      nid32="$nid32 $n"
  else
      let i64+=1
      nid64="$nid64 $n"
  fi
done
rm $filetmp
nid32=$(echo $nid32 | tr ' ' ',')
nid64=$(echo $nid64 | tr ' ' ',')
(( P32 = $i32 * $p32 ))
(( P64 = $i64 * $p64 ))
(( D32 = 32 / p32 ))
(( D64 = 32 / p64 ))
# Launch the parallel job to the allocated compute nodes using
# Multi Program, Multi Data (MPMD) mode (see "man aprun")
# -------------------------------------------------
# $nid64 : node list with 64GB memory
# $i64    : number of nodes with 64GB memory
# $p64    : number of PEs per node on nodes with 64GB
# $P64    : total number of PEs (processing elements) on nodes with 64GB
# ----
# $nid32 : node list with 32GB memory
# $i32    : number of nodes with 32GB memory
# $p32    : number of PEs per node on nodes with 32GB
# $P32    : total number of PEs on nodes with 32GB
# ----------
# The "env OMP_NUM_THREADS=...." parts of the aprun command below are only useful for OpenMP (hybrid) programs.
#
aprun -L $nid64 -n $P64 -N $p64 -d $D64 env OMP_NUM_THREADS=$D64 ./my_executable1 : -L $nid32 -n $P32 -N $p32 -d $D32 env OMP_NUM_THREADS=$D32 ./my_executable2
</pre>
By defining p64 and p32 in the example above you can control the number of processes on each node for the different node types (64GB memory and 32GB memory). This corresponds to the qsub job option ''"-l mppnppn=32"'' for mono node type mpp jobs (see examples in previous chapters above). Important to know is the maximum value is 32, the number of cores of each mpp node.
=== Nodes with Kepler accelerator cards ===
<font color=green>28</font> nodes of total <font color=red>3552</font> nodes are installed with <font color=green>32GB</font> memory, <font color=blue> 16 </font> cores each and with an additional <font color=blue>Kepler accelerator card</font> installed.
There are 2 ways to submit a job allocating the nodes with the Kepler accelerators.
* Using the queue '''gpgpu''': 
  qsub -q gpgpu -l mppwidth=16,mppnppn=16 <myjobscript>
This allocates 1 node (16 cores) with Kepler accelerator, respectively nodes with feature '''tesla''' .
* Using the node feature '''tesla''':
  qsub -l mppwidth=16,mppnppn=16,feature=tesla <myjobscript>
This also allocates 1 node (16 cores) with Kepler accelerator.
'''For ccm jobs you need to use the second way submitting to the queue <tt> ccm </tt>:'''
  qsub -q ccm -l mppwidth=16,mppnppn=16,feature=tesla <myjobscript>
=== Pre- and Postprocessing/Visualization nodes with large memory ===
A few number of external nodes are standard cluster nodes with 128GB memory and 32 cores of Intel Xeon CPU X7550.
These nodes are not connected to the GEMINI interconnect network of the CRAY mpp compute nodes. But these nodes have the same workspace and home filesystem mounted and they have a graphic NVIDIA Quadro 6000 installed for visualization.
To get one of the pre-postprocessing node, following qsub options are needed:
  qsub -l nodes=1:mem128gb,walltime=3600 ./my_batchjob_script.pbs
This allocates one of the pre-postprocessing node for 3600 seconds. It's not possible to get more than 1 of those nodes in one job.
Detail information for visualization are available on [[CRAY_XE6_Graphic_Environment | Graphic Environment]]. There you will find wrapper scripts and environment settings for your visualization work.
=== Shared node SMP with very large memory ===
One external node have 1TB memory and 64 cores of Intel Xeon CPU X7550.
This node is not connected to the GEMINI interconnect network of the CRAY mpp compute nodes. But this node has the same workspace and home filesystem mounted. <font color=red>This node is shared by several users and jobs at the same time!</font>
To submit a job to this node you need following qsub options:
  qsub -q smp -l nodes=1:smp:ppn=1,walltime=3600 ./my_batchjob_script.pbs
This allocates one process/core (ppn=1) of the SMP node for 3600 seconds. Please be considerate to other users on this node. <font color=red>The maximum number of processes on this node is limited to 64! Please do not start more processes on this node than requested.</font>

Latest revision as of 14:40, 30 April 2014


The only way to start a parallel job on the compute nodes of this system is to use the batch system. The installed batch system is based on

  • the resource management system torque and
  • the scheduler moab

Additional you have to know on CRAY XE6 the user applications are always launched on the compute nodes using the application launcher, aprun, which submits applications to the Application Level Placement Scheduler (ALPS) for placement and execution.

Detailed information about how to use this system and many examples can be found in Cray Application Developer's Environment User's Guide and Workload Management and Application Placement for the Cray Linux Environment.



Writing a submission script is typically the most convenient way to submit your job to the batch system. You generally interact with the batch system in two ways: through options specified in job submission scripts (these are detailed below in the examples) and by using torque or moab commands on the login nodes. There are three key commands used to interact with torque:

  • qsub
  • qstat
  • qdel

Check the man page of torque for more advanced commands and options

 man pbs 

Submitting a Job / allocating resources

Batch Mode

Production jobs are typically run in batch mode. Batch scripts are shell scripts containing flags and commands to be interpreted by a shell and are used to run a set of commands in sequence. To submit a job, type

 qsub my_batchjob_script.pbs

This will submit your job script "my_batchjob_script.pbs" to the job-queues.

A simple MPI job submission script for the XE6 would look like:

#!/bin/bash
#PBS -N job_name
#PBS -l mppwidth=64
#PBS -l mppnppn=32
#PBS -l walltime=00:20:00             
  
# Change to the direcotry that the job was submitted from
cd $PBS_O_WORKDIR

# Launch the parallel job to the allocated compute nodes
aprun -n 64 -N 32 ./my_mpi_executable arg1 arg2 > my_output_file 2>&1

This will run your executable "my_mpi_executable" in parallel with 64 MPI processes. Torque will allocate 2 nodes to your job for a maximum time of 20 minutes and place 32 processes on each node (one per core). The batch systems allocates nodes exclusively only for one job. After the walltime limit is exceeded, the batch system will terminate your job.

Important: You have to change into a subdirectory of /mnt/lustre_server (your workspace), before calling aprun.

All torque options start with a "#PBS"-string.

You can overwrite this options on the qsub command line:

 qsub -N other_name -A myother_account -l mppwidth=64,mppnppn=32,walltime=01:00:00 my_batchjob_script.pbs

To have the same environmental settings (exported environment) of your current session in your batchjob, the qsub command needs the option argument -V:

 qsub -V my_batchjob_script.pbs

The individual options are explained in:

 man qsub

The job launcher for the XE6 parallel jobs (both MPI and OpenMP) is aprun. This needs to be started from a subdirectory of the /mnt/lustre_server (your workspace). The aprun example above will start the parallel executable "my_mpi_executable" with the arguments "arg1" and "arg2". The job will be started using 64 MPI processes with 32 processes placed on each of your allocated nodes (remember that a node consists of 32 cores in the XE6 system). You need to have nodes allocated by the batch system before (qsub). To query further options, please use

 man aprun 
 aprun -h

An example OpenMP job submission script for the XE6 nodes is shown below.

#!/bin/bash
#PBS -N job_name
# Request the number of cores that you need in total
#PBS -l mppwidth=16
#PBS -l mppnppn=16
# Request the time you need for computation
#PBS -l walltime=00:03:00

# Change to the directory that the job was submitted from
cd $PBS_O_WORKDIR

# Set the number of OpenMP threads per node
export OMP_NUM_THREADS=16

# Launch the OpenMP job to the allocated compute node
aprun -n 1 -N 1 -d $OMP_NUM_THREADS ./my_openmp_executable.x arg1 arg2 > my_output_file 2>&1

This will run your executable "my_openmp_executable" using 16 threads on one node. We set the environment variable OMP_NUM_THREADS to 16.

Interactive Mode

Interactive mode is typically used for debugging or optimizing code but not for running production code. To begin an interactive session, use the "qsub -I" command:

 qsub -I -l mppwidth=64,mppnppn=32,walltime=00:30:00

If the requested resources are available and free (in the example above: 2 nodes/32 cores, 30 minutes), then you will get a new session on the login node for your requested resources. Now you have to use the aprun command to launch your application to the allocated compute nodes. When you are finished, enter logout to exit the batch system and return to the normal command line.

Notes

  • You must use (in Interactive Mode) the "-l mppwidth=" option and "-l mppnppn=" to specify at least one core when you start the interactive session. If you do not, your request for an interactive session will pause indefinitely.
  • Remember, you use aprun within the context of a batch session and the maximum size of the job is determined by the resources you requested when you launched the batch session. You cannot use the aprun command to use more resources than you reserved using the qsub command. Once a batch session begins, you can only use fewer resources than initially requested.
  • While your job is running (in Batch Mode), STDOUT and STDERR are written to a file or files in a system directory and the output is copied to your submission directory only after the job completes. Specifying the "qsub -j oe" option here and redirecting the output to a file (see examples above) makes it possible for you to view STDOUT and STDERR while the job is running.


Run job on other Account ID

There are Unix groups associated to the project account ID (ACID). To run a job on a non-default project budget (associated to secondary group), the groupname of this project has to be passed in the group_list:

qsub -W group_list=<groupname> ...

To get your available groups:

id -Gn

Note, that this procedure is neither applicable nor necessary for the default project (associated to the primary group), printed with "id -gn".

Usage of a Reservation

For nodes which are reserved for special groups or users, you need to specify an additional option for this reservation:

E.g. a reservation named john.1 will be used with following command:
qsub -W x=FLAGS:ADVRES:john.1 ...

Deleting a Batch Job

 qdel <jobID>
 canceljob <jobID>

This commands enables you to remove jobs from the job queue. If the job is running, qdel will abort it. You can obtain the Job ID from the output of command "qstat" or you remember the output of your qsub command of your job.

Status Information

* Status of jobs:
 qstat
 qstat -a
 showq
  • Status of Qeues:
 qstat -q
 qstat -Q
 
  • Status of job scheduling
 checkjob <jobID>
 showstart <jobID>
  • Status of backfill. This can help you to build small jobs that can be backfilled immediately while you are waiting for the resources to become available for your larger jobs
 showbf
 xtnodestat
 apstat

Note: for further details type on the login node:

 man qstat
 man apstat
 man xtnodestat
 showbf -h
 showq -h
 checkjob -h
 showstart -h

Limitations

 see the  Batch System Layout and Limits


Special Jobs / Special Nodes

MPP nodes with different memory

3072 nodes of total 3552 nodes are installed with 32GB memory; 480 nodes are installed with 64GB memory.

32 GB nodes or 64 GB nodes

  • If your job has not defined any node feature, then your job gets a default feature "mem32gb" which will allocate nodes with 32GB memory (see job examples above).
  • If you want one or more of the 64GB nodes, then you have to specify the node feature "mem64gb":
 qsub -l feature=mem64gb <my_batchjob_script.pbs>

Or inside a simple script my_batchjob_script.pbs

#!/bin/bash
#PBS -N job_name
#PBS -l feature=mem64gb
#PBS -l mppwidth=64
#PBS -l mppnppn=32
#PBS -l walltime=00:20:00             
  
# Change to the direcotry that the job was submitted from
cd $PBS_O_WORKDIR

# Launch the parallel job to the allocated compute nodes
aprun -n 64 -N 32 ./my_mpi_executable arg1 arg2 > my_output_file 2>&1


Jobs with mixed node features

If your job needs some of the 64GB nodes and some of the 32GB nodes at the same time, then your job submission options looks totally different. Do not specify additional -l feature=<nodefeature>! You only need to specify resource name nodes=<node count>:ppn=<process count per node>:mem64gb+<node count>:ppn=<proc count per node>:mem32gb:

 qsub -l nodes=1:ppn=32:mem64gb+64:ppn=32:mem32gb,walltime=3600 my_batchjob_script.pbs

The example above will allocate 65 nodes to your job for a maximum time of 3600 seconds and can place 32 processes on one node with 64GB memory and 32 processes on each of the 64 allocated nodes with 32GB memory. Important is option ppn=32 to get all cores of the allocated mpp nodes. Now you need to select your different allocated nodes for your aprun command in your script my_batchjob_script.pbs:

#!/bin/bash
#PBS -N job_name
#PBS -l nodes=1:ppn=32:mem64gb+64:ppn=32:mem32gb
#PBS -l walltime=3600          
  
# Change to the direcotry that the job was submitted from
cd $PBS_O_WORKDIR

### defining the number of PEs (processes per node ( max 32) ###
# p32: number of PEs (Processing Elements) on 32GB nodes
# p64: number of PEs (Processing Elements) on 64GB nodes
#-------------------------------------------------------
p32=32
p64=16

### selecting nodes with different memory ###
#---------------------------------------
resid=$(qstat -f $PBS_JOBID | grep BATCH_PARTITION_ID | sed -e 's/.*_ID=//' -e 's/,.*//')
sleep 5
nids=$(apstat -n -R $resid | grep UP | awk '{print $1}')

let i32=0
let i64=0
filetmp=./tmp.$$
xtprocadmin -A | grep compute > $filetmp
for n in $nids; do
   let size=$(grep -w $n  $filetmp | awk '{print $9}' )
   if (( $size == 32768 )); then
      let i32+=1
      nid32="$nid32 $n"
   else
      let i64+=1
      nid64="$nid64 $n"
   fi
done
rm $filetmp
nid32=$(echo $nid32 | tr ' ' ',')
nid64=$(echo $nid64 | tr ' ' ',')
(( P32 = $i32 * $p32 ))
(( P64 = $i64 * $p64 ))
(( D32 = 32 / p32 ))
(( D64 = 32 / p64 ))

# Launch the parallel job to the allocated compute nodes using 
# Multi Program, Multi Data (MPMD) mode (see "man aprun")
# -------------------------------------------------
# $nid64 : node list with 64GB memory
# $i64     : number of nodes with 64GB memory
# $p64    : number of PEs per node on nodes with 64GB
# $P64    : total number of PEs (processing elements) on nodes with 64GB
# ----
# $nid32 : node list with 32GB memory
# $i32     : number of nodes with 32GB memory
# $p32    : number of PEs per node on nodes with 32GB
# $P32    : total number of PEs on nodes with 32GB
# ----------
# The "env OMP_NUM_THREADS=...." parts of the aprun command below are only useful for OpenMP (hybrid) programs.
#
aprun -L $nid64 -n $P64 -N $p64 -d $D64 env OMP_NUM_THREADS=$D64 ./my_executable1 : -L $nid32 -n $P32 -N $p32 -d $D32 env OMP_NUM_THREADS=$D32 ./my_executable2

By defining p64 and p32 in the example above you can control the number of processes on each node for the different node types (64GB memory and 32GB memory). This corresponds to the qsub job option "-l mppnppn=32" for mono node type mpp jobs (see examples in previous chapters above). Important to know is the maximum value is 32, the number of cores of each mpp node.


Nodes with Kepler accelerator cards

28 nodes of total 3552 nodes are installed with 32GB memory, 16 cores each and with an additional Kepler accelerator card installed.

There are 2 ways to submit a job allocating the nodes with the Kepler accelerators.

  • Using the queue gpgpu:
 qsub -q gpgpu -l mppwidth=16,mppnppn=16 <myjobscript> 

This allocates 1 node (16 cores) with Kepler accelerator, respectively nodes with feature tesla .

  • Using the node feature tesla:
 qsub -l mppwidth=16,mppnppn=16,feature=tesla <myjobscript> 

This also allocates 1 node (16 cores) with Kepler accelerator.

For ccm jobs you need to use the second way submitting to the queue ccm :

  qsub -q ccm -l mppwidth=16,mppnppn=16,feature=tesla <myjobscript>

Pre- and Postprocessing/Visualization nodes with large memory

A few number of external nodes are standard cluster nodes with 128GB memory and 32 cores of Intel Xeon CPU X7550. These nodes are not connected to the GEMINI interconnect network of the CRAY mpp compute nodes. But these nodes have the same workspace and home filesystem mounted and they have a graphic NVIDIA Quadro 6000 installed for visualization. To get one of the pre-postprocessing node, following qsub options are needed:

 qsub -l nodes=1:mem128gb,walltime=3600 ./my_batchjob_script.pbs

This allocates one of the pre-postprocessing node for 3600 seconds. It's not possible to get more than 1 of those nodes in one job.

Detail information for visualization are available on Graphic Environment. There you will find wrapper scripts and environment settings for your visualization work.

Shared node SMP with very large memory

One external node have 1TB memory and 64 cores of Intel Xeon CPU X7550. This node is not connected to the GEMINI interconnect network of the CRAY mpp compute nodes. But this node has the same workspace and home filesystem mounted. This node is shared by several users and jobs at the same time! To submit a job to this node you need following qsub options:

 qsub -q smp -l nodes=1:smp:ppn=1,walltime=3600 ./my_batchjob_script.pbs

This allocates one process/core (ppn=1) of the SMP node for 3600 seconds. Please be considerate to other users on this node. The maximum number of processes on this node is limited to 64! Please do not start more processes on this node than requested.