- Infos im HLRS Wiki sind nicht rechtsverbindlich und ohne Gewähr -
- Information contained in the HLRS Wiki is not legally binding and HLRS is not responsible for any damages that might result from its use -

Batch system: Difference between revisions

From HLRS Platforms
Jump to navigationJump to search
No edit summary
No edit summary
Line 96: Line 96:
  mpirun -nn 4 -nnp 1 $HOME/bin//mycode  # run on 4 nodes using 1 process per node, but 16 threads
  mpirun -nn 4 -nnp 1 $HOME/bin//mycode  # run on 4 nodes using 1 process per node, but 16 threads
  cp outfile $HOME/output
  cp outfile $HOME/output
== Usage ==
To submit a job use
qsub jobfile
To monitor system usage, you can use ''qstat'' command of NQSII to see you requests, or you can use ''qs'' script on ontake/yari to see all running and pending requests.
qstat output looks like:
RequestID      ReqName  UserName Queue    Pri STT S  Memory      CPU  Elapse R H Jobs
--------------- -------- -------- -------- ---- --- - -------- -------- -------- - - ----
For detailed description of all fields, please see the qstat manpage on a1. Jobs shows the number of nodes a job requested. Please note that CPU time is cpu time of current running process within the job. If you want to see accumulated time of the whole request, use ''qstat -c 1''.
To learn on which nodes your job is running, use ''qstat -J'', ''qs''.
To see all jobs in the system, please use ''qs'' on ontake or yari. (not available on SX):
STAT REQ-ID  OWNER    NAME    QUEUE  NODES    TIME      TIME ESTIMATIONS        HOSTS
---- ------ -------- -------- -------- --- ------------ -------------------------- -----
''qs'' shows the requests on the order as they will be started by scheduler. For privacy reasons, you can not see all details of other users requests, but you can see all requests in the system, waiting or running.
The times and memory numbers show is current consumption and requested limit. The ESTIMATIONS colum gives the estimated time when the job will start, this estimations is based on a 72 hours prediction.
To delete a job, use the ''qdel'' command. Please note: ''qdel'' of NQSII does send a SIGTERM first, followed by a SIGKILL after 5 seconds. You can change the number of seconds using -g option. By using ''qdel -g -1'', SIGKILL is sent immediatly.
Please avoid to write large stdout, please redirect stdout and stderr of you application into a file in your jobs directory. Writing large stdout requires spool space of unpredictable size, and always causes problems when trying to store back those files into users home directories.
Tip: If you want to make sure a batch request is able to clean up if it hits a time limit, specify a second limit. In addition to cputim_job you can specify a cputim_prc. Specify that limit a few minutes shorter, and the process hitting the limit (probably your simulation) will be killed first, and your batch job has some time to cleanup. Same applies to elapsed time limit.

Revision as of 10:13, 28 November 2008

Job Examples

A batch job starts with a few comments, giving information to the batch system about the nature of the job.

Please note: The Account code is mandantory, and the account code is not your loginname, but a code used for accounting. Each user has at least one, but can have several. The choosen account code is used for billing the job.


Job sample large job, will be executed in '?multi' on v901-v907

#PBS -q dq
#PBS -l cpunum_job=16           # cpus per Node
#PBS -b 2                       # number of nodes, max 4 at the moment
#PBS -l elapstim_req=12:00:00   # max wallclock time
#PBS -l cputim_job=192:00:00    # max accumulated cputime per node
#PBS -l cputim_prc=11:55:00     # max accumulated cputime per node
#PBS -l memsz_job=500gb         # memory per node
#PBS -A <acctcode>              # Your Account code, see login message, without <>
#PBS -j o                       # join stdout/stderr
#PBS -T mpisx                   # Job type: mpisx for MPI
#PBS -N MyJob                   # job name
#PBS -M MyMail@mydomain         # you should always specify your emai

Job sample small job, will be executed in '?single' on v900 in shared mode, other jobs will run on same node.

#PBS -q dq
#PBS -l cpunum_job=8            # cpus per Node
#PBS -b 1                       # number of nodes
#PBS -l elapstim_req=12:00:00   # max wallclock time
#PBS -l cputim_job=192:00:00    # max accumulated cputime per node
#PBS -l cputim_prc=11:55:00     # max accumulated cputime per node
#PBS -l memsz_job=64gb          # memory per node
#PBS -A <acctcode>              # Your Account code, see login message, without <>
#PBS -j o                       # join stdout/stderr
#PBS -T mpisx                   # Job type: mpisx for MPI
#PBS -N MyJob                   # job name
#PBS -M MyMail@mydomain         # you should always specify your emai

Job sample test job, will be executed in 'test', always one job running, no matter how loaded node is.

#PBS -q dq
#PBS -l cpunum_job=4            # cpus per Node
#PBS -b 1                       # number of nodes
#PBS -l elapstim_req=1200       # max wallclock time
#PBS -l cputim_job=600          # max accumulated cputime per node
#PBS -l cputim_prc=599          # max accumulated cputime per node
#PBS -l memsz_job=16gb          # memory per node
#PBS -A <acctcode>              # Your Account code, see login message, without <>
#PBS -j o                       # join stdout/stderr
#PBS -T mpisx                   # Job type: mpisx for MPI
#PBS -N MyJob                   # job name
#PBS -M MyMail@mydomain         # you should always specify your emai


Please note: The above time and memory limits are the maximum that can be specified. You should specify values that come close to reality, as the scheduler takes those values as input to select jobs. Smaller jobs can fit into holes, so realistic smaller values will increase probability that the jobs starts early.


Contents of Job

A typical job will create a workspace (see workspace mechanism), copy some data, run the application and save some data at the end.

Multithreaded job

ws=`ws_allocate myimportantdata 10`    # get a workspace for 10 days
cd $ws                                 # go there
cp $HOME/input/file.dat .              # get some data
export OMP_NUM_THREADS=8               # use 8 OMP threads
export F_PROGINF=DETAIL                # get some performance information after the run
$HOME/bin/myApp                        # run my application
cp output.dat $HOME/output

MPI job

ws=`ws_allocate myimportantdata 10`    # get a workspace for 10 days
cd $ws                                 # go there
cp $HOME/input/file.dat .              # get some data
export MPIPROGINF=DETAIL
mpirun -nn 2 -nnp 16 $HOME/bin/myApp   # run my application on 2 nodes, 16 CPUs each (32 total)
cp output.dat $HOME/output

Hybrid OpenMP and MPI job

SCR=`ws_allocate MyWorkspace 2`    
cd $SCR
export OMP_NUM_THREADS=16              # 16 threads
export MPIPROGINF=YES
export MPIMULTITASKMIX=YES
MPIEXPORT="OMP_NUM_THREADS"            # make this environment known to all nodes
export MPIEXPORT
mpirun -nn 4 -nnp 1 $HOME/bin//mycode  # run on 4 nodes using 1 process per node, but 16 threads
cp outfile $HOME/output


Usage

To submit a job use

qsub jobfile


To monitor system usage, you can use qstat command of NQSII to see you requests, or you can use qs script on ontake/yari to see all running and pending requests.

qstat output looks like:

RequestID       ReqName  UserName Queue     Pri STT S   Memory      CPU   Elapse R H Jobs
--------------- -------- -------- -------- ---- --- - -------- -------- -------- - - ----

For detailed description of all fields, please see the qstat manpage on a1. Jobs shows the number of nodes a job requested. Please note that CPU time is cpu time of current running process within the job. If you want to see accumulated time of the whole request, use qstat -c 1.

To learn on which nodes your job is running, use qstat -J, qs.

To see all jobs in the system, please use qs on ontake or yari. (not available on SX):

STAT REQ-ID  OWNER     NAME    QUEUE  NODES    TIME       TIME ESTIMATIONS         HOSTS
---- ------ -------- -------- -------- --- ------------ -------------------------- -----

qs shows the requests on the order as they will be started by scheduler. For privacy reasons, you can not see all details of other users requests, but you can see all requests in the system, waiting or running.

The times and memory numbers show is current consumption and requested limit. The ESTIMATIONS colum gives the estimated time when the job will start, this estimations is based on a 72 hours prediction.

To delete a job, use the qdel command. Please note: qdel of NQSII does send a SIGTERM first, followed by a SIGKILL after 5 seconds. You can change the number of seconds using -g option. By using qdel -g -1, SIGKILL is sent immediatly.

Please avoid to write large stdout, please redirect stdout and stderr of you application into a file in your jobs directory. Writing large stdout requires spool space of unpredictable size, and always causes problems when trying to store back those files into users home directories.

Tip: If you want to make sure a batch request is able to clean up if it hits a time limit, specify a second limit. In addition to cputim_job you can specify a cputim_prc. Specify that limit a few minutes shorter, and the process hitting the limit (probably your simulation) will be killed first, and your batch job has some time to cleanup. Same applies to elapsed time limit.