- Infos im HLRS Wiki sind nicht rechtsverbindlich und ohne Gewähr -
NEC SX-ACE Batch System
Central entry point is the queue multi, maximum elapse time of a job is 24 hours, maximum memory per node is 61 GB.
Job granularity is nodes, you always get at least a whole node dedicated.
The used batch system is NEC NQSII, the directives follow the POSIX standard for batchsystems, and look very much the same as PBS or Sun Gridengine directives.
A batch job starts with a few comments, giving information to the batch system about the nature of the job. All these can be given on command line of qsub as well.
Job sample for a 2 node Job:
#!/usr/bin/bash #PBS -b 2 # number of nodes #PBS -l elapstim_req=12:00:00 # max wallclock time #PBS -j o # join stdout/stderr #PBS -T mpisx # Job type: mpisx for MPI #PBS -N MyJob # job name #PBS -M MyMail@mydomain # you should always specify your emai
You should specify an elapse time limit that come close to reality, as the scheduler takes this value as input to select jobs. Smaller jobs can fit into holes, so realistic smaller values will increase probability that the job starts early.
Contents of Job
A typical job will create a workspace (see workspace mechanism), copy some data, run the application and save some data at the end.
ws=`ws_allocate myimportantdata 10` # get a workspace for 10 days cd $ws # go there cp $HOME/input/file.dat . # get some data export OMP_NUM_THREADS=8 # use 8 OMP threads export F_PROGINF=DETAIL # get some performance information after the run $HOME/bin/myApp # run my application cp output.dat $HOME/output
ws=`ws_allocate myimportantdata 10` # get a workspace for 10 days cd $ws # go there cp $HOME/input/file.dat . # get some data export MPIPROGINF=DETAIL mpirun -nn 2 -nnp 4 $HOME/bin/myApp # run my application on 2 nodes, 4 CPUs each (8 total) cp output.dat $HOME/output
Hybrid OpenMP and MPI job
SCR=`ws_allocate MyWorkspace 2` cd $SCR export OMP_NUM_THREADS=4 # 4 threads export MPIPROGINF=YES export MPIMULTITASKMIX=YES MPIEXPORT="OMP_NUM_THREADS" # make this environment known to all nodes export MPIEXPORT mpirun -nn 4 -nnp 1 $HOME/bin//mycode # run on 4 nodes using 1 process per node, but 4 threads cp outfile $HOME/output
More about mpirun
Basic syntax for a job on multiple nodes is
mpirun -nn X -nnp Y app
NQSII sets the variable $_MPINNODES which is the number of nodes requested with #PBS -b N (or qsub -b N), so this can be used as argument behind mpirun -nn to avoid inconsistency.
Redirect stdout/stderr for each MPI-process into seperate files
mpirun -nn X -nnp Y /usr/lib/mpi/mpisep.sh app
The environment variable $MPISEPSELECT determines whether stdout and stderr are seperated or merged.
To submit a job use
(Use qlogin to get an interactive node.)
To monitor system usage, you can use qstat command of NQSII to see you requests, or you can use qs script on kabuki to see all running and pending requests.
qstat output looks like:
RequestID ReqName UserName Queue Pri STT S Memory CPU Elapse R H Jobs --------------- -------- -------- -------- ---- --- - -------- -------- -------- - - ----
For detailed description of all fields, please see the qstat manpage on kabuki. Jobs shows the number of nodes a job requested. Please note that CPU time is cpu time of current running process within the job. If you want to see accumulated time of the whole request, use qstat -c 1.
To learn on which nodes your job is running, use qstat -J, qs.
To see all jobs in the system, please use qs on kabuki:
STAT REQ-ID OWNER NAME QUEUE NODES TIME TIME ESTIMATIONS HOSTS ---- ------ -------- -------- -------- --- ------------ -------------------------- -----
qs shows the requests on the order as they will be started by scheduler. For privacy reasons, you can not see all details of other users requests, but you can see all requests in the system, waiting or running.
The times and memory numbers show is current consumption and requested limit. The ESTIMATIONS colum gives the estimated time when the job will start, this estimations is based on a 72 hours prediction.
To delete a job, use the qdel command. Please note: qdel of NQSII does send a SIGTERM first, followed by a SIGKILL after 5 seconds. You can change the number of seconds using -g option. By using qdel -g -1, SIGKILL is sent immediatly.
Please avoid to write large stdout, please redirect stdout and stderr of your application into a file in your jobs directory. Writing large stdout requires spool space of unpredictable size, and always causes problems when trying to store back those files into users home directories.
Tip: If you want to make sure a batch request is able to clean up if it hits a time limit, specify a second limit. In addition to cputim_job you can specify a cputim_prc. Specify that limit a few minutes shorter, and the process hitting the limit (probably your simulation) will be killed first, and your batch job has some time to cleanup. Same applies to elapsed time limit.
The deployed scheduler (except for test queue) is using a fairshare and backfilling strategy.
- new users having small usage in the last weeks have high priority
- small jobs can surpass large jobs to fill gaps
- large jobs have priority in general (but have harder time to find resources)
- jobs are aging, long waiting jobs gain priority
- scheduler does not intercept running jobs, all jobs in HOLD state are hold by user or administrator
- jobs can be checkpointed and restarted, a job which is in HOLD and was running will continue after beeing released.