- Infos im HLRS Wiki sind nicht rechtsverbindlich und ohne Gewähr -
- Information contained in the HLRS Wiki is not legally binding and HLRS is not responsible for any damages that might result from its use -

CRAY XC40 Batch System Layout and Limits: Difference between revisions

From HLRS Platforms
Jump to navigationJump to search
 
(9 intermediate revisions by the same user not shown)
Line 6: Line 6:
If you don't specify any queue on your job submit command, then your job will be routed in to a default queue(s).
If you don't specify any queue on your job submit command, then your job will be routed in to a default queue(s).


Using the qsub option "-q ccm" your job will be prepared for the [http://docs.cray.com/cgi-bin/craydoc.cgi?mode=View;id=S-2496-4001;right=/books/S-2496-4001/html-S-2496-4001//chapter-9b6qil6d-craigf.html Cluster Compatibility Mode (CCM)].
The CCM is a software solution that provides the services needed to run most cluster-based independent software vendor (ISV) applications out-of-the-box with some configuration adjustments.


<font color red>'''Note:'''</font>  
<font color red>'''Note:'''</font>  
* CCM is only available for some users, not by default!
* In general the max. walltime for jobs is 24h.
* On workdays (Mo.- Fr.) between 6:00 - 22:00, 15% of the compute nodes are reserved for short jobs with walltime lower than 4h.
* The max. walltime for jobs is 24h.


=== Job Run Limitations ===
=== Job Run Limitations ===
Line 19: Line 15:
* User limits:
* User limits:
** limited number of jobs of one user that can run at the same time
** limited number of jobs of one user that can run at the same time
** in total a user can only allocate 60000 cores or 1875 nodes.
* User Group limits:
* User Group limits:
** limited number of jobs of users in the same group that can run at the same time
** limited number of jobs of users in the same group that can run at the same time
* For CCM jobs the max. mppwidth is 256
* Batch Queue limits of all user jobs:
* Batch Queue limits of all user jobs:
** single node job queue (single): max. 300 nodes in total
** available nodes are 4488
** multi node job queue (multi): max. 3400 nodes in total
*** single node job queues: max. 200 nodes in total
** queue for ccm jobs (ccm_base): max. 1000 nodes in total
*** multi node job queues: max. 4488 nodes in total (max node count per job: 2560)
 
**** '''small jobs (requesting less than 48 nodes): 400 nodes in total'''
** the default wall time is 10 minutes


== Queues with extended wall time limits ==
== Queues with extended wall time limits ==
Line 33: Line 28:
* Jobs may be killed for operation reasons at any time.
* Jobs may be killed for operation reasons at any time.
* Jobs will be accounted in any case. This is also true if the jab has to be terminated for operational reasons.
* Jobs will be accounted in any case. This is also true if the jab has to be terminated for operational reasons.
* Joblimit per Group = 3
* Joblimit per Group
* Joblimit per user = 2
* Joblimit per user
* Total number of nodes/cores used for this queue = 256 / 8192
* Node types available:    32GB
* Low scheduling priority
* Low scheduling priority
* Max walltime  96h
* Max walltime  96h
Line 42: Line 35:


Again this queue is not for convenience but for running jobs which can not produce a result in other ways!
Again this queue is not for convenience but for running jobs which can not produce a result in other ways!
== Frontend node for Filetransfer and unlimited runtime ==
a special frontend node has been set up without a limitation of the CPU time. This is necessary for jobs like:
* workflow management, an interactive process which takes care of a bunch of batch jobs. E.G. optimization tasks
* transfer of large amounts of data. Currently such jobs can not be handled on the login nodes.
By using this node, the user accepts the following rules:
* This node may be rebooted at any time
* This node may rebooted without notice
* functionality of this node may be replaced by other solutions (like 3rd party transfer, private cloud server per user/group...)
* usage of this node is at your own risk. Jobs will be charged if this node fails...
* everything which could be done on other nodes is not allowed. (e.g. compiling, data analysis, ...)
* memory usage is limited and has to be as less as possible

Latest revision as of 15:01, 19 November 2019

There are different types of queues configured on this System. These will be used to set proper priorities for different jobs, consider user permissions and resource reservations for different user groups. The configuration is laid out such that usually all you need to do is to request the number of processes you need (along with the number of processes per node) and the time (walltime) for your job. Users should always specify a realistic value for the walltime. Jobs with a shorter walltime get a higher priority and may be used for backfilling (users usually specify 24h, which is the max. time limit on HLRS systems. If your job usually runs in 4h 17min and you specified 5h, your job will be selected if nodes are available for this timeframe while the job-scheduler is collecting more nodes for a larger job).


If you don't specify any queue on your job submit command, then your job will be routed in to a default queue(s).


Note:

  • In general the max. walltime for jobs is 24h.

Job Run Limitations

  • The maximum time limit for a Job is 24 hours.
  • User limits:
    • limited number of jobs of one user that can run at the same time
  • User Group limits:
    • limited number of jobs of users in the same group that can run at the same time
  • Batch Queue limits of all user jobs:
    • available nodes are 4488
      • single node job queues: max. 200 nodes in total
      • multi node job queues: max. 4488 nodes in total (max node count per job: 2560)
        • small jobs (requesting less than 48 nodes): 400 nodes in total
    • the default wall time is 10 minutes

Queues with extended wall time limits

are not available for general use. This Queue is available for Jobs, which can not run within the 24h timeframe. Access to this queue is only granted by passing an evaluation process. Following rules apply to this queue:

  • Jobs may be killed for operation reasons at any time.
  • Jobs will be accounted in any case. This is also true if the jab has to be terminated for operational reasons.
  • Joblimit per Group
  • Joblimit per user
  • Low scheduling priority
  • Max walltime 96h
  • Jobs which are possible to run in a normal queue (by walltime or if the job can be splitted into subjobs have to be processed in a standard queue.

Again this queue is not for convenience but for running jobs which can not produce a result in other ways!