- Infos im HLRS Wiki sind nicht rechtsverbindlich und ohne Gewähr -
- Information contained in the HLRS Wiki is not legally binding and HLRS is not responsible for any damages that might result from its use -

Workflow and Job monitoring: Difference between revisions

From HLRS Platforms
Jump to navigationJump to search
(Created page with "For projects that want to implement an automatic workflow using the high-performance computer Hawk or any other system at HLRS, we provide guidelines for job monitoring on thi...")
 
No edit summary
 
(5 intermediate revisions by the same user not shown)
Line 1: Line 1:
For projects that want to implement an automatic workflow using the high-performance computer Hawk or any other system at HLRS, we provide guidelines for job monitoring on this page.  
For projects that want to implement an automatic workflow using the high-performance computer Hawk, or any other system at HLRS, we provide guidelines how to implement a job monitoring and / or workflow handling on this page.  
* please be aware that you are not the only user on the system. There are many other users who perform tasks similar to yours. This may result in the case that the job of one individual is working, but due to the large number of these tasks, there is massive malfunction.
* please be aware that you are not the only user on the system. There are many other users who perform tasks similar to yours. This may result in the case that the job of one individual is working, but due to the large number of these tasks, there is massive malfunction.
* if you implement a monitoring system at your local site, do not run an instance (monitor task) per Job. This will fail due to lots of ssh requsets and qstat - commands on the compute server at HLRS
* do not fire up a ssh ( to execute e.g. qstat) on the HLRS compute server in a loop without a delay. You should wait at least 3 Minutes between two commands! (yes Minutes NOT Seconds!!)
=== Additional comments ===
* many jobs in the queue make the system confusing and troubleshooting is complicated
* job in User-Hold will not gain any priority advatages!

Latest revision as of 12:07, 24 March 2020

For projects that want to implement an automatic workflow using the high-performance computer Hawk, or any other system at HLRS, we provide guidelines how to implement a job monitoring and / or workflow handling on this page.

  • please be aware that you are not the only user on the system. There are many other users who perform tasks similar to yours. This may result in the case that the job of one individual is working, but due to the large number of these tasks, there is massive malfunction.
  • if you implement a monitoring system at your local site, do not run an instance (monitor task) per Job. This will fail due to lots of ssh requsets and qstat - commands on the compute server at HLRS
  • do not fire up a ssh ( to execute e.g. qstat) on the HLRS compute server in a loop without a delay. You should wait at least 3 Minutes between two commands! (yes Minutes NOT Seconds!!)

Additional comments

  • many jobs in the queue make the system confusing and troubleshooting is complicated
  • job in User-Hold will not gain any priority advatages!