- Infos im HLRS Wiki sind nicht rechtsverbindlich und ohne Gewähr -
- Information contained in the HLRS Wiki is not legally binding and HLRS is not responsible for any damages that might result from its use -

How to detect out of memory events

From HLRS Platforms
Jump to navigationJump to search

Out of memory events can be detected by running dmesg on every node of the job and greping for "oom" or "Memory cgroup out of memory: Killed process". However, to avoid false positives, one first has to restrict the output of dmesg to the runtime of the job.

Although it's possible to do all the dmesg calls by means of ssh <node> "dmesg ...", this would serialize the collection. It would be better to do so by calls to system() in a MPI program. Who volunteers to implement this?