- Infos im HLRS Wiki sind nicht rechtsverbindlich und ohne Gewähr -
- Information contained in the HLRS Wiki is not legally binding and HLRS is not responsible for any damages that might result from its use -

User monitoring: Difference between revisions

From HLRS Platforms
Jump to navigationJump to search
No edit summary
Line 1: Line 1:
= Long term aggregations for end users ==
== Long term aggregations for end users ==
The long term job-specific aggregations for end users consist of seven metrics from the domains of energy consumption, performance characterization and I/O.
- The total energy consumption per job
- The achieved memory bandwidth, averaged, per node.
- The achieved floating point performance, averaged, per node.
- The total amount of data written or read to the lustre file systems, per job.
- The achieved peak write bandwidth to the lustre file systems, averaged, per node
- The achieved peak read bandwidth to the lustre file systems, averaged, per node 
- The total number of metadata operations on the lustre file systems, per job 
- The achieved peak rate of metadata operations on the lustre file systems, averaged, per node
 


=== How is data aggregated (default): ===
=== How is data aggregated (default): ===
For the energy consu
# Collection of data over a timebucket and depending on each metric  perform a calculation to get new metric. The formula for each metric follow the formulas from:  https://github.com/RRZE-HPC/likwid/tree/master/groups/zen2
# Collection of data over a timebucket and depending on each metric  perform a calculation to get new metric. The formula for each metric follow the formulas from:  https://github.com/RRZE-HPC/likwid/tree/master/groups/zen2
# Calculate median, min, max of each node or CPU if relevant.
# Calculate median, min, max of each node or CPU if relevant.

Revision as of 12:41, 1 August 2024

Long term aggregations for end users

The long term job-specific aggregations for end users consist of seven metrics from the domains of energy consumption, performance characterization and I/O. - The total energy consumption per job - The achieved memory bandwidth, averaged, per node. - The achieved floating point performance, averaged, per node. - The total amount of data written or read to the lustre file systems, per job. - The achieved peak write bandwidth to the lustre file systems, averaged, per node - The achieved peak read bandwidth to the lustre file systems, averaged, per node - The total number of metadata operations on the lustre file systems, per job - The achieved peak rate of metadata operations on the lustre file systems, averaged, per node


How is data aggregated (default):

For the energy consu

  1. Collection of data over a timebucket and depending on each metric perform a calculation to get new metric. The formula for each metric follow the formulas from: https://github.com/RRZE-HPC/likwid/tree/master/groups/zen2
  2. Calculate median, min, max of each node or CPU if relevant.
  3. Calculate average and standard deviation of median, 10th percentile of min and 90th percentile of max.


Aggregated metrics:

  • Bandwidth: Total memory bandwidth on socket basis. The two memory controller of a socket are added together. The data is saved in Bytes/s in the TimescaleDB and in GBytes/s in the JSON for Users.
  • Flops: Amount of floating point operations per second. The group does not differentiate between singe and double point precision rate. The data is saved in MFlops/s in
  • Energy sum: This is the total amount of energy consumed by each node. It’s measured in W. It is calculated by first adding the rapl counters of both sockets together and then adding a constant of 220 W for the energy consumption of other parts of the node.