- Infos im HLRS Wiki sind nicht rechtsverbindlich und ohne Gewähr -
- Information contained in the HLRS Wiki is not legally binding and HLRS is not responsible for any damages that might result from its use -
User monitoring: Difference between revisions
From HLRS Platforms
Jump to navigationJump to search
(Created page with "= Long term aggregations on a job basis== === How is data aggregated (default): === # Collection of data over a timebucket and depending on each metric perform a calculation to get new metric. The formula for each metric follow the formulas from: https://github.com/RRZE-HPC/likwid/tree/master/groups/zen2 # Calculate median, min, max of each node or CPU if relevant. # Calculate average and standard deviation of median, 10th percentile of min and 90th percentile of max....") |
No edit summary |
||
Line 1: | Line 1: | ||
= Long term aggregations | = Long term aggregations for end users == | ||
=== How is data aggregated (default): === | === How is data aggregated (default): === | ||
Line 6: | Line 6: | ||
# Calculate average and standard deviation of median, 10th percentile of min and 90th percentile of max. | # Calculate average and standard deviation of median, 10th percentile of min and 90th percentile of max. | ||
=== Aggregated metrics: === | === Aggregated metrics: === | ||
* Bandwidth: Total memory bandwidth on socket basis. The two memory controller of a socket are added together. The data is saved in Bytes/s in the TimescaleDB and in GBytes/s in the JSON for Users. | * Bandwidth: Total memory bandwidth on socket basis. The two memory controller of a socket are added together. The data is saved in Bytes/s in the TimescaleDB and in GBytes/s in the JSON for Users. | ||
* Flops: Amount of floating point operations per second. The group does not differentiate between singe and double point precision rate. The data is saved in MFlops/s in | |||
* Flops: Amount of floating point operations per second. The group does not differentiate between singe and double point precision rate. The data is saved in MFlops/s in | |||
* Energy sum: This is the total amount of energy consumed by each node. It’s measured in W. It is calculated by first adding the rapl counters of both sockets together and then adding a constant of 220 W for the energy consumption of other parts of the node. | * Energy sum: This is the total amount of energy consumed by each node. It’s measured in W. It is calculated by first adding the rapl counters of both sockets together and then adding a constant of 220 W for the energy consumption of other parts of the node. | ||
Revision as of 12:14, 1 August 2024
Long term aggregations for end users =
How is data aggregated (default):
- Collection of data over a timebucket and depending on each metric perform a calculation to get new metric. The formula for each metric follow the formulas from: https://github.com/RRZE-HPC/likwid/tree/master/groups/zen2
- Calculate median, min, max of each node or CPU if relevant.
- Calculate average and standard deviation of median, 10th percentile of min and 90th percentile of max.
Aggregated metrics:
- Bandwidth: Total memory bandwidth on socket basis. The two memory controller of a socket are added together. The data is saved in Bytes/s in the TimescaleDB and in GBytes/s in the JSON for Users.
- Flops: Amount of floating point operations per second. The group does not differentiate between singe and double point precision rate. The data is saved in MFlops/s in
- Energy sum: This is the total amount of energy consumed by each node. It’s measured in W. It is calculated by first adding the rapl counters of both sockets together and then adding a constant of 220 W for the energy consumption of other parts of the node.