- Infos im HLRS Wiki sind nicht rechtsverbindlich und ohne Gewähr -
- Information contained in the HLRS Wiki is not legally binding and HLRS is not responsible for any damages that might result from its use -

User monitoring: Difference between revisions

From HLRS Platforms
Jump to navigationJump to search
Line 1: Line 1:
== Long term aggregations for end users ==
== Long term aggregations for end users ==
The long term job-specific aggregations for end users consist of seven metrics from the domains of energy consumption, performance characterization and I/O.
The long term job-specific aggregations for end users consist of eight metrics from the domains of energy consumption, performance characterization and I/O.
The total energy consumption per job
# The achieved memory bandwidth, averaged, per node.
# The achieved floating point performance, averaged, per node.
# The total amount of data written or read to the lustre file systems, per job.
# The achieved peak write bandwidth to the lustre file systems, averaged, per node
# The achieved peak read bandwidth to the lustre file systems, averaged, per node 
# The total number of metadata operations on the lustre file systems, per job 
# The achieved peak rate of metadata operations on the lustre file systems, averaged, per node


# the total energy consumption per job [Wh]
# the achieved memory bandwidth, averaged, per node.
# the achieved floating point performance, averaged, per node.
# the total amount of data written or read to the lustre file systems, per job.
# the achieved peak write bandwidth to the lustre file systems, averaged, per node
# the achieved peak read bandwidth to the lustre file systems, averaged, per node 
# the total number of metadata operations on the lustre file systems, per job 
# the achieved peak rate of metadata operations on the lustre file systems, averaged, per node


=== How is data aggregated (default): ===
For the energy consu
# Collection of data over a timebucket and depending on each metric  perform a calculation to get new metric. The formula for each metric follow the formulas from:  https://github.com/RRZE-HPC/likwid/tree/master/groups/zen2
# Calculate median, min, max of each node or CPU if relevant.
# Calculate average and standard deviation of median, 10th percentile of min and 90th percentile of max.


=== How is data aggregated: ===


For the total energy consumption per job:
# we calculate the median of the power consumption per node over the timeline of the job and sum up the respective contributions from all compute nodes of the job.
# we factor in static contributions from the admin and storage infrastructure, averaged over all compute nodes, assigned per compute node of the job
# we add a static contribution from the cooling distribution units, averaged over all compute nodes, assigned per compute node of the job
# we include the overhead corresponding to the efficiency rating of the power supply units


=== Aggregated metrics: ===
For the achieved memory bandwidth:
* Bandwidth: Total memory bandwidth on socket basis. The two memory controller of a socket are added together. The data is saved in Bytes/s in the TimescaleDB  and in GBytes/s in the JSON for Users.
# this is the rate at which data can be read from or stored to the main memory.
* Flops: Amount of floating point operations per second. The group does not differentiate between singe and double point precision rateThe data is saved in MFlops/s in
# to calculate the achieved memory bandwidth, we calculate the median of the memory bandwidth per node as reported by Lkiwd[1] over the timeline of the job. We then average over the respective contributions from all compute nodes of the job.
* Energy sum: This is the total amount of energy consumed by each node. It’s measured in W. It is calculated by first adding the rapl counters of both sockets together and then adding a constant of 220 W for the energy consumption of other parts of the node.
 
For the achieved floating point performance:
# this is the amount of floating point operations per second. While this metric does not discriminate between single or double precision floating point operations, it will take into account the SIMD width [2] of the floating point instructions.   
# to calculate the amount of floating point operations per second, we calculate the median of the amount of floating point operations per second as reported by Lkiwd[1] over the timeline of the job. We then average over the respective contributions from all compute nodes of the job
 
 
 
[1] Likwid: https://github.com/RRZE-HPC/likwid
[2] SIMD: https://en.wikipedia.org/wiki/Single_instruction,_multiple_data

Revision as of 13:49, 1 August 2024

Long term aggregations for end users

The long term job-specific aggregations for end users consist of eight metrics from the domains of energy consumption, performance characterization and I/O.

  1. the total energy consumption per job [Wh]
  2. the achieved memory bandwidth, averaged, per node.
  3. the achieved floating point performance, averaged, per node.
  4. the total amount of data written or read to the lustre file systems, per job.
  5. the achieved peak write bandwidth to the lustre file systems, averaged, per node
  6. the achieved peak read bandwidth to the lustre file systems, averaged, per node
  7. the total number of metadata operations on the lustre file systems, per job
  8. the achieved peak rate of metadata operations on the lustre file systems, averaged, per node


How is data aggregated:

For the total energy consumption per job:

  1. we calculate the median of the power consumption per node over the timeline of the job and sum up the respective contributions from all compute nodes of the job.
  2. we factor in static contributions from the admin and storage infrastructure, averaged over all compute nodes, assigned per compute node of the job
  3. we add a static contribution from the cooling distribution units, averaged over all compute nodes, assigned per compute node of the job
  4. we include the overhead corresponding to the efficiency rating of the power supply units

For the achieved memory bandwidth:

  1. this is the rate at which data can be read from or stored to the main memory.
  2. to calculate the achieved memory bandwidth, we calculate the median of the memory bandwidth per node as reported by Lkiwd[1] over the timeline of the job. We then average over the respective contributions from all compute nodes of the job.

For the achieved floating point performance:

  1. this is the amount of floating point operations per second. While this metric does not discriminate between single or double precision floating point operations, it will take into account the SIMD width [2] of the floating point instructions.
  2. to calculate the amount of floating point operations per second, we calculate the median of the amount of floating point operations per second as reported by Lkiwd[1] over the timeline of the job. We then average over the respective contributions from all compute nodes of the job


[1] Likwid: https://github.com/RRZE-HPC/likwid [2] SIMD: https://en.wikipedia.org/wiki/Single_instruction,_multiple_data