- Infos im HLRS Wiki sind nicht rechtsverbindlich und ohne Gewähr -
- Information contained in the HLRS Wiki is not legally binding and HLRS is not responsible for any damages that might result from its use -

User monitoring: Difference between revisions

From HLRS Platforms
Jump to navigationJump to search
(Created page with "= Long term aggregations on a job basis== === How is data aggregated (default): === # Collection of data over a timebucket and depending on each metric perform a calculation to get new metric. The formula for each metric follow the formulas from: https://github.com/RRZE-HPC/likwid/tree/master/groups/zen2 # Calculate median, min, max of each node or CPU if relevant. # Calculate average and standard deviation of median, 10th percentile of min and 90th percentile of max....")
 
 
(34 intermediate revisions by the same user not shown)
Line 1: Line 1:
= Long term aggregations on a job basis==
== Long term aggregations for end users ==
The long term job-specific aggregations for end users consist of nine metrics from the domains of energy consumption, performance characterization and I/O.


=== How is data aggregated (default): ===
# the total energy consumption, per job
# Collection of data over a timebucket and depending on each metric  perform a calculation to get new metric. The formula for each metric follow the formulas from:  https://github.com/RRZE-HPC/likwid/tree/master/groups/zen2
# the achieved memory bandwidth, averaged, per node.
# Calculate median, min, max of each node or CPU if relevant.
# the achieved floating point performance, averaged, per node.
# Calculate average and standard deviation of median, 10th percentile of min and 90th percentile of max.
# the total amount of data written to the lustre file systems, per job.
# the total amount of data read from the lustre file systems, per job.
# the achieved peak write bandwidth to the lustre file systems, per job
# the achieved peak read bandwidth to the lustre file systems, per job
# the total number of metadata operations on the lustre file systems, per job 
# the achieved peak rate of metadata operations on the lustre file systems, per job


=== Data aggregation of InfiniBand counter metrics: ===
=== What the metrics represent and how the data is aggregated: ===
If there are two controllers per node the values of those are first added together.  Then the differences of seconds and of the value of each following timestamp is  calculated. With a division of the value difference and the time difference we get  the change per second on which we perform our default aggregations.


=== Storing the aggregated data: ===
1. The total energy consumption, per job [in Wh]
After each job every metric is calculated and aggregated. Then the data of every
  1. we calculate the integral over of the power consumption per node over the timeline of the job. We then sum up the respective energy contribution from all compute nodes of the job.  
metric is inserted into the TimescaleDB. A predefined subset of metrics is also
  2. we include the overhead corresponding to the efficiency rating of the power supply units.
inserted into the accounting database and made available to the user in form of  
  3. we also factor in static contributions from the admin and storage infrastructure, averaged over all compute nodes, assigned per compute node of the job.
a JSON file.
  4. we add a static contribution from the cooling distribution units, averaged over all compute nodes, assigned per compute node of the job.


=== Aggregated metrics: ===
2. The achieved memory bandwidth, averaged, per node [in GByte/s]:
* Bandwidth: Total memory bandwidth on socket basis. The two memory controller of a socket are added together. The data is saved in Bytes/s in the TimescaleDB  and in GBytes/s in the JSON for Users.
  1. this is the rate at which data can be read from or stored to the main memory.
* L3 bandwidth: Total L3 Cache bandwidth on socket basis. The data is saved in  Bytes/s in the TimescaleDB and in GBytes/s in the JSON for Users.
  2. to calculate the achieved memory bandwidth, we calculate the mean of the memory bandwidth per node as reported by Likwid [https://github.com/RRZE-HPC/likwid] over the timeline of the job. We then average over all compute nodes of the job.
* L3 miss rate: The percent rate of L3 misses to L3 cache accesses. A miss is when data is not in the cache when accessed. A lower miss rate is better. The data is saved in Bytes/s in the TimescaleDB and in GBytes/s in the JSON for Users.
 
* Flops:  Amount of floating point operations per second. The group does not differentiate between singe and double point precision rate.   The data is saved in MFlops/s in the TimescaleDB and in GFlops/s in the JSON for Users.
3. The achieved floating point performance, averaged, per node [in GFlop/s]:
* Instructions per cycle (IPC): This is a measure of the efficiency of the CPU. It represents the average number of instructions executed for each clock cycle. A higher IPC means a more efficient execution.
  1. this is the amount of floating point operations per second. While this metric does not discriminate between single or double precision floating point operations, it will take into account the SIMD width [https://en.wikipedia.org/wiki/Single_instruction,_multiple_data] of the floating point instructions.  
* (Cache) Miss rate: This is the percentage of cache accesses that result in a miss. A cache miss occurs when the CPU looks for data in the cache and it isn’t there. The data cache miss rate gives a measure how often it was necessary to get cache lines from higher levels of the memory hierarchy. A lower miss rate is  better, as it means data is being successfully retrieved from the cache more often.
  2. to calculate the amount of floating point operations per second, we calculate the mean of the amount of floating point operations per second as reported by Likwid over the timeline of the job. We then average over all compute nodes of the job.
* (Cache) Miss ratio: The data cache miss ratio tells you how many of your memory references required a cache line to be loaded from a higher level. It is similar to the cache miss rate but is a ratio rather than a percentage.  It’s calculated as the number of cache misses divided by the total number of cache  accesses. While the data cache miss rate might be given by your algorithm you should try to get data cache miss ratio as low as possible by increasing your cache reuse.
 
* Energy sum: This is the total amount of energy consumed by each node. It’s measured in W. It is calculated by first adding the rapl counters of both sockets together and then adding a constant of 220 W for the energy consumption of other parts of the node.
4. The total amount of data written to the lustre file systems, per job [in GByte]:
* Mem free: This is the amount of free memory available in the node. The data is saved in Bytes in the TimescaleDB and in GBytes in the JSON for Users.
  1. the amount of data written to the workspaces of the lustre storage within the job.
* Mem dirty: This is the amount of memory that has been modified but not yet written back to disk per node. The data is saved in Bytes in the TimescaleDB and in GBytes in the JSON for Users.
 
* CPU usage system: This is the percentage which is used by system tasks on a node basis.
5. The total amount of data read from the lustre file systems, per job [in GByte]:
* CPU usage user: This is the percentage which is used by user tasks on a node basis.
  1. the amount of data read from the workspaces of the lustre storage within the job.
* Rcv data: This is the amount of data received over the network per node. The data is saved in Bytes/s in the TimescaleDB and in GBytes/s in the JSON for Users. The data is calculated after the infiniband counter metric scheme.
 
* Xmit data: This is the amount of data transmitted over the network per node. The data is saved in Bytes/s in the TimescaleDB and in GBytes/s in the JSON for Users. The data is calculated after the infiniband counter metric scheme.
6. the achieved peak write bandwidth to the lustre file systems, per job [in Gbyte/s]
  1. this is the peak rate at which the job writes to the workspaces of the lustre storage, over the lifetime of the job.
 
7. the achieved peak read bandwidth to the lustre file systems, per job [in Gbyte/s]
  1. this is the peak rate at which the job reads from the workspaces of the lustre storage, over the lifetime of the job.
 
8. the total number of metadata operations on the lustre file systems, per job [in metadata ops]
  1. this is the number of metadata operations [status, open, close, rename, unlink, etc.] on the lustre storage which can be assigned to the job.
 
9. the achieved peak rate of metadata operations on the lustre storage, per job [in metadata ops/s]
  1. this is the peak rate of metadata operations on the lustre storage which can be assigned to the job, over the lifetime of the job.

Latest revision as of 10:21, 29 August 2024

Long term aggregations for end users

The long term job-specific aggregations for end users consist of nine metrics from the domains of energy consumption, performance characterization and I/O.

  1. the total energy consumption, per job
  2. the achieved memory bandwidth, averaged, per node.
  3. the achieved floating point performance, averaged, per node.
  4. the total amount of data written to the lustre file systems, per job.
  5. the total amount of data read from the lustre file systems, per job.
  6. the achieved peak write bandwidth to the lustre file systems, per job
  7. the achieved peak read bandwidth to the lustre file systems, per job
  8. the total number of metadata operations on the lustre file systems, per job
  9. the achieved peak rate of metadata operations on the lustre file systems, per job

What the metrics represent and how the data is aggregated:

1. The total energy consumption, per job [in Wh]

 1. we calculate the integral over of the power consumption per node over the timeline of the job. We then sum up the respective energy contribution from all compute nodes of the job. 
 2. we include the overhead corresponding to the efficiency rating of the power supply units. 
 3. we also factor in static contributions from the admin and storage infrastructure, averaged over all compute nodes, assigned per compute node of the job.
 4. we add a static contribution from the cooling distribution units, averaged over all compute nodes, assigned per compute node of the job.

2. The achieved memory bandwidth, averaged, per node [in GByte/s]:

 1. this is the rate at which data can be read from or stored to the main memory.
 2. to calculate the achieved memory bandwidth, we calculate the mean of the memory bandwidth per node as reported by Likwid [1] over the timeline of the job. We then average over all compute nodes of the job.

3. The achieved floating point performance, averaged, per node [in GFlop/s]:

 1. this is the amount of floating point operations per second. While this metric does not discriminate between single or double precision floating point operations, it will take into account the SIMD width [2] of the floating point instructions. 
 2. to calculate the amount of floating point operations per second, we calculate the mean of the amount of floating point operations per second as reported by Likwid over the timeline of the job. We then average over all compute nodes of the job.

4. The total amount of data written to the lustre file systems, per job [in GByte]:

 1. the amount of data written to the workspaces of the lustre storage within the job.

5. The total amount of data read from the lustre file systems, per job [in GByte]:

 1. the amount of data read from the workspaces of the lustre storage within the job.

6. the achieved peak write bandwidth to the lustre file systems, per job [in Gbyte/s]

 1. this is the peak rate at which the job writes to the workspaces of the lustre storage, over the lifetime of the job.

7. the achieved peak read bandwidth to the lustre file systems, per job [in Gbyte/s]

 1. this is the peak rate at which the job reads from the workspaces of the lustre storage, over the lifetime of the job.

8. the total number of metadata operations on the lustre file systems, per job [in metadata ops]

 1. this is the number of metadata operations [status, open, close, rename, unlink, etc.] on the lustre storage which can be assigned to the job.

9. the achieved peak rate of metadata operations on the lustre storage, per job [in metadata ops/s]

 1. this is the peak rate of metadata operations on the lustre storage which can be assigned to the job, over the lifetime of the job.