- Infos im HLRS Wiki sind nicht rechtsverbindlich und ohne Gewähr -
- Information contained in the HLRS Wiki is not legally binding and HLRS is not responsible for any damages that might result from its use -

Monitoring

From HLRS Platforms
Revision as of 12:41, 16 April 2024 by Hpcchsim (talk | contribs) (→‎Aggregated metrics:)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigationJump to search

Long term aggregations on a job basis

How is data aggregated (default):

  1. Collection of data over a timebucket and depending on each metric perform a calculation to get new metric. The formula for each metric follow the formulas from: https://github.com/RRZE-HPC/likwid/tree/master/groups/zen2
  2. Calculate median, min, max of each node or CPU if relevant.
  3. Calculate average and standard deviation of median, 10th percentile of min and 90th percentile of max.

Data aggregation of InfiniBand counter metrics:

If there are two controllers per node the values of those are first added together. Then the differences of seconds and of the value of each following timestamp is calculated. With a division of the value difference and the time difference we get the change per second on which we perform our default aggregations.

Storing the aggregated data:

After each job every metric is calculated and aggregated. Then the data of every metric is inserted into the TimescaleDB. A predefined subset of metrics is also inserted into the accounting database and made available to the user in form of a JSON file.

Aggregated metrics:

  • Bandwidth: Total memory bandwidth on socket basis. The two memory controller of a socket are added together. The data is saved in Bytes/s in the TimescaleDB and in GBytes/s in the JSON for Users.
  • L3 bandwidth: Total L3 Cache bandwidth on socket basis. The data is saved in Bytes/s in the TimescaleDB and in GBytes/s in the JSON for Users.
  • L3 miss rate: The percent rate of L3 misses to L3 cache accesses. A miss is when data is not in the cache when accessed. A lower miss rate is better. The data is saved in Bytes/s in the TimescaleDB and in GBytes/s in the JSON for Users.
  • Flops: Amount of floating point operations per second. The group does not differentiate between singe and double point precision rate. The data is saved in MFlops/s in the TimescaleDB and in GFlops/s in the JSON for Users.
  • Instructions per cycle (IPC): This is a measure of the efficiency of the CPU. It represents the average number of instructions executed for each clock cycle. A higher IPC means a more efficient execution.
  • (Cache) Miss rate: This is the percentage of cache accesses that result in a miss. A cache miss occurs when the CPU looks for data in the cache and it isn’t there. The data cache miss rate gives a measure how often it was necessary to get cache lines from higher levels of the memory hierarchy. A lower miss rate is better, as it means data is being successfully retrieved from the cache more often.
  • (Cache) Miss ratio: The data cache miss ratio tells you how many of your memory references required a cache line to be loaded from a higher level. It is similar to the cache miss rate but is a ratio rather than a percentage. It’s calculated as the number of cache misses divided by the total number of cache accesses. While the data cache miss rate might be given by your algorithm you should try to get data cache miss ratio as low as possible by increasing your cache reuse.
  • Energy sum: This is the total amount of energy consumed by each node. It’s measured in W. It is calculated by first adding the rapl counters of both sockets together and then adding a constant of 220 W for the energy consumption of other parts of the node.
  • Mem free: This is the amount of free memory available in the node. The data is saved in Bytes in the TimescaleDB and in GBytes in the JSON for Users.
  • Mem dirty: This is the amount of memory that has been modified but not yet written back to disk per node. The data is saved in Bytes in the TimescaleDB and in GBytes in the JSON for Users.
  • CPU usage system: This is the percentage which is used by system tasks on a node basis.
  • CPU usage user: This is the percentage which is used by user tasks on a node basis.
  • Rcv data: This is the amount of data received over the network per node. The data is saved in Bytes/s in the TimescaleDB and in GBytes/s in the JSON for Users. The data is calculated after the infiniband counter metric scheme.
  • Xmit data: This is the amount of data transmitted over the network per node. The data is saved in Bytes/s in the TimescaleDB and in GBytes/s in the JSON for Users. The data is calculated after the infiniband counter metric scheme.