- Infos im HLRS Wiki sind nicht rechtsverbindlich und ohne Gewähr -
- Information contained in the HLRS Wiki is not legally binding and HLRS is not responsible for any damages that might result from its use -

Monitoring: Difference between revisions

From HLRS Platforms
Jump to navigationJump to search
Line 16: Line 16:


=== Aggregated metrics: ===
=== Aggregated metrics: ===
-Bandwidth: Total memory bandwidth on socket basis. The two memory controller
* Bandwidth: Total memory bandwidth on socket basis. The two memory controller of a socket are added together. The data is saved in Bytes/s in the TimescaleDB  
of a socket are added together. The data is saved in Bytes/s in the TimescaleDB  
and in GBytes/s in the JSON for Users.
and in GBytes/s in the JSON for Users.
-L3 bandwidth: Total L3 Cache bandwidth on socket basis. The data is saved in  
* L3 bandwidth: Total L3 Cache bandwidth on socket basis. The data is saved in Bytes/s in the TimescaleDB and in GBytes/s in the JSON for Users.
Bytes/s in the TimescaleDB and in GBytes/s in the JSON for Users.
* L3 miss rate: The percent rate of L3 misses to L3 cache accesses. A miss is when data is not in the cache when accessed. A lower miss rate is better. The  
-L3 miss rate: The percent rate of L3 misses to L3 cache accesses. A miss is  
data is saved in Bytes/s in the TimescaleDB and in GBytes/s in the JSON for Users.
when data is not in the cache when accessed. A lower miss rate is better. The  
* Flops:  Amount of floating point operations per second. The group does not differentiate between singe and double point precision rate.  The data is saved  
data is saved in Bytes/s in the TimescaleDB and in GBytes/s in the JSON for  
Users.
-Flops:  Amount of floating point operations per second. The group does not  
differentiate between singe and double point precision rate.  The data is saved  
in MFlops/s in the TimescaleDB and in GFlops/s in the JSON for Users.
in MFlops/s in the TimescaleDB and in GFlops/s in the JSON for Users.
-Instructions per cycle (IPC): This is a measure of the efficiency of the CPU. It  
* Instructions per cycle (IPC): This is a measure of the efficiency of the CPU. It represents the average number of instructions executed for each clock cycle. A  
represents the average number of instructions executed for each clock cycle. A  
higher IPC means a more efficient execution.
higher IPC means a more efficient execution.
-(Cache) Miss rate: This is the percentage of cache accesses that result in a miss.
* (Cache) Miss rate: This is the percentage of cache accesses that result in a miss. A cache miss occurs when the CPU looks for data in the cache and it isn’t there.  
A cache miss occurs when the CPU looks for data in the cache and it isn’t there.  
The data cache miss rate gives a measure how often it was necessary to get cache lines from higher levels of the memory hierarchy. A lower miss rate is  
The data cache miss rate gives a measure how often it was necessary to get  
better, as it means data is being successfully retrieved from the cache more often.
cache lines from higher levels of the memory hierarchy. A lower miss rate is  
* (Cache) Miss ratio: The data cache miss ratio tells you how many of your memory references required a cache line to be loaded from a higher level. It is  
better, as it means data is being successfully retrieved from the cache more  
similar to the cache miss rate but is a ratio rather than a percentage.  It’s calculated as the number of cache misses divided by the total number of cache  
often.
accesses. While the data cache miss rate might be given by your algorithm you should try to get data cache miss ratio as low as possible by increasing your  
-(Cache) Miss ratio: The data cache miss ratio tells you how many of your  
memory references required a cache line to be loaded from a higher level. It is  
similar to the cache miss rate but is a ratio rather than a percentage.  It’s  
calculated as the number of cache misses divided by the total number of cache  
accesses. While the data cache miss rate might be given by your algorithm you  
should try to get data cache miss ratio as low as possible by increasing your  
cache reuse.
cache reuse.
-Energy sum: This is the total amount of energy consumed by each node. It’s  
* Energy sum: This is the total amount of energy consumed by each node. It’s measured in W. It is calculated by first adding the rapl counters of both sockets  
measured in W. It is calculated by first adding the rapl counters of both sockets  
together and then adding a constant of 220 W for the energy consumption of other parts of the node.
together and then adding a constant of 220 W for the energy consumption of  
* Mem free: This is the amount of free memory available in the node. The data is saved in Bytes in the TimescaleDB and in GBytes in the JSON for Users.
other parts of the node.
* Mem dirty: This is the amount of memory that has been modified but not yet written back to disk per node. The data is saved in Bytes in the TimescaleDB and  
-Mem free: This is the amount of free memory available in the node. The data is  
saved in Bytes in the TimescaleDB and in GBytes in the JSON for Users.
-Mem dirty: This is the amount of memory that has been modified but not yet  
written back to disk per node. The data is saved in Bytes in the TimescaleDB and  
in GBytes in the JSON for Users.
in GBytes in the JSON for Users.
-CPU usage system: This is the percentage which is used by system tasks on a  
* CPU usage system: This is the percentage which is used by system tasks on a node basis.
node basis.
* CPU usage user: This is the percentage which is used by user tasks on a node basis.
-CPU usage user: This is the percentage which is used by user tasks on a node  
* Rcv data: This is the amount of data received over the network per node. The data is saved in Bytes/s in the TimescaleDB and in GBytes/s in the JSON for  
basis.
-Rcv data: This is the amount of data received over the network per node. The  
data is saved in Bytes/s in the TimescaleDB and in GBytes/s in the JSON for  
Users. The data is calculated after the infiniband counter metric scheme.
Users. The data is calculated after the infiniband counter metric scheme.
-Xmit data: This is the amount of data transmitted over the network per node.  
* Xmit data: This is the amount of data transmitted over the network per node. The data is saved in Bytes/s in the TimescaleDB and in GBytes/s in the JSON for  
The data is saved in Bytes/s in the TimescaleDB and in GBytes/s in the JSON for  
Users. The data is calculated after the infiniband counter metric scheme.
Users. The data is calculated after the infiniband counter metric scheme.

Revision as of 12:40, 16 April 2024

Long term aggregations on a job basis

How is data aggregated (default):

  1. Collection of data over a timebucket and depending on each metric perform a calculation to get new metric. The formula for each metric follow the formulas from: https://github.com/RRZE-HPC/likwid/tree/master/groups/zen2
  2. Calculate median, min, max of each node or CPU if relevant.
  3. Calculate average and standard deviation of median, 10th percentile of min and 90th percentile of max.

Data aggregation of InfiniBand counter metrics:

If there are two controllers per node the values of those are first added together. Then the differences of seconds and of the value of each following timestamp is calculated. With a division of the value difference and the time difference we get the change per second on which we perform our default aggregations.

Storing the aggregated data:

After each job every metric is calculated and aggregated. Then the data of every metric is inserted into the TimescaleDB. A predefined subset of metrics is also inserted into the accounting database and made available to the user in form of a JSON file.

Aggregated metrics:

  • Bandwidth: Total memory bandwidth on socket basis. The two memory controller of a socket are added together. The data is saved in Bytes/s in the TimescaleDB

and in GBytes/s in the JSON for Users.

  • L3 bandwidth: Total L3 Cache bandwidth on socket basis. The data is saved in Bytes/s in the TimescaleDB and in GBytes/s in the JSON for Users.
  • L3 miss rate: The percent rate of L3 misses to L3 cache accesses. A miss is when data is not in the cache when accessed. A lower miss rate is better. The

data is saved in Bytes/s in the TimescaleDB and in GBytes/s in the JSON for Users.

  • Flops: Amount of floating point operations per second. The group does not differentiate between singe and double point precision rate. The data is saved

in MFlops/s in the TimescaleDB and in GFlops/s in the JSON for Users.

  • Instructions per cycle (IPC): This is a measure of the efficiency of the CPU. It represents the average number of instructions executed for each clock cycle. A

higher IPC means a more efficient execution.

  • (Cache) Miss rate: This is the percentage of cache accesses that result in a miss. A cache miss occurs when the CPU looks for data in the cache and it isn’t there.

The data cache miss rate gives a measure how often it was necessary to get cache lines from higher levels of the memory hierarchy. A lower miss rate is better, as it means data is being successfully retrieved from the cache more often.

  • (Cache) Miss ratio: The data cache miss ratio tells you how many of your memory references required a cache line to be loaded from a higher level. It is

similar to the cache miss rate but is a ratio rather than a percentage. It’s calculated as the number of cache misses divided by the total number of cache accesses. While the data cache miss rate might be given by your algorithm you should try to get data cache miss ratio as low as possible by increasing your cache reuse.

  • Energy sum: This is the total amount of energy consumed by each node. It’s measured in W. It is calculated by first adding the rapl counters of both sockets

together and then adding a constant of 220 W for the energy consumption of other parts of the node.

  • Mem free: This is the amount of free memory available in the node. The data is saved in Bytes in the TimescaleDB and in GBytes in the JSON for Users.
  • Mem dirty: This is the amount of memory that has been modified but not yet written back to disk per node. The data is saved in Bytes in the TimescaleDB and

in GBytes in the JSON for Users.

  • CPU usage system: This is the percentage which is used by system tasks on a node basis.
  • CPU usage user: This is the percentage which is used by user tasks on a node basis.
  • Rcv data: This is the amount of data received over the network per node. The data is saved in Bytes/s in the TimescaleDB and in GBytes/s in the JSON for

Users. The data is calculated after the infiniband counter metric scheme.

  • Xmit data: This is the amount of data transmitted over the network per node. The data is saved in Bytes/s in the TimescaleDB and in GBytes/s in the JSON for

Users. The data is calculated after the infiniband counter metric scheme.