- Infos im HLRS Wiki sind nicht rechtsverbindlich und ohne Gewähr -
- Information contained in the HLRS Wiki is not legally binding and HLRS is not responsible for any damages that might result from its use -

Monitoring: Difference between revisions

From HLRS Platforms
Jump to navigationJump to search
No edit summary
 
(2 intermediate revisions by the same user not shown)
Line 2: Line 2:


=== How is data aggregated (default): ===
=== How is data aggregated (default): ===
1. Collection of data over a timebucket and depending on each metric  perform a calculation to get new metric. The formula for each metric follow the formulas from:  https://github.com/RRZE-HPC/likwid/tree/master/groups/zen2
# Collection of data over a timebucket and depending on each metric  perform a calculation to get new metric. The formula for each metric follow the formulas from:  https://github.com/RRZE-HPC/likwid/tree/master/groups/zen2
2. Calculate median, min, max of each node or CPU if relevant.
# Calculate median, min, max of each node or CPU if relevant.
3. Calculate average and standard deviation of median, 10th percentile of min and 90th percentile of max.
# Calculate average and standard deviation of median, 10th percentile of min and 90th percentile of max.


=== Data aggregation of InfiniBand counter metrics: ===
=== Data aggregation of InfiniBand counter metrics: ===
Line 16: Line 16:


=== Aggregated metrics: ===
=== Aggregated metrics: ===
-Bandwidth: Total memory bandwidth on socket basis. The two memory controller
* Bandwidth: Total memory bandwidth on socket basis. The two memory controller of a socket are added together. The data is saved in Bytes/s in the TimescaleDB and in GBytes/s in the JSON for Users.
of a socket are added together. The data is saved in Bytes/s in the TimescaleDB  
* L3 bandwidth: Total L3 Cache bandwidth on socket basis. The data is saved in Bytes/s in the TimescaleDB and in GBytes/s in the JSON for Users.
and in GBytes/s in the JSON for Users.
* L3 miss rate: The percent rate of L3 misses to L3 cache accesses. A miss is when data is not in the cache when accessed. A lower miss rate is better. The data is saved in Bytes/s in the TimescaleDB and in GBytes/s in the JSON for Users.
-L3 bandwidth: Total L3 Cache bandwidth on socket basis. The data is saved in  
* Flops:  Amount of floating point operations per second. The group does not differentiate between singe and double point precision rate.  The data is saved in MFlops/s in the TimescaleDB and in GFlops/s in the JSON for Users.
Bytes/s in the TimescaleDB and in GBytes/s in the JSON for Users.
* Instructions per cycle (IPC): This is a measure of the efficiency of the CPU. It represents the average number of instructions executed for each clock cycle. A higher IPC means a more efficient execution.
-L3 miss rate: The percent rate of L3 misses to L3 cache accesses. A miss is  
* (Cache) Miss rate: This is the percentage of cache accesses that result in a miss. A cache miss occurs when the CPU looks for data in the cache and it isn’t there. The data cache miss rate gives a measure how often it was necessary to get cache lines from higher levels of the memory hierarchy. A lower miss rate is better, as it means data is being successfully retrieved from the cache more often.
when data is not in the cache when accessed. A lower miss rate is better. The  
* (Cache) Miss ratio: The data cache miss ratio tells you how many of your memory references required a cache line to be loaded from a higher level. It is similar to the cache miss rate but is a ratio rather than a percentage.  It’s calculated as the number of cache misses divided by the total number of cache accesses. While the data cache miss rate might be given by your algorithm you should try to get data cache miss ratio as low as possible by increasing your cache reuse.
data is saved in Bytes/s in the TimescaleDB and in GBytes/s in the JSON for  
* Energy sum: This is the total amount of energy consumed by each node. It’s measured in W. It is calculated by first adding the rapl counters of both sockets together and then adding a constant of 220 W for the energy consumption of other parts of the node.
Users.
* Mem free: This is the amount of free memory available in the node. The data is saved in Bytes in the TimescaleDB and in GBytes in the JSON for Users.
-Flops:  Amount of floating point operations per second. The group does not  
* Mem dirty: This is the amount of memory that has been modified but not yet written back to disk per node. The data is saved in Bytes in the TimescaleDB and in GBytes in the JSON for Users.
differentiate between singe and double point precision rate.  The data is saved  
* CPU usage system: This is the percentage which is used by system tasks on a node basis.
in MFlops/s in the TimescaleDB and in GFlops/s in the JSON for Users.
* CPU usage user: This is the percentage which is used by user tasks on a node basis.
-Instructions per cycle (IPC): This is a measure of the efficiency of the CPU. It  
* Rcv data: This is the amount of data received over the network per node. The data is saved in Bytes/s in the TimescaleDB and in GBytes/s in the JSON for Users. The data is calculated after the infiniband counter metric scheme.
represents the average number of instructions executed for each clock cycle. A  
* Xmit data: This is the amount of data transmitted over the network per node. The data is saved in Bytes/s in the TimescaleDB and in GBytes/s in the JSON for Users. The data is calculated after the infiniband counter metric scheme.
higher IPC means a more efficient execution.
-(Cache) Miss rate: This is the percentage of cache accesses that result in a miss.
A cache miss occurs when the CPU looks for data in the cache and it isn’t there.  
The data cache miss rate gives a measure how often it was necessary to get  
cache lines from higher levels of the memory hierarchy. A lower miss rate is  
better, as it means data is being successfully retrieved from the cache more  
often.
-(Cache) Miss ratio: The data cache miss ratio tells you how many of your  
memory references required a cache line to be loaded from a higher level. It is  
similar to the cache miss rate but is a ratio rather than a percentage.  It’s  
calculated as the number of cache misses divided by the total number of cache  
accesses. While the data cache miss rate might be given by your algorithm you  
should try to get data cache miss ratio as low as possible by increasing your  
cache reuse.
-Energy sum: This is the total amount of energy consumed by each node. It’s  
measured in W. It is calculated by first adding the rapl counters of both sockets  
together and then adding a constant of 220 W for the energy consumption of  
other parts of the node.
-Mem free: This is the amount of free memory available in the node. The data is  
saved in Bytes in the TimescaleDB and in GBytes in the JSON for Users.
-Mem dirty: This is the amount of memory that has been modified but not yet  
written back to disk per node. The data is saved in Bytes in the TimescaleDB and  
in GBytes in the JSON for Users.
-CPU usage system: This is the percentage which is used by system tasks on a  
node basis.
-CPU usage user: This is the percentage which is used by user tasks on a node  
basis.
-Rcv data: This is the amount of data received over the network per node. The  
data is saved in Bytes/s in the TimescaleDB and in GBytes/s in the JSON for  
Users. The data is calculated after the infiniband counter metric scheme.
-Xmit data: This is the amount of data transmitted over the network per node.  
The data is saved in Bytes/s in the TimescaleDB and in GBytes/s in the JSON for  
Users. The data is calculated after the infiniband counter metric scheme.

Latest revision as of 12:41, 16 April 2024

Long term aggregations on a job basis

How is data aggregated (default):

  1. Collection of data over a timebucket and depending on each metric perform a calculation to get new metric. The formula for each metric follow the formulas from: https://github.com/RRZE-HPC/likwid/tree/master/groups/zen2
  2. Calculate median, min, max of each node or CPU if relevant.
  3. Calculate average and standard deviation of median, 10th percentile of min and 90th percentile of max.

Data aggregation of InfiniBand counter metrics:

If there are two controllers per node the values of those are first added together. Then the differences of seconds and of the value of each following timestamp is calculated. With a division of the value difference and the time difference we get the change per second on which we perform our default aggregations.

Storing the aggregated data:

After each job every metric is calculated and aggregated. Then the data of every metric is inserted into the TimescaleDB. A predefined subset of metrics is also inserted into the accounting database and made available to the user in form of a JSON file.

Aggregated metrics:

  • Bandwidth: Total memory bandwidth on socket basis. The two memory controller of a socket are added together. The data is saved in Bytes/s in the TimescaleDB and in GBytes/s in the JSON for Users.
  • L3 bandwidth: Total L3 Cache bandwidth on socket basis. The data is saved in Bytes/s in the TimescaleDB and in GBytes/s in the JSON for Users.
  • L3 miss rate: The percent rate of L3 misses to L3 cache accesses. A miss is when data is not in the cache when accessed. A lower miss rate is better. The data is saved in Bytes/s in the TimescaleDB and in GBytes/s in the JSON for Users.
  • Flops: Amount of floating point operations per second. The group does not differentiate between singe and double point precision rate. The data is saved in MFlops/s in the TimescaleDB and in GFlops/s in the JSON for Users.
  • Instructions per cycle (IPC): This is a measure of the efficiency of the CPU. It represents the average number of instructions executed for each clock cycle. A higher IPC means a more efficient execution.
  • (Cache) Miss rate: This is the percentage of cache accesses that result in a miss. A cache miss occurs when the CPU looks for data in the cache and it isn’t there. The data cache miss rate gives a measure how often it was necessary to get cache lines from higher levels of the memory hierarchy. A lower miss rate is better, as it means data is being successfully retrieved from the cache more often.
  • (Cache) Miss ratio: The data cache miss ratio tells you how many of your memory references required a cache line to be loaded from a higher level. It is similar to the cache miss rate but is a ratio rather than a percentage. It’s calculated as the number of cache misses divided by the total number of cache accesses. While the data cache miss rate might be given by your algorithm you should try to get data cache miss ratio as low as possible by increasing your cache reuse.
  • Energy sum: This is the total amount of energy consumed by each node. It’s measured in W. It is calculated by first adding the rapl counters of both sockets together and then adding a constant of 220 W for the energy consumption of other parts of the node.
  • Mem free: This is the amount of free memory available in the node. The data is saved in Bytes in the TimescaleDB and in GBytes in the JSON for Users.
  • Mem dirty: This is the amount of memory that has been modified but not yet written back to disk per node. The data is saved in Bytes in the TimescaleDB and in GBytes in the JSON for Users.
  • CPU usage system: This is the percentage which is used by system tasks on a node basis.
  • CPU usage user: This is the percentage which is used by user tasks on a node basis.
  • Rcv data: This is the amount of data received over the network per node. The data is saved in Bytes/s in the TimescaleDB and in GBytes/s in the JSON for Users. The data is calculated after the infiniband counter metric scheme.
  • Xmit data: This is the amount of data transmitted over the network per node. The data is saved in Bytes/s in the TimescaleDB and in GBytes/s in the JSON for Users. The data is calculated after the infiniband counter metric scheme.