Monitoring: Difference between revisions

Revision as of 12:40, 16 April 2024

Long term aggregations on a job basis

How is data aggregated (default):

Collection of data over a timebucket and depending on each metric perform a calculation to get new metric. The formula for each metric follow the formulas from: https://github.com/RRZE-HPC/likwid/tree/master/groups/zen2
Calculate median, min, max of each node or CPU if relevant.
Calculate average and standard deviation of median, 10th percentile of min and 90th percentile of max.

Data aggregation of InfiniBand counter metrics:

If there are two controllers per node the values of those are first added together. Then the differences of seconds and of the value of each following timestamp is calculated. With a division of the value difference and the time difference we get the change per second on which we perform our default aggregations.

Storing the aggregated data:

After each job every metric is calculated and aggregated. Then the data of every metric is inserted into the TimescaleDB. A predefined subset of metrics is also inserted into the accounting database and made available to the user in form of a JSON file.

Aggregated metrics:

Bandwidth: Total memory bandwidth on socket basis. The two memory controller of a socket are added together. The data is saved in Bytes/s in the TimescaleDB

and in GBytes/s in the JSON for Users.

L3 bandwidth: Total L3 Cache bandwidth on socket basis. The data is saved in Bytes/s in the TimescaleDB and in GBytes/s in the JSON for Users.
L3 miss rate: The percent rate of L3 misses to L3 cache accesses. A miss is when data is not in the cache when accessed. A lower miss rate is better. The

data is saved in Bytes/s in the TimescaleDB and in GBytes/s in the JSON for Users.

Flops: Amount of floating point operations per second. The group does not differentiate between singe and double point precision rate. The data is saved

in MFlops/s in the TimescaleDB and in GFlops/s in the JSON for Users.

Instructions per cycle (IPC): This is a measure of the efficiency of the CPU. It represents the average number of instructions executed for each clock cycle. A

higher IPC means a more efficient execution.

(Cache) Miss rate: This is the percentage of cache accesses that result in a miss. A cache miss occurs when the CPU looks for data in the cache and it isn’t there.

The data cache miss rate gives a measure how often it was necessary to get cache lines from higher levels of the memory hierarchy. A lower miss rate is better, as it means data is being successfully retrieved from the cache more often.

(Cache) Miss ratio: The data cache miss ratio tells you how many of your memory references required a cache line to be loaded from a higher level. It is

similar to the cache miss rate but is a ratio rather than a percentage. It’s calculated as the number of cache misses divided by the total number of cache accesses. While the data cache miss rate might be given by your algorithm you should try to get data cache miss ratio as low as possible by increasing your cache reuse.

Energy sum: This is the total amount of energy consumed by each node. It’s measured in W. It is calculated by first adding the rapl counters of both sockets

together and then adding a constant of 220 W for the energy consumption of other parts of the node.

Mem free: This is the amount of free memory available in the node. The data is saved in Bytes in the TimescaleDB and in GBytes in the JSON for Users.
Mem dirty: This is the amount of memory that has been modified but not yet written back to disk per node. The data is saved in Bytes in the TimescaleDB and

in GBytes in the JSON for Users.

CPU usage system: This is the percentage which is used by system tasks on a node basis.
CPU usage user: This is the percentage which is used by user tasks on a node basis.
Rcv data: This is the amount of data received over the network per node. The data is saved in Bytes/s in the TimescaleDB and in GBytes/s in the JSON for

Users. The data is calculated after the infiniband counter metric scheme.

Xmit data: This is the amount of data transmitted over the network per node. The data is saved in Bytes/s in the TimescaleDB and in GBytes/s in the JSON for

Users. The data is calculated after the infiniband counter metric scheme.

Monitoring: Difference between revisions

Revision as of 12:40, 16 April 2024

Contents

Long term aggregations on a job basis

How is data aggregated (default):

Data aggregation of InfiniBand counter metrics:

Storing the aggregated data:

Aggregated metrics:

Navigation menu

Page actions

Page actions

Personal tools

Navigation

Search

Tools

@@ Line 16: / Line 16: @@
 === Aggregated metrics: ===
--Bandwidth: Total memory bandwidth on socket basis. The two memory controller
+* Bandwidth: Total memory bandwidth on socket basis. The two memory controller of a socket are added together. The data is saved in Bytes/s in the TimescaleDB
-of a socket are added together. The data is saved in Bytes/s in the TimescaleDB
 and in GBytes/s in the JSON for Users.
--L3 bandwidth: Total L3 Cache bandwidth on socket basis. The data is saved in
+* L3 bandwidth: Total L3 Cache bandwidth on socket basis. The data is saved in  Bytes/s in the TimescaleDB and in GBytes/s in the JSON for Users.
-Bytes/s in the TimescaleDB and in GBytes/s in the JSON for Users.
+* L3 miss rate: The percent rate of L3 misses to L3 cache accesses. A miss is when data is not in the cache when accessed. A lower miss rate is better. The
--L3 miss rate: The percent rate of L3 misses to L3 cache accesses. A miss is
+data is saved in Bytes/s in the TimescaleDB and in GBytes/s in the JSON for Users.
-when data is not in the cache when accessed. A lower miss rate is better. The
+* Flops:  Amount of floating point operations per second. The group does not differentiate between singe and double point precision rate.   The data is saved
-data is saved in Bytes/s in the TimescaleDB and in GBytes/s in the JSON for
-Users.
--Flops:  Amount of floating point operations per second. The group does not
-differentiate between singe and double point precision rate.   The data is saved
 in MFlops/s in the TimescaleDB and in GFlops/s in the JSON for Users.
--Instructions per cycle (IPC): This is a measure of the efficiency of the CPU. It
+* Instructions per cycle (IPC): This is a measure of the efficiency of the CPU. It represents the average number of instructions executed for each clock cycle. A
-represents the average number of instructions executed for each clock cycle. A
 higher IPC means a more efficient execution.
--(Cache) Miss rate: This is the percentage of cache accesses that result in a miss.
+* (Cache) Miss rate: This is the percentage of cache accesses that result in a miss. A cache miss occurs when the CPU looks for data in the cache and it isn’t there.
-A cache miss occurs when the CPU looks for data in the cache and it isn’t there.
+The data cache miss rate gives a measure how often it was necessary to get cache lines from higher levels of the memory hierarchy. A lower miss rate is
-The data cache miss rate gives a measure how often it was necessary to get
+better, as it means data is being successfully retrieved from the cache more often.
-cache lines from higher levels of the memory hierarchy. A lower miss rate is
+* (Cache) Miss ratio: The data cache miss ratio tells you how many of your memory references required a cache line to be loaded from a higher level. It is
-better, as it means data is being successfully retrieved from the cache more
+similar to the cache miss rate but is a ratio rather than a percentage.  It’s calculated as the number of cache misses divided by the total number of cache
-often.
+accesses. While the data cache miss rate might be given by your algorithm you should try to get data cache miss ratio as low as possible by increasing your
--(Cache) Miss ratio: The data cache miss ratio tells you how many of your
-memory references required a cache line to be loaded from a higher level. It is
-similar to the cache miss rate but is a ratio rather than a percentage.  It’s
-calculated as the number of cache misses divided by the total number of cache
-accesses. While the data cache miss rate might be given by your algorithm you
-should try to get data cache miss ratio as low as possible by increasing your
 cache reuse.
--Energy sum: This is the total amount of energy consumed by each node. It’s
+* Energy sum: This is the total amount of energy consumed by each node. It’s measured in W. It is calculated by first adding the rapl counters of both sockets
-measured in W. It is calculated by first adding the rapl counters of both sockets
+together and then adding a constant of 220 W for the energy consumption of other parts of the node.
-together and then adding a constant of 220 W for the energy consumption of
+* Mem free: This is the amount of free memory available in the node. The data is saved in Bytes in the TimescaleDB and in GBytes in the JSON for Users.
-other parts of the node.
+* Mem dirty: This is the amount of memory that has been modified but not yet written back to disk per node. The data is saved in Bytes in the TimescaleDB and
--Mem free: This is the amount of free memory available in the node. The data is
-saved in Bytes in the TimescaleDB and in GBytes in the JSON for Users.
--Mem dirty: This is the amount of memory that has been modified but not yet
-written back to disk per node. The data is saved in Bytes in the TimescaleDB and
 in GBytes in the JSON for Users.
--CPU usage system: This is the percentage which is used by system tasks on a
+* CPU usage system: This is the percentage which is used by system tasks on a node basis.
-node basis.
+* CPU usage user: This is the percentage which is used by user tasks on a node basis.
--CPU usage user: This is the percentage which is used by user tasks on a node
+* Rcv data: This is the amount of data received over the network per node. The data is saved in Bytes/s in the TimescaleDB and in GBytes/s in the JSON for
-basis.
--Rcv data: This is the amount of data received over the network per node. The
-data is saved in Bytes/s in the TimescaleDB and in GBytes/s in the JSON for
 Users. The data is calculated after the infiniband counter metric scheme.
--Xmit data: This is the amount of data transmitted over the network per node.
+* Xmit data: This is the amount of data transmitted over the network per node. The data is saved in Bytes/s in the TimescaleDB and in GBytes/s in the JSON for
-The data is saved in Bytes/s in the TimescaleDB and in GBytes/s in the JSON for
 Users. The data is calculated after the infiniband counter metric scheme.