- Infos im HLRS Wiki sind nicht rechtsverbindlich und ohne Gewähr -
- Information contained in the HLRS Wiki is not legally binding and HLRS is not responsible for any damages that might result from its use -

User monitoring: Difference between revisions

From HLRS Platforms
Jump to navigationJump to search
 
(30 intermediate revisions by the same user not shown)
Line 1: Line 1:
== Long term aggregations for end users ==
== Long term aggregations for end users ==
The long term job-specific aggregations for end users consist of eight metrics from the domains of energy consumption, performance characterization and I/O.
The long term job-specific aggregations for end users consist of nine metrics from the domains of energy consumption, performance characterization and I/O.


# the total energy consumption per job [Wh]
# the total energy consumption, per job
# the achieved memory bandwidth, averaged, per node.
# the achieved memory bandwidth, averaged, per node.
# the achieved floating point performance, averaged, per node.
# the achieved floating point performance, averaged, per node.
# the total amount of data written or read to the lustre file systems, per job.
# the total amount of data written to the lustre file systems, per job.
# the achieved peak write bandwidth to the lustre file systems, averaged, per node
# the total amount of data read from the lustre file systems, per job.
# the achieved peak read bandwidth to the lustre file systems, averaged, per node 
# the achieved peak write bandwidth to the lustre file systems, per job
# the achieved peak read bandwidth to the lustre file systems, per job
# the total number of metadata operations on the lustre file systems, per job   
# the total number of metadata operations on the lustre file systems, per job   
# the achieved peak rate of metadata operations on the lustre file systems, averaged, per node
# the achieved peak rate of metadata operations on the lustre file systems, per job


=== What the metrics represent and how the data is aggregated: ===


=== How is data aggregated: ===
1. The total energy consumption, per job [in Wh]
  1. we calculate the integral over of the power consumption per node over the timeline of the job. We then sum up the respective energy contribution from all compute nodes of the job.
  2. we include the overhead corresponding to the efficiency rating of the power supply units.
  3. we also factor in static contributions from the admin and storage infrastructure, averaged over all compute nodes, assigned per compute node of the job.
  4. we add a static contribution from the cooling distribution units, averaged over all compute nodes, assigned per compute node of the job.


For the total energy consumption per job:
2. The achieved memory bandwidth, averaged, per node [in GByte/s]:
# we calculate the median of the power consumption per node over the timeline of the job and sum up the respective contributions from all compute nodes of the job.
  1. this is the rate at which data can be read from or stored to the main memory.
# we factor in static contributions from the admin and storage infrastructure, averaged over all compute nodes, assigned per compute node of the job
  2. to calculate the achieved memory bandwidth, we calculate the mean of the memory bandwidth per node as reported by Likwid [https://github.com/RRZE-HPC/likwid] over the timeline of the job. We then average over all compute nodes of the job.
# we add a static contribution from the cooling distribution units, averaged over all compute nodes, assigned per compute node of the job
# we include the overhead corresponding to the efficiency rating of the power supply units


For the achieved memory bandwidth:
3. The achieved floating point performance, averaged, per node [in GFlop/s]:
# this is the rate at which data can be read from or stored to the main memory.
  1. this is the amount of floating point operations per second. While this metric does not discriminate between single or double precision floating point operations, it will take into account the SIMD width [https://en.wikipedia.org/wiki/Single_instruction,_multiple_data] of the floating point instructions.  
# to calculate the achieved memory bandwidth, we calculate the median of the memory bandwidth per node as reported by Lkiwd[1] over the timeline of the job. We then average over the respective contributions from all compute nodes of the job.
  2. to calculate the amount of floating point operations per second, we calculate the mean of the amount of floating point operations per second as reported by Likwid over the timeline of the job. We then average over all compute nodes of the job.


For the achieved floating point performance:
4. The total amount of data written to the lustre file systems, per job [in GByte]:
# this is the amount of floating point operations per second. While this metric does not discriminate between single or double precision floating point operations, it will take into account the SIMD width [2] of the floating point instructions.  
  1. the amount of data written to the workspaces of the lustre storage within the job.
# to calculate the amount of floating point operations per second, we calculate the median of the amount of floating point operations per second as reported by Lkiwd[1] over the timeline of the job. We then average over the respective contributions from all compute nodes of the job


5. The total amount of data read from the lustre file systems, per job [in GByte]:
  1. the amount of data read from the workspaces of the lustre storage within the job.


6. the achieved peak write bandwidth to the lustre file systems, per job [in Gbyte/s]
  1. this is the peak rate at which the job writes to the workspaces of the lustre storage, over the lifetime of the job.


[1] Likwid: https://github.com/RRZE-HPC/likwid
7. the achieved peak read bandwidth to the lustre file systems, per job [in Gbyte/s]
[2] SIMD: https://en.wikipedia.org/wiki/Single_instruction,_multiple_data
  1. this is the peak rate at which the job reads from the workspaces of the lustre storage, over the lifetime of the job.
 
8. the total number of metadata operations on the lustre file systems, per job [in metadata ops]
  1. this is the number of metadata operations [status, open, close, rename, unlink, etc.] on the lustre storage which can be assigned to the job.
 
9. the achieved peak rate of metadata operations on the lustre storage, per job [in metadata ops/s]
  1. this is the peak rate of metadata operations on the lustre storage which can be assigned to the job, over the lifetime of the job.

Latest revision as of 10:21, 29 August 2024

Long term aggregations for end users

The long term job-specific aggregations for end users consist of nine metrics from the domains of energy consumption, performance characterization and I/O.

  1. the total energy consumption, per job
  2. the achieved memory bandwidth, averaged, per node.
  3. the achieved floating point performance, averaged, per node.
  4. the total amount of data written to the lustre file systems, per job.
  5. the total amount of data read from the lustre file systems, per job.
  6. the achieved peak write bandwidth to the lustre file systems, per job
  7. the achieved peak read bandwidth to the lustre file systems, per job
  8. the total number of metadata operations on the lustre file systems, per job
  9. the achieved peak rate of metadata operations on the lustre file systems, per job

What the metrics represent and how the data is aggregated:

1. The total energy consumption, per job [in Wh]

 1. we calculate the integral over of the power consumption per node over the timeline of the job. We then sum up the respective energy contribution from all compute nodes of the job. 
 2. we include the overhead corresponding to the efficiency rating of the power supply units. 
 3. we also factor in static contributions from the admin and storage infrastructure, averaged over all compute nodes, assigned per compute node of the job.
 4. we add a static contribution from the cooling distribution units, averaged over all compute nodes, assigned per compute node of the job.

2. The achieved memory bandwidth, averaged, per node [in GByte/s]:

 1. this is the rate at which data can be read from or stored to the main memory.
 2. to calculate the achieved memory bandwidth, we calculate the mean of the memory bandwidth per node as reported by Likwid [1] over the timeline of the job. We then average over all compute nodes of the job.

3. The achieved floating point performance, averaged, per node [in GFlop/s]:

 1. this is the amount of floating point operations per second. While this metric does not discriminate between single or double precision floating point operations, it will take into account the SIMD width [2] of the floating point instructions. 
 2. to calculate the amount of floating point operations per second, we calculate the mean of the amount of floating point operations per second as reported by Likwid over the timeline of the job. We then average over all compute nodes of the job.

4. The total amount of data written to the lustre file systems, per job [in GByte]:

 1. the amount of data written to the workspaces of the lustre storage within the job.

5. The total amount of data read from the lustre file systems, per job [in GByte]:

 1. the amount of data read from the workspaces of the lustre storage within the job.

6. the achieved peak write bandwidth to the lustre file systems, per job [in Gbyte/s]

 1. this is the peak rate at which the job writes to the workspaces of the lustre storage, over the lifetime of the job.

7. the achieved peak read bandwidth to the lustre file systems, per job [in Gbyte/s]

 1. this is the peak rate at which the job reads from the workspaces of the lustre storage, over the lifetime of the job.

8. the total number of metadata operations on the lustre file systems, per job [in metadata ops]

 1. this is the number of metadata operations [status, open, close, rename, unlink, etc.] on the lustre storage which can be assigned to the job.

9. the achieved peak rate of metadata operations on the lustre storage, per job [in metadata ops/s]

 1. this is the peak rate of metadata operations on the lustre storage which can be assigned to the job, over the lifetime of the job.