- Infos im HLRS Wiki sind nicht rechtsverbindlich und ohne Gewähr -
- Information contained in the HLRS Wiki is not legally binding and HLRS is not responsible for any damages that might result from its use -

Barreleye

From HLRS Platforms
Revision as of 15:30, 15 November 2023 by Hpcralf (talk | contribs) (Created page with "== Lustre Server CPU Usage == Each portion of CPU usage is for all servers reported in a separate table i.e measurement. The reported usage states are: '''idle''' Is reported in <code>aggregation.cpu-average.cpu.idle</code>. When there is really nothing the kernel can do, it just as to waste away this slice of time. Technically, when the runnable queue is empty and there are no I/O operations going on, the CPU usage is marked as ''idle''. '''system''' Is reported in...")
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigationJump to search

Lustre Server CPU Usage

Each portion of CPU usage is for all servers reported in a separate table i.e measurement. The reported usage states are:

idle

Is reported in aggregation.cpu-average.cpu.idle. When there is really nothing the kernel can do, it just as to waste away this slice of time. Technically, when the runnable queue is empty and there are no I/O operations going on, the CPU usage is marked as idle.

system

Is reported in aggregation.cpu-average.cpu.system. This means the CPU is running kernel code. This includes device drivers and kernel modules.

user

Is reported in aggregation.cpu-average.cpu.user. The CPU is running code in user-mode. This includes your application code. Note that if an application tries to read from disk or write to network, it actually goes to sleep while the kernel performs that work, and wakes up the application again.

steal

Is reported in aggregation.cpu-average.cpu.steal. DDN Lustre servers are virtual machines. In a virtualized environment, the hypervisor may “steal” cycles that are meant for your CPUs and give them to another, for various reasons. This time is accounted for as steal.

nice

Is reported in aggregation.cpu-average.cpu.nice. The user code can be executed in “normal” priority, or various degrees of “below normal” priority. You can, for example, run some kind of report generation process at a lower priority and interactive processes at normal priority. Nice is when the CPU is executing a user task having below-normal priority.

wait

Is reported in aggregation.cpu-average.cpu.wait. Sometimes the CPU has only one thing to do – wait for the results of a disk/network read/write. This isn’t as uncommon as you’d think. A file server for example would nearly spend all it’s life waiting for disk reads and network writes to complete. I/O Wait is when the CPU is waiting for an I/O operation to complete, and the CPU can’t be used for anything else.

interrupt & softirq

interrupt is reported in aggregation.cpu-average.cpu.interrupt, softirq is reported in aggregation.cpu-average.cpu.softirq. Both cases tell that the kernel is servicing interrupt requests.

Visualizations

The data collected in the tables are visualized in the Grafana dashboard "CPU usage by Type per Server" on mon-login01.

Table Structure

Each of the tables has the same structure.

Measurement: aggregation.cpu-average.cpu.<usage_state>
Key Value Explanation
cluster exafs
fqdn hawk-mds01
hawk-mds02
hawk-oss01
hawk-oss02
hawk-oss03
hawk-oss04
hawk-oss05
hawk-oss06
hawk-oss07
hawk-oss08

Please not that the explanation of CPU usage states was taken from https://www.opsdash.com/blog/cpu-usage-linux.html

Network Data

The Infiniband counters of the Lustre Servers are collected in four tables, i.e. measurements. Two of which, i.e. counters_error and counters_info, report port based metrics whereas the other two, i.e. hw_counters_error and hw_counters_info, report function based metrics.

Visualizations

The data collected in the tables are visualized in the Grafana dashboard Network Metrics by Server (Selectable) on mon-login01.

Table Structure

The four tables have the same structure with respect to the tag keys. The counters are differentiated by optype.

Measurement: counters_error
Key Value Explanation
cluster exafs
driver_index 0
1
driver_type mlx5
fqdn hawk-mds01
hawk-mds02
hawk-oss01
hawk-oss02
hawk-oss03
hawk-oss04
hawk-oss05
hawk-oss06
hawk-oss07
hawk-oss08
optype VL15_dropped
link_downed
link_error_recovery
local_link_integrity_errors
port_rcv_constraint_errors
port_rcv_remote_physical_errors
port_rcv_switch_relay_errors
port_xmit_constraint_errors
port_xmit_discards
symbol_error
port_number 1
Measurement: counters_info
Key Value Explanation
cluster exafs
driver_index 0
1
driver_type mlx5
fqdn hawk-mds01
hawk-mds02
hawk-oss01
hawk-oss02
hawk-oss03
hawk-oss04
hawk-oss05
hawk-oss06
hawk-oss07
hawk-oss08
optype excessive_buffer_overrun_errors
port_rcv_data
port_rcv_errors
port_rcv_packets
port_xmit_data
port_xmit_packets
port_number 1
Measurement: hw_counters_error
Key Value Explanation
cluster exafs
driver_index 0
1
driver_type mlx5
fqdn hawk-mds01
hawk-mds02
hawk-oss01
hawk-oss02
hawk-oss03
hawk-oss04
hawk-oss05
hawk-oss06
hawk-oss07
hawk-oss08
optype duplicate_request
implied_nak_seq_err
local_ack_timeout_err
out_of_buffer
out_of_sequence
packet_seq_err
req_cqe_error
req_cqe_flush_error
req_remote_access_errors
req_remote_invalid_request
resp_cqe_error
resp_cqe_flush_error
resp_local_length_error
resp_remote_access_errors
rnr_nak_retry_err
rx_icrc_encapsulated
port_number 1
Measurement: hw_counters_info
Key Value Explanation
cluster exafs
driver_index 0
1
driver_type mlx5
fqdn hawk-mds01
hawk-mds02
hawk-oss01
hawk-oss02
hawk-oss03
hawk-oss04
hawk-oss05
hawk-oss06
hawk-oss07
hawk-oss08
optype lifespan
roce_adp_retrans
roce_adp_retrans_to
roce_slow_restart
roce_slow_restart_cnps
roce_slow_restart_trans
rx_atomic_requests
rx_dct_connect
rx_read_requests
rx_write_requests
port_number 1

For more information about Infiniband counters please see https://enterprise-support.nvidia.com/s/article/understanding-mlx5-linux-counters-and-status-parameters.

Meta Data Server (MDS) and Meta Data Target (MDT) Metrics

While the table cq_md_stats_by_optype reports meta data operations agains the full file system, all other measurements of metrics collected from the lustre meta data servers and accordingly the meta data targets can in principle be grouped into three different categories.

  1. Metrics by MDS and MDT.
    • md_stats
    • md_stats_max
    • md_stats_min
    • md_stats_sum
    • md_stats_sumsq
    • mdt_filesinfo_free
    • mdt_filesinfo_total
    • mdt_filesinfo_used
    • mdt_kbytesinfo_free
    • mdt_kbytesinfo_total
    • mdt_kbytesinfo_used
  2. Metrics differentiating between user-, group, and job-id.
    • cq_mdt_acctuser_samples_by_user_id
    • cq_mdt_jobstats_samples_by_ll_job_gid
    • cq_mdt_jobstats_samples_by_ll_job_id
    • cq_mdt_jobstats_samples_by_ll_job_uid
    • mdt_acctuser_samples
    • mdt_jobstats_max
    • mdt_jobstats_min
    • mdt_jobstats_samples
    • mdt_jobstats_sum
    • mdt_jobstats_sumsq
  3. Metrics differentiating between clients.
    • exp_md_stats
    • exp_md_stats_max_latency
    • exp_md_stats_min_latency
    • exp_md_stats_sum_latency
    • exp_md_stats_sumsq_latency

Total Meta Data Operations

The table cq_md_stats_by_optype collects the total sum of meta data operations against the complete file system in a continuous query.

Visualizations

The Dashboard ws10-barreleye uses the table in the panel "Lustre Aggregated Metadata".

Table Structure

Measurement: cq_md_stats_by_optype
Key Value Explanation
optype close
getattr
getxattr
mkdir
mknod
open
rename
rmdir
setattr
setxattr
statfs
unlink
sum float

Metrics by MDS and MDT.

Meta Data Operations grouped by MDS and MDT.

The table md_stats collects meta data operations per meta data target. The table shares its structure with

  • md_stats_max
  • md_stats_min
  • md_stats_sum
  • md_stats_sumsq

In which ....

Visualizations

The Dashboard ws10-barreleye uses the table in the panel "Lustre Aggregated Metadata".

Table Structure
Measurement: md_stats
Key Value Explanation
cluster exafs
fqdn hawk-mds01
hawk-mds02
fs_name exafs
mdt_index MDT0000
MDT0001
MDT0002
MDT0003
optype close
getattr
getxattr
mkdir
mknod
open
rename
rmdir
setattr
setxattr
statfs
unlink
value float

Inode and filespace usage by MDS and MDT.

The tables md_filesinfo_* collect information about the number of free, total and used inodes on each MDT while the tables mdt_kbytesinfo_* collect information about the free, total and used filespace on each MDT.

All six tables share the same structure.

Visualizations

...

Table Structure
md_[files|kbytes]info_[free|total|used]
Key Value Explanation
cluster exafs
fqdn hawk-mds01
hawk-mds02
fs_name exafs
mdt_index MDT0000
MDT0001
MDT0002
MDT0003
value float

Metrics differentiating between user-, group, and job-id

Measurement: cq_mdt_acctuser_samples_by_user_id
Key Value Explanation
optype usage_inodes
usage_kbytes
user_id 0
1001
1002
11932
12266
12356
12448
12499
13468
13967
...
sum float
Measurement: cq_mdt_jobstats_samples_by_ll_job_gid
Key Value Explanation
ll_job_gid 0
0:0
0:0:
11142
12793
12801
12803
12812
12831
12833
...
optype close
crossdir_rename
getattr
getxattr
link
mkdir
mknod
open
punch
read_bytes
rename
rmdir
samedir_rename
setattr
setxattr
statfs
sync
unlink
write_bytes
Measurement: cq_mdt_jobstats_samples_by_ll_job_id
Key Value Explanation
ll_job_id 0
1
10
100010.cl1intern__1
100010.cl1intern__I
100010.cl1intern__S
100010.cl1intern__a
100010.cl1intern__c
100010.cl1intern__d
100010.cl1intern__f
...
optype close
crossdir_rename
getattr
getxattr
link
mkdir
mknod
open
punch
read_bytes
rename
rmdir
samedir_rename
setattr
setxattr
statfs
sync
unlink
write_bytes
Measurement: cq_mdt_jobstats_samples_by_ll_job_uid
Key Value Explanation
ll_job_uid 0
0:
11932
12266
12356
12448
12499
13468
13967
14207
...
optype close
crossdir_rename
getattr
getxattr
link
mkdir
mknod
open
punch
read_bytes
rename
rmdir
samedir_rename
setattr
setxattr
statfs
sync
unlink
write_bytes
Measurement: mdt_acctuser_samples
Key Value Explanation
cluster exafs
fqdn hawk-mds01
hawk-mds02
fs_name exafs
mdt_index MDT0000
MDT0001
MDT0002
MDT0003
optype usage_inodes
usage_kbytes
user_id 0
1001
1002
11932
12266
12356
12448
12499
13468
13967
Measurement: mdt_jobstats_max
Key Value Explanation
cluster exafs
fqdn hawk-mds01
hawk-mds02
fs_name exafs
job_id 0:0:2072485.hawk-pbs5
0:0:2073250.hawk-pbs5
0:0:2073899.hawk-pbs5
0:0:2074114.hawk-pbs5
0:0:2075166.hawk-pbs5
0:0:2075906.hawk-pbs5
0:0:2077436.hawk-pbs5
0:0:2079673.hawk-pbs5
0:0:2081442.hawk-pbs5
0:0:2081474.hawk-pbs5
ll_job_gid 0
0:0
0:0:
11142
12793
12801
12803
12812
12831
12833
ll_job_id 0
1
10
100010.cl1intern__1
100010.cl1intern__I
100010.cl1intern__S
100010.cl1intern__a
100010.cl1intern__c
100010.cl1intern__d
100010.cl1intern__f
ll_job_uid 0
0:
11932
12266
12356
12448
12499
13468
13967
14207
mdt_index MDT0000
MDT0001
MDT0002
MDT0003
optype max_punch
max_read_bytes
max_write_bytes
Measurement: mdt_jobstats_min
Key Value Explanation
cluster exafs
fqdn hawk-mds01
hawk-mds02
fs_name exafs
job_id 0:0:2072485.hawk-pbs5
0:0:2073250.hawk-pbs5
0:0:2073899.hawk-pbs5
0:0:2074114.hawk-pbs5
0:0:2075166.hawk-pbs5
0:0:2075906.hawk-pbs5
0:0:2077436.hawk-pbs5
0:0:2079673.hawk-pbs5
0:0:2081442.hawk-pbs5
0:0:2081474.hawk-pbs5
ll_job_gid 0
0:0
0:0:
11142
12793
12801
12803
12812
12831
12833
ll_job_id 0
1
10
100010.cl1intern__1
100010.cl1intern__I
100010.cl1intern__S
100010.cl1intern__a
100010.cl1intern__c
100010.cl1intern__d
100010.cl1intern__f
ll_job_uid 0
0:
11932
12266
12356
12448
12499
13468
13967
14207
mdt_index MDT0000
MDT0001
MDT0002
MDT0003
optype min_punch
min_read_bytes
min_write_bytes
Measurement: mdt_jobstats_samples
Key Value Explanation
cluster exafs
fqdn hawk-mds01
hawk-mds02
fs_name exafs
job_id 0:0:2072485.hawk-pbs5
0:0:2073250.hawk-pbs5
0:0:2073899.hawk-pbs5
0:0:2074114.hawk-pbs5
0:0:2075166.hawk-pbs5
0:0:2075906.hawk-pbs5
0:0:2077436.hawk-pbs5
0:0:2079673.hawk-pbs5
0:0:2081442.hawk-pbs5
0:0:2081474.hawk-pbs5
ll_job_gid 0
0:0
0:0:
11142
12793
12801
12803
12812
12831
12833
ll_job_id 0
1
10
100010.cl1intern__1
100010.cl1intern__I
100010.cl1intern__S
100010.cl1intern__a
100010.cl1intern__c
100010.cl1intern__d
100010.cl1intern__f
ll_job_uid 0
0:
11932
12266
12356
12448
12499
13468
13967
14207
mdt_index MDT0000
MDT0001
MDT0002
MDT0003
optype close
crossdir_rename
getattr
getxattr
link
mkdir
mknod
open
punch
read_bytes
rename
rmdir
samedir_rename
setattr
setxattr
statfs
sync
unlink
write_bytes
Measurement: mdt_jobstats_sum
Key Value Explanation
cluster exafs
fqdn hawk-mds01
hawk-mds02
fs_name exafs
job_id 0:0:2072485.hawk-pbs5
0:0:2073250.hawk-pbs5
0:0:2073899.hawk-pbs5
0:0:2074114.hawk-pbs5
0:0:2075166.hawk-pbs5
0:0:2075906.hawk-pbs5
0:0:2077436.hawk-pbs5
0:0:2079673.hawk-pbs5
0:0:2081442.hawk-pbs5
0:0:2081474.hawk-pbs5
ll_job_gid 0
0:0
0:0:
11142
12793
12801
12803
12812
12831
12833
ll_job_id 0
1
10
100010.cl1intern__1
100010.cl1intern__I
100010.cl1intern__S
100010.cl1intern__a
100010.cl1intern__c
100010.cl1intern__d
100010.cl1intern__f
ll_job_uid 0
0:
11932
12266
12356
12448
12499
13468
13967
14207
mdt_index MDT0000
MDT0001
MDT0002
MDT0003
optype sum_punch
sum_read_bytes
sum_write_bytes
Measurement: mdt_jobstats_sumsq
Key Value Explanation
cluster exafs
fqdn hawk-mds01
hawk-mds02
fs_name exafs
job_id 0:0:2072485.hawk-pbs5
0:0:2073250.hawk-pbs5
0:0:2073899.hawk-pbs5
0:0:2074114.hawk-pbs5
0:0:2075166.hawk-pbs5
0:0:2075906.hawk-pbs5
0:0:2077436.hawk-pbs5
0:0:2079673.hawk-pbs5
0:0:2081442.hawk-pbs5
0:0:2081474.hawk-pbs5
ll_job_gid 0
0:0
0:0:
11142
12793
12801
12803
12812
12831
12833
ll_job_id 0
1
10
100010.cl1intern__1
100010.cl1intern__I
100010.cl1intern__S
100010.cl1intern__a
100010.cl1intern__c
100010.cl1intern__d
100010.cl1intern__f
ll_job_uid 0
0:
11932
12266
12356
12448
12499
13468
13967
14207
mdt_index MDT0000
MDT0001
MDT0002
MDT0003
optype sumsq_punch
sumsq_read_bytes
sumsq_write_bytes

Metrics differentiating between clients

Measurement: exp_md_stats
Key Value Explanation
cluster exafs
exp_client 0
10.148.0.32
10.148.0.33
10.148.0.34
10.148.0.36
10.148.0.37
10.148.0.38
10.148.0.39
10.148.0.40
10.148.0.41
...
exp_type lo
o2ib20
o2ib43
o2ib44
fqdn hawk-mds01
hawk-mds02
fs_name exafs
mdt_index MDT0000
MDT0001
MDT0002
MDT0003
optype close
getattr
getxattr
link
mkdir
mknod
open
rename
rmdir
setattr
setxattr
statfs
sync
unlink
Measurement: exp_md_stats_max_latency
Key Value Explanation
cluster exafs
exp_client 0
10.148.0.32
10.148.0.33
10.148.0.34
10.148.0.36
10.148.0.37
10.148.0.38
10.148.0.39
10.148.0.40
10.148.0.41
...
exp_type lo
o2ib20
o2ib43
o2ib44
fqdn hawk-mds01
hawk-mds02
fs_name exafs
mdt_index MDT0000
MDT0001
MDT0002
MDT0003
optype close
getattr
getxattr
link
mkdir
mknod
open
rename
rmdir
setattr
setxattr
statfs
sync
unlink
Measurement: exp_md_stats_min_latency
Key Value Explanation
cluster exafs
exp_client 0
10.148.0.32
10.148.0.33
10.148.0.34
10.148.0.36
10.148.0.37
10.148.0.38
10.148.0.39
10.148.0.40
10.148.0.41
...
exp_type lo
o2ib20
o2ib43
o2ib44
fqdn hawk-mds01
hawk-mds02
fs_name exafs
mdt_index MDT0000
MDT0001
MDT0002
MDT0003
optype close
getattr
getxattr
link
mkdir
mknod
open
rename
rmdir
setattr
setxattr
statfs
sync
unlink
Measurement: exp_md_stats_sum_latency
Key Value Explanation
cluster exafs
exp_client 0
10.148.0.32
10.148.0.33
10.148.0.34
10.148.0.36
10.148.0.37
10.148.0.38
10.148.0.39
10.148.0.40
10.148.0.41
...
exp_type lo
o2ib20
o2ib43
o2ib44
fqdn hawk-mds01
hawk-mds02
fs_name exafs
mdt_index MDT0000
MDT0001
MDT0002
MDT0003
optype close
getattr
getxattr
link
mkdir
mknod
open
rename
rmdir
setattr
setxattr
statfs
sync
unlink
Measurement: exp_md_stats_sumsq_latency
Key Value Explanation
cluster exafs
exp_client 0
10.148.0.32
10.148.0.33
10.148.0.34
10.148.0.36
10.148.0.37
10.148.0.38
10.148.0.39
10.148.0.40
10.148.0.41
...
exp_type lo
o2ib20
o2ib43
o2ib44
fqdn hawk-mds01
hawk-mds02
fs_name exafs
mdt_index MDT0000
MDT0001
MDT0002
MDT0003
optype close
getattr
getxattr
link
mkdir
mknod
open
rename
rmdir
setattr
setxattr
statfs
sync
unlink

Object Storage Server (OSS) and Object Storage Target (OST) Metrics

While the tables

  • cq_ost_brw_stats_rpc_bulk_samples_by_size
  • cq_ost_kbytesinfo_used_by_fs_name
  • cq_ost_stats_bytes_by_optype

report operation and usage stats of the full file system, all other measurements of metrics collected from the lustre object storage servers and accordingly the object storage targets can in principle be grouped into four different categories.

  1. Metrics by OSS
    • ost_io_stats_ost_punch_max
    • ost_io_stats_ost_punch_mean
    • ost_io_stats_ost_punch_mean_square
    • ost_io_stats_ost_punch_min
    • ost_io_stats_ost_punch_samples
    • ost_io_stats_ost_punch_sum
    • ost_io_stats_ost_punch_sum_square
    • ost_io_stats_ost_read_max
    • ost_io_stats_ost_read_mean
    • ost_io_stats_ost_read_mean_square
    • ost_io_stats_ost_read_min
    • ost_io_stats_ost_read_samples
    • ost_io_stats_ost_read_sum
    • ost_io_stats_ost_read_sum_square
    • ost_io_stats_ost_write_max
    • ost_io_stats_ost_write_mean
    • ost_io_stats_ost_write_mean_square
    • ost_io_stats_ost_write_min
    • ost_io_stats_ost_write_samples
    • ost_io_stats_ost_write_sum
    • ost_io_stats_ost_write_sum_square
    • ost_io_stats_req_active_max
    • ost_io_stats_req_active_mean
    • ost_io_stats_req_active_mean_square
    • ost_io_stats_req_active_min
    • ost_io_stats_req_active_samples
    • ost_io_stats_req_active_sum
    • ost_io_stats_req_active_sum_square
    • ost_io_stats_req_qdepth_max
    • ost_io_stats_req_qdepth_mean
    • ost_io_stats_req_qdepth_mean_square
    • ost_io_stats_req_qdepth_min
    • ost_io_stats_req_qdepth_samples
    • ost_io_stats_req_qdepth_sum
    • ost_io_stats_req_qdepth_sum_square
    • ost_io_stats_req_timeout_max
    • ost_io_stats_req_timeout_mean
    • ost_io_stats_req_timeout_mean_square
    • ost_io_stats_req_timeout_min
    • ost_io_stats_req_timeout_samples
    • ost_io_stats_req_timeout_sum
    • ost_io_stats_req_timeout_sum_square
    • ost_io_stats_req_waittime_max
    • ost_io_stats_req_waittime_mean
    • ost_io_stats_req_waittime_mean_square
    • ost_io_stats_req_waittime_min
    • ost_io_stats_req_waittime_samples
    • ost_io_stats_req_waittime_sum
    • ost_io_stats_req_waittime_sum_square
    • ost_io_stats_reqbuf_avail_max
    • ost_io_stats_reqbuf_avail_mean
    • ost_io_stats_reqbuf_avail_mean_square
    • ost_io_stats_reqbuf_avail_min
    • ost_io_stats_reqbuf_avail_samples
    • ost_io_stats_reqbuf_avail_sum
    • ost_io_stats_reqbuf_avail_sum_square
  2. Metrics by OSS and OST
    • ost_brw_stats_block_discontiguous_rpc_cum
    • ost_brw_stats_block_discontiguous_rpc_percentage
    • ost_brw_stats_block_discontiguous_rpc_samples
    • ost_brw_stats_fragmented_io_cum
    • ost_brw_stats_fragmented_io_percentage
    • ost_brw_stats_fragmented_io_samples
    • ost_brw_stats_io_in_flight_cum
    • ost_brw_stats_io_in_flight_percentage
    • ost_brw_stats_io_in_flight_samples
    • ost_brw_stats_io_size_cum
    • ost_brw_stats_io_size_percentage
    • ost_brw_stats_io_size_samples
    • ost_brw_stats_page_discontiguous_rpc_cum
    • ost_brw_stats_page_discontiguous_rpc_percentage
    • ost_brw_stats_page_discontiguous_rpc_samples
    • ost_brw_stats_rpc_bulk_cum
    • ost_brw_stats_rpc_bulk_percentage
    • ost_brw_stats_rpc_bulk_samples
    • ost_filesinfo_free
    • ost_filesinfo_total
    • ost_filesinfo_used
    • ost_kbytesinfo_free
    • ost_kbytesinfo_total
    • ost_kbytesinfo_used
    • ost_stats_bytes
    • ost_stats_max_latency
    • ost_stats_min_latency
    • ost_stats_samples
    • ost_stats_sum_latency
    • ost_stats_sumsq_latency
  3. Metrics differentiating between user-, group, and job-id
    • cq_ost_acctuser_samples_by_user_id
    • cq_ost_jobstats_bytes_by_ll_job_gid
    • cq_ost_jobstats_bytes_by_ll_job_id
    • cq_ost_jobstats_bytes_by_ll_job_uid
    • ost_acctuser_samples
    • ost_jobstats_bytes
    • ost_jobstats_samples
  4. Metrics differentiating between clients.
    • exp_ost_stats_bytes
    • exp_ost_stats_samples

Metrics by OSS

All measurements that are stored by OSS share the same table structure.

Measurement: ost_io_stats_<operation>_<aggregation>
Key Value Explanation
cluster exafs
fqdn hawk-oss01
hawk-oss02
hawk-oss03
hawk-oss04
hawk-oss05
hawk-oss06
hawk-oss07
hawk-oss08
value float

Metrics by OSS and OST

To be done

Metrics differentiating between user-, group, and job-id

Measurement: cq_ost_acctuser_samples_by_user_id
Key Value Explanation
optype usage_inodes
usage_kbytes
user_id 0
1001
1002
11363
11932
12266
12356
12448
12499
13468
...
sum float
Measurement: cq_ost_jobstats_bytes_by_ll_job_gid
Key Value Explanation
ll_job_gid 0
00145
00277
00279
00967
01141
01142
01392
01540
02073
...
optype sum_read_bytes
sum_write_bytes
sum float
Measurement: cq_ost_jobstats_bytes_by_ll_job_id
Key Value Explanation
ll_job_id .
.hawk-pbs5
.hawk-pbs5__
0
0-bin
00
01
01].hawk-pbs
02
02].hawk-pbs
...
optype sum_read_bytes
sum_write_bytes
sum float
Measurement: cq_ost_jobstats_bytes_by_ll_job_uid
Key Value Explanation
ll_job_uid .kworker/10
.kworker/101
.kworker/104
.kworker/106
.kworker/107
.kworker/108
.kworker/112
.kworker/113
.kworker/114
.kworker/116
...
optype sum_read_bytes
sum_write_bytes
sum float
Measurement: ost_acctuser_samples
Key Value Explanation
cluster exafs
fqdn hawk-oss01
hawk-oss02
hawk-oss03
hawk-oss04
hawk-oss05
hawk-oss06
hawk-oss07
hawk-oss08
fs_name exafs
optype usage_inodes
usage_kbytes
ost_index OST0000
OST0001
OST0002
OST0003
OST0004
OST0005
OST0006
OST0007
OST0008
OST0009
OST000a
OST000b
OST000c
OST000d
OST000e
OST000f
OST0010
OST0011
OST0012
OST0013
OST0014
OST0015
OST0016
OST0017
OST0018
OST0019
OST001a
OST001b
OST001c
OST001d
OST001e
OST001f
OST0020
OST0021
OST0022
OST0023
OST0024
OST0025
OST0026
OST0027
OST0028
OST0029
OST002a
OST002b
OST002c
OST002d
OST002e
OST002f
user_id 0
1001
1002
11363
11932
12266
12356
12448
12499
13420
...
value float
Measurement: ost_jobstats_bytes
Key Value Explanation
cluster exafs
fqdn hawk-oss01
hawk-oss02
hawk-oss03
hawk-oss04
hawk-oss05
hawk-oss06
hawk-oss07
hawk-oss08
fs_name exafs
job_id 00145:31716:2169120.hawk-pbs5
00277:29017:1955996.hawk-pbs5
00279:15448:1963897.hawk-pbs5
00279:15448:2056529.hawk-pbs5
00279:15448:__ResultCombine.e
00967:30969:1983785.hawk-pbs5
00967:34627:2209444.hawk-pbs5__
00967:34627:2225903.hawk-pbs5__
00967:34627:2237056.hawk-pbs5__
01141:32275:2121954.hawk-pbs5
...
ll_job_gid 0
00145
00277
00279
00967
01141
01142
01392
01540
02073
...
ll_job_id .
.hawk-pbs5
.hawk-pbs5__
0
0-bin
00
01
01].hawk-pbs
02
02].hawk-pbs
...
ll_job_uid .kworker/10
.kworker/101
.kworker/104
.kworker/106
.kworker/107
.kworker/108
.kworker/112
.kworker/113
.kworker/114
.kworker/116
...
optype sum_read_bytes
sum_write_bytes
ost_index OST0000
OST0001
OST0002
OST0003
OST0004
OST0005
OST0006
OST0007
OST0008
OST0009
OST000a
OST000b
OST000c
OST000d
OST000e
OST000f
OST0010
OST0011
OST0012
OST0013
OST0014
OST0015
OST0016
OST0017
OST0018
OST0019
OST001a
OST001b
OST001c
OST001d
OST001e
OST001f
OST0020
OST0021
OST0022
OST0023
OST0024
OST0025
OST0026
OST0027
OST0028
OST0029
OST002a
OST002b
OST002c
OST002d
OST002e
OST002f
value float
Measurement: ost_jobstats_samples
Key Value Explanation
cluster exafs
fqdn hawk-oss01
hawk-oss02
hawk-oss03
hawk-oss04
hawk-oss05
hawk-oss06
hawk-oss07
hawk-oss08
fs_name exafs
job_id 00145:31716:2169120.hawk-pbs5
00277:29017:1955996.hawk-pbs5
00279:15448:1963897.hawk-pbs5
00279:15448:2056529.hawk-pbs5
00279:15448:__ResultCombine.e
00967:30969:1983785.hawk-pbs5
00967:34627:2209444.hawk-pbs5__
00967:34627:2225903.hawk-pbs5__
00967:34627:2237056.hawk-pbs5__
01141:32275:2121954.hawk-pbs5
...
ll_job_gid 0
00145
00277
00279
00967
01141
01142
01392
01540
02073
...
ll_job_id .
.hawk-pbs5
.hawk-pbs5__
0
0-bin
00
01
01].hawk-pbs
02
02].hawk-pbs
...
ll_job_uid .kworker/10
.kworker/101
.kworker/104
.kworker/106
.kworker/107
.kworker/108
.kworker/112
.kworker/113
.kworker/114
.kworker/116
...
optype create
destroy
get_info
getattr
punch
quotactl
read
read_samples
set_info
setattr
statfs
sync
write
write_samples
ost_index OST0000
OST0001
OST0002
OST0003
OST0004
OST0005
OST0006
OST0007
OST0008
OST0009
OST000a
OST000b
OST000c
OST000d
OST000e
OST000f
OST0010
OST0011
OST0012
OST0013
OST0014
OST0015
OST0016
OST0017
OST0018
OST0019
OST001a
OST001b
OST001c
OST001d
OST001e
OST001f
OST0020
OST0021
OST0022
OST0023
OST0024
OST0025
OST0026
OST0027
OST0028
OST0029
OST002a
OST002b
OST002c
OST002d
OST002e
OST002f
value float

Metrics differentiating between clients

Measurement: exp_ost_stats_bytes
Key Value Explanation
cluster exafs
exp_client 10.148.0.32
10.148.0.33
10.148.0.34
10.148.0.36
10.148.0.37
10.148.0.38
10.148.0.39
10.148.0.40
10.148.0.41
10.148.0.42
...
exp_type o2ib20
o2ib43
o2ib44
fqdn hawk-oss01
hawk-oss02
hawk-oss03
hawk-oss04
hawk-oss05
hawk-oss06
hawk-oss07
hawk-oss08
fs_name exafs
optype read
write
ost_index OST0000
OST0001
OST0002
OST0003
OST0004
OST0005
OST0006
OST0007
OST0008
OST0009
OST000a
OST000b
OST000c
OST000d
OST000e
OST000f
OST0010
OST0011
OST0012
OST0013
OST0014
OST0015
OST0016
OST0017
OST0018
OST0019
OST001a
OST001b
OST001c
OST001d
OST001e
OST001f
OST0020
OST0021
OST0022
OST0023
OST0024
OST0025
OST0026
OST0027
OST0028
OST0029
OST002a
OST002b
OST002c
OST002d
OST002e
OST002f
value float
Measurement: exp_ost_stats_samples
Key Value Explanation
cluster exafs
exp_client 10.148.0.32
10.148.0.33
10.148.0.34
10.148.0.36
10.148.0.37
10.148.0.38
10.148.0.39
10.148.0.40
10.148.0.41
10.148.0.42
...
exp_type o2ib20
o2ib43
o2ib44
fqdn hawk-oss01
hawk-oss02
hawk-oss03
hawk-oss04
hawk-oss05
hawk-oss06
hawk-oss07
hawk-oss08
fs_name exafs
optype read
write
ost_index OST0000
OST0001
OST0002
OST0003
OST0004
OST0005
OST0006
OST0007
OST0008
OST0009
OST000a
OST000b
OST000c
OST000d
OST000e
OST000f
OST0010
OST0011
OST0012
OST0013
OST0014
OST0015
OST0016
OST0017
OST0018
OST0019
OST001a
OST001b
OST001c
OST001d
OST001e
OST001f
OST0020
OST0021
OST0022
OST0023
OST0024
OST0025
OST0026
OST0027
OST0028
OST0029
OST002a
OST002b
OST002c
OST002d
OST002e
OST002f
value float