HPE Hawk Hardware and Architecture: Difference between revisions

Latest revision as of 13:42, 18 June 2024

Node/Processor

hawk compute nodes

CPU type: AMD EPYC 7742
2 CPU's / node
64 Cores / CPU
CPU frequency: 2.25 GHz
256 GB / node
5632 nodes (until 2024-06-18), 4096 nodes (from 2024-06-18)

With respect to details of the processor deployed in Hawk, please refer to this Slides.

Pre- and post processing

Within HLRS simulation environment special nodes for pre- and post processing tasks are available. Such nodes could be requested via the batch systems using the smp queue. Available nodes are

    4 nodes 2 TB Memory, 2 Socket AMD EPYC 7702 64-Core, shared usage model
    1 Node  4 TB Memory, 2 Socket AMD EPYC 7702 64-Core, shared usage model

more specialized nodes e.g. graphics, vector, DataAnalytics, ... are available in the Vulcan cluster.
If you need such specialized nodes on vulcan cluster for pre- or postprocessing inside your project located on hawk resources, please ask your project manager for access to vulcan.

AI nodes

For AI compute jobs there is a special part of nodes with GPU's available:

number nodes: 24
CPU TYPE: 2 Socket AMD EPYC 7702 64-Core
memory per node: 1TB
GPU's per node: 8
GPU type: NVIDIA A100
GPU memory: 20 nodes with 40 GB, and 4 nodes with 80 GB
Node to node interconnect: Dual Rail InfiniBand HDR200
local disk capacity per node: 15TB

Nodes can be used via the batch systems by requesting the special node_type's (rome-ai, nv-a100-40gb, nv-a100-80gb).

Interconnect

Hawk deploys an Infiniband HDR based interconnect with a 8-dimensional hypercube topology. Please refer to here with respect to the latter. Infiniband HDR has a bandwidth of 200 Gbit/s and a MPI latency of ~1.3us per link. The full bandwidth of 200 Gbit/s can be used if communicating between the 16 nodes connected to the same node of the hypercube (cf. here). Within the hypercube, the higher the dimension, the less bandwidth is available. Topology aware scheduling is used to exclude major performance fluctuations. This means that larger jobs can only be requested with defined node numbers (64, 128, 256, 512, 1024) in regular operation. This restriction ensures optimal system utilization while simultaneously exploiting the network topology. Jobs with a node number of < 128 nodes are processed in a special partition. Jobs from 2048 nodes and more are processed at special times (called XXL days). Please ask if needed.
With respect to further details, please refer to the Slides already referenced above.

Filesystem

On hawk there are 2 different lustre filesystems available:

ws10:
- availabe storage capacity: 22 PB
- lustre devices: 2 MDS, 4 MDT, 8 OSS, 48 OST
- performance: < 100 GiB/s
ws11:
- available storage capacity: 15 PB
- lustre devices: 2 MDS, 2 MDT, 20 OSS, 40 OST
- performance: ~200 GiB/s

Additional an central HOME and project fileserver is also mounted on hawk. Some special nodes have a local disk installed which can be uses as localscratch.

@@ Line 1: / Line 1: @@
 === Node/Processor ===
+==== hawk compute nodes ====
+* CPU type: AMD EPYC 7742
+* 2 CPU's / node
+* 64 Cores / CPU
+* CPU frequency: 2.25 GHz
+* 256 GB / node
+* 5632 nodes (until 2024-06-18), 4096 nodes (from 2024-06-18)
-[https://kb.hlrs.de/platforms/upload/Processor.pdf Slides]
+With respect to details of the processor deployed in Hawk, please refer to this [https://kb.hlrs.de/platforms/upload/Processor.pdf Slides].
 ==== Pre- and post processing ====
-Within HLRS simulation environment special nodes for pre- and post processing tasks are available. Such nodes could be requested via the batch system (follow this link for more info).
+Within HLRS simulation environment special nodes for pre- and post processing tasks are available. Such nodes could be requested via the [[Batch_System_PBSPro_(Hawk) | batch systems]] using the smp queue.
 Available nodes are
-   table...
+nodes 2 TB Memory, 2 Socket AMD EPYC 7702 64-Core, shared usage model
-nodes 2 TB Memory 2 Socket AMD ...x TB local storage   shared usage model
+Node  4 TB Memory, 2 Socket AMD EPYC 7702 64-Core, shared usage model
-Node  4 TB Memory 2 Socket AMD    x TB local storage   shared usage model
-more specialized nodes e.g. graphics, vector, DataAnalytics, ... are available in the [[NEC_Cluster_Hardware_and_Architecture_(vulcan)|Vulcan cluster]]
+more specialized nodes e.g. graphics, vector, DataAnalytics, ... are available in the [[NEC_Cluster_Hardware_and_Architecture_(vulcan)|Vulcan cluster]].<BR>
+If you need such specialized nodes on vulcan cluster for pre- or postprocessing inside your project located on hawk resources, please ask your project manager for access to vulcan.
+==== AI nodes ====
+For AI compute jobs there is a special part of nodes with GPU's available:
+* number nodes: 24
+* CPU TYPE: 2 Socket AMD EPYC 7702 64-Core
+* memory per node: 1TB
+* GPU's per node: 8
+* GPU type: NVIDIA A100
+* GPU memory: 20 nodes with 40 GB, and 4 nodes with 80 GB
+* Node to node interconnect: Dual Rail InfiniBand HDR200
+* local disk capacity per node: 15TB
+Nodes can be used via the [[Batch_System_PBSPro_(Hawk) | batch systems]] by requesting the special node_type's (rome-ai, nv-a100-40gb, nv-a100-80gb).
+see also [[Big_Data,_AI_Aplications_and_Frameworks]]
 <br>
 === Interconnect ===
-Hawk deploys an Infiniband HDR based interconnect with a 9-dimensional enhanced hypercube topology. Please refer to [https://kb.hlrs.de/platforms/upload/Interconnect_topology.pdf here] with respect to the latter. Infiniband HDR has a bandwidth of 200 Gbit/s and a MPI latency of ~1.3us per link. The full bandwidth of 200 Gbit/s can be used if communicating between the 16 nodes connected to the same node of the hypercube (cf. [https://kb.hlrs.de/platforms/upload/Interconnect_topology.pdf here]). Within the hypercube, the higher the dimension, the less bandwidth is available.
+Hawk deploys an Infiniband HDR based interconnect with a 8-dimensional hypercube topology. Please refer to [https://kb.hlrs.de/platforms/upload/Interconnect.pdf here] with respect to the latter. Infiniband HDR has a bandwidth of 200 Gbit/s and a MPI latency of ~1.3us per link. The full bandwidth of 200 Gbit/s can be used if communicating between the 16 nodes connected to the same node of the hypercube (cf. [https://kb.hlrs.de/platforms/upload/Interconnect.pdf here]). Within the hypercube, the higher the dimension, the less bandwidth is available.
-Topology aware scheduling is used to exclude major performance fluctuations. This means that larger jobs can only be requested with defined node numbers (64, 128, 256, 512, 1024, 2048 and 4096) in regular operation. This restriction ensures optimal system utilization while simultaneously exploiting the network topology. Jobs with a node number of < 128 nodes are processed in a special partition. Jobs over 4096 nodes are processed at special times.
+Topology aware scheduling is used to exclude major performance fluctuations. This means that larger jobs can only be requested with defined node numbers (64, 128, 256, 512, 1024) in regular operation. This restriction ensures optimal system utilization while simultaneously exploiting the network topology. Jobs with a node number of < 128 nodes are processed in a special partition. Jobs from 2048 nodes and more are processed at special times (called XXL days). Please ask if needed. <br>
+With respect to further details, please refer to the [https://kb.hlrs.de/platforms/upload/Interconnect.pdf Slides] already referenced above.
 <br>
 === Filesystem ===
+On hawk there are 2 different lustre filesystems available:
+* ws10:
+** availabe storage capacity: 22 PB
+** lustre devices: 2 MDS, 4 MDT, 8 OSS, 48 OST
+** performance: < 100 GiB/s
+* ws11:
+** available storage capacity: 15 PB
+** lustre devices: 2 MDS, 2 MDT, 20 OSS, 40 OST
+** performance: ~200 GiB/s
+Additional an central HOME and project fileserver is also mounted on hawk.
+Some special nodes have a local disk installed which can be uses as localscratch.
+See also [[Storage_(Hawk)| Storage (Hawk)]]
 <br>

HPE Hawk Hardware and Architecture: Difference between revisions

Latest revision as of 13:42, 18 June 2024

Contents

Node/Processor

hawk compute nodes

Pre- and post processing

AI nodes

Interconnect

Filesystem

Navigation menu

Page actions

Page actions

Personal tools

Navigation

Search

Tools