- Infos im HLRS Wiki sind nicht rechtsverbindlich und ohne Gewähr -
- Information contained in the HLRS Wiki is not legally binding and HLRS is not responsible for any damages that might result from its use -
Big Data, AI Aplications and Frameworks
This guide provides a technical description of the hardware and software environments for high-performance data analytics (HPDA) and AI applications.
Hardware and Storage Overview
AI and HPDA workflows can require local storage. However, HPC nodes usually do not have any local drive except for particular nodes. Local storage is available only on the nodes mentioned below. You can also use the ram disk mounted to /run/user/${UID}
. For more information on the HOME and SCRATCH directories, please refer to the dedicated documentation for Hawk and Vulcan.
/tmp
is a minimal in-memory filesystem unless mounted as a local SSD.Hawk
Hawk is primarily a CPU-based supercomputer, but its GPU partition fits HPDA and AI applications.
rome-ai
partition contains 24 nodes and 192 GPUs. Resources per node:
- CPU: 2x AMD EPYC 7742
- GPU: 8x NVIDIA A100-SXM4
- 20 nodes with the 40 GB version
- 4 nodes with the 80 GB version
- RAM: 1 TB
- 15 TB local storage mounted to
/localscratch
rome
partition contains 4,096 nodes and 524,288 compute cores in total. Resources per node:
- CPU: 2x AMD EPYC 7742
- RAM: 256 GB
Vulcan
Vulcan has dedicated partitions to accelerate AI and HPDA workloads.
genoa-a30
partition contains 16 nodes and 16 GPUs available via the regular queue. Resources per node:
- CPU: 2x AMD Epyc 9124 Genoa, 3.0 GHz
- GPU: 1x NVIDIA A30 24GB HBM2
- RAM: 768GB
clx-21
is an 8-node CPU-based partition with local storage. Resources per node:
- CPU: 2x Intel Xeon Gold 6230
- RAM: 384 GB
- 1.9 TB local storage mounted to
/localscratch
rome256gc32c
is a 3-node CPU-based partition with local storage. Resources per node:
- CPU: 2x AMD Epyc 7302
- RAM: 512 GB
- 3.5 TB local storage mounted to
/localscratch
Software
The only way to access the compute nodes is by using the batch system from the login nodes. For more information, please refer to the dedicated documentation for Hawk and Vulcan.
Conda
Only the main
and r
channels are available using the Conda module. If you require custom Conda packages, our guide explains how to transfer local Conda environments to clusters. Additionally, the documentation demonstrates the use of the default Conda module for creating Conda environments.
Containers
Only udocker is available for security reasons since it can execute container runtimes without sudo permissions and user namespace support. Our documentation contains a guide explaining AI containers on GPU-accelerated partitions.
Frameworks
You can install PyTorch and TensorFlow in a custom Conda environment or container. Template project repositories are available at https://code.hlrs.de under the SiVeGCS organization for widely recognized data processing and machine learning frameworks, illustrating their usage on the HLRS systems.