Big Data, AI Aplications and Frameworks: Difference between revisions

Latest revision as of 08:35, 11 April 2024

This guide provides a technical description of the hardware and software environments for high-performance data analytics (HPDA) and AI applications.

Hardware and Storage Overview

AI and HPDA workflows can require local storage. However, HPC nodes usually do not have any local drive except for particular nodes. Local storage is available only on the nodes mentioned below. You can also use the ram disk mounted to /run/user/${UID}. For more information on the HOME and SCRATCH directories, please refer to the dedicated documentation for Hawk and Vulcan.

Warning: Ensure your application uses the correct paths for local files. /tmp is a minimal in-memory filesystem unless mounted as a local SSD.

Hawk

Hawk is primarily a CPU-based supercomputer, but its GPU partition fits HPDA and AI applications.

rome-ai partition contains 24 nodes and 192 GPUs. Resources per node:

CPU: 2x AMD EPYC 7742
GPU: 8x NVIDIA A100-SXM4
- 20 nodes with the 40 GB version
- 4 nodes with the 80 GB version
RAM: 1 TB
15 TB local storage mounted to /localscratch

rome partition contains 5,632 nodes and 720,896 compute cores in total. Resources per node:

CPU: 2x AMD EPYC 7742
RAM: 256 GB

Vulcan

Vulcan has two dedicated partitions to accelerate AI and HPDA workloads.

clx-ai partition contains 4 nodes and 32 GPUs. Resources per node:

CPU: 2x Intel Xeon Gold 6240
GPU: 8x NVIDIA V100-SXM2 32 GB
RAM: 768 GB
7.3 TB local storage mounted to /localscratch

clx-21 is an 8-node CPU-based partition with local storage. Resources per node:

CPU: 2x Intel Xeon Gold 6230
RAM: 384 GB
1.9 TB local storage mounted to /localscratch

Software

The only way to access the compute nodes is by using the batch system from the login nodes. For more information, please refer to the dedicated documentation for Hawk and Vulcan.

Warning: For security reasons, the compute nodes have no internet connectivity.

Conda

Only the main and r channels are available using the Conda module. If you require custom Conda packages, our guide explains how to transfer local Conda environments to clusters. Additionally, the documentation demonstrates the use of the default Conda module for creating Conda environments.

Containers

Only udocker is available for security reasons since it can execute container runtimes without sudo permissions and user namespace support. Our documentation contains a guide explaining AI containers on GPU-accelerated partitions.

Frameworks

You can install PyTorch and TensorFlow in a custom Conda environment or container. Template project repositories are available at https://code.hlrs.de under the SiVeGCS organization for widely recognized data processing and machine learning frameworks, illustrating their usage on the HLRS systems.

@@ Line 44: / Line 44: @@
 === Conda ===
-Only the <code>main</code> and <code>r</code> channels are available using the conda module. If you need custom conda packages, [[How_to_move_local_conda_environments_to_the_clusters|a guide]] shows how to move local conda environments to the clusters.
+Only the <code>main</code> and <code>r</code> channels are available using the Conda module. If you require custom Conda packages, [https://kb.hlrs.de/platforms/index.php/How_to_use_Conda_environments_on_the_clusters our guide] explains how to transfer local Conda environments to clusters. Additionally, the documentation demonstrates the use of the default Conda module for creating Conda environments.
 === Containers ===
@@ Line 52: / Line 52: @@
 === Frameworks ===
-You can install PyTorch and Tensorflow in a custom conda environment or container. In addition, our documentation has [[How_to_launch_a_Ray_cluster_on_HLRS_compute_platforms|a guide]] for launching a Ray cluster on HLRS compute platforms.
+You can install PyTorch and TensorFlow in a custom Conda environment or container. Template project repositories are available at https://code.hlrs.de under [https://code.hlrs.de/SiVeGCS the SiVeGCS organization] for widely recognized data processing and machine learning frameworks, illustrating their usage on the HLRS systems.

Big Data, AI Aplications and Frameworks: Difference between revisions

Latest revision as of 08:35, 11 April 2024

Contents

Hardware and Storage Overview

Hawk

Vulcan

Software

Conda

Containers

Frameworks

Navigation menu

Page actions

Page actions

Personal tools

Navigation

Search

Tools