- Infos im HLRS Wiki sind nicht rechtsverbindlich und ohne Gewähr -
- Information contained in the HLRS Wiki is not legally binding and HLRS is not responsible for any damages that might result from its use -

Big Data, AI Aplications and Frameworks: Difference between revisions

From HLRS Platforms
Jump to navigationJump to search
m (Minor text improvements)
m (Changes regarding Conda and Framework documentation)
 
(2 intermediate revisions by 2 users not shown)
Line 1: Line 1:
{{Note | text = This page is being actively edited. Please don’t use links to its sections yet, as the content structure may change }}
This guide provides a technical description of the hardware and software environments for high-performance data analytics (HPDA) and AI applications.  


== Hardware overview ==
== Hardware and Storage Overview ==


AI and Big Data (HPDA) workflows often require local storage. However, HPC nodes usually do not have any local drive. Therefore special nodes with local storage are provided. Local storage is available on the nodes mentioned below. Make sure that your application uses correct paths for local files (cache, scratch).
AI and HPDA workflows can require local storage. However, HPC nodes usually do not have any local drive except for particular nodes. Local storage is available only on the nodes mentioned below. You can also use the ram disk mounted to <code>/run/user/${UID}</code>. For more information on the HOME and SCRATCH directories, please refer to the dedicated documentation for [[Storage_(Hawk)|Hawk]] and [[NEC_Cluster_Disk_Storage_(vulcan)|Vulcan]].
{{Warning|text=<tt>/tmp</tt> (unless mounted as local SSD) is usually a very small in-memory filesystem.}}
 
=== Vulcan ===
 
The following nodes are to be used for AI and HPDA jobs:
 
* <code>clx-21</code> - tuned for HPDA, no GPUs
* <code>clx-ai</code> - tuned for AI, 8 × V100 GPUs per node
 
Cray Urika-CS container can be executed on <code>clx-ai</code> and <code>clx-21</code> nodes. For more information please read the [[Urika_CS|corresponding Urica-CS page]].


{{Warning|text=Ensure your application uses the correct paths for local files. <code>/tmp</code> is a minimal in-memory filesystem unless mounted as a local SSD.}}
=== Hawk ===
=== Hawk ===


The following nodes can be used for AI jobs:
Hawk is primarily a CPU-based supercomputer, but its GPU partition fits HPDA and AI applications.


* <code>hawk-ai</code> - tuned for AI, 8 × A100 GPUs per node
<code>rome-ai</code> partition contains 24 nodes and 192 GPUs. Resources per node:
* CPU: 2x AMD EPYC 7742
* GPU: 8x NVIDIA A100-SXM4
** 20 nodes with the 40 GB version
** 4 nodes with the 80 GB version
* RAM: 1 TB
* 15 TB local storage mounted to <code>/localscratch</code>


== Python and Python packages ==
<code>rome</code> partition contains 5,632 nodes and 720,896 compute cores in total. Resources per node:
* CPU: 2x AMD EPYC 7742
* RAM: 256 GB


There are three most popular ways to manage Python packages for the projects you work on:
=== Vulcan ===
 
# Using Conda
# Using Virtualenv
# Installing globally (for user) with <code>pip install --user</code>
 
{{Note|text=Using pip is less reproducible and may cause you troubles when working on several (sub-)projects.}}
 
=== Conda modules ===
 
Miniconda is available as Module, and can be used with packages from <code>main</code> and <code>r</code> channels.
 
Miniconda itself is distributed under the 3-clause BSD License, but it allows users to install third-party software with proprietary licenses. You will have to explicitly accept this license when using Miniconda first time. Please read carefully the license and third-party licenses which are mentioned there (including Nvidia cuDNN license).
 
The first time use is slightly different on Vulcan and Hawk. (TL;DR: call <code>module load bigdata/conda/miniconda-4.10.3</code>, and follow instructions)
 
==== Vulcan ====
 
<source lang="bash">module load bigdata/conda/miniconda-4.10.3</source>
When loading the module a pager program will be started to display you license terms, after reading exit the pager (by default by pressing <code>q</code>) and enter <code>yes</code> or <code>no</code> to accept or decline the license. After accepting the license module will be loaded. Next time the module is loaded without further actions.
 
==== Hawk ====
 
Similar as for Vulcan, but with separate module and command at first usage:
 
<source lang="bash">module load bigdata/conda/miniconda-4.10.3-license
conda_license</source>
After accepting the license use <code>module load bigdata/conda/miniconda-4.10.3</code> to load the module.
 
=== Conda environments ===
 
After the Conda module is loaded you need to initialize Conda with <code>source activate [env-name]</code>. If you omit the <code>env-name</code> Conda will activate the default (read only) <code>base</code> environment with a minimal set of packages.
 
Use Conda as usual. Only <code>main</code> and <code>r</code> channels are available.
 
In the environment files you will need to delete the <code>channels:</code> section.
 
Conda will create environments in <code>~/.conda/envs</code> and caches packages under <code>~/.conda/pkgs</code>. This folder can become quite big and exhaust your quota. The environment variables <code>CONDA_ENVS_PATH</code> and <code>CONDA_PKGS_DIRS</code> can help with this.
 
Here is some random example:
 
<source lang="bash">module load bigdata/conda
source activate
conda env list
conda create -n my-jupyter jupyter tensorflow-gpu pandas
rm -r ~/.conda/pkgs # delete cache
conda activate my-jupyter</source>
Please note: Conda packages (e.g. Tensorflow) are compiled for a generic CPU.
 
=== virtualenv and pip ===
 
Packages installed with <code>pip</code> are often compiled during installation, <code>pip</code> can be used both with Conda and with virtualenv.
 
Here is an example how to create a virtual environment (not using Conda).
 
<source lang="bash">module load python/3.8 # Load required python module (you can also use the system one, but this is less reproducible)
mkdir ~/venvs # directory for your environments
python3 -m venv ~/venvs/myproject # create the environment
source ~/venvs/myproject/bin/activate # activate environment to use it
which python3 # verify that you are using your environment</source>
=== pip offline ===
 
 
=== Installing custom Conda packages ===
 
In the example below we install <code>music21</code> and its dependencies from <code>conda-forge</code>
 
Docker is used to quickly create a clean Conda setup, but you can use your existing Conda, just clean the <code>CONDA_PKGS_DIRS</code> directory before you begin.
 
==== 1. Locally ====
 
<source lang="bash">mkdir -p ./pkgs
docker run -it --rm -v `pwd`/pkgs:/host conda/miniconda3
# List all dependencies
conda create --dry-run --json -n dummy -c conda-forge music21
# Download and install packages
conda create -n dummy -c conda-forge music21
# Pack the cache into one file
tar -czf /host/pkgs-forge.tgz -C /usr/local/ pkgs
exit</source>
Copy the archive to Vulcan/Hawk
 
<source lang="bash">scp ./pkgs/pkgs-forge.tgz vulcan:/path/to/your/workspace/</source>
 
==== 2. Vulcan/Hawk ====
 
Create a temp dir and extract your packages there:
 
<source lang="bash">mkdir -p /tmp/${USER}-pkgs
cd /tmp/${USER}-conda
tar -xzf /path/to/your/workspace/pkgs-forge.tgz
cd -
 
module load bigdata/conda
source activate
# Conda uses this path for packages cache
export CONDA_PKGS_DIRS="/tmp/${USER}-conda/pkgs"
</source>
 
===== Install into an existing environment =====
 
<source lang="bash">conda env list
conda activate env-name
conda install music21 --offline
# Don't forget to clean up
unset CONDA_PKGS_DIRS
rm -rf "/tmp/${USER}-conda"
</source>
===== Or create a new environment =====
 
<source lang="bash">conda env list
conda create -n env-name music21 --offline
conda activate env-name
# Don't forget to clean up
unset CONDA_PKGS_DIRS
rm -rf "/tmp/${USER}-conda"
</source>
 
== Spark ==
 
==== Vulcan ====
 
===== Cray Urika-CS =====
 
Spark is deployed when you run Urika-CS in interactive mode (with <code>start_analytics</code>).
 
===== Bare-metal setup =====
 
Available on all compute nodes at Vulcan. '''This is a test-installation, which is not tuned yet.'''
 
To deploy Spark, run in your job script (or interactively):
 
<source lang="bash">module load bigdata/spark_cluster
init-spark</source>
This will deploy Spark master on the current node and Spark workers on the rest of the nodes. Spark is started in the background, but you will see its output in the console.
 
<code>init-spark</code> script also creates <code>$HOME/bigdata/$PBS_JOBID/</code> directory with configs and logs.
 
<code>/tmp/${USER}_spark</code> is used as <code>SPARK_WORKER_DIR</code> (local scratch for spark). '''Be aware:''' on most nodes <code>/tmp</code> is a ram-disk, and is quite small.
 
On <code>clx-21</code> nodes you must set <code>SPARK_WORKER_DIR</code> before running <code>init-spark</code>:
 
<source lang="bash">module load bigdata/spark_cluster
export SPARK_WORKER_DIR="/localscratch/${PBS_JOBID}/spark_worker_dir"
mkdir -p "$SPARK_WORKER_DIR"
init-spark</source>
== Containers ==
 
=== Singularity ===
 
Singularity has been created as HPC aware containers platform. For more security we run Singularity in a rootless mode, in this mode SIF images are extracted into a sandbox directory, this requires nodes with local storage. Make sure to setup <code>SINGULARITY_TMPDIR</code> and <code>SINGULARITY_CACHEDIR</code> environment variables and create corresponding directories on a local drive.
 
==== Vulcan ====
 
To use Singularity containers (e.g. for Cray Urika-CS) add <code>UNS=true</code> to <code>qsub</code> selectors. Currently only a preconfigured for Cray Urika-CS version of Singularity is preinstalled. Nodes with Singularity support are <code>clx-21</code> and <code>clx-ai</code>. Local NVME drives are mounted as <code>/localscratch</code> on these nodes.
 
Please create a working directory <code>mkdir -p &quot;/localscratch/${PBS_JOBID}&quot;</code> to make it consistent with Hawk (see below).


'''Be aware local storage is wiped after your job ends!'''
Vulcan has two dedicated partitions to accelerate AI and HPDA workloads.


==== Hawk ====
<code>clx-ai</code> partition contains 4 nodes and 32 GPUs. Resources per node:
* CPU: 2x Intel Xeon Gold 6240
* GPU: 8x NVIDIA V100-SXM2 32 GB
* RAM: 768 GB
* 7.3 TB local storage mounted to <code>/localscratch</code>


Singularity containers on Hawk can only be executed on the AI nodes. To use Singularity containers add <code>UNS=true</code> to <code>qsub</code> selectors.
<code>clx-21</code> is an 8-node CPU-based partition with local storage. Resources per node:
* CPU: 2x Intel Xeon Gold 6230
* RAM: 384 GB
* 1.9 TB local storage mounted to <code>/localscratch</code>


Nodes with Singularity support are <code>rome-ai</code>. Local NVME drives are mounted as <code>/localscratch</code> on these nodes. Users have write permissions under <code>/localscratch/${PBS_JOBID}/</code>
== Software ==


'''Be aware local storage is wiped after your job ends!'''
The only way to access the compute nodes is by using the batch system from the login nodes. For more information, please refer to the dedicated documentation for [[Batch_System_PBSPro_(Hawk)|Hawk]] and [[Batch_System_PBSPro_(vulcan)|Vulcan]].


Singularity binaries will be available soon.
{{Warning|text=For security reasons, the compute nodes have no internet connectivity.}}
=== Conda ===


=== Docker ===
Only the <code>main</code> and <code>r</code> channels are available using the Conda module. If you require custom Conda packages, [https://kb.hlrs.de/platforms/index.php/How_to_use_Conda_environments_on_the_clusters our guide] explains how to transfer local Conda environments to clusters. Additionally, the documentation demonstrates the use of the default Conda module for creating Conda environments.


Docker is not supported. You can convert your image to a Singularity image, or alternatively try uDocker.
=== Containers ===


=== uDocker ===
Only udocker is available for security reasons since it can execute container runtimes without sudo permissions and user namespace support. Our documentation contains [[How_to_use_AI_containers_on_GPU-accelerated_compute_partitions%3F|a guide]] explaining AI containers on GPU-accelerated partitions.


Some users have reported successfully running their containers with uDocker.
=== Frameworks ===


uDocker is not yet preinstalled.
You can install PyTorch and TensorFlow in a custom Conda environment or container. Template project repositories are available at https://code.hlrs.de under [https://code.hlrs.de/SiVeGCS the SiVeGCS organization] for widely recognized data processing and machine learning frameworks, illustrating their usage on the HLRS systems.

Latest revision as of 08:35, 11 April 2024

This guide provides a technical description of the hardware and software environments for high-performance data analytics (HPDA) and AI applications.

Hardware and Storage Overview

AI and HPDA workflows can require local storage. However, HPC nodes usually do not have any local drive except for particular nodes. Local storage is available only on the nodes mentioned below. You can also use the ram disk mounted to /run/user/${UID}. For more information on the HOME and SCRATCH directories, please refer to the dedicated documentation for Hawk and Vulcan.

Warning: Ensure your application uses the correct paths for local files. /tmp is a minimal in-memory filesystem unless mounted as a local SSD.

Hawk

Hawk is primarily a CPU-based supercomputer, but its GPU partition fits HPDA and AI applications.

rome-ai partition contains 24 nodes and 192 GPUs. Resources per node:

  • CPU: 2x AMD EPYC 7742
  • GPU: 8x NVIDIA A100-SXM4
    • 20 nodes with the 40 GB version
    • 4 nodes with the 80 GB version
  • RAM: 1 TB
  • 15 TB local storage mounted to /localscratch

rome partition contains 5,632 nodes and 720,896 compute cores in total. Resources per node:

  • CPU: 2x AMD EPYC 7742
  • RAM: 256 GB

Vulcan

Vulcan has two dedicated partitions to accelerate AI and HPDA workloads.

clx-ai partition contains 4 nodes and 32 GPUs. Resources per node:

  • CPU: 2x Intel Xeon Gold 6240
  • GPU: 8x NVIDIA V100-SXM2 32 GB
  • RAM: 768 GB
  • 7.3 TB local storage mounted to /localscratch

clx-21 is an 8-node CPU-based partition with local storage. Resources per node:

  • CPU: 2x Intel Xeon Gold 6230
  • RAM: 384 GB
  • 1.9 TB local storage mounted to /localscratch

Software

The only way to access the compute nodes is by using the batch system from the login nodes. For more information, please refer to the dedicated documentation for Hawk and Vulcan.

Warning: For security reasons, the compute nodes have no internet connectivity.

Conda

Only the main and r channels are available using the Conda module. If you require custom Conda packages, our guide explains how to transfer local Conda environments to clusters. Additionally, the documentation demonstrates the use of the default Conda module for creating Conda environments.

Containers

Only udocker is available for security reasons since it can execute container runtimes without sudo permissions and user namespace support. Our documentation contains a guide explaining AI containers on GPU-accelerated partitions.

Frameworks

You can install PyTorch and TensorFlow in a custom Conda environment or container. Template project repositories are available at https://code.hlrs.de under the SiVeGCS organization for widely recognized data processing and machine learning frameworks, illustrating their usage on the HLRS systems.