- Infos im HLRS Wiki sind nicht rechtsverbindlich und ohne Gewähr -
- Information contained in the HLRS Wiki is not legally binding and HLRS is not responsible for any damages that might result from its use -

Big Data, AI Aplications and Frameworks

From HLRS Platforms
Revision as of 11:40, 4 November 2022 by Hpcchris (talk | contribs) (Minor text improvements)
Jump to navigationJump to search
Note: This page is being actively edited. Please don’t use links to its sections yet, as the content structure may change


Hardware overview

AI and Big Data (HPDA) workflows often require local storage. However, HPC nodes usually do not have any local drive. Therefore special nodes with local storage are provided. Local storage is available on the nodes mentioned below. Make sure that your application uses correct paths for local files (cache, scratch).

Warning: /tmp (unless mounted as local SSD) is usually a very small in-memory filesystem.


Vulcan

The following nodes are to be used for AI and HPDA jobs:

  • clx-21 - tuned for HPDA, no GPUs
  • clx-ai - tuned for AI, 8 × V100 GPUs per node

Cray Urika-CS container can be executed on clx-ai and clx-21 nodes. For more information please read the corresponding Urica-CS page.

Hawk

The following nodes can be used for AI jobs:

  • hawk-ai - tuned for AI, 8 × A100 GPUs per node

Python and Python packages

There are three most popular ways to manage Python packages for the projects you work on:

  1. Using Conda
  2. Using Virtualenv
  3. Installing globally (for user) with pip install --user
Note: Using pip is less reproducible and may cause you troubles when working on several (sub-)projects.


Conda modules

Miniconda is available as Module, and can be used with packages from main and r channels.

Miniconda itself is distributed under the 3-clause BSD License, but it allows users to install third-party software with proprietary licenses. You will have to explicitly accept this license when using Miniconda first time. Please read carefully the license and third-party licenses which are mentioned there (including Nvidia cuDNN license).

The first time use is slightly different on Vulcan and Hawk. (TL;DR: call module load bigdata/conda/miniconda-4.10.3, and follow instructions)

Vulcan

module load bigdata/conda/miniconda-4.10.3

When loading the module a pager program will be started to display you license terms, after reading exit the pager (by default by pressing q) and enter yes or no to accept or decline the license. After accepting the license module will be loaded. Next time the module is loaded without further actions.

Hawk

Similar as for Vulcan, but with separate module and command at first usage:

module load bigdata/conda/miniconda-4.10.3-license
conda_license

After accepting the license use module load bigdata/conda/miniconda-4.10.3 to load the module.

Conda environments

After the Conda module is loaded you need to initialize Conda with source activate [env-name]. If you omit the env-name Conda will activate the default (read only) base environment with a minimal set of packages.

Use Conda as usual. Only main and r channels are available.

In the environment files you will need to delete the channels: section.

Conda will create environments in ~/.conda/envs and caches packages under ~/.conda/pkgs. This folder can become quite big and exhaust your quota. The environment variables CONDA_ENVS_PATH and CONDA_PKGS_DIRS can help with this.

Here is some random example:

module load bigdata/conda
source activate
conda env list
conda create -n my-jupyter jupyter tensorflow-gpu pandas
rm -r ~/.conda/pkgs # delete cache
conda activate my-jupyter

Please note: Conda packages (e.g. Tensorflow) are compiled for a generic CPU.

virtualenv and pip

Packages installed with pip are often compiled during installation, pip can be used both with Conda and with virtualenv.

Here is an example how to create a virtual environment (not using Conda).

module load python/3.8 # Load required python module (you can also use the system one, but this is less reproducible)
mkdir ~/venvs # directory for your environments
python3 -m venv ~/venvs/myproject # create the environment
source ~/venvs/myproject/bin/activate # activate environment to use it
which python3 # verify that you are using your environment

pip offline

Installing custom Conda packages

In the example below we install music21 and its dependencies from conda-forge

Docker is used to quickly create a clean Conda setup, but you can use your existing Conda, just clean the CONDA_PKGS_DIRS directory before you begin.

1. Locally

mkdir -p ./pkgs
docker run -it --rm -v `pwd`/pkgs:/host conda/miniconda3
# List all dependencies
conda create --dry-run --json -n dummy -c conda-forge music21
# Download and install packages
conda create -n dummy -c conda-forge music21
# Pack the cache into one file
tar -czf /host/pkgs-forge.tgz -C /usr/local/ pkgs
exit

Copy the archive to Vulcan/Hawk

scp ./pkgs/pkgs-forge.tgz vulcan:/path/to/your/workspace/

2. Vulcan/Hawk

Create a temp dir and extract your packages there:

mkdir -p /tmp/${USER}-pkgs
cd /tmp/${USER}-conda
tar -xzf /path/to/your/workspace/pkgs-forge.tgz
cd -

module load bigdata/conda
source activate
# Conda uses this path for packages cache
export CONDA_PKGS_DIRS="/tmp/${USER}-conda/pkgs"
Install into an existing environment
conda env list
conda activate env-name
conda install music21 --offline
# Don't forget to clean up
unset CONDA_PKGS_DIRS
rm -rf "/tmp/${USER}-conda"
Or create a new environment
conda env list
conda create -n env-name music21 --offline
conda activate env-name
# Don't forget to clean up
unset CONDA_PKGS_DIRS
rm -rf "/tmp/${USER}-conda"

Spark

Vulcan

Cray Urika-CS

Spark is deployed when you run Urika-CS in interactive mode (with start_analytics).

Bare-metal setup

Available on all compute nodes at Vulcan. This is a test-installation, which is not tuned yet.

To deploy Spark, run in your job script (or interactively):

module load bigdata/spark_cluster
init-spark

This will deploy Spark master on the current node and Spark workers on the rest of the nodes. Spark is started in the background, but you will see its output in the console.

init-spark script also creates $HOME/bigdata/$PBS_JOBID/ directory with configs and logs.

/tmp/${USER}_spark is used as SPARK_WORKER_DIR (local scratch for spark). Be aware: on most nodes /tmp is a ram-disk, and is quite small.

On clx-21 nodes you must set SPARK_WORKER_DIR before running init-spark:

module load bigdata/spark_cluster
export SPARK_WORKER_DIR="/localscratch/${PBS_JOBID}/spark_worker_dir"
mkdir -p "$SPARK_WORKER_DIR"
init-spark

Containers

Singularity

Singularity has been created as HPC aware containers platform. For more security we run Singularity in a rootless mode, in this mode SIF images are extracted into a sandbox directory, this requires nodes with local storage. Make sure to setup SINGULARITY_TMPDIR and SINGULARITY_CACHEDIR environment variables and create corresponding directories on a local drive.

Vulcan

To use Singularity containers (e.g. for Cray Urika-CS) add UNS=true to qsub selectors. Currently only a preconfigured for Cray Urika-CS version of Singularity is preinstalled. Nodes with Singularity support are clx-21 and clx-ai. Local NVME drives are mounted as /localscratch on these nodes.

Please create a working directory mkdir -p "/localscratch/${PBS_JOBID}" to make it consistent with Hawk (see below).

Be aware local storage is wiped after your job ends!

Hawk

Singularity containers on Hawk can only be executed on the AI nodes. To use Singularity containers add UNS=true to qsub selectors.

Nodes with Singularity support are rome-ai. Local NVME drives are mounted as /localscratch on these nodes. Users have write permissions under /localscratch/${PBS_JOBID}/

Be aware local storage is wiped after your job ends!

Singularity binaries will be available soon.

Docker

Docker is not supported. You can convert your image to a Singularity image, or alternatively try uDocker.

uDocker

Some users have reported successfully running their containers with uDocker.

uDocker is not yet preinstalled.