- Infos im HLRS Wiki sind nicht rechtsverbindlich und ohne Gewähr -
- Information contained in the HLRS Wiki is not legally binding and HLRS is not responsible for any damages that might result from its use -

How to use AI containers on GPU-accelerated compute partitions?

From HLRS Platforms
Jump to navigationJump to search

This guide shows you how to work with containers in GPU-accelerated partitions of computation platforms. For security reasons, namespaces support is disabled, and the compute nodes have no internet connectivity. Container runtimes (e.g., udocker) are executed without sudo permissions. Therefore, the container runtimes can not mount container images, and an image must be extracted first. This guide shows how to build a container locally, transfer the container to the target system, and use suitable filesystems to run the container on the target system.

This guide assumes:

  • Docker is available on your local machine. If not, follow this guide to setup Docker.
  • You have access to the Vulcan or Hawk cluster.
  • You have a workspace configured in the Lustre filesystem.

Set up a container

On your local machine, create a Dockerfile. Here, we use an NVIDIA base image, install Python, and some additional libraries.

  • Warning: Please read carefully the license and third-party licenses (including Nvidia CUDA and cuDNN licenses).
  • Warning: Check carefully for compatible versions of PyTorch, Tensorflow, Python, CUDA, and cuDNN. You should adjust the base image or library versions accordingly.
  • Warning: Conda/pip downloads and installs precompiled binaries suitable to the architecture of the local environment and might compile from source when necessary for the local architecture. These packages will run differently for the target system.

Content of sample-tensorflow-container.dockerfile:

FROM nvidia/cuda:11.2.0-cudnn8-devel-rockylinux8
RUN dnf install -y python38 \\
    && dnf clean all \\
    && rm -rf /var/cache/yum
RUN pip3 install --no-cache-dir tensorflow \\
    pandas \\
    scikit-learn \\
    matplotlib \\
    jupyterlab

Build the container:

docker build --pull --rm -f "sample-tensorflow-container.dockerfile" -t sample-tensorflow-container:latest .

(Optional) Test the container:

docker run -it --rm sample-tensorflow-container
nvcc --version # learn cuda version
cat /usr/include/cudnn_version.h | grep CUDNN_MAJOR -A 2 # learn cuDNN version
pip3 list # learn which packages are installed

Save the container and transfer the container archive to your workspace in the Lustre filesystem:

cd ~/Desktop
docker save --output sample-tensorflow-container.tar sample-tensorflow-container
scp sample-tensorflow-container.tar vulcan.hww.hlrs.de:/lustre/path/to/your/workspace/container_archives/sample-tensorflow-container.tar # Don't forget to adjust the host and Lustre workspace path
rm sample-tensorflow-container.tar # remove the local image archive

Load the container image on the compute system

We use the Lustre filesystem to store the image layers because the layers can be quite large.

  • Warning: We use the Lustre filesystem to store the image layers, but we do not extract and execute containers in the Lustre filesystem. We will use the local storage in the upcoming examples to extract and execute the containers.
module load bigdata/udocker/1.3.4
export WS_DIR=/lustre/path/to/your/workspace # Don't forget to adjust Lustre workspace path
export UDOCKER_DIR="$WS_DIR/.udocker/"
udocker images -l
udocker rmi sample-tensorflow-container:latest # results in error if the image does not exist
udocker load -i /$WS_DIR/container_archives/sample-tensorflow-container.tar sample-tensorflow-container
rm /$WS_DIR/container_archives/sample-tensorflow-container.tar

Interactive examples that use one GPU node

We use the localscratch to execute the container.

  • Warning: We use the locally mounted /localscratch to extract and execute the containers by setting the UDOCKER_CONTAINERS environment variable.
  • Warning: Be aware local storage is wiped after your job ends!

Vulcan:

qsub -I -l select=1:node_type=clx-ai -l walltime=00:15:00
module load bigdata/udocker/1.3.4
ws_list # list the workspaces
export WS_DIR=/lustre/path/to/your/workspace # Don't forget to adjust workspace directory
export UDOCKER_DIR="$WS_DIR/.udocker/"
export UDOCKER_CONTAINERS=/localscratch/$PBS_JOBID/udocker/containers
mkdir -p $UDOCKER_CONTAINERS
udocker create --name=sample-tensorflow-container sample-tensorflow-container:latest

Hawk:

qsub -I -l select=1:node_type=rome-ai -l walltime=00:15:00
module load bigdata/udocker/1.3.4
ws_list # list the workspaces
export WS_DIR=/lustre/path/to/your/workspace # Don't forget to adjust workspace directory
export UDOCKER_DIR="$WS_DIR/.udocker/"
export UDOCKER_CONTAINERS=/localscratch/$PBS_JOBID/udocker/containers
mkdir -p $UDOCKER_CONTAINERS
udocker create --name=sample-tensorflow-container sample-tensorflow-container:latest
udocker setup --nvidia sample-tensorflow-container

Example: Run ipython shell

Vulcan:

# Assuming $WS_DIR is set
udocker run --rm --volume=/opt/system/nvidia/ALL.ALL.525.60.13/usr:/usr/local/nvidia --volume=/opt/system/nvidia/ALL.ALL.525.60.13/usr/lib:/usr/local/nvidia/lib --volume=/opt/system/nvidia/ALL.ALL.525.60.13/usr/lib64:/usr/local/nvidia/lib64 --volume=$WS_DIR:/workspace --workdir=/workspace sample-tensorflow-container ipython
import tensorflow as tf
physical_devices = tf.config.list_physical_devices('GPU') # might take some time
print(physical_devices)
!nvidia-smi

Hawk:

# Assuming $WS_DIR is set
udocker run --rm --volume=$WS_DIR:/workspace --workdir=/workspace sample-tensorflow-container ipython
import tensorflow as tf
physical_devices = tf.config.list_physical_devices('GPU')
print(physical_devices) # might take some time
!nvidia-smi

Clear localscratch before your job ends:

rm -r /localscratch/$PBS_JOBID/udocker