- Infos im HLRS Wiki sind nicht rechtsverbindlich und ohne Gewähr -
- Information contained in the HLRS Wiki is not legally binding and HLRS is not responsible for any damages that might result from its use -

How to launch a Ray cluster on HLRS compute platforms

From HLRS Platforms
Revision as of 11:00, 10 July 2023 by Hpckkaya (talk | contribs) (Initial version of the how-to guide)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigationJump to search

This guide shows you how to launch a Ray cluster on HLRS compute platforms.

Using conda environments

On your local computer, create a new conda environment. Here, we use conda to install a particular Python version and create an isolated environment. Then we use pip to install the rest of the packages, because Ray Documentation recommends to install Ray from the Python Package Index.

Warning: Read the best practices for using pip in a conda environment. For example, use pip only after conda.

conda create -n ray_env python=3.9 # Choose the python version that best matches your requirements.
conda activate ray
pip3 install "ray[default]"==2.2.0 # Choose the packages, options, and versions that best match your requirements.

Use conda-pack to transfer the environment to HLRS compute platform.

Prepare scripts to launch a Ray cluster on the compute nodes

Let us estimate the value of π as an example application.

Prepare a bash script called ray-start-worker.sh to start the workers:

#!/bin/bash

if [ $# -ne 3 ]; then
    echo "Usage: $0 <env_archive> <ray_address> <redis_password>"
    exit 1
fi

export ENV_ARCHIVE=$1
export RAY_ADDRESS=$2
export REDIS_PASSWORD=$3

# printenv | grep 'ENV_ARCHIVE\|RAY_ADDRESS\|REDIS_PASSWORD' # uncomment this line for debugging

export ENV_PATH=/run/user/$PBS_JOBID/ray_env  # We use the ram disk to extract the environment packages since a large number of files decreases the performance of the parallel file system.

mkdir -p $ENV_PATH
tar -xzf $ENV_ARCHIVE -C $ENV_PATH
source $ENV_PATH/bin/activate
conda-unpack

ray start --address=$RAY_ADDRESS \
 	--redis-password=$REDIS_PASSWORD \
	--block

rm -rf $ENV_PATH # It's nice to clean up before you terminate the job

Add execution permissions to ray-start-worker.sh:

chmod +x ray-start-worker.sh

Prepare a job script submit-ray-job.pbs to launch the head and worker nodes. Here, we set the number of CPUs on the head node to zero, following the best practice. You only need to modify the select statement in line 3 to change the cluster size.

#!/bin/bash
#PBS -N ray-test
#PBS -l select=3:node_type=clx-25
#PBS -l walltime=01:00:00

export SRC_DIR=/path/to/source_directory # Adjust the path. The source directory contains the environment archive, the source codes, and the bash script to launch ray workers.

export ENV_ARCHIVE=$SRC_DIR/ray_env.tar.gz
export PYTHON_FILE=$SRC_DIR/ray_monte_carlo_pi.py
export START_RAY_WORKER=$SRC_DIR/ray-start-worker.sh

export ENV_PATH=/run/user/$PBS_JOBID/ray_env  # We use the ram disk to extract the environment packages since a large number of files decreases the performance of the parallel file system.

mkdir -p $ENV_PATH

tar -xzf $ENV_ARCHIVE -C $ENV_PATH # This line extracts the packages to ram disk

source $ENV_PATH/bin/activate

conda-unpack

export IP_ADDRESS=`ip addr show ib0 | grep -oP '(?<=inet\s)\d+(\.\d+){3}' | awk '{print $1}'`

export RAY_ADDRESS=$IP_ADDRESS:6379
export REDIS_PASSWORD=$(openssl rand -base64 32)

ray start --disable-usage-stats \
	--head \
	--node-ip-address=$IP_ADDRESS \
	--port=6379 \
	--dashboard-host=127.0.0.1 \
	--redis-password=$REDIS_PASSWORD \
	--num-cpus=0 # We set the number of cpus to 0 on the head node following the Ray documentation best practice.

export NUM_NODES=$(sort $PBS_NODEFILE |uniq | wc -l)

for ((i=1;i<$NUM_NODES;i++)); do
	pbsdsh -n $i -- bash -l -c "'$START_RAY_WORKER' '$ENV_ARCHIVE' '$RAY_ADDRESS' '$REDIS_PASSWORD'" &
done

python3 $PYTHON_FILE

ray stop

rm -rf $ENV_PATH # It's nice to clean up before you terminate the job.

The Python code should wait for all the nodes in the cluster to become available:

def wait_for_nodes(expected_num_nodes: int):
    while True:
        num_nodes = len(ray.nodes())
        if num_nodes >= expected_num_nodes:
            break
        print(f'Currently {num_nodes} nodes connected. Waiting for more...')
        time.sleep(5)  # wait for 5 seconds before checking again

if __name__ == "__main__":

	num_nodes = int(os.environ["NUM_NODES"])
	assert num_nodes > 1, "If the environment variable NUM_NODES is set, it should be greater than 1."

	redis_password = os.environ["REDIS_PASSWORD"]
	ray.init(address="auto", _redis_password=redis_password)

	wait_for_nodes(num_nodes)

	cluster_resources = ray.available_resources()
	print(cluster_resources)