- Infos im HLRS Wiki sind nicht rechtsverbindlich und ohne Gewähr -
- Information contained in the HLRS Wiki is not legally binding and HLRS is not responsible for any damages that might result from its use -
How to launch a Ray cluster on HLRS compute platforms
This guide shows you how to launch a Ray cluster on HLRS compute platforms.
Using conda environments
On your local computer, create a new conda environment. Here, we use conda to install a particular Python version and create an isolated environment. Then we use pip to install the rest of the packages, because Ray Documentation recommends to install Ray from the Python Package Index.
Warning: Read the best practices for using pip in a conda environment. For example, use pip only after conda.
conda create -n ray_env python=3.9 # Choose the python version that best matches your requirements.
conda activate ray
pip3 install "ray[default]"==2.2.0 # Choose the packages, options, and versions that best match your requirements.
Use conda-pack to transfer the environment to HLRS compute platform.
Prepare scripts to launch a Ray cluster on the compute nodes
Let us estimate the value of π as an example application.
Prepare a bash script called ray-start-worker.sh
to start the workers:
#!/bin/bash
if [ $# -ne 3 ]; then
echo "Usage: $0 <env_archive> <ray_address> <redis_password>"
exit 1
fi
export ENV_ARCHIVE=$1
export RAY_ADDRESS=$2
export REDIS_PASSWORD=$3
# printenv | grep 'ENV_ARCHIVE\|RAY_ADDRESS\|REDIS_PASSWORD' # uncomment this line for debugging
export ENV_PATH=/run/user/$PBS_JOBID/ray_env # We use the ram disk to extract the environment packages since a large number of files decreases the performance of the parallel file system.
mkdir -p $ENV_PATH
tar -xzf $ENV_ARCHIVE -C $ENV_PATH
source $ENV_PATH/bin/activate
conda-unpack
ray start --address=$RAY_ADDRESS \
--redis-password=$REDIS_PASSWORD \
--block
rm -rf $ENV_PATH # It's nice to clean up before you terminate the job
Add execution permissions to ray-start-worker.sh
:
chmod +x ray-start-worker.sh
Prepare a job script submit-ray-job.pbs
to launch the head and worker nodes. Here, we set the number of CPUs on the head node to zero, following the best practice. You only need to modify the select statement in line 3 to change the cluster size.
#!/bin/bash
#PBS -N ray-test
#PBS -l select=3:node_type=clx-25
#PBS -l walltime=01:00:00
export SRC_DIR=/path/to/source_directory # Adjust the path. The source directory contains the environment archive, the source codes, and the bash script to launch ray workers.
export ENV_ARCHIVE=$SRC_DIR/ray_env.tar.gz
export PYTHON_FILE=$SRC_DIR/ray_monte_carlo_pi.py
export START_RAY_WORKER=$SRC_DIR/ray-start-worker.sh
export ENV_PATH=/run/user/$PBS_JOBID/ray_env # We use the ram disk to extract the environment packages since a large number of files decreases the performance of the parallel file system.
mkdir -p $ENV_PATH
tar -xzf $ENV_ARCHIVE -C $ENV_PATH # This line extracts the packages to ram disk
source $ENV_PATH/bin/activate
conda-unpack
export IP_ADDRESS=`ip addr show ib0 | grep -oP '(?<=inet\s)\d+(\.\d+){3}' | awk '{print $1}'`
export RAY_ADDRESS=$IP_ADDRESS:6379
export REDIS_PASSWORD=$(openssl rand -base64 32)
ray start --disable-usage-stats \
--head \
--node-ip-address=$IP_ADDRESS \
--port=6379 \
--dashboard-host=127.0.0.1 \
--redis-password=$REDIS_PASSWORD \
--num-cpus=0 # We set the number of cpus to 0 on the head node following the Ray documentation best practice.
export NUM_NODES=$(sort $PBS_NODEFILE |uniq | wc -l)
for ((i=1;i<$NUM_NODES;i++)); do
pbsdsh -n $i -- bash -l -c "'$START_RAY_WORKER' '$ENV_ARCHIVE' '$RAY_ADDRESS' '$REDIS_PASSWORD'" &
done
python3 $PYTHON_FILE
ray stop
rm -rf $ENV_PATH # It's nice to clean up before you terminate the job.
The Python code should wait for all the nodes in the cluster to become available:
def wait_for_nodes(expected_num_nodes: int):
while True:
num_nodes = len(ray.nodes())
if num_nodes >= expected_num_nodes:
break
print(f'Currently {num_nodes} nodes connected. Waiting for more...')
time.sleep(5) # wait for 5 seconds before checking again
if __name__ == "__main__":
num_nodes = int(os.environ["NUM_NODES"])
assert num_nodes > 1, "If the environment variable NUM_NODES is set, it should be greater than 1."
redis_password = os.environ["REDIS_PASSWORD"]
ray.init(address="auto", _redis_password=redis_password)
wait_for_nodes(num_nodes)
cluster_resources = ray.available_resources()
print(cluster_resources)