How to launch a Ray cluster on HLRS compute platforms: Difference between revisions

Latest revision as of 08:43, 11 April 2024

Please follow the updated guide at https://code.hlrs.de under the SiVeGCS organization.

@@ Line 1: / Line 1: @@
-This guide shows you how to launch a Ray cluster on HLRS compute platforms.
+Please follow [https://code.hlrs.de/SiVeGCS/ray_template the updated guide] at https://code.hlrs.de under [https://code.hlrs.de/SiVeGCS the SiVeGCS organization.]
-== Using conda environments ==
-On your local computer, create a new conda environment. Here, we use conda to install a particular Python version and create an isolated environment. Then we use pip to install the rest of the packages, because [https://docs.ray.io/en/latest/ray-overview/installation.html#installing-from-conda-forge Ray Documentation] recommends to install Ray from the Python Package Index.
-'''Warning:''' Read the best practices for using [https://conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html#pip-in-env pip in a conda environment]. For example, use pip only after conda.
-<source lang="bash">
-conda create -n ray_env python=3.9 # Choose the python version that best matches your requirements.
-conda activate ray
-pip3 install "ray[default]"==2.2.0 # Choose the packages, options, and versions that best match your requirements.
-</source>
-Use [[How_to_move_local_conda_environments_to_the_clusters|conda-pack]] to transfer the environment to HLRS compute platform.
-=== Prepare scripts to launch a Ray cluster on the compute nodes ===
-Let us [https://docs.ray.io/en/latest/ray-core/examples/monte_carlo_pi.html estimate the value of π] as an example application.
-Prepare a bash script called <code>ray-start-worker.sh</code> to start the workers:
-<source lang="bash">
-#!/bin/bash
-if [ $# -ne 3 ]; then
-    echo "Usage: $0 <env_archive> <ray_address> <redis_password>"
-    exit 1
-fi
-export ENV_ARCHIVE=$1
-export RAY_ADDRESS=$2
-export REDIS_PASSWORD=$3
-# printenv | grep 'ENV_ARCHIVE\|RAY_ADDRESS\|REDIS_PASSWORD' # uncomment this line for debugging
-export ENV_PATH=/run/user/$PBS_JOBID/ray_env  # We use the ram disk to extract the environment packages since a large number of files decreases the performance of the parallel file system.
-mkdir -p $ENV_PATH
-tar -xzf $ENV_ARCHIVE -C $ENV_PATH
-source $ENV_PATH/bin/activate
-conda-unpack
-ray start --address=$RAY_ADDRESS \
- 	--redis-password=$REDIS_PASSWORD \
-	--block
-rm -rf $ENV_PATH # It's nice to clean up before you terminate the job
-</source>
-Add execution permissions to <code>ray-start-worker.sh</code>:
-<source lang="bash">
-chmod +x ray-start-worker.sh
-</source>
-Prepare a job script <code>submit-ray-job.pbs</code> to launch the head and worker nodes. Here, we set the number of CPUs on the head node to zero, following [https://docs.ray.io/en/latest/cluster/vms/user-guides/large-cluster-best-practices.html#vms-large-cluster-configure-head-node the best practice]. You only need to modify the select statement in line 3 to change the cluster size.
-<source lang="bash">
-#!/bin/bash
-#PBS -N ray-test
-#PBS -l select=3:node_type=clx-25
-#PBS -l walltime=01:00:00
-export SRC_DIR=/path/to/source_directory # Adjust the path. The source directory contains the environment archive, the source codes, and the bash script to launch ray workers.
-export ENV_ARCHIVE=$SRC_DIR/ray_env.tar.gz
-export PYTHON_FILE=$SRC_DIR/ray_monte_carlo_pi.py
-export START_RAY_WORKER=$SRC_DIR/ray-start-worker.sh
-export ENV_PATH=/run/user/$PBS_JOBID/ray_env  # We use the ram disk to extract the environment packages since a large number of files decreases the performance of the parallel file system.
-mkdir -p $ENV_PATH
-tar -xzf $ENV_ARCHIVE -C $ENV_PATH # This line extracts the packages to ram disk
-source $ENV_PATH/bin/activate
-conda-unpack
-export IP_ADDRESS=`ip addr show ib0 | grep -oP '(?<=inet\s)\d+(\.\d+){3}' | awk '{print $1}'`
-export RAY_ADDRESS=$IP_ADDRESS:6379
-export REDIS_PASSWORD=$(openssl rand -base64 32)
-ray start --disable-usage-stats \
-	--head \
-	--node-ip-address=$IP_ADDRESS \
-	--port=6379 \
-	--dashboard-host=127.0.0.1 \
-	--redis-password=$REDIS_PASSWORD \
-	--num-cpus=0 # We set the number of cpus to 0 on the head node following the Ray documentation best practice.
-export NUM_NODES=$(sort $PBS_NODEFILE |uniq | wc -l)
-for ((i=1;i<$NUM_NODES;i++)); do
-	pbsdsh -n $i -- bash -l -c "'$START_RAY_WORKER' '$ENV_ARCHIVE' '$RAY_ADDRESS' '$REDIS_PASSWORD'" &
-done
-python3 $PYTHON_FILE
-ray stop
-rm -rf $ENV_PATH # It's nice to clean up before you terminate the job.
-</source>
-The Python code should wait for all the nodes in the cluster to become available:
-<source lang="python">
-def wait_for_nodes(expected_num_nodes: int):
-    while True:
-        num_nodes = len(ray.nodes())
-        if num_nodes >= expected_num_nodes:
-            break
-        print(f'Currently {num_nodes} nodes connected. Waiting for more...')
-        time.sleep(5)  # wait for 5 seconds before checking again
-if __name__ == "__main__":
-	num_nodes = int(os.environ["NUM_NODES"])
-	assert num_nodes > 1, "If the environment variable NUM_NODES is set, it should be greater than 1."
-	redis_password = os.environ["REDIS_PASSWORD"]
-	ray.init(address="auto", _redis_password=redis_password)
-	wait_for_nodes(num_nodes)
-	cluster_resources = ray.available_resources()
-	print(cluster_resources)
-</source>

How to launch a Ray cluster on HLRS compute platforms: Difference between revisions

Latest revision as of 08:43, 11 April 2024

Navigation menu

Page actions

Page actions

Personal tools

Navigation

Search

Tools