- Infos im HLRS Wiki sind nicht rechtsverbindlich und ohne Gewähr -
- Information contained in the HLRS Wiki is not legally binding and HLRS is not responsible for any damages that might result from its use -

How to launch a Ray cluster on HLRS compute platforms: Difference between revisions

From HLRS Platforms
Jump to navigationJump to search
(Initial version of the how-to guide)
 
(Redirect to template repo)
Tag: Replaced
 
Line 1: Line 1:
This guide shows you how to launch a Ray cluster on HLRS compute platforms.
Please follow [https://code.hlrs.de/SiVeGCS/ray_template the updated guide] at https://code.hlrs.de under [https://code.hlrs.de/SiVeGCS the SiVeGCS organization.]
 
== Using conda environments ==
 
On your local computer, create a new conda environment. Here, we use conda to install a particular Python version and create an isolated environment. Then we use pip to install the rest of the packages, because [https://docs.ray.io/en/latest/ray-overview/installation.html#installing-from-conda-forge Ray Documentation] recommends to install Ray from the Python Package Index.
 
'''Warning:''' Read the best practices for using [https://conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html#pip-in-env pip in a conda environment]. For example, use pip only after conda.
 
<source lang="bash">
conda create -n ray_env python=3.9 # Choose the python version that best matches your requirements.
conda activate ray
pip3 install "ray[default]"==2.2.0 # Choose the packages, options, and versions that best match your requirements.
</source>
 
Use [[How_to_move_local_conda_environments_to_the_clusters|conda-pack]] to transfer the environment to HLRS compute platform.
 
=== Prepare scripts to launch a Ray cluster on the compute nodes ===
 
Let us [https://docs.ray.io/en/latest/ray-core/examples/monte_carlo_pi.html estimate the value of π] as an example application.
 
Prepare a bash script called <code>ray-start-worker.sh</code> to start the workers:
 
<source lang="bash">
#!/bin/bash
 
if [ $# -ne 3 ]; then
    echo "Usage: $0 <env_archive> <ray_address> <redis_password>"
    exit 1
fi
 
export ENV_ARCHIVE=$1
export RAY_ADDRESS=$2
export REDIS_PASSWORD=$3
 
# printenv | grep 'ENV_ARCHIVE\|RAY_ADDRESS\|REDIS_PASSWORD' # uncomment this line for debugging
 
export ENV_PATH=/run/user/$PBS_JOBID/ray_env  # We use the ram disk to extract the environment packages since a large number of files decreases the performance of the parallel file system.
 
mkdir -p $ENV_PATH
tar -xzf $ENV_ARCHIVE -C $ENV_PATH
source $ENV_PATH/bin/activate
conda-unpack
 
ray start --address=$RAY_ADDRESS \
--redis-password=$REDIS_PASSWORD \
--block
 
rm -rf $ENV_PATH # It's nice to clean up before you terminate the job
</source>
 
Add execution permissions to <code>ray-start-worker.sh</code>:
 
<source lang="bash">
chmod +x ray-start-worker.sh
</source>
 
Prepare a job script <code>submit-ray-job.pbs</code> to launch the head and worker nodes. Here, we set the number of CPUs on the head node to zero, following [https://docs.ray.io/en/latest/cluster/vms/user-guides/large-cluster-best-practices.html#vms-large-cluster-configure-head-node the best practice]. You only need to modify the select statement in line 3 to change the cluster size.
 
<source lang="bash">
#!/bin/bash
#PBS -N ray-test
#PBS -l select=3:node_type=clx-25
#PBS -l walltime=01:00:00
 
export SRC_DIR=/path/to/source_directory # Adjust the path. The source directory contains the environment archive, the source codes, and the bash script to launch ray workers.
 
export ENV_ARCHIVE=$SRC_DIR/ray_env.tar.gz
export PYTHON_FILE=$SRC_DIR/ray_monte_carlo_pi.py
export START_RAY_WORKER=$SRC_DIR/ray-start-worker.sh
 
export ENV_PATH=/run/user/$PBS_JOBID/ray_env  # We use the ram disk to extract the environment packages since a large number of files decreases the performance of the parallel file system.
 
mkdir -p $ENV_PATH
 
tar -xzf $ENV_ARCHIVE -C $ENV_PATH # This line extracts the packages to ram disk
 
source $ENV_PATH/bin/activate
 
conda-unpack
 
export IP_ADDRESS=`ip addr show ib0 | grep -oP '(?<=inet\s)\d+(\.\d+){3}' | awk '{print $1}'`
 
export RAY_ADDRESS=$IP_ADDRESS:6379
export REDIS_PASSWORD=$(openssl rand -base64 32)
 
ray start --disable-usage-stats \
--head \
--node-ip-address=$IP_ADDRESS \
--port=6379 \
--dashboard-host=127.0.0.1 \
--redis-password=$REDIS_PASSWORD \
--num-cpus=0 # We set the number of cpus to 0 on the head node following the Ray documentation best practice.
 
export NUM_NODES=$(sort $PBS_NODEFILE |uniq | wc -l)
 
for ((i=1;i<$NUM_NODES;i++)); do
pbsdsh -n $i -- bash -l -c "'$START_RAY_WORKER' '$ENV_ARCHIVE' '$RAY_ADDRESS' '$REDIS_PASSWORD'" &
done
 
python3 $PYTHON_FILE
 
ray stop
 
rm -rf $ENV_PATH # It's nice to clean up before you terminate the job.
</source>
 
The Python code should wait for all the nodes in the cluster to become available:
 
<source lang="python">
 
def wait_for_nodes(expected_num_nodes: int):
    while True:
        num_nodes = len(ray.nodes())
        if num_nodes >= expected_num_nodes:
            break
        print(f'Currently {num_nodes} nodes connected. Waiting for more...')
        time.sleep(5)  # wait for 5 seconds before checking again
 
if __name__ == "__main__":
 
num_nodes = int(os.environ["NUM_NODES"])
assert num_nodes > 1, "If the environment variable NUM_NODES is set, it should be greater than 1."
 
redis_password = os.environ["REDIS_PASSWORD"]
ray.init(address="auto", _redis_password=redis_password)
 
wait_for_nodes(num_nodes)
 
cluster_resources = ray.available_resources()
print(cluster_resources)
</source>

Latest revision as of 08:43, 11 April 2024