|
|
Line 1: |
Line 1: |
| This guide shows you how to launch a Ray cluster on HLRS compute platforms.
| | Please follow [https://code.hlrs.de/SiVeGCS/ray_template the updated guide] at https://code.hlrs.de under [https://code.hlrs.de/SiVeGCS the SiVeGCS organization.] |
| | |
| == Using conda environments ==
| |
| | |
| On your local computer, create a new conda environment. Here, we use conda to install a particular Python version and create an isolated environment. Then we use pip to install the rest of the packages, because [https://docs.ray.io/en/latest/ray-overview/installation.html#installing-from-conda-forge Ray Documentation] recommends to install Ray from the Python Package Index.
| |
| | |
| '''Warning:''' Read the best practices for using [https://conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html#pip-in-env pip in a conda environment]. For example, use pip only after conda.
| |
| | |
| <source lang="bash">
| |
| conda create -n ray_env python=3.9 # Choose the python version that best matches your requirements.
| |
| conda activate ray
| |
| pip3 install "ray[default]"==2.2.0 # Choose the packages, options, and versions that best match your requirements.
| |
| </source>
| |
| | |
| Use [[How_to_move_local_conda_environments_to_the_clusters|conda-pack]] to transfer the environment to HLRS compute platform.
| |
| | |
| === Prepare scripts to launch a Ray cluster on the compute nodes ===
| |
| | |
| Let us [https://docs.ray.io/en/latest/ray-core/examples/monte_carlo_pi.html estimate the value of π] as an example application.
| |
| | |
| Prepare a bash script called <code>ray-start-worker.sh</code> to start the workers:
| |
| | |
| <source lang="bash">
| |
| #!/bin/bash
| |
| | |
| if [ $# -ne 3 ]; then
| |
| echo "Usage: $0 <env_archive> <ray_address> <redis_password>"
| |
| exit 1
| |
| fi
| |
| | |
| export ENV_ARCHIVE=$1
| |
| export RAY_ADDRESS=$2
| |
| export REDIS_PASSWORD=$3
| |
| | |
| # printenv | grep 'ENV_ARCHIVE\|RAY_ADDRESS\|REDIS_PASSWORD' # uncomment this line for debugging
| |
| | |
| export ENV_PATH=/run/user/$PBS_JOBID/ray_env # We use the ram disk to extract the environment packages since a large number of files decreases the performance of the parallel file system.
| |
| | |
| mkdir -p $ENV_PATH
| |
| tar -xzf $ENV_ARCHIVE -C $ENV_PATH
| |
| source $ENV_PATH/bin/activate
| |
| conda-unpack
| |
| | |
| ray start --address=$RAY_ADDRESS \
| |
| --redis-password=$REDIS_PASSWORD \
| |
| --block
| |
| | |
| rm -rf $ENV_PATH # It's nice to clean up before you terminate the job
| |
| </source>
| |
| | |
| Add execution permissions to <code>ray-start-worker.sh</code>:
| |
| | |
| <source lang="bash">
| |
| chmod +x ray-start-worker.sh
| |
| </source>
| |
| | |
| Prepare a job script <code>submit-ray-job.pbs</code> to launch the head and worker nodes. Here, we set the number of CPUs on the head node to zero, following [https://docs.ray.io/en/latest/cluster/vms/user-guides/large-cluster-best-practices.html#vms-large-cluster-configure-head-node the best practice]. You only need to modify the select statement in line 3 to change the cluster size.
| |
| | |
| <source lang="bash">
| |
| #!/bin/bash
| |
| #PBS -N ray-test
| |
| #PBS -l select=3:node_type=clx-25
| |
| #PBS -l walltime=01:00:00
| |
| | |
| export SRC_DIR=/path/to/source_directory # Adjust the path. The source directory contains the environment archive, the source codes, and the bash script to launch ray workers.
| |
| | |
| export ENV_ARCHIVE=$SRC_DIR/ray_env.tar.gz
| |
| export PYTHON_FILE=$SRC_DIR/ray_monte_carlo_pi.py
| |
| export START_RAY_WORKER=$SRC_DIR/ray-start-worker.sh
| |
| | |
| export ENV_PATH=/run/user/$PBS_JOBID/ray_env # We use the ram disk to extract the environment packages since a large number of files decreases the performance of the parallel file system.
| |
| | |
| mkdir -p $ENV_PATH
| |
| | |
| tar -xzf $ENV_ARCHIVE -C $ENV_PATH # This line extracts the packages to ram disk
| |
| | |
| source $ENV_PATH/bin/activate
| |
| | |
| conda-unpack
| |
| | |
| export IP_ADDRESS=`ip addr show ib0 | grep -oP '(?<=inet\s)\d+(\.\d+){3}' | awk '{print $1}'`
| |
| | |
| export RAY_ADDRESS=$IP_ADDRESS:6379
| |
| export REDIS_PASSWORD=$(openssl rand -base64 32)
| |
| | |
| ray start --disable-usage-stats \
| |
| --head \
| |
| --node-ip-address=$IP_ADDRESS \
| |
| --port=6379 \
| |
| --dashboard-host=127.0.0.1 \
| |
| --redis-password=$REDIS_PASSWORD \
| |
| --num-cpus=0 # We set the number of cpus to 0 on the head node following the Ray documentation best practice.
| |
| | |
| export NUM_NODES=$(sort $PBS_NODEFILE |uniq | wc -l)
| |
| | |
| for ((i=1;i<$NUM_NODES;i++)); do
| |
| pbsdsh -n $i -- bash -l -c "'$START_RAY_WORKER' '$ENV_ARCHIVE' '$RAY_ADDRESS' '$REDIS_PASSWORD'" &
| |
| done
| |
| | |
| python3 $PYTHON_FILE
| |
| | |
| ray stop
| |
| | |
| rm -rf $ENV_PATH # It's nice to clean up before you terminate the job.
| |
| </source>
| |
| | |
| The Python code should wait for all the nodes in the cluster to become available:
| |
| | |
| <source lang="python">
| |
| | |
| def wait_for_nodes(expected_num_nodes: int):
| |
| while True:
| |
| num_nodes = len(ray.nodes())
| |
| if num_nodes >= expected_num_nodes:
| |
| break
| |
| print(f'Currently {num_nodes} nodes connected. Waiting for more...')
| |
| time.sleep(5) # wait for 5 seconds before checking again
| |
| | |
| if __name__ == "__main__":
| |
| | |
| num_nodes = int(os.environ["NUM_NODES"])
| |
| assert num_nodes > 1, "If the environment variable NUM_NODES is set, it should be greater than 1."
| |
| | |
| redis_password = os.environ["REDIS_PASSWORD"]
| |
| ray.init(address="auto", _redis_password=redis_password)
| |
| | |
| wait_for_nodes(num_nodes)
| |
| | |
| cluster_resources = ray.available_resources()
| |
| print(cluster_resources)
| |
| </source>
| |