- Infos im HLRS Wiki sind nicht rechtsverbindlich und ohne Gewähr -
- Information contained in the HLRS Wiki is not legally binding and HLRS is not responsible for any damages that might result from its use -
Urika CS: Difference between revisions
(Urika CS. Initial commit) |
m (typo) |
||
Line 43: | Line 43: | ||
== Run Urika-CS container in batch mode == | == Run Urika-CS container in batch mode == | ||
<pre>run_training --no-node-list --cudnn-libs ~/cudnn/default/lib64/ -n 1 -e | <pre>run_training --no-node-list --cudnn-libs ~/cudnn/default/lib64/ -n 1 -e py37_tf_gpu "python test.py"</pre> | ||
<code>run_training</code> here has following arguments | <code>run_training</code> here has following arguments | ||
* <code>--no-node-list</code> - do not pass node list to CMD | * <code>--no-node-list</code> - do not pass node list to CMD | ||
* <code>-n 1</code> - number of MPI processes, usually should correspond to the amount of reserved nodes (<code>run_training</code> executes: <code>mpirun singularity ...</code>) | * <code>-n 1</code> - number of MPI processes, usually should correspond to the amount of reserved nodes (<code>run_training</code> executes: <code>mpirun singularity ...</code>) | ||
* <code>-e | * <code>-e py37_tf_gpu</code> - this conda environment will be activated in the contai |
Revision as of 21:15, 27 July 2020
Urika-CS nodes
Urika-CS containers can only be used on CS-Storm (clx-ai
) and CS500 (clx-21
) nodes.
To be able to start singularity container, the nodes have to be prepared before your batch job starts. You can do this by a resource request ('UNS=True') at your batch submission:
qsub -l select=1:node_type=clx-21:UNS=True,walltime=00:30:00 <mybatchjob>
Module singularity/3.5.2-cs-patch
has some hard-coded settings for the Urika-CS stack.
Urika-CS User Manual for installed version can be found here. Latest (will be installed soon).
The only module needed to be loaded is: bigdata/analytics
- it will load all dependencies and wrappers.
Quick Start
Install cuDNN
Due to license limitations you have to install cuDNN by yourself if you accept cuDNN license.
- Download ( https://developer.nvidia.com/cudnn ) NVIDIA cuDNN for CUDA 10.0, tgz version
- Check License you accepted during download
- Check License in the downloaded file
- If you accept them, upload the archive to login node
Extract the archive
mkdir -p ~/cudnn/7.6.5.32/cuda10.0 tar -xzf ~/cudnn-10.0-linux-x64-v7.6.5.32.tgz --strip-components=1 -C ~/cudnn/7.6.5.32/cuda10.0
Create a link to your cuDNN version
ln -s ~/cudnn/7.6.5.32/cuda10.0 ~/cudnn/default
Reserve a node (interactive GPU job in this example)
qsub -I -l select=1:node_type=clx-ai:UNS=true -l walltime=01:00:00
Load environment
module load bigdata/analytics
Start interactive Urika-CS container (this will also deploy Spark)
start_analytics --cudnn-libs ~/cudnn/default/lib64/ # Activate Conda environment with Tensorflow Singularity> source activate py37_tf_gpu
Run Urika-CS container in batch mode
run_training --no-node-list --cudnn-libs ~/cudnn/default/lib64/ -n 1 -e py37_tf_gpu "python test.py"
run_training
here has following arguments
--no-node-list
- do not pass node list to CMD-n 1
- number of MPI processes, usually should correspond to the amount of reserved nodes (run_training
executes:mpirun singularity ...
)-e py37_tf_gpu
- this conda environment will be activated in the contai