- Infos im HLRS Wiki sind nicht rechtsverbindlich und ohne Gewähr -
- Information contained in the HLRS Wiki is not legally binding and HLRS is not responsible for any damages that might result from its use -
Julia: Difference between revisions
Line 169: | Line 169: | ||
# Other settings | # Other settings | ||
export UCX_WARN_UNUSED_ENV_VARS=n # suppress UCX warnings | export UCX_WARN_UNUSED_ENV_VARS=n # suppress UCX warnings | ||
</syntaxhighlight> | |||
=== MPT not working with Julia === | |||
Right now there seems to be an issue with using HPE's MPT as the MPI backend in MPI.jl, resulting in abort when trying to run even a simple MPI program on one rank (see also https://github.com/JuliaLang/julia/issues/44969). It would be great if this can be fixed. In the meantime, we can use OpenMPI (see module code above). | |||
=== CUDA-aware MPI === | |||
At the moment, OpenMPI does not seem to support CUDA-aware MPI on the Hawk AI nodes. Instead, the execution crashes with a segmentation fault. To reproduce, log in to one of the AI nodes and execute | |||
<syntaxhighlight lang="shell"> | |||
mpirun -np 5 julia cuda_mpi_test.jl | |||
</syntaxhighlight> | |||
where <code>cuda_mpi_test.jl</code> is given as follows: | |||
<syntaxhighlight lang="julia"> | |||
# cuda_mpi_test.jl | |||
using MPI | |||
using CUDA | |||
MPI.Init() | |||
comm = MPI.COMM_WORLD | |||
rank = MPI.Comm_rank(comm) | |||
size = MPI.Comm_size(comm) | |||
dst = mod(rank+1, size) | |||
src = mod(rank-1, size) | |||
println("rank=$rank, size=$size, dst=$dst, src=$src") | |||
# allocate memory on the GPU | |||
N = 4 | |||
send_mesg = CuArray{Float64}(undef, N) | |||
recv_mesg = CuArray{Float64}(undef, N) | |||
fill!(send_mesg, Float64(rank)) | |||
# pass GPU buffers (CuArrays) into MPI functions | |||
MPI.Sendrecv!(send_mesg, dst, 0, recv_mesg, src, 0, comm) | |||
println("recv_mesg on proc $rank: $recv_mesg") | |||
</syntaxhighlight> | |||
This will crash with an error similar to this: | |||
<syntaxhighlight> | |||
rank=4, size=5, dst=0, src=3 | |||
rank=0, size=5, dst=1, src=4 | |||
rank=2, size=5, dst=3, src=1 | |||
rank=1, size=5, dst=2, src=0 | |||
rank=3, size=5, dst=4, src=2 | |||
[hawk-ai01:263609:0:263609] Caught signal 11 (Segmentation fault: invalid permissions for mapped object at address 0xa02000000) | |||
[hawk-ai01:263605:0:263605] Caught signal 11 (Segmentation fault: invalid permissions for mapped object at address 0xa02000000) | |||
[hawk-ai01:263607:0:263607] Caught signal 11 (Segmentation fault: invalid permissions for mapped object at address 0xa02000000) | |||
[hawk-ai01:263606:0:263606] Caught signal 11 (Segmentation fault: invalid permissions for mapped object at address 0xa02000000) | |||
[hawk-ai01:263608:0:263608] Caught signal 11 (Segmentation fault: invalid permissions for mapped object at address 0xa02000000) | |||
==== backtrace (tid: 263606) ==== | |||
0 0x00000000000532f9 ucs_debug_print_backtrace() ???:0 | |||
1 0x0000000000012b20 .annobin_sigaction.c() sigaction.c:0 | |||
2 0x000000000015dd3b __memcpy_avx_unaligned() :0 | |||
3 0x0000000000043f4f ucp_wireup_select_sockaddr_transport() ???:0 | |||
4 0x00000000000148c9 uct_mm_ep_am_bcopy() ???:0 | |||
5 0x0000000000043fcb ucp_wireup_select_sockaddr_transport() ???:0 | |||
6 0x000000000003a74a ucp_tag_send_nbr() ???:0 | |||
7 0x00000000001c7e4f mca_pml_ucx_send() ???:0 | |||
8 0x00000000000bba69 PMPI_Sendrecv() ???:0 | |||
9 0x00000000000c4e0a _jl_invoke() /buildworker/worker/package_linux64/build/src/gf.c:2247 | |||
10 0x00000000000e3e96 jl_apply() /buildworker/worker/package_linux64/build/src/julia.h:1788 | |||
11 0x00000000000e390e eval_value() /buildworker/worker/package_linux64/build/src/interpreter.c:215 | |||
12 0x00000000000e46d2 eval_stmt_value() /buildworker/worker/package_linux64/build/src/interpreter.c:166 | |||
13 0x00000000000e46d2 eval_stmt_value() /buildworker/worker/package_linux64/build/src/interpreter.c:167 | |||
14 0x00000000000e46d2 eval_body() /buildworker/worker/package_linux64/build/src/interpreter.c:587 | |||
15 0x00000000000e52f8 jl_interpret_toplevel_thunk() /buildworker/worker/package_linux64/build/src/interpreter.c:731 | |||
16 0x00000000001027a4 jl_toplevel_eval_flex() /buildworker/worker/package_linux64/build/src/toplevel.c:885 | |||
17 0x00000000001029e5 jl_toplevel_eval_flex() /buildworker/worker/package_linux64/build/src/toplevel.c:830 | |||
18 0x000000000010462a jl_toplevel_eval_in() /buildworker/worker/package_linux64/build/src/toplevel.c:944 | |||
19 0x000000000115a83b eval() ./boot.jl:373 | |||
20 0x000000000115a83b japi1_include_string_40536() ./loading.jl:1196 | |||
21 0x00000000000c4e0a _jl_invoke() /buildworker/worker/package_linux64/build/src/gf.c:2247 | |||
22 0x000000000124a35b japi1__include_32082() ./loading.jl:1253 | |||
23 0x0000000000d67c16 japi1_include_36299() ./Base.jl:418 | |||
24 0x00000000000c4e0a _jl_invoke() /buildworker/worker/package_linux64/build/src/gf.c:2247 | |||
25 0x00000000012d064c julia_exec_options_33549() ./client.jl:292 | |||
26 0x0000000000d8a0f8 julia__start_38731() ./client.jl:495 | |||
27 0x0000000000d8a269 jfptr__start_38732.clone_1() text:0 | |||
28 0x00000000000c4e0a _jl_invoke() /buildworker/worker/package_linux64/build/src/gf.c:2247 | |||
29 0x00000000001282d6 jl_apply() /buildworker/worker/package_linux64/build/src/julia.h:1788 | |||
30 0x0000000000128c7d jl_repl_entrypoint() /buildworker/worker/package_linux64/build/src/jlapi.c:701 | |||
31 0x00000000004007d9 main() /buildworker/worker/package_linux64/build/cli/loader_exe.c:42 | |||
32 0x00000000000237b3 __libc_start_main() ???:0 | |||
33 0x0000000000400809 _start() ???:0 | |||
================================= | |||
signal (11): Segmentation fault | |||
in expression starting at /zhome/academic/HLRS/hlrs/hpcschlo/cuda_mpi_test.jl:21 | |||
__memmove_avx_unaligned at /lib64/libc.so.6 (unknown line) | |||
unknown function (ip: 0x147c167e3f4e) | |||
uct_mm_ep_am_bcopy at /lib64/libuct.so.0 (unknown line) | |||
unknown function (ip: 0x147c167e3fca) | |||
ucp_tag_send_nbr at /lib64/libucp.so.0 (unknown line) | |||
mca_pml_ucx_send at /opt/hlrs/non-spack/mpi/openmpi/4.0.5-gcc-9.2.0/lib/libmpi.so (unknown line) | |||
PMPI_Sendrecv at /opt/hlrs/non-spack/mpi/openmpi/4.0.5-gcc-9.2.0/lib/libmpi.so (unknown line) | |||
Sendrecv! at /zhome/academic/HLRS/hlrs/hpcschlo/.julia/HLRS/hawk/packages/MPI/08SPr/src/pointtopoint.jl:380 [inlined] | |||
Sendrecv! at /zhome/academic/HLRS/hlrs/hpcschlo/.julia/HLRS/hawk/packages/MPI/08SPr/src/pointtopoint.jl:389 | |||
unknown function (ip: 0x147c1a1062fb) | |||
_jl_invoke at /buildworker/worker/package_linux64/build/src/gf.c:2247 [inlined] | |||
jl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:2429 | |||
jl_apply at /buildworker/worker/package_linux64/build/src/julia.h:1788 [inlined] | |||
do_call at /buildworker/worker/package_linux64/build/src/interpreter.c:126 | |||
[...] | |||
</syntaxhighlight> | </syntaxhighlight> |
Revision as of 23:18, 13 April 2022
If you have questions regarding the use of Julia at HLRS, please get in touch with Michael Schlottke-Lakemper.
For users
Getting started
Create SSH SOCKS proxy to install packages
Follow the instructions here to be able to install Julia packages. Log in to Hawk. All following steps should be executed on a login node.
Load the Julia module
Right now, just copy-paste the code found below in the Modules setup section below.
TODO: Replace by actual module command.
Install MPI.jl and CUDA.jl
To install MPI.jl, execute
julia -e 'using Pkg; Pkg.add("MPI")'
This will download and precompile MPI.jl and all of its dependencies. If you find that the installation produce is stuck at a very early stage (e.g., after outputting only Updating registry at `~/.julia/registries/General.toml`
), it means you have not properly set up your SOCKS proxy or forgot to add the appropriate environment variables.
You can check MPI.jl was properly configured by executing
julia -e 'using MPI; println(MPI.identify_implementation())'
This should give you an output similar to
(MPI.OpenMPI, v"4.0.5")
If you also want to use the GPUs with Julia, install CUDA.jl by executing
julia -e 'using Pkg; Pkg.add("CUDA")'
Note that you should not attempt to use or test CUDA.jl on the login nodes, since CUDA is not available here (they do not have GPUs) and thus anything CUDA-related will fail.
Verify that MPI works
Start an interactive session on a compute node by executing
qsub -I -l select=1:node_type=rome:ncpus=128:mpiprocs=128 -l walltime=00:20:00
Once your interactive job has been allocated, run a simple test program from the shell by executing
mpirun -np 5 julia mpi_test.jl
The code for `mpi_test.jl` is as follows:
# mpi_test.jl
using MPI
MPI.Init()
comm = MPI.COMM_WORLD
rank = MPI.Comm_rank(comm)
size = MPI.Comm_size(comm)
dst = mod(rank+1, size)
src = mod(rank-1, size)
println("rank=$rank, size=$size, dst=$dst, src=$src")
# allocate memory
N = 4
send_mesg = Array{Float64}(undef, N)
recv_mesg = Array{Float64}(undef, N)
fill!(send_mesg, Float64(rank))
# pass buffers into MPI functions
MPI.Sendrecv!(send_mesg, dst, 0, recv_mesg, src, 0, comm)
println("recv_mesg on proc $rank: $recv_mesg")
If everything is working OK, it should give you an output similar to
rank=0, size=5, dst=1, src=4
rank=1, size=5, dst=2, src=0
rank=2, size=5, dst=3, src=1
rank=3, size=5, dst=4, src=2
rank=4, size=5, dst=0, src=3
recv_mesg on proc 2: [1.0, 1.0, 1.0, 1.0]
recv_mesg on proc 1: [0.0, 0.0, 0.0, 0.0]
recv_mesg on proc 3: [2.0, 2.0, 2.0, 2.0]
recv_mesg on proc 0: [4.0, 4.0, 4.0, 4.0]
recv_mesg on proc 4: [3.0, 3.0, 3.0, 3.0]
Verify that CUDA works
To test CUDA, you need to leave your interactive session on a CPU node and get an interactive session on a GPU node by running
qsub -I -l select=1:node_type=nv-a100-40gb:mpiprocs=8 -l walltime=00:20:00
The first test is to check whether CUDA.jl can find all relevant drivers and GPUs. Start the Julia REPL by running julia
. Then, execute
julia> using CUDA
julia> CUDA.versioninfo()
CUDA toolkit 11.4, local installation
NVIDIA driver 470.57.2, for CUDA 11.4
CUDA driver 11.4
Libraries:
- CUBLAS: 11.5.4
- CURAND: 10.2.5
- CUFFT: 10.5.1
- CUSOLVER: 11.2.0
- CUSPARSE: 11.6.0
- CUPTI: 14.0.0
- NVML: 11.0.0+470.57.2
- CUDNN: missing
- CUTENSOR: missing
Toolchain:
- Julia: 1.7.2
- LLVM: 12.0.1
- PTX ISA support: 3.2, 4.0, 4.1, 4.2, 4.3, 5.0, 6.0, 6.1, 6.3, 6.4, 6.5, 7.0
- Device capability support: sm_35, sm_37, sm_50, sm_52, sm_53, sm_60, sm_61, sm_62, sm_70, sm_72, sm_75, sm_80
Environment:
- JULIA_CUDA_USE_MEMORY_POOL: none
- JULIA_CUDA_USE_BINARYBUILDER: false
8 devices:
0: NVIDIA A100-SXM4-40GB (sm_80, 39.583 GiB / 39.586 GiB available)
1: NVIDIA A100-SXM4-40GB (sm_80, 39.583 GiB / 39.586 GiB available)
2: NVIDIA A100-SXM4-40GB (sm_80, 39.583 GiB / 39.586 GiB available)
3: NVIDIA A100-SXM4-40GB (sm_80, 39.583 GiB / 39.586 GiB available)
4: NVIDIA A100-SXM4-40GB (sm_80, 39.583 GiB / 39.586 GiB available)
5: NVIDIA A100-SXM4-40GB (sm_80, 39.583 GiB / 39.586 GiB available)
6: NVIDIA A100-SXM4-40GB (sm_80, 39.583 GiB / 39.586 GiB available)
7: NVIDIA A100-SXM4-40GB (sm_80, 39.583 GiB / 39.586 GiB available)
As you can see, all 8 Nvidia Tesla A100 GPUs have been correctly detected.
Next, we will test if computing on the GPU is actually faster than on the CPU, to ensure that actual computations work. For this, paste the following snippet in the Julia REPL. Approximate timings are included as reference for you:
A = rand(2000, 2000);
B = rand(2000, 2000);
@time A*B; # 1.296624 seconds (2.52 M allocations: 155.839 MiB, 23.66% gc time, 65.33% compilation time)
@time A*B; # 0.341631 seconds (2 allocations: 30.518 MiB)
Agpu = CuArray(A); # move matrix to gpu
Bgpu = CuArray(B); # move matrix to gpu
@time Agpu*Bgpu; # 1.544657 seconds (1.54 M allocations: 81.926 MiB, 2.16% gc time, 59.89% compilation time)
@time Agpu*Bgpu; # 0.000627 seconds (32 allocations: 640 bytes)
As you can see, the matrix-matrix multiplication on the GPU is much faster than on the CPU.
For admins
Module setup
It would be great if we could have a julia/1.7.2
namespacing to support different Julia versions, where the default should be the latest stable version.
The following commands should be executing when loading the Julia module:
# Julia-related settings
export JULIA_DEPOT_PATH="$HOME/.julia/$SITE_NAME/$SITE_PLATFORM_NAME"
# MPI-related settings
module load openmpi
export JULIA_MPI_BINARY=system
# CUDA-related settings
export CUDA_PATH=/usr/local/cuda
export JULIA_CUDA_USE_BINARYBUILDER=false
export JULIA_CUDA_USE_MEMORY_POOL=none
# Other settings
export UCX_WARN_UNUSED_ENV_VARS=n # suppress UCX warnings
MPT not working with Julia
Right now there seems to be an issue with using HPE's MPT as the MPI backend in MPI.jl, resulting in abort when trying to run even a simple MPI program on one rank (see also https://github.com/JuliaLang/julia/issues/44969). It would be great if this can be fixed. In the meantime, we can use OpenMPI (see module code above).
CUDA-aware MPI
At the moment, OpenMPI does not seem to support CUDA-aware MPI on the Hawk AI nodes. Instead, the execution crashes with a segmentation fault. To reproduce, log in to one of the AI nodes and execute
mpirun -np 5 julia cuda_mpi_test.jl
where cuda_mpi_test.jl
is given as follows:
# cuda_mpi_test.jl
using MPI
using CUDA
MPI.Init()
comm = MPI.COMM_WORLD
rank = MPI.Comm_rank(comm)
size = MPI.Comm_size(comm)
dst = mod(rank+1, size)
src = mod(rank-1, size)
println("rank=$rank, size=$size, dst=$dst, src=$src")
# allocate memory on the GPU
N = 4
send_mesg = CuArray{Float64}(undef, N)
recv_mesg = CuArray{Float64}(undef, N)
fill!(send_mesg, Float64(rank))
# pass GPU buffers (CuArrays) into MPI functions
MPI.Sendrecv!(send_mesg, dst, 0, recv_mesg, src, 0, comm)
println("recv_mesg on proc $rank: $recv_mesg")
This will crash with an error similar to this:
rank=4, size=5, dst=0, src=3
rank=0, size=5, dst=1, src=4
rank=2, size=5, dst=3, src=1
rank=1, size=5, dst=2, src=0
rank=3, size=5, dst=4, src=2
[hawk-ai01:263609:0:263609] Caught signal 11 (Segmentation fault: invalid permissions for mapped object at address 0xa02000000)
[hawk-ai01:263605:0:263605] Caught signal 11 (Segmentation fault: invalid permissions for mapped object at address 0xa02000000)
[hawk-ai01:263607:0:263607] Caught signal 11 (Segmentation fault: invalid permissions for mapped object at address 0xa02000000)
[hawk-ai01:263606:0:263606] Caught signal 11 (Segmentation fault: invalid permissions for mapped object at address 0xa02000000)
[hawk-ai01:263608:0:263608] Caught signal 11 (Segmentation fault: invalid permissions for mapped object at address 0xa02000000)
==== backtrace (tid: 263606) ====
0 0x00000000000532f9 ucs_debug_print_backtrace() ???:0
1 0x0000000000012b20 .annobin_sigaction.c() sigaction.c:0
2 0x000000000015dd3b __memcpy_avx_unaligned() :0
3 0x0000000000043f4f ucp_wireup_select_sockaddr_transport() ???:0
4 0x00000000000148c9 uct_mm_ep_am_bcopy() ???:0
5 0x0000000000043fcb ucp_wireup_select_sockaddr_transport() ???:0
6 0x000000000003a74a ucp_tag_send_nbr() ???:0
7 0x00000000001c7e4f mca_pml_ucx_send() ???:0
8 0x00000000000bba69 PMPI_Sendrecv() ???:0
9 0x00000000000c4e0a _jl_invoke() /buildworker/worker/package_linux64/build/src/gf.c:2247
10 0x00000000000e3e96 jl_apply() /buildworker/worker/package_linux64/build/src/julia.h:1788
11 0x00000000000e390e eval_value() /buildworker/worker/package_linux64/build/src/interpreter.c:215
12 0x00000000000e46d2 eval_stmt_value() /buildworker/worker/package_linux64/build/src/interpreter.c:166
13 0x00000000000e46d2 eval_stmt_value() /buildworker/worker/package_linux64/build/src/interpreter.c:167
14 0x00000000000e46d2 eval_body() /buildworker/worker/package_linux64/build/src/interpreter.c:587
15 0x00000000000e52f8 jl_interpret_toplevel_thunk() /buildworker/worker/package_linux64/build/src/interpreter.c:731
16 0x00000000001027a4 jl_toplevel_eval_flex() /buildworker/worker/package_linux64/build/src/toplevel.c:885
17 0x00000000001029e5 jl_toplevel_eval_flex() /buildworker/worker/package_linux64/build/src/toplevel.c:830
18 0x000000000010462a jl_toplevel_eval_in() /buildworker/worker/package_linux64/build/src/toplevel.c:944
19 0x000000000115a83b eval() ./boot.jl:373
20 0x000000000115a83b japi1_include_string_40536() ./loading.jl:1196
21 0x00000000000c4e0a _jl_invoke() /buildworker/worker/package_linux64/build/src/gf.c:2247
22 0x000000000124a35b japi1__include_32082() ./loading.jl:1253
23 0x0000000000d67c16 japi1_include_36299() ./Base.jl:418
24 0x00000000000c4e0a _jl_invoke() /buildworker/worker/package_linux64/build/src/gf.c:2247
25 0x00000000012d064c julia_exec_options_33549() ./client.jl:292
26 0x0000000000d8a0f8 julia__start_38731() ./client.jl:495
27 0x0000000000d8a269 jfptr__start_38732.clone_1() text:0
28 0x00000000000c4e0a _jl_invoke() /buildworker/worker/package_linux64/build/src/gf.c:2247
29 0x00000000001282d6 jl_apply() /buildworker/worker/package_linux64/build/src/julia.h:1788
30 0x0000000000128c7d jl_repl_entrypoint() /buildworker/worker/package_linux64/build/src/jlapi.c:701
31 0x00000000004007d9 main() /buildworker/worker/package_linux64/build/cli/loader_exe.c:42
32 0x00000000000237b3 __libc_start_main() ???:0
33 0x0000000000400809 _start() ???:0
=================================
signal (11): Segmentation fault
in expression starting at /zhome/academic/HLRS/hlrs/hpcschlo/cuda_mpi_test.jl:21
__memmove_avx_unaligned at /lib64/libc.so.6 (unknown line)
unknown function (ip: 0x147c167e3f4e)
uct_mm_ep_am_bcopy at /lib64/libuct.so.0 (unknown line)
unknown function (ip: 0x147c167e3fca)
ucp_tag_send_nbr at /lib64/libucp.so.0 (unknown line)
mca_pml_ucx_send at /opt/hlrs/non-spack/mpi/openmpi/4.0.5-gcc-9.2.0/lib/libmpi.so (unknown line)
PMPI_Sendrecv at /opt/hlrs/non-spack/mpi/openmpi/4.0.5-gcc-9.2.0/lib/libmpi.so (unknown line)
Sendrecv! at /zhome/academic/HLRS/hlrs/hpcschlo/.julia/HLRS/hawk/packages/MPI/08SPr/src/pointtopoint.jl:380 [inlined]
Sendrecv! at /zhome/academic/HLRS/hlrs/hpcschlo/.julia/HLRS/hawk/packages/MPI/08SPr/src/pointtopoint.jl:389
unknown function (ip: 0x147c1a1062fb)
_jl_invoke at /buildworker/worker/package_linux64/build/src/gf.c:2247 [inlined]
jl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:2429
jl_apply at /buildworker/worker/package_linux64/build/src/julia.h:1788 [inlined]
do_call at /buildworker/worker/package_linux64/build/src/interpreter.c:126
[...]