- Infos im HLRS Wiki sind nicht rechtsverbindlich und ohne Gewähr -
- Information contained in the HLRS Wiki is not legally binding and HLRS is not responsible for any damages that might result from its use -

Julia: Difference between revisions

From HLRS Platforms
Jump to navigationJump to search
Line 19: Line 19:
</syntaxhighlight>
</syntaxhighlight>
This will download and precompile MPI.jl and all of its dependencies. If you find that the installation produce is stuck at a very early stage (e.g., after outputting only <code>Updating registry at `~/.julia/registries/General.toml`</code>), it means you have not properly set up your SOCKS proxy or forgot to add the appropriate environment variables.
This will download and precompile MPI.jl and all of its dependencies. If you find that the installation produce is stuck at a very early stage (e.g., after outputting only <code>Updating registry at `~/.julia/registries/General.toml`</code>), it means you have not properly set up your SOCKS proxy or forgot to add the appropriate environment variables.
You can check MPI.jl was properly configured by executing
<syntaxhighlight lang="shell">
julia -e 'using MPI; println(MPI.identify_implementation())'
</syntaxhighlight>
This should give you an output similar to
<syntaxhighlight lang="shell">
(MPI.OpenMPI, v"4.0.5")
</syntaxhighlight>


If you also want to use the GPUs with Julia, install CUDA.jl by executing
If you also want to use the GPUs with Julia, install CUDA.jl by executing
Line 24: Line 33:
julia -e 'using Pkg; Pkg.add("CUDA")'
julia -e 'using Pkg; Pkg.add("CUDA")'
</syntaxhighlight>
</syntaxhighlight>
Note that you should not attempt to use or test CUDA.jl on the login nodes, since CUDA is not available here (they do not have GPUs) and thus anything CUDA-related will fail.
==== Verify that MPI works ====
Start an interactive session on a compute node by executing
<syntaxhighlight lang="shell">
qsub -I -l select=1:node_type=rome:ncpus=128:mpiprocs=128 -l walltime=00:20:00
</syntaxhighlight>
Once your interactive job has been allocated, run a simple test program from the shell by executing
<syntaxhighlight lang="shell">
mpirun -np 5 julia mpi_test.jl
</syntaxhighlight>
The code for `mpi_test.jl` is as follows:
<syntaxhighlight lang="julia">
# mpi_test.jl
using MPI
MPI.Init()
comm = MPI.COMM_WORLD
rank = MPI.Comm_rank(comm)
size = MPI.Comm_size(comm)
dst = mod(rank+1, size)
src = mod(rank-1, size)
println("rank=$rank, size=$size, dst=$dst, src=$src")
# allocate memory
N = 4
send_mesg = Array{Float64}(undef, N)
recv_mesg = Array{Float64}(undef, N)
fill!(send_mesg, Float64(rank))
# pass buffers into MPI functions
MPI.Sendrecv!(send_mesg, dst, 0, recv_mesg, src, 0, comm)
println("recv_mesg on proc $rank: $recv_mesg")
</syntaxhighlight>
If everything is working OK, it should give you an output similar to
<syntaxhighlight lang="shell">
rank=0, size=5, dst=1, src=4
rank=1, size=5, dst=2, src=0
rank=2, size=5, dst=3, src=1
rank=3, size=5, dst=4, src=2
rank=4, size=5, dst=0, src=3
recv_mesg on proc 2: [1.0, 1.0, 1.0, 1.0]
recv_mesg on proc 1: [0.0, 0.0, 0.0, 0.0]
recv_mesg on proc 3: [2.0, 2.0, 2.0, 2.0]
recv_mesg on proc 0: [4.0, 4.0, 4.0, 4.0]
recv_mesg on proc 4: [3.0, 3.0, 3.0, 3.0]
</syntaxhighlight>
==== Verify that CUDA works ====
To test CUDA, you need to leave your interactive session on a CPU node and get an interactive session on a GPU node by running
<syntaxhighlight lang="shell">
qsub -I -l select=1:node_type=nv-a100-40gb:mpiprocs=8 -l walltime=00:20:00
</syntaxhighlight>
The first test is to check whether CUDA.jl can find all relevant drivers and GPUs. Start the Julia REPL by running <code>julia</code>. Then, execute
<syntaxhighlight lang="julia">
julia> using CUDA
julia> CUDA.versioninfo()
CUDA toolkit 11.4, local installation
NVIDIA driver 470.57.2, for CUDA 11.4
CUDA driver 11.4
Libraries:
- CUBLAS: 11.5.4
- CURAND: 10.2.5
- CUFFT: 10.5.1
- CUSOLVER: 11.2.0
- CUSPARSE: 11.6.0
- CUPTI: 14.0.0
- NVML: 11.0.0+470.57.2
- CUDNN: missing
- CUTENSOR: missing
Toolchain:
- Julia: 1.7.2
- LLVM: 12.0.1
- PTX ISA support: 3.2, 4.0, 4.1, 4.2, 4.3, 5.0, 6.0, 6.1, 6.3, 6.4, 6.5, 7.0
- Device capability support: sm_35, sm_37, sm_50, sm_52, sm_53, sm_60, sm_61, sm_62, sm_70, sm_72, sm_75, sm_80
Environment:
- JULIA_CUDA_USE_MEMORY_POOL: none
- JULIA_CUDA_USE_BINARYBUILDER: false
8 devices:
  0: NVIDIA A100-SXM4-40GB (sm_80, 39.583 GiB / 39.586 GiB available)
  1: NVIDIA A100-SXM4-40GB (sm_80, 39.583 GiB / 39.586 GiB available)
  2: NVIDIA A100-SXM4-40GB (sm_80, 39.583 GiB / 39.586 GiB available)
  3: NVIDIA A100-SXM4-40GB (sm_80, 39.583 GiB / 39.586 GiB available)
  4: NVIDIA A100-SXM4-40GB (sm_80, 39.583 GiB / 39.586 GiB available)
  5: NVIDIA A100-SXM4-40GB (sm_80, 39.583 GiB / 39.586 GiB available)
  6: NVIDIA A100-SXM4-40GB (sm_80, 39.583 GiB / 39.586 GiB available)
  7: NVIDIA A100-SXM4-40GB (sm_80, 39.583 GiB / 39.586 GiB available)
</syntaxhighlight>
As you can see, all 8 Nvidia Tesla A100 GPUs have been correctly detected.
Next, we will test if computing on the GPU is actually faster than on the CPU, to ensure that actual computations work.
For this, paste the following snippet in the Julia REPL. Approximate timings are included as reference for you:
<syntaxhighlight lang="julia">
A = rand(2000, 2000);
B = rand(2000, 2000);
@time A*B; # 1.296624 seconds (2.52 M allocations: 155.839 MiB, 23.66% gc time, 65.33% compilation time)
@time A*B; # 0.341631 seconds (2 allocations: 30.518 MiB)
Agpu = CuArray(A); # move matrix to gpu
Bgpu = CuArray(B); # move matrix to gpu
@time Agpu*Bgpu; # 1.544657 seconds (1.54 M allocations: 81.926 MiB, 2.16% gc time, 59.89% compilation time)
@time Agpu*Bgpu; # 0.000627 seconds (32 allocations: 640 bytes)
</syntaxhighlight>
As you can see, the matrix-matrix multiplication on the GPU is much faster than on the CPU.


== For admins ==
== For admins ==

Revision as of 23:10, 13 April 2022

If you have questions regarding the use of Julia at HLRS, please get in touch with Michael Schlottke-Lakemper.

For users

Getting started

Create SSH SOCKS proxy to install packages

Follow the instructions here to be able to install Julia packages. Log in to Hawk. All following steps should be executed on a login node.

Load the Julia module

Right now, just copy-paste the code found below in the Modules setup section below.

TODO: Replace by actual module command.

Install MPI.jl and CUDA.jl

To install MPI.jl, execute

julia -e 'using Pkg; Pkg.add("MPI")'

This will download and precompile MPI.jl and all of its dependencies. If you find that the installation produce is stuck at a very early stage (e.g., after outputting only Updating registry at `~/.julia/registries/General.toml`), it means you have not properly set up your SOCKS proxy or forgot to add the appropriate environment variables.

You can check MPI.jl was properly configured by executing

julia -e 'using MPI; println(MPI.identify_implementation())'

This should give you an output similar to

(MPI.OpenMPI, v"4.0.5")

If you also want to use the GPUs with Julia, install CUDA.jl by executing

julia -e 'using Pkg; Pkg.add("CUDA")'

Note that you should not attempt to use or test CUDA.jl on the login nodes, since CUDA is not available here (they do not have GPUs) and thus anything CUDA-related will fail.

Verify that MPI works

Start an interactive session on a compute node by executing

qsub -I -l select=1:node_type=rome:ncpus=128:mpiprocs=128 -l walltime=00:20:00

Once your interactive job has been allocated, run a simple test program from the shell by executing

mpirun -np 5 julia mpi_test.jl

The code for `mpi_test.jl` is as follows:

# mpi_test.jl
using MPI

MPI.Init()
comm = MPI.COMM_WORLD
rank = MPI.Comm_rank(comm)
size = MPI.Comm_size(comm)

dst = mod(rank+1, size)
src = mod(rank-1, size)
println("rank=$rank, size=$size, dst=$dst, src=$src")

# allocate memory
N = 4
send_mesg = Array{Float64}(undef, N)
recv_mesg = Array{Float64}(undef, N)
fill!(send_mesg, Float64(rank))

# pass buffers into MPI functions
MPI.Sendrecv!(send_mesg, dst, 0, recv_mesg, src, 0, comm)
println("recv_mesg on proc $rank: $recv_mesg")

If everything is working OK, it should give you an output similar to

rank=0, size=5, dst=1, src=4
rank=1, size=5, dst=2, src=0
rank=2, size=5, dst=3, src=1
rank=3, size=5, dst=4, src=2
rank=4, size=5, dst=0, src=3
recv_mesg on proc 2: [1.0, 1.0, 1.0, 1.0]
recv_mesg on proc 1: [0.0, 0.0, 0.0, 0.0]
recv_mesg on proc 3: [2.0, 2.0, 2.0, 2.0]
recv_mesg on proc 0: [4.0, 4.0, 4.0, 4.0]
recv_mesg on proc 4: [3.0, 3.0, 3.0, 3.0]


Verify that CUDA works

To test CUDA, you need to leave your interactive session on a CPU node and get an interactive session on a GPU node by running

qsub -I -l select=1:node_type=nv-a100-40gb:mpiprocs=8 -l walltime=00:20:00

The first test is to check whether CUDA.jl can find all relevant drivers and GPUs. Start the Julia REPL by running julia. Then, execute

julia> using CUDA

julia> CUDA.versioninfo()
CUDA toolkit 11.4, local installation
NVIDIA driver 470.57.2, for CUDA 11.4
CUDA driver 11.4

Libraries:
- CUBLAS: 11.5.4
- CURAND: 10.2.5
- CUFFT: 10.5.1
- CUSOLVER: 11.2.0
- CUSPARSE: 11.6.0
- CUPTI: 14.0.0
- NVML: 11.0.0+470.57.2
- CUDNN: missing
- CUTENSOR: missing

Toolchain:
- Julia: 1.7.2
- LLVM: 12.0.1
- PTX ISA support: 3.2, 4.0, 4.1, 4.2, 4.3, 5.0, 6.0, 6.1, 6.3, 6.4, 6.5, 7.0
- Device capability support: sm_35, sm_37, sm_50, sm_52, sm_53, sm_60, sm_61, sm_62, sm_70, sm_72, sm_75, sm_80

Environment:
- JULIA_CUDA_USE_MEMORY_POOL: none
- JULIA_CUDA_USE_BINARYBUILDER: false

8 devices:
  0: NVIDIA A100-SXM4-40GB (sm_80, 39.583 GiB / 39.586 GiB available)
  1: NVIDIA A100-SXM4-40GB (sm_80, 39.583 GiB / 39.586 GiB available)
  2: NVIDIA A100-SXM4-40GB (sm_80, 39.583 GiB / 39.586 GiB available)
  3: NVIDIA A100-SXM4-40GB (sm_80, 39.583 GiB / 39.586 GiB available)
  4: NVIDIA A100-SXM4-40GB (sm_80, 39.583 GiB / 39.586 GiB available)
  5: NVIDIA A100-SXM4-40GB (sm_80, 39.583 GiB / 39.586 GiB available)
  6: NVIDIA A100-SXM4-40GB (sm_80, 39.583 GiB / 39.586 GiB available)
  7: NVIDIA A100-SXM4-40GB (sm_80, 39.583 GiB / 39.586 GiB available)

As you can see, all 8 Nvidia Tesla A100 GPUs have been correctly detected.

Next, we will test if computing on the GPU is actually faster than on the CPU, to ensure that actual computations work. For this, paste the following snippet in the Julia REPL. Approximate timings are included as reference for you:

A = rand(2000, 2000);
B = rand(2000, 2000);
@time A*B; # 1.296624 seconds (2.52 M allocations: 155.839 MiB, 23.66% gc time, 65.33% compilation time)
@time A*B; # 0.341631 seconds (2 allocations: 30.518 MiB)

Agpu = CuArray(A); # move matrix to gpu
Bgpu = CuArray(B); # move matrix to gpu
@time Agpu*Bgpu; # 1.544657 seconds (1.54 M allocations: 81.926 MiB, 2.16% gc time, 59.89% compilation time)
@time Agpu*Bgpu; # 0.000627 seconds (32 allocations: 640 bytes)

As you can see, the matrix-matrix multiplication on the GPU is much faster than on the CPU.

For admins

Module setup

It would be great if we could have a julia/1.7.2 namespacing to support different Julia versions, where the default should be the latest stable version.

The following commands should be executing when loading the Julia module:

# Julia-related settings
export JULIA_DEPOT_PATH="$HOME/.julia/$SITE_NAME/$SITE_PLATFORM_NAME"

# MPI-related settings
module load openmpi
export JULIA_MPI_BINARY=system

# CUDA-related settings
export CUDA_PATH=/usr/local/cuda
export JULIA_CUDA_USE_BINARYBUILDER=false
export JULIA_CUDA_USE_MEMORY_POOL=none

# Other settings
export UCX_WARN_UNUSED_ENV_VARS=n # suppress UCX warnings