- Infos im HLRS Wiki sind nicht rechtsverbindlich und ohne Gewähr -
- Information contained in the HLRS Wiki is not legally binding and HLRS is not responsible for any damages that might result from its use -
Hunter Porting Tutorial
How to start with porting a code for Hunter
Although Hunter is not available yet, you can begin porting your code using Hawk. Here’s how to get started.
Hunter will be equipped with AMD Mi300A GPUs, which are sometimes referred to as APUs (Accelerated Processing Units). Mi300A are of a new type of GPU with unified shared memory : the GPUs and the CPUs share the same HBM main memory, thus removing one of the biggest chore/issue of GPUs : the need to copy data to GPU memory before processing, and back to CPU memory afterward. More information about MI300A capabilities are available on AMD website :
https://www.amd.com/en/products/accelerators/instinct/mi300/mi300a.html
Hunter will also use a CRAY environmment (CPE) like the old Hazelhen system we had at HLRS. The default compilers will therefore be Cray compilers and default MPI implementation will be Cray MPICH.
A complete documentation about CPE (including profiling tools, libraries, compiler flags etc…) is available on HPE website :
https://cpe.ext.hpe.com/docs/latest/
Disclaimer : Regarding Optimization and Performance
This tutorial main objective is to help people with a full CPU code to start working on porting it to Hunter, so that they can work on the new machine as soon as possible. Therefore, we will not go into details regarding performance or optimizing the GPU vs CPU usage or its efficiency. Having most of the code ported to use the GPUs is a necessary step before starting the optimization process. We propose here a simple way to do so, that will give good performances and that does not require an access to Hunter or another GPU machine.
First step : Toolchain migration
Hawk has installed a CPE environment. The first step for porting a code is to compile and run the current version of it on Hawk with the CPE environment. For that, you need to load it by
module load CPE/23.12
Note that it will remove all your current loaded modules and give access to a new set of available modules, specific to CPE. By default, CPE will be using PrgEnv-cray (the cray compilers). Ideally one should try to compile with those cray compilers. If it is not possible, one can swap modules to use another PrgEnv-*** (gnu for example). Regardless of the PrgEnv module loaded, CPE provides wrappers for compilers invocations. Makefiles/Cmake will have to be modified to use this wrapper :
CC for c++ compiler
cc for c compiler
ftn for fortran compiler
Also note that cray compilers use a different set of Flags, so you might have to change some of those in your Makefiles.
Once compilation and running is done, porting can start
Optional (but recommended) : Perftools-lite profile
Perftools is CPE main profiling tools and will be the default profiling tool used on Hunter. Doing a profile before porting will help you to identify the hotspots of your code and therefore the first functions/loops to port.
module load perftools-lite-gpu
Then recompile your code to create an instrumented version of it, and then run it normally. Overheads should be minimal. You will get an output directory which can be opened with the following command :
pat_report {output directory}
The first table shows the time spent in each functions, except functions that take less than 1% runtime – this limit can be removed with
pat_report -T {output directory}
Second step : Porting loops with OpenMP offloading
Important : Validation Strategy
The ported version will need to be validated (in term of results). We advise to have a strategy in order to check the accuracy of results regularly (after porting a few loops). GPU porting can lead to massive errors (porting a non-parallelizable loop or triggering data-race conditions…) and catching those early is always better. Also note that GPUs codes are not 100% bit-to-bit reproducible : there will be (small) differences due to rounding and the “order” of operations which are dynamically scheduled on the GPU.
Programming Models
There are a lot of ways to port a code to GPUs : you can use native languages (HIP/CUDA), OpenACC or OpenMP directives, Standard parallelism in C++, DO CONCURRENT in Fortran etc…We call those “programming models”.
We recommend using OpenMP directives, which from our tests offer good performance and ease of implementation compared to native GPU languages. It is well supported by most vendors/compilers and can be used on both Nvidia and AMD GPUs.
Porting loops
Porting a loop to make a kernel on the GPU is very easy : you just have to add the following directive right before the loop :
in C/C++:
#pragma omp target teams loop
or
#pragma omp target teams distribute parallel for simd
In Fortran :
!$omp target teams loop
or
!$omp target teams distribute parallel do simd
As of today, both versions are identical on AMD GPUs. “teams loop” is (and slightly better with nvidia compilers).
A few things must be checked before porting a loop :
First, it should be parallelize-able, which means every iteration should be independent from the others (no x(i) = x(i-1)+…)). Those loops are very difficult to port efficiently on GPUs. There are some solutions, but if one of the code hotposts is like this, you should ask for help as soon as possible. A complete change of the algorithm might become necessary.
Second, two different iteration should never WRITE (reading is totally fine) in the same memory address, creating a data-race. There are again ways to deal with those loops like reduction, or in the worst case using an atomic directive.
Hybrid codes :
If you already had an MPI+Open MP code with parallel for/do directives (on the CPUs), the rules for kernels are the same as for parallelization with CPUs OMP threads : every loop which already have those parallel construct should be safe to port.
Reduction
A reduction is a loop where every iteration use the same scalar variable to aggregate the results (the best example is a sum other an array).
The clause reduction (operator : variable) can be added to the directive to perform safely and efficiently a reduction on the GPU, avoiding the data-race inherent to those algorithms. “operator” can be +,*, logical operator (and, or…) and can even be self-defined.
Collapse clause
One important clause (of the target teams… directive) should be understood when porting to the GPU : collapse (n). This is necessary to collapse multiple (n) nested loops into a big single kernel, expressing all the level of parallelism. The loops should follow each other without other instructions in the middle.
Most code will have some construct like this (sometimes with more levels) :
for
{do some stuff ;
for
{do other stuff}
}
It is usually beneficial to move {some stuff} to the inner loop and use collapse to parallelize more :
for
{
for
{do some stuff ;
do other stuff}
}
Despite doing {some stuff} multiple times, the GPU is going to be much faster because it can know parallelize the work to be done other a lot more threads, instead of doing the inner loop “in serial”.
Data-sharing clauses
Being executed in parallel, some precision must be given to the GPU about variables that are used. In a serial code, you could re-use temporary variables for each iteration without any issue, but now those iterations will run in parallel. The clauses to manage data-sharing of the variables are here to solve this issue. There are three of them :
shared (list) : the variable is shared by all threads, this will mostly be used for arrays and is the DEFAULT with a few exceptions (but you can precise them anyway for safeness). Shared variables can be accessed by any threads at the same time.
private (list) : private variables are allocated for each threads and only visible by this thread. Every temporary scalar that was re-use at each iteration should be declared private to avoid data-races. Note that loops indexes are private by default, and so are variables declared inside the loop construct.
firstprivate(list) : first private is the same as private but the value of the variable is broadcast to every thread. This is useful for initialization : you can initialize a scalar before the loop, then firstprivate it so that every thread has the same value.
The right combination of shared and private variables allows the kernel to have the correct behaviour on the GPUs. Note: declaring large arrays as private can lead to excessive memory usage because each thread would need its own copy, which can quickly exhaust GPU memory.
Map clauses : Because Hunter will have Unified Shared Memory, it is not necessary to use the map clauses in the directives. However, you might want to add them/already have them if your code is also running on another system with discrete GPUs. There is no need to remove them, they will simply be ignored on Hunter.
Declare target
One can call functions (or subroutine) from inside a kernel (assuming they follow the parallelization rules). Those functions needs to be preceded by the following directive, to tell the compiler that they are supposed to be executed on the accelerator :
C/C++ :
#pragma omp declare target
Fortran :
!$omp declare target
One can use the clause any if one needs to call the same function both from inside a kernel and from CPU code.
Things not allowed on the GPU
There are a few limitations on what can be called from inside a kernel :
-MPI_ABORT is not allowed, because it is not thread safe. You should call it after the kernel in a CPU region.
-breaks and exceptions are not possible either.
-Input/output operations (cout, printf, WRITE in Fortran etc…) cannot be performed inside a GPU kernel.
Since most codes rely on one of those things to handle errors at runtime, those needs to be changed without affecting the performance.
We advice to add a (shared) error variable : a single integer set to 0. Inside the kernel, if something went wrong, instead of calling an error handling function (which will try to write a message), change this integer to 1 (with an atomic directive). Then check the integer after the end of the kernel and call the error handling function.
“Bad for a GPU” ? :
One may have heard that branching (if...then...else or select...cases) and “small” loops are “bad” on the GPU. While this is technically true, it doesn’t mean that a loop with branches or a “small” loop should not be ported.
Regarding “small” loops, “small” is actually extremely small and even loops other boundary cells are long enough to be (a lot) faster on a GPU. While this doesn’t always mean that the GPU is used efficiently, it will be a lot easier to port every loops (with a range that depends on the problem sizes) to the GPU, than to try to guess what will be optimal on Hunter.
Branching is less tricky : a GPU executes every instructions in all the branches of an IF/SELECT construct, using masks to only perform the operations on the threads for which the condition is TRUE. This means that it will take a lot more time than a CPU to treat 1 element, depending on the number of possible cases, hence the “bad for GPUs”. However, lets not forget that an MI300A can treat ~500k elements in parallel at the same time, while a CPU can treat...1. So if the branch construct is so complex that it has thousands of cases, then maybe the question needs to be considered, but in all the other cases, even with ~100 cases, the GPU is going to be faster, because it can treat many elements at the same time compared to a CPU.
Third Step : Running and testing
While Hunter is not yet available, one can still run the ported code on Hawk. Using the CPE environment, you can use the following module :
module load craype-accel-host
This module allows you to simulate GPU execution on a CPU, ensuring the ported code runs correctly, even if actual GPU hardware is unavailable. The kernels will run serially on the CPU.
On a perftools profile, you can check that kernels are indeed working by seeing some ACC_REGIONS in the functions called. You can also do a first validation of the results with this version. Note however that since the code is still run serially, things like data-race conditions or missing private clauses might not be visible on Hawk.
Preparing for Hunter Nodes
Each Hunter node will have 4 APUs and 24 CPUs. At first, we advise to use only one CPU as the “Host” of each APU. This means that there will be only 4 MPI ranks on an Hunter node. Each of those 4 MPI ranks will have access to more compute capabilities than multiples Hawk nodes, and 128 Gbit of memory available.
An application on Hunter should be adapted to run with fewer MPI ranks, but each rank will process more data (elements/cells) than on Hawk. It is the best way to get the best performances from the GPU, while fitting in the memory of the node.
Test-cases can already be tweaked to use less MPI ranks with bigger “subdomains”, remove “cashing” optimization (those does not exist with GPU) etc...
Having an idea of the memory footprint of your code will also be very useful.