- Infos im HLRS Wiki sind nicht rechtsverbindlich und ohne Gewähr -
- Information contained in the HLRS Wiki is not legally binding and HLRS is not responsible for any damages that might result from its use -
Hunter Porting Tutorial: Difference between revisions
No edit summary |
(Text improvements and typo fixes) |
||
Line 2: | Line 2: | ||
Although Hunter is not available yet, you can begin porting your code using Hawk. Here’s how to get started. | Although Hunter is not available yet, you can begin porting your code using Hawk. Here’s how to get started. | ||
Hunter will be equipped with AMD Mi300A GPUs, which are sometimes referred to as APUs (Accelerated Processing Units). Mi300A | Hunter will be equipped with AMD Mi300A GPUs, which are sometimes referred to as APUs (Accelerated Processing Units). Mi300A is a new type of GPU with unified shared memory: the GPUs and the CPUs share the same HBM main memory, thus removing one of the biggest issues of GPUs - the need to copy data to GPU memory before processing, and back to CPU memory afterwards. More information about MI300A capabilities is available on the [https://www.amd.com/en/products/accelerators/instinct/mi300/mi300a.html AMD website]: | ||
Hunter will also use the CRAY programming environment (CPE) (like the old [[Cray_XC40|Hazelhen]] system we had at HLRS). The default compilers will therefore be Cray compilers and the default MPI implementation will be Cray MPI (MPICH compatible). | |||
Complete documentation about CPE (including profiling tools, libraries, compiler flags etc…) is available on the HPE website: https://cpe.ext.hpe.com/docs/latest/. | |||
=== Disclaimer : Regarding Optimization and Performance === | === Disclaimer: Regarding Optimization and Performance === | ||
This tutorial main objective is to help people with a full CPU code to start working on porting it to Hunter | This tutorial's main objective is to help people with a full CPU code to start working on porting it to Hunter so that they can work on the new machine as soon as possible. Therefore, we will not go into details regarding performance or optimizing the GPU vs CPU usage or its efficiency. Having most of the code ported to use the GPUs is a necessary step before starting the optimization process. We propose here a simple way to do so, that will give good performances and that does not require access to Hunter or another GPU machine. | ||
== First step : Toolchain migration == | == First step : Toolchain migration == | ||
Line 20: | Line 18: | ||
</syntaxhighlight>Note that it will remove all your current loaded modules and give access to a new set of available modules, specific to CPE. By default, CPE will be using PrgEnv-cray (the cray compilers). Ideally one should try to compile with those cray compilers. If it is not possible, one can swap modules to use another PrgEnv-*** (gnu for example). | </syntaxhighlight>Note that it will remove all your current loaded modules and give access to a new set of available modules, specific to CPE. By default, CPE will be using PrgEnv-cray (the cray compilers). Ideally one should try to compile with those cray compilers. If it is not possible, one can swap modules to use another PrgEnv-*** (gnu for example). | ||
Regardless of the PrgEnv module loaded, CPE provides wrappers for | Regardless of the PrgEnv module loaded, CPE provides wrappers for compiler invocations. Makefiles/Cmake will have to be modified to use this wrapper:<syntaxhighlight lang="bash"> | ||
CC for | CC for C++ compiler | ||
cc for c compiler | cc for c compiler | ||
ftn for | ftn for Fortran compiler | ||
</syntaxhighlight>Also note that cray compilers use a different set of Flags, so you might have to change some of those in your Makefiles. | </syntaxhighlight>Also note that cray compilers use a different set of Flags, so you might have to change some of those in your Makefiles. | ||
Once compilation and running | Once compilation and running are done, porting can start | ||
=== Optional (but recommended) : Perftools-lite profile === | === Optional (but recommended): Perftools-lite profile === | ||
Perftools is CPE main profiling tools and will be the default profiling tool used on Hunter. Doing a profile before porting will help you to identify the hotspots of your code and therefore the first functions/loops to port. <syntaxhighlight lang="bash"> | Perftools is CPE main profiling tools and will be the default profiling tool used on Hunter. Doing a profile before porting will help you to identify the hotspots of your code and therefore the first functions/loops to port. <syntaxhighlight lang="bash"> | ||
module load perftools-lite | module load perftools-lite | ||
Line 37: | Line 35: | ||
Then recompile your code to create an instrumented version of it, and then run it normally. Overheads should be minimal. | Then recompile your code to create an instrumented version of it, and then run it normally. Overheads should be minimal. | ||
You will get an output directory which can be opened with the following command :<syntaxhighlight lang="shell"> | You will get an output directory which can be opened with the following command:<syntaxhighlight lang="shell"> | ||
pat_report {output directory} | pat_report {output directory} | ||
</syntaxhighlight>The first table shows the time spent in each | </syntaxhighlight>The first table shows the time spent in each function, except functions that take less than 1% runtime – this limit can be removed with <syntaxhighlight lang="shell"> | ||
pat_report -T {output directory} | pat_report -T {output directory} | ||
</syntaxhighlight> | </syntaxhighlight> | ||
Line 46: | Line 44: | ||
=== Important : Validation Strategy === | === Important : Validation Strategy === | ||
The ported version will need to be validated (in | The ported version will need to be validated (in terms of results). We advise having a strategy to check the accuracy of results regularly (after porting a few loops). GPU porting can lead to massive errors (porting a non-parallelizable loop or triggering data-race conditions…) and catching those early is always better. Also note that GPU codes are not 100% bit-to-bit reproducible: there will be (small) differences due to rounding and the “order” of operations which are dynamically scheduled on the GPU. | ||
=== Programming Models === | === Programming Models === | ||
There are a lot of ways to port a code to GPUs : you can use native languages (HIP/CUDA), OpenACC or OpenMP directives, Standard parallelism in C++, DO CONCURRENT in Fortran etc…We call those “programming models”. | There are a lot of ways to port a code to GPUs: you can use native languages (HIP/CUDA), OpenACC or OpenMP directives, Standard parallelism in C++, DO CONCURRENT in Fortran etc…We call those “programming models”. | ||
We recommend using OpenMP directives, which from our tests offer good performance and ease of implementation compared to native GPU languages. It is well supported by most vendors/compilers and can be used on both Nvidia and AMD GPUs. | We recommend using OpenMP directives, which from our tests offer good performance and ease of implementation compared to native GPU languages. It is well-supported by most vendors/compilers and can be used on both Nvidia and AMD GPUs. | ||
==== Porting loops ==== | ==== Porting loops ==== | ||
Porting a loop to make a kernel on the GPU is very easy : you just have to add the following directive right before the loop : | Porting a loop to make a kernel on the GPU is very easy: you just have to add the following directive right before the loop : | ||
in C/C++:<syntaxhighlight lang="c"> | in C/C++:<syntaxhighlight lang="c"> | ||
Line 64: | Line 62: | ||
</syntaxhighlight>or<syntaxhighlight lang="fortran"> | </syntaxhighlight>or<syntaxhighlight lang="fortran"> | ||
!$omp target teams distribute parallel do simd | !$omp target teams distribute parallel do simd | ||
</syntaxhighlight>As of today, both versions are identical on AMD GPUs. “teams loop” is (and slightly better with | </syntaxhighlight>As of today, both versions are identical on AMD GPUs. “teams loop” is (and slightly better with Nvidia compilers). | ||
A few things must be checked before porting a loop : | A few things must be checked before porting a loop : | ||
First, it should be parallelize-able, which means every iteration should be independent | First, it should be parallelize-able, which means every iteration should be independent of the others (no x(i) = x(i-1)+…)). Those loops are very difficult to port efficiently on GPUs. There are some solutions, but if one of the code hotspots is like this, you should ask for help as soon as possible. A complete change of the algorithm might become necessary. | ||
Second, two different | Second, two different iterations should never WRITE (reading is totally fine) in the same memory address, creating a data race. There are again ways to deal with those loops like reduction, or in the worst case using an atomic directive. | ||
''Hybrid codes :'' | ''Hybrid codes:'' | ||
If you already had an MPI+Open MP code with parallel for/do directives (on the CPUs), the rules for kernels are the same as for parallelization with CPUs OMP threads : every loop which already | If you already had an MPI+Open MP code with parallel for/do directives (on the CPUs), the rules for kernels are the same as for parallelization with CPUs OMP threads: every loop which already has those parallel constructs should be safe to port. | ||
==== Reduction ==== | ==== Reduction ==== | ||
A reduction is a loop where every iteration | A reduction is a loop where every iteration uses the same scalar variable to aggregate the results (the best example is a sum other than an array). | ||
The clause '''reduction (operator : variable)''' can be added to the directive to perform safely and efficiently a reduction on the GPU, avoiding the data-race inherent to those algorithms. “operator” can be +,*, logical operator (and, or…) and can even be self-defined. | The clause '''reduction (operator: variable)''' can be added to the directive to perform safely and efficiently a reduction on the GPU, avoiding the data-race inherent to those algorithms. “operator” can be +,*, logical operator (and, or…) and can even be self-defined. | ||
==== Collapse clause ==== | ==== Collapse clause ==== | ||
One important clause (of the target teams… directive) should be understood when porting to the GPU : '''collapse (n)'''. This is necessary to collapse multiple (n) nested loops into a big single kernel, expressing all the | One important clause (of the target teams… directive) should be understood when porting to the GPU: '''collapse (n)'''. This is necessary to collapse multiple (n) nested loops into a big single kernel, expressing all the levels of parallelism. The loops should follow each other without other instructions in the middle. | ||
Most code will have some construct like this (sometimes with more levels) :<syntaxhighlight lang="c"> | Most code will have some construct like this (sometimes with more levels):<syntaxhighlight lang="c"> | ||
for | for | ||
Line 95: | Line 93: | ||
} | } | ||
</syntaxhighlight>It is usually beneficial to move '''{some stuff}''' to the inner loop and use collapse to parallelize more :<syntaxhighlight lang="c"> | </syntaxhighlight>It is usually beneficial to move '''{some stuff}''' to the inner loop and use collapse to parallelize more:<syntaxhighlight lang="c"> | ||
for | for | ||
Line 107: | Line 105: | ||
} | } | ||
</syntaxhighlight>Despite doing '''{some stuff}''' multiple times, the GPU is going to be much faster because it can | </syntaxhighlight>Despite doing '''{some stuff}''' multiple times, the GPU is going to be much faster because it can now parallelize the work to be done other a lot more threads, instead of doing the inner loop “in serial”. | ||
==== Data-sharing clauses ==== | ==== Data-sharing clauses ==== | ||
Being executed in parallel, some precision must be given to the GPU about variables that are used. In a serial code, you could | Being executed in parallel, some precision must be given to the GPU about variables that are used. In a serial code, you could reuse temporary variables for each iteration without any issue, but now those iterations will run in parallel. The clauses to manage data-sharing of the variables are here to solve this issue. There are three of them : | ||
'''shared (list)''' : the variable is shared by all threads, this will mostly be used for arrays and is the DEFAULT with a few exceptions (but you can precise them anyway for safeness). Shared variables can be accessed by any | '''shared (list)''': the variable is shared by all threads, this will mostly be used for arrays and is the DEFAULT with a few exceptions (but you can precise them anyway for safeness). Shared variables can be accessed by any thread at the same time. | ||
'''private (list)''' : private variables are allocated for each | '''private (list)''': private variables are allocated for each thread and only visible by this thread. Every temporary scalar that was re-used at each iteration should be declared private to avoid data-races. Note that loop indexes are private by default, and so are variables declared inside the loop construct. | ||
'''firstprivate(list)''' : first private is the same as private but the value of the variable is broadcast to every thread. This is useful for initialization : you can initialize a scalar before the loop, then firstprivate it so that every thread has the same value. | '''firstprivate(list)''': first private is the same as private but the value of the variable is broadcast to every thread. This is useful for initialization: you can initialize a scalar before the loop, then firstprivate it so that every thread has the same value. | ||
Line 123: | Line 121: | ||
The right combination of shared and private variables allows the kernel to have the correct behaviour on the GPUs. '''Note:''' declaring large arrays as '''private''' can lead to excessive memory usage because each thread would need its own copy, which can quickly exhaust GPU memory. | The right combination of shared and private variables allows the kernel to have the correct behaviour on the GPUs. '''Note:''' declaring large arrays as '''private''' can lead to excessive memory usage because each thread would need its own copy, which can quickly exhaust GPU memory. | ||
''Map clauses :'' Because Hunter will have Unified Shared Memory, it is not necessary to use the map clauses in the directives. However, you might want to add them/already have them if your code is also running on another system with discrete GPUs. There is no need to remove them, they will simply be ignored on Hunter. | ''Map clauses:'' Because Hunter will have Unified Shared Memory, it is not necessary to use the map clauses in the directives. However, you might want to add them/already have them if your code is also running on another system with discrete GPUs. There is no need to remove them, they will simply be ignored on Hunter. | ||
==== Declare target ==== | ==== Declare target ==== | ||
One can call functions (or | One can call functions (or subroutines) from inside a kernel (assuming they follow the parallelization rules). Those functions need to be preceded by the following directive, to tell the compiler that they are supposed to be executed on the accelerator : | ||
C/C++ : <syntaxhighlight lang="c"> | C/C++ : <syntaxhighlight lang="c"> | ||
Line 137: | Line 135: | ||
There are a few limitations on what can be called from inside a kernel : | There are a few limitations on what can be called from inside a kernel : | ||
-MPI_ABORT is not allowed, because it is not thread safe. You should call it after the kernel in a CPU region. | -MPI_ABORT is not allowed, because it is not thread-safe. You should call it after the kernel in a CPU region. | ||
-breaks and exceptions are not possible either. | -breaks and exceptions are not possible either. | ||
Line 143: | Line 141: | ||
-Input/output operations (cout, printf, WRITE in Fortran etc…) cannot be performed inside a GPU kernel. | -Input/output operations (cout, printf, WRITE in Fortran etc…) cannot be performed inside a GPU kernel. | ||
Since most codes rely on one of those things to handle errors at runtime, those | Since most codes rely on one of those things to handle errors at runtime, those need to be changed without affecting the performance. | ||
We | We advise to add a (shared) error variable: a single integer set to 0. Inside the kernel, if something goes wrong, instead of calling an error handling function (which will try to write a message), change this integer to 1 (with an atomic directive). Then check the integer after the end of the kernel and call the error handling function. | ||
''“Bad for a GPU” ? :'' | ''“Bad for a GPU” ? :'' | ||
Line 151: | Line 149: | ||
One may have heard that branching (if...then...else or select...cases) and “small” loops are “bad” on the GPU. While this is technically true, it doesn’t mean that a loop with branches or a “small” loop should not be ported. | One may have heard that branching (if...then...else or select...cases) and “small” loops are “bad” on the GPU. While this is technically true, it doesn’t mean that a loop with branches or a “small” loop should not be ported. | ||
Regarding “small” loops, “small” is actually extremely small and even loops other boundary cells are long enough to be (a lot) faster on a GPU. While this doesn’t always mean that the GPU is used efficiently, it will be a lot easier to port every | Regarding “small” loops, “small” is actually extremely small and even loops with other boundary cells are long enough to be (a lot) faster on a GPU. While this doesn’t always mean that the GPU is used efficiently, it will be a lot easier to port every loop (with a range that depends on the problem sizes) to the GPU, than to try to guess what will be optimal on Hunter. | ||
Branching is less tricky : a GPU executes every | Branching is less tricky: a GPU executes every instruction in all the branches of an IF/SELECT construct, using masks to only perform the operations on the threads for which the condition is TRUE. This means that it will take a lot more time than a CPU to treat 1 element, depending on the number of possible cases, hence the “bad for GPUs”. However, let's not forget that an MI300A can treat ~500k elements in parallel at the same time, while a CPU can treat...1. So if the branch construct is so complex that it has thousands of cases, then maybe the question needs to be considered, but in all the other cases, even with ~100 cases, the GPU is going to be faster, because it can treat many elements at the same time compared to a CPU. | ||
== Third Step : Running and testing == | == Third Step: Running and testing == | ||
While Hunter is not yet available, one can still run the ported code on Hawk. Using the CPE environment, you can use the following module :<syntaxhighlight lang="shell"> | While Hunter is not yet available, one can still run the ported code on Hawk. Using the CPE environment, you can use the following module:<syntaxhighlight lang="shell"> | ||
module load craype-accel-host | module load craype-accel-host | ||
</syntaxhighlight>This module allows you to simulate GPU execution on a CPU, ensuring the ported code runs correctly, even if actual GPU hardware is unavailable. The kernels will run serially on the CPU. | </syntaxhighlight>This module allows you to simulate GPU execution on a CPU, ensuring the ported code runs correctly, even if actual GPU hardware is unavailable. The kernels will run serially on the CPU. | ||
Line 163: | Line 161: | ||
=== Preparing for Hunter Nodes === | === Preparing for Hunter Nodes === | ||
Each Hunter node will have 4 APUs and 24 CPUs. At first, we advise to use only one CPU as the “Host” of each APU. This means that there will be only 4 MPI ranks on | Each Hunter node will have 4 APUs and 24 CPUs. At first, we advise to use only one CPU as the “Host” of each APU. This means that there will be only 4 MPI ranks on a Hunter node. Each of those 4 MPI ranks will have access to more compute capabilities than multiple Hawk nodes, and 128 Gbit of memory available. | ||
An application on Hunter should be adapted to run with fewer MPI ranks, but each rank will process more data (elements/cells) than on Hawk. It is the best way to get the best performances from the GPU | An application on Hunter should be adapted to run with fewer MPI ranks, but each rank will process more data (elements/cells) than on Hawk. It is the best way to get the best performances from the GPU while fitting in the memory of the node. | ||
Test | Test cases can already be tweaked to use fewer MPI ranks with bigger “subdomains”, remove “cashing” optimization (those do not exist with GPU) etc... | ||
Having an idea of the memory footprint of your code will also be very useful. | Having an idea of the memory footprint of your code will also be very useful. |
Latest revision as of 03:09, 16 January 2025
How to start with porting a code for Hunter
Although Hunter is not available yet, you can begin porting your code using Hawk. Here’s how to get started.
Hunter will be equipped with AMD Mi300A GPUs, which are sometimes referred to as APUs (Accelerated Processing Units). Mi300A is a new type of GPU with unified shared memory: the GPUs and the CPUs share the same HBM main memory, thus removing one of the biggest issues of GPUs - the need to copy data to GPU memory before processing, and back to CPU memory afterwards. More information about MI300A capabilities is available on the AMD website:
Hunter will also use the CRAY programming environment (CPE) (like the old Hazelhen system we had at HLRS). The default compilers will therefore be Cray compilers and the default MPI implementation will be Cray MPI (MPICH compatible).
Complete documentation about CPE (including profiling tools, libraries, compiler flags etc…) is available on the HPE website: https://cpe.ext.hpe.com/docs/latest/.
Disclaimer: Regarding Optimization and Performance
This tutorial's main objective is to help people with a full CPU code to start working on porting it to Hunter so that they can work on the new machine as soon as possible. Therefore, we will not go into details regarding performance or optimizing the GPU vs CPU usage or its efficiency. Having most of the code ported to use the GPUs is a necessary step before starting the optimization process. We propose here a simple way to do so, that will give good performances and that does not require access to Hunter or another GPU machine.
First step : Toolchain migration
Hawk has installed a CPE environment. The first step for porting a code is to compile and run the current version of it on Hawk with the CPE environment. For that, you need to load it by
module load CPE/23.12
Note that it will remove all your current loaded modules and give access to a new set of available modules, specific to CPE. By default, CPE will be using PrgEnv-cray (the cray compilers). Ideally one should try to compile with those cray compilers. If it is not possible, one can swap modules to use another PrgEnv-*** (gnu for example). Regardless of the PrgEnv module loaded, CPE provides wrappers for compiler invocations. Makefiles/Cmake will have to be modified to use this wrapper:
CC for C++ compiler
cc for c compiler
ftn for Fortran compiler
Also note that cray compilers use a different set of Flags, so you might have to change some of those in your Makefiles.
Once compilation and running are done, porting can start
Optional (but recommended): Perftools-lite profile
Perftools is CPE main profiling tools and will be the default profiling tool used on Hunter. Doing a profile before porting will help you to identify the hotspots of your code and therefore the first functions/loops to port.
module load perftools-lite
You can instead load perftools-lite-gpu (for code already ported to GPUs) or perftools-lite-loops (which give more information about CPUs loops at the cost of more overhead)
Then recompile your code to create an instrumented version of it, and then run it normally. Overheads should be minimal.
You will get an output directory which can be opened with the following command:
pat_report {output directory}
The first table shows the time spent in each function, except functions that take less than 1% runtime – this limit can be removed with
pat_report -T {output directory}
Second step : Porting loops with OpenMP offloading
Important : Validation Strategy
The ported version will need to be validated (in terms of results). We advise having a strategy to check the accuracy of results regularly (after porting a few loops). GPU porting can lead to massive errors (porting a non-parallelizable loop or triggering data-race conditions…) and catching those early is always better. Also note that GPU codes are not 100% bit-to-bit reproducible: there will be (small) differences due to rounding and the “order” of operations which are dynamically scheduled on the GPU.
Programming Models
There are a lot of ways to port a code to GPUs: you can use native languages (HIP/CUDA), OpenACC or OpenMP directives, Standard parallelism in C++, DO CONCURRENT in Fortran etc…We call those “programming models”.
We recommend using OpenMP directives, which from our tests offer good performance and ease of implementation compared to native GPU languages. It is well-supported by most vendors/compilers and can be used on both Nvidia and AMD GPUs.
Porting loops
Porting a loop to make a kernel on the GPU is very easy: you just have to add the following directive right before the loop :
in C/C++:
#pragma omp target teams loop
or
#pragma omp target teams distribute parallel for simd
In Fortran :
!$omp target teams loop
or
!$omp target teams distribute parallel do simd
As of today, both versions are identical on AMD GPUs. “teams loop” is (and slightly better with Nvidia compilers).
A few things must be checked before porting a loop :
First, it should be parallelize-able, which means every iteration should be independent of the others (no x(i) = x(i-1)+…)). Those loops are very difficult to port efficiently on GPUs. There are some solutions, but if one of the code hotspots is like this, you should ask for help as soon as possible. A complete change of the algorithm might become necessary.
Second, two different iterations should never WRITE (reading is totally fine) in the same memory address, creating a data race. There are again ways to deal with those loops like reduction, or in the worst case using an atomic directive.
Hybrid codes:
If you already had an MPI+Open MP code with parallel for/do directives (on the CPUs), the rules for kernels are the same as for parallelization with CPUs OMP threads: every loop which already has those parallel constructs should be safe to port.
Reduction
A reduction is a loop where every iteration uses the same scalar variable to aggregate the results (the best example is a sum other than an array).
The clause reduction (operator: variable) can be added to the directive to perform safely and efficiently a reduction on the GPU, avoiding the data-race inherent to those algorithms. “operator” can be +,*, logical operator (and, or…) and can even be self-defined.
Collapse clause
One important clause (of the target teams… directive) should be understood when porting to the GPU: collapse (n). This is necessary to collapse multiple (n) nested loops into a big single kernel, expressing all the levels of parallelism. The loops should follow each other without other instructions in the middle.
Most code will have some construct like this (sometimes with more levels):
for
{do some stuff ;
for
{do other stuff}
}
It is usually beneficial to move {some stuff} to the inner loop and use collapse to parallelize more:
for
{
for
{do some stuff ;
do other stuff}
}
Despite doing {some stuff} multiple times, the GPU is going to be much faster because it can now parallelize the work to be done other a lot more threads, instead of doing the inner loop “in serial”.
Data-sharing clauses
Being executed in parallel, some precision must be given to the GPU about variables that are used. In a serial code, you could reuse temporary variables for each iteration without any issue, but now those iterations will run in parallel. The clauses to manage data-sharing of the variables are here to solve this issue. There are three of them :
shared (list): the variable is shared by all threads, this will mostly be used for arrays and is the DEFAULT with a few exceptions (but you can precise them anyway for safeness). Shared variables can be accessed by any thread at the same time.
private (list): private variables are allocated for each thread and only visible by this thread. Every temporary scalar that was re-used at each iteration should be declared private to avoid data-races. Note that loop indexes are private by default, and so are variables declared inside the loop construct.
firstprivate(list): first private is the same as private but the value of the variable is broadcast to every thread. This is useful for initialization: you can initialize a scalar before the loop, then firstprivate it so that every thread has the same value.
The right combination of shared and private variables allows the kernel to have the correct behaviour on the GPUs. Note: declaring large arrays as private can lead to excessive memory usage because each thread would need its own copy, which can quickly exhaust GPU memory.
Map clauses: Because Hunter will have Unified Shared Memory, it is not necessary to use the map clauses in the directives. However, you might want to add them/already have them if your code is also running on another system with discrete GPUs. There is no need to remove them, they will simply be ignored on Hunter.
Declare target
One can call functions (or subroutines) from inside a kernel (assuming they follow the parallelization rules). Those functions need to be preceded by the following directive, to tell the compiler that they are supposed to be executed on the accelerator :
C/C++ :
#pragma omp declare target
Fortran :
!$omp declare target
One can use the clause any if one needs to call the same function both from inside a kernel and from CPU code.
Things not allowed on the GPU
There are a few limitations on what can be called from inside a kernel :
-MPI_ABORT is not allowed, because it is not thread-safe. You should call it after the kernel in a CPU region.
-breaks and exceptions are not possible either.
-Input/output operations (cout, printf, WRITE in Fortran etc…) cannot be performed inside a GPU kernel.
Since most codes rely on one of those things to handle errors at runtime, those need to be changed without affecting the performance.
We advise to add a (shared) error variable: a single integer set to 0. Inside the kernel, if something goes wrong, instead of calling an error handling function (which will try to write a message), change this integer to 1 (with an atomic directive). Then check the integer after the end of the kernel and call the error handling function.
“Bad for a GPU” ? :
One may have heard that branching (if...then...else or select...cases) and “small” loops are “bad” on the GPU. While this is technically true, it doesn’t mean that a loop with branches or a “small” loop should not be ported.
Regarding “small” loops, “small” is actually extremely small and even loops with other boundary cells are long enough to be (a lot) faster on a GPU. While this doesn’t always mean that the GPU is used efficiently, it will be a lot easier to port every loop (with a range that depends on the problem sizes) to the GPU, than to try to guess what will be optimal on Hunter.
Branching is less tricky: a GPU executes every instruction in all the branches of an IF/SELECT construct, using masks to only perform the operations on the threads for which the condition is TRUE. This means that it will take a lot more time than a CPU to treat 1 element, depending on the number of possible cases, hence the “bad for GPUs”. However, let's not forget that an MI300A can treat ~500k elements in parallel at the same time, while a CPU can treat...1. So if the branch construct is so complex that it has thousands of cases, then maybe the question needs to be considered, but in all the other cases, even with ~100 cases, the GPU is going to be faster, because it can treat many elements at the same time compared to a CPU.
Third Step: Running and testing
While Hunter is not yet available, one can still run the ported code on Hawk. Using the CPE environment, you can use the following module:
module load craype-accel-host
This module allows you to simulate GPU execution on a CPU, ensuring the ported code runs correctly, even if actual GPU hardware is unavailable. The kernels will run serially on the CPU.
On a perftools profile, you can check that kernels are indeed working by seeing some ACC_REGIONS in the functions called. You can also do a first validation of the results with this version. Note however that since the code is still run serially, things like data-race conditions or missing private clauses might not be visible on Hawk.
Preparing for Hunter Nodes
Each Hunter node will have 4 APUs and 24 CPUs. At first, we advise to use only one CPU as the “Host” of each APU. This means that there will be only 4 MPI ranks on a Hunter node. Each of those 4 MPI ranks will have access to more compute capabilities than multiple Hawk nodes, and 128 Gbit of memory available.
An application on Hunter should be adapted to run with fewer MPI ranks, but each rank will process more data (elements/cells) than on Hawk. It is the best way to get the best performances from the GPU while fitting in the memory of the node.
Test cases can already be tweaked to use fewer MPI ranks with bigger “subdomains”, remove “cashing” optimization (those do not exist with GPU) etc...
Having an idea of the memory footprint of your code will also be very useful.