- Infos im HLRS Wiki sind nicht rechtsverbindlich und ohne Gewähr -
- Information contained in the HLRS Wiki is not legally binding and HLRS is not responsible for any damages that might result from its use -

Hunter Porting Tutorial: Difference between revisions

From HLRS Platforms
Jump to navigationJump to search
No edit summary
(Text improvements and typo fixes)
 
Line 2: Line 2:
Although Hunter is not available yet, you can begin porting your code using Hawk. Here’s how to get started.
Although Hunter is not available yet, you can begin porting your code using Hawk. Here’s how to get started.


Hunter will be equipped with AMD Mi300A GPUs, which are sometimes referred to as APUs (Accelerated Processing Units). Mi300A are of a new type of GPU with unified shared memory : the GPUs and the CPUs share the same HBM main memory, thus removing one of the biggest chore/issue of GPUs : the need to copy data to GPU memory before processing, and back to CPU memory afterward. More information about MI300A capabilities are available on AMD website :
Hunter will be equipped with AMD Mi300A GPUs, which are sometimes referred to as APUs (Accelerated Processing Units). Mi300A is a new type of GPU with unified shared memory: the GPUs and the CPUs share the same HBM main memory, thus removing one of the biggest issues of GPUs - the need to copy data to GPU memory before processing, and back to CPU memory afterwards. More information about MI300A capabilities is available on the [https://www.amd.com/en/products/accelerators/instinct/mi300/mi300a.html AMD website]:


https://www.amd.com/en/products/accelerators/instinct/mi300/mi300a.html
Hunter will also use the CRAY programming environment (CPE) (like the old [[Cray_XC40|Hazelhen]] system we had at HLRS). The default compilers will therefore be Cray compilers and the default MPI implementation will be Cray MPI (MPICH compatible).


Hunter will also use a CRAY environmment (CPE) like the old Hazelhen system we had at HLRS. The default compilers will therefore be Cray compilers and default MPI implementation will be Cray MPICH.
Complete documentation about CPE (including profiling tools, libraries, compiler flags etc…) is available on the HPE website: https://cpe.ext.hpe.com/docs/latest/.


A complete documentation about CPE (including profiling tools, libraries, compiler flags etc…) is available on HPE website :


https://cpe.ext.hpe.com/docs/latest/


=== Disclaimer : Regarding Optimization and Performance ===
=== Disclaimer: Regarding Optimization and Performance ===
This tutorial main objective is to help people with a full CPU code to start working on porting it to Hunter, so that they can work on the new machine as soon as possible. Therefore, we will not go into details regarding performance or optimizing the GPU vs CPU usage or its efficiency. Having most of the code ported to use the GPUs is a necessary step before starting the optimization process. We propose here a simple way to do so, that will give good performances and that does not require an access to Hunter or another GPU machine.
This tutorial's main objective is to help people with a full CPU code to start working on porting it to Hunter so that they can work on the new machine as soon as possible. Therefore, we will not go into details regarding performance or optimizing the GPU vs CPU usage or its efficiency. Having most of the code ported to use the GPUs is a necessary step before starting the optimization process. We propose here a simple way to do so, that will give good performances and that does not require access to Hunter or another GPU machine.


== First step : Toolchain migration ==
== First step : Toolchain migration ==
Line 20: Line 18:
</syntaxhighlight>Note that it will remove all your current loaded modules and give access to a new set of available modules, specific to CPE. By default, CPE will be using PrgEnv-cray (the cray compilers). Ideally one should try to compile with those cray compilers. If it is not possible, one can swap modules to use another PrgEnv-*** (gnu for example).
</syntaxhighlight>Note that it will remove all your current loaded modules and give access to a new set of available modules, specific to CPE. By default, CPE will be using PrgEnv-cray (the cray compilers). Ideally one should try to compile with those cray compilers. If it is not possible, one can swap modules to use another PrgEnv-*** (gnu for example).


Regardless of the PrgEnv module loaded, CPE provides wrappers for compilers invocations. Makefiles/Cmake will have to be modified to use this wrapper :<syntaxhighlight lang="bash">
Regardless of the PrgEnv module loaded, CPE provides wrappers for compiler invocations. Makefiles/Cmake will have to be modified to use this wrapper:<syntaxhighlight lang="bash">
CC for c++ compiler
CC for C++ compiler


cc for c compiler
cc for c compiler


ftn for fortran compiler
ftn for Fortran compiler
</syntaxhighlight>Also note that cray compilers use a different set of Flags, so you might have to change some of those in your Makefiles.
</syntaxhighlight>Also note that cray compilers use a different set of Flags, so you might have to change some of those in your Makefiles.


Once compilation and running is done, porting can start
Once compilation and running are done, porting can start


=== Optional  (but recommended) : Perftools-lite profile ===
=== Optional  (but recommended): Perftools-lite profile ===
Perftools is CPE main profiling tools and will be the default profiling tool used on Hunter. Doing a profile before porting will help you to identify the hotspots of your code and therefore the first functions/loops to port. <syntaxhighlight lang="bash">
Perftools is CPE main profiling tools and will be the default profiling tool used on Hunter. Doing a profile before porting will help you to identify the hotspots of your code and therefore the first functions/loops to port. <syntaxhighlight lang="bash">
module load perftools-lite
module load perftools-lite
Line 37: Line 35:
Then recompile your code to create an instrumented version of it, and then run it normally. Overheads should be minimal.
Then recompile your code to create an instrumented version of it, and then run it normally. Overheads should be minimal.


You will get an output directory which can be opened with the following command :<syntaxhighlight lang="shell">
You will get an output directory which can be opened with the following command:<syntaxhighlight lang="shell">
pat_report {output directory}
pat_report {output directory}
</syntaxhighlight>The first table shows the time spent in each functions, except functions that take less than 1% runtime – this limit can be removed with <syntaxhighlight lang="shell">
</syntaxhighlight>The first table shows the time spent in each function, except functions that take less than 1% runtime – this limit can be removed with <syntaxhighlight lang="shell">
pat_report -T {output directory}
pat_report -T {output directory}
</syntaxhighlight>
</syntaxhighlight>
Line 46: Line 44:


=== Important : Validation Strategy ===
=== Important : Validation Strategy ===
The ported version will need to be validated (in term of results). We advise to have a strategy in order to check the accuracy of results regularly (after porting a few loops). GPU porting can lead to massive errors (porting a non-parallelizable loop or triggering data-race conditions…) and catching those early is always better. Also note that GPUs codes are not 100% bit-to-bit reproducible : there will be (small) differences due to rounding and the “order” of operations which are dynamically scheduled on the GPU.
The ported version will need to be validated (in terms of results). We advise having a strategy to check the accuracy of results regularly (after porting a few loops). GPU porting can lead to massive errors (porting a non-parallelizable loop or triggering data-race conditions…) and catching those early is always better. Also note that GPU codes are not 100% bit-to-bit reproducible: there will be (small) differences due to rounding and the “order” of operations which are dynamically scheduled on the GPU.


=== Programming Models ===
=== Programming Models ===
There are a lot of ways to port a code to GPUs : you can use native languages (HIP/CUDA), OpenACC or OpenMP directives, Standard parallelism in C++, DO CONCURRENT in Fortran etc…We call those “programming models”.
There are a lot of ways to port a code to GPUs: you can use native languages (HIP/CUDA), OpenACC or OpenMP directives, Standard parallelism in C++, DO CONCURRENT in Fortran etc…We call those “programming models”.


We recommend using OpenMP directives, which from our tests offer good performance and ease of implementation compared to native GPU languages. It is well supported by most vendors/compilers and can be used on both Nvidia and AMD GPUs.  
We recommend using OpenMP directives, which from our tests offer good performance and ease of implementation compared to native GPU languages. It is well-supported by most vendors/compilers and can be used on both Nvidia and AMD GPUs.  


==== Porting loops ====
==== Porting loops ====
Porting a loop to make a kernel on the GPU is very easy : you just have to add the following directive right before the loop :
Porting a loop to make a kernel on the GPU is very easy: you just have to add the following directive right before the loop :


in C/C++:<syntaxhighlight lang="c">
in C/C++:<syntaxhighlight lang="c">
Line 64: Line 62:
</syntaxhighlight>or<syntaxhighlight lang="fortran">
</syntaxhighlight>or<syntaxhighlight lang="fortran">
!$omp target teams distribute parallel do simd
!$omp target teams distribute parallel do simd
</syntaxhighlight>As of today, both versions are identical on AMD GPUs. “teams loop” is (and slightly better with nvidia compilers).
</syntaxhighlight>As of today, both versions are identical on AMD GPUs. “teams loop” is (and slightly better with Nvidia compilers).




A few things must be checked before porting a loop :
A few things must be checked before porting a loop :


First, it should be parallelize-able, which means every iteration should be independent from the others (no x(i) = x(i-1)+…)). Those loops are very difficult to port efficiently on GPUs. There are some solutions, but if one of the code hotposts is like this, you should ask for help as soon as possible. A complete change of the algorithm might become necessary.
First, it should be parallelize-able, which means every iteration should be independent of the others (no x(i) = x(i-1)+…)). Those loops are very difficult to port efficiently on GPUs. There are some solutions, but if one of the code hotspots is like this, you should ask for help as soon as possible. A complete change of the algorithm might become necessary.


Second, two different iteration should never WRITE (reading is totally fine) in the same memory address, creating a data-race. There are again ways to deal with those loops like reduction, or in the worst case using an atomic directive.
Second, two different iterations should never WRITE (reading is totally fine) in the same memory address, creating a data race. There are again ways to deal with those loops like reduction, or in the worst case using an atomic directive.


''Hybrid codes :''  
''Hybrid codes:''  


If you already had an MPI+Open MP code with parallel for/do directives (on the CPUs), the rules for kernels are the same as for parallelization with CPUs OMP threads : every loop which already have those parallel construct should be safe to port.
If you already had an MPI+Open MP code with parallel for/do directives (on the CPUs), the rules for kernels are the same as for parallelization with CPUs OMP threads: every loop which already has those parallel constructs should be safe to port.


==== Reduction ====
==== Reduction ====
A reduction is a loop where every iteration use the same scalar variable to aggregate the results (the best example is a sum other an array).
A reduction is a loop where every iteration uses the same scalar variable to aggregate the results (the best example is a sum other than an array).


The clause '''reduction (operator : variable)''' can be added to the directive to perform safely and efficiently a reduction on the GPU, avoiding the data-race inherent to those algorithms. “operator” can be +,*, logical operator (and, or…) and can even be self-defined.
The clause '''reduction (operator: variable)''' can be added to the directive to perform safely and efficiently a reduction on the GPU, avoiding the data-race inherent to those algorithms. “operator” can be +,*, logical operator (and, or…) and can even be self-defined.


==== Collapse clause ====
==== Collapse clause ====
One important clause  (of the target teams… directive) should be understood when porting to the GPU : '''collapse (n)'''. This is necessary to collapse multiple (n) nested loops into a big single kernel, expressing all the level of parallelism. The loops should follow each other without other instructions in the middle.
One important clause  (of the target teams… directive) should be understood when porting to the GPU: '''collapse (n)'''. This is necessary to collapse multiple (n) nested loops into a big single kernel, expressing all the levels of parallelism. The loops should follow each other without other instructions in the middle.


Most code will have some construct like this (sometimes with more levels) :<syntaxhighlight lang="c">
Most code will have some construct like this (sometimes with more levels):<syntaxhighlight lang="c">
for
for


Line 95: Line 93:


}
}
</syntaxhighlight>It is usually beneficial to move '''{some stuff}''' to the inner loop and use collapse to parallelize more :<syntaxhighlight lang="c">
</syntaxhighlight>It is usually beneficial to move '''{some stuff}''' to the inner loop and use collapse to parallelize more:<syntaxhighlight lang="c">
for
for


Line 107: Line 105:


}
}
</syntaxhighlight>Despite doing '''{some stuff}''' multiple times, the GPU is going to be much faster because it can know parallelize the work to be done other a lot more threads, instead of doing the inner loop “in serial”.
</syntaxhighlight>Despite doing '''{some stuff}''' multiple times, the GPU is going to be much faster because it can now parallelize the work to be done other a lot more threads, instead of doing the inner loop “in serial”.


==== Data-sharing clauses ====
==== Data-sharing clauses ====
Being executed in parallel, some precision must be given to the GPU about variables that are used. In a serial code, you could re-use temporary variables for each iteration without any issue, but now those iterations will run in parallel. The clauses to manage data-sharing of the variables are here to solve this issue. There are three of them :
Being executed in parallel, some precision must be given to the GPU about variables that are used. In a serial code, you could reuse temporary variables for each iteration without any issue, but now those iterations will run in parallel. The clauses to manage data-sharing of the variables are here to solve this issue. There are three of them :




'''shared (list)''' : the variable is shared by all threads, this will mostly be used for arrays and is the DEFAULT with a few exceptions (but you can precise them anyway for safeness). Shared variables can be accessed by any threads at the same time.
'''shared (list)''': the variable is shared by all threads, this will mostly be used for arrays and is the DEFAULT with a few exceptions (but you can precise them anyway for safeness). Shared variables can be accessed by any thread at the same time.


'''private (list)''' : private variables are allocated for each threads and only visible by this thread. Every temporary scalar that was re-use at each iteration should be declared private to avoid data-races. Note that loops indexes are private by default, and so are variables declared inside the loop construct.
'''private (list)''': private variables are allocated for each thread and only visible by this thread. Every temporary scalar that was re-used at each iteration should be declared private to avoid data-races. Note that loop indexes are private by default, and so are variables declared inside the loop construct.


'''firstprivate(list)''' : first private is the same as private but the value of the variable is broadcast to every thread. This is useful for initialization : you can initialize a scalar before the loop, then firstprivate it so that every thread has the same value.
'''firstprivate(list)''': first private is the same as private but the value of the variable is broadcast to every thread. This is useful for initialization: you can initialize a scalar before the loop, then firstprivate it so that every thread has the same value.




Line 123: Line 121:
The right combination of shared and private variables allows the kernel to have the correct behaviour on the GPUs. '''Note:''' declaring large arrays as '''private''' can lead to excessive memory usage because each thread would need its own copy, which can quickly exhaust GPU memory.
The right combination of shared and private variables allows the kernel to have the correct behaviour on the GPUs. '''Note:''' declaring large arrays as '''private''' can lead to excessive memory usage because each thread would need its own copy, which can quickly exhaust GPU memory.


''Map clauses :'' Because Hunter will have Unified Shared Memory, it is not necessary to use the map clauses in the directives. However, you might want to add them/already have them if your code is also running on another system with discrete GPUs. There is no need to remove them, they will simply be ignored on Hunter.
''Map clauses:'' Because Hunter will have Unified Shared Memory, it is not necessary to use the map clauses in the directives. However, you might want to add them/already have them if your code is also running on another system with discrete GPUs. There is no need to remove them, they will simply be ignored on Hunter.


==== Declare target ====
==== Declare target ====
One can call functions (or subroutine) from inside a kernel (assuming they follow the parallelization rules). Those functions needs to be preceded by the following directive, to tell the compiler that they are supposed to be executed on the accelerator :
One can call functions (or subroutines) from inside a kernel (assuming they follow the parallelization rules). Those functions need to be preceded by the following directive, to tell the compiler that they are supposed to be executed on the accelerator :


C/C++ : <syntaxhighlight lang="c">
C/C++ : <syntaxhighlight lang="c">
Line 137: Line 135:
There are a few limitations on what can be called from inside a kernel :
There are a few limitations on what can be called from inside a kernel :
   
   
-MPI_ABORT is not allowed, because it is not thread safe. You should call it after the kernel in a CPU region.
-MPI_ABORT is not allowed, because it is not thread-safe. You should call it after the kernel in a CPU region.


-breaks and exceptions are not possible either.
-breaks and exceptions are not possible either.
Line 143: Line 141:
-Input/output operations (cout, printf, WRITE in Fortran etc…) cannot be performed inside a GPU kernel.  
-Input/output operations (cout, printf, WRITE in Fortran etc…) cannot be performed inside a GPU kernel.  


Since most codes rely on one of those things to handle errors at runtime, those needs to be changed without affecting the performance.  
Since most codes rely on one of those things to handle errors at runtime, those need to be changed without affecting the performance.  


We advice to add a (shared) error variable : a single integer set to 0. Inside the kernel, if something went wrong, instead of calling an error handling function (which will try to write a message), change this integer to 1 (with an atomic directive). Then check the integer after the end of the kernel and call the error handling function.
We advise to add a (shared) error variable: a single integer set to 0. Inside the kernel, if something goes wrong, instead of calling an error handling function (which will try to write a message), change this integer to 1 (with an atomic directive). Then check the integer after the end of the kernel and call the error handling function.


''“Bad for a GPU” ? :''  
''“Bad for a GPU” ? :''  
Line 151: Line 149:
One may have heard that branching (if...then...else or select...cases) and “small” loops are “bad” on the GPU. While this is technically true, it doesn’t mean that a loop with branches or a “small” loop should not be ported.
One may have heard that branching (if...then...else or select...cases) and “small” loops are “bad” on the GPU. While this is technically true, it doesn’t mean that a loop with branches or a “small” loop should not be ported.


Regarding “small” loops, “small” is actually extremely small and even loops other boundary cells are long enough to be (a lot) faster on a GPU. While this doesn’t always mean that the GPU is used efficiently, it will be a lot easier to port every loops (with a range that depends on the problem sizes) to the GPU, than to try to guess what will be optimal on Hunter.
Regarding “small” loops, “small” is actually extremely small and even loops with other boundary cells are long enough to be (a lot) faster on a GPU. While this doesn’t always mean that the GPU is used efficiently, it will be a lot easier to port every loop (with a range that depends on the problem sizes) to the GPU, than to try to guess what will be optimal on Hunter.


Branching is less tricky : a GPU executes every instructions in all the branches of an IF/SELECT construct, using masks to only perform the operations on the threads for which the condition is TRUE. This means that it will take a lot more time than a CPU to treat 1 element, depending on the number of possible cases, hence the “bad for GPUs”. However, lets not forget that an MI300A can treat ~500k elements in parallel at the same time, while a CPU can treat...1. So if the branch construct is so complex that it has thousands of cases, then maybe the question needs to be considered, but in all the other cases, even with ~100 cases, the GPU is going to be faster, because it can treat many elements at the same time compared to a CPU.
Branching is less tricky: a GPU executes every instruction in all the branches of an IF/SELECT construct, using masks to only perform the operations on the threads for which the condition is TRUE. This means that it will take a lot more time than a CPU to treat 1 element, depending on the number of possible cases, hence the “bad for GPUs”. However, let's not forget that an MI300A can treat ~500k elements in parallel at the same time, while a CPU can treat...1. So if the branch construct is so complex that it has thousands of cases, then maybe the question needs to be considered, but in all the other cases, even with ~100 cases, the GPU is going to be faster, because it can treat many elements at the same time compared to a CPU.


== Third Step : Running and testing ==
== Third Step: Running and testing ==
While Hunter is not yet available, one can still run the ported code on Hawk. Using the CPE environment, you can use the following module :<syntaxhighlight lang="shell">
While Hunter is not yet available, one can still run the ported code on Hawk. Using the CPE environment, you can use the following module:<syntaxhighlight lang="shell">
module load craype-accel-host
module load craype-accel-host
</syntaxhighlight>This module allows you to simulate GPU execution on a CPU, ensuring the ported code runs correctly, even if actual GPU hardware is unavailable. The kernels will run serially on the CPU.
</syntaxhighlight>This module allows you to simulate GPU execution on a CPU, ensuring the ported code runs correctly, even if actual GPU hardware is unavailable. The kernels will run serially on the CPU.
Line 163: Line 161:


=== Preparing for Hunter Nodes ===
=== Preparing for Hunter Nodes ===
Each Hunter node will have 4 APUs and 24 CPUs. At first, we advise to use only one CPU as the “Host” of each APU. This means that there will be only 4 MPI ranks on an Hunter node. Each of those 4 MPI ranks will have access to more compute capabilities than multiples Hawk nodes, and 128 Gbit of memory available.
Each Hunter node will have 4 APUs and 24 CPUs. At first, we advise to use only one CPU as the “Host” of each APU. This means that there will be only 4 MPI ranks on a Hunter node. Each of those 4 MPI ranks will have access to more compute capabilities than multiple Hawk nodes, and 128 Gbit of memory available.


An application on Hunter should be adapted to run with fewer MPI ranks, but each rank will process more data (elements/cells) than on Hawk. It is the best way to get the best performances from the GPU, while fitting in the memory of the node.  
An application on Hunter should be adapted to run with fewer MPI ranks, but each rank will process more data (elements/cells) than on Hawk. It is the best way to get the best performances from the GPU while fitting in the memory of the node.  


Test-cases can already be tweaked to use less MPI ranks with bigger “subdomains”, remove “cashing” optimization (those does not exist with GPU) etc...
Test cases can already be tweaked to use fewer MPI ranks with bigger “subdomains”, remove “cashing” optimization (those do not exist with GPU) etc...


Having an idea of the memory footprint of your code will also be very useful.
Having an idea of the memory footprint of your code will also be very useful.

Latest revision as of 03:09, 16 January 2025

How to start with porting a code for Hunter

Although Hunter is not available yet, you can begin porting your code using Hawk. Here’s how to get started.

Hunter will be equipped with AMD Mi300A GPUs, which are sometimes referred to as APUs (Accelerated Processing Units). Mi300A is a new type of GPU with unified shared memory: the GPUs and the CPUs share the same HBM main memory, thus removing one of the biggest issues of GPUs - the need to copy data to GPU memory before processing, and back to CPU memory afterwards. More information about MI300A capabilities is available on the AMD website:

Hunter will also use the CRAY programming environment (CPE) (like the old Hazelhen system we had at HLRS). The default compilers will therefore be Cray compilers and the default MPI implementation will be Cray MPI (MPICH compatible).

Complete documentation about CPE (including profiling tools, libraries, compiler flags etc…) is available on the HPE website: https://cpe.ext.hpe.com/docs/latest/.


Disclaimer: Regarding Optimization and Performance

This tutorial's main objective is to help people with a full CPU code to start working on porting it to Hunter so that they can work on the new machine as soon as possible. Therefore, we will not go into details regarding performance or optimizing the GPU vs CPU usage or its efficiency. Having most of the code ported to use the GPUs is a necessary step before starting the optimization process. We propose here a simple way to do so, that will give good performances and that does not require access to Hunter or another GPU machine.

First step : Toolchain migration

Hawk has installed a CPE environment. The first step for porting a code is to compile and run the current version of it on Hawk with the CPE environment. For that, you need to load it by

module load CPE/23.12

Note that it will remove all your current loaded modules and give access to a new set of available modules, specific to CPE. By default, CPE will be using PrgEnv-cray (the cray compilers). Ideally one should try to compile with those cray compilers. If it is not possible, one can swap modules to use another PrgEnv-*** (gnu for example). Regardless of the PrgEnv module loaded, CPE provides wrappers for compiler invocations. Makefiles/Cmake will have to be modified to use this wrapper:

CC for C++ compiler

cc for c compiler

ftn for Fortran compiler

Also note that cray compilers use a different set of Flags, so you might have to change some of those in your Makefiles.

Once compilation and running are done, porting can start

Optional (but recommended): Perftools-lite profile

Perftools is CPE main profiling tools and will be the default profiling tool used on Hunter. Doing a profile before porting will help you to identify the hotspots of your code and therefore the first functions/loops to port.

module load perftools-lite

You can instead load perftools-lite-gpu (for code already ported to GPUs) or perftools-lite-loops (which give more information about CPUs loops at the cost of more overhead)

Then recompile your code to create an instrumented version of it, and then run it normally. Overheads should be minimal.

You will get an output directory which can be opened with the following command:

pat_report {output directory}

The first table shows the time spent in each function, except functions that take less than 1% runtime – this limit can be removed with

pat_report -T {output directory}

Second step : Porting loops with OpenMP offloading

Important : Validation Strategy

The ported version will need to be validated (in terms of results). We advise having a strategy to check the accuracy of results regularly (after porting a few loops). GPU porting can lead to massive errors (porting a non-parallelizable loop or triggering data-race conditions…) and catching those early is always better. Also note that GPU codes are not 100% bit-to-bit reproducible: there will be (small) differences due to rounding and the “order” of operations which are dynamically scheduled on the GPU.

Programming Models

There are a lot of ways to port a code to GPUs: you can use native languages (HIP/CUDA), OpenACC or OpenMP directives, Standard parallelism in C++, DO CONCURRENT in Fortran etc…We call those “programming models”.

We recommend using OpenMP directives, which from our tests offer good performance and ease of implementation compared to native GPU languages. It is well-supported by most vendors/compilers and can be used on both Nvidia and AMD GPUs.

Porting loops

Porting a loop to make a kernel on the GPU is very easy: you just have to add the following directive right before the loop :

in C/C++:

#pragma omp target teams loop

or

#pragma omp target teams distribute parallel for simd

In Fortran :

!$omp target teams loop

or

!$omp target teams distribute parallel do simd

As of today, both versions are identical on AMD GPUs. “teams loop” is (and slightly better with Nvidia compilers).


A few things must be checked before porting a loop :

First, it should be parallelize-able, which means every iteration should be independent of the others (no x(i) = x(i-1)+…)). Those loops are very difficult to port efficiently on GPUs. There are some solutions, but if one of the code hotspots is like this, you should ask for help as soon as possible. A complete change of the algorithm might become necessary.

Second, two different iterations should never WRITE (reading is totally fine) in the same memory address, creating a data race. There are again ways to deal with those loops like reduction, or in the worst case using an atomic directive.

Hybrid codes:

If you already had an MPI+Open MP code with parallel for/do directives (on the CPUs), the rules for kernels are the same as for parallelization with CPUs OMP threads: every loop which already has those parallel constructs should be safe to port.

Reduction

A reduction is a loop where every iteration uses the same scalar variable to aggregate the results (the best example is a sum other than an array).

The clause reduction (operator: variable) can be added to the directive to perform safely and efficiently a reduction on the GPU, avoiding the data-race inherent to those algorithms. “operator” can be +,*, logical operator (and, or…) and can even be self-defined.

Collapse clause

One important clause (of the target teams… directive) should be understood when porting to the GPU: collapse (n). This is necessary to collapse multiple (n) nested loops into a big single kernel, expressing all the levels of parallelism. The loops should follow each other without other instructions in the middle.

Most code will have some construct like this (sometimes with more levels):

for

{do some stuff ;

for

{do other stuff}

}

It is usually beneficial to move {some stuff} to the inner loop and use collapse to parallelize more:

for

{

for

{do some stuff ;

do other stuff}

}

Despite doing {some stuff} multiple times, the GPU is going to be much faster because it can now parallelize the work to be done other a lot more threads, instead of doing the inner loop “in serial”.

Data-sharing clauses

Being executed in parallel, some precision must be given to the GPU about variables that are used. In a serial code, you could reuse temporary variables for each iteration without any issue, but now those iterations will run in parallel. The clauses to manage data-sharing of the variables are here to solve this issue. There are three of them :


shared (list): the variable is shared by all threads, this will mostly be used for arrays and is the DEFAULT with a few exceptions (but you can precise them anyway for safeness). Shared variables can be accessed by any thread at the same time.

private (list): private variables are allocated for each thread and only visible by this thread. Every temporary scalar that was re-used at each iteration should be declared private to avoid data-races. Note that loop indexes are private by default, and so are variables declared inside the loop construct.

firstprivate(list): first private is the same as private but the value of the variable is broadcast to every thread. This is useful for initialization: you can initialize a scalar before the loop, then firstprivate it so that every thread has the same value.


The right combination of shared and private variables allows the kernel to have the correct behaviour on the GPUs. Note: declaring large arrays as private can lead to excessive memory usage because each thread would need its own copy, which can quickly exhaust GPU memory.

Map clauses: Because Hunter will have Unified Shared Memory, it is not necessary to use the map clauses in the directives. However, you might want to add them/already have them if your code is also running on another system with discrete GPUs. There is no need to remove them, they will simply be ignored on Hunter.

Declare target

One can call functions (or subroutines) from inside a kernel (assuming they follow the parallelization rules). Those functions need to be preceded by the following directive, to tell the compiler that they are supposed to be executed on the accelerator :

C/C++ :

#pragma omp declare target

Fortran :

!$omp declare target

One can use the clause any if one needs to call the same function both from inside a kernel and from CPU code.

Things not allowed on the GPU

There are a few limitations on what can be called from inside a kernel :

-MPI_ABORT is not allowed, because it is not thread-safe. You should call it after the kernel in a CPU region.

-breaks and exceptions are not possible either.

-Input/output operations (cout, printf, WRITE in Fortran etc…) cannot be performed inside a GPU kernel.

Since most codes rely on one of those things to handle errors at runtime, those need to be changed without affecting the performance.

We advise to add a (shared) error variable: a single integer set to 0. Inside the kernel, if something goes wrong, instead of calling an error handling function (which will try to write a message), change this integer to 1 (with an atomic directive). Then check the integer after the end of the kernel and call the error handling function.

“Bad for a GPU” ? :

One may have heard that branching (if...then...else or select...cases) and “small” loops are “bad” on the GPU. While this is technically true, it doesn’t mean that a loop with branches or a “small” loop should not be ported.

Regarding “small” loops, “small” is actually extremely small and even loops with other boundary cells are long enough to be (a lot) faster on a GPU. While this doesn’t always mean that the GPU is used efficiently, it will be a lot easier to port every loop (with a range that depends on the problem sizes) to the GPU, than to try to guess what will be optimal on Hunter.

Branching is less tricky: a GPU executes every instruction in all the branches of an IF/SELECT construct, using masks to only perform the operations on the threads for which the condition is TRUE. This means that it will take a lot more time than a CPU to treat 1 element, depending on the number of possible cases, hence the “bad for GPUs”. However, let's not forget that an MI300A can treat ~500k elements in parallel at the same time, while a CPU can treat...1. So if the branch construct is so complex that it has thousands of cases, then maybe the question needs to be considered, but in all the other cases, even with ~100 cases, the GPU is going to be faster, because it can treat many elements at the same time compared to a CPU.

Third Step: Running and testing

While Hunter is not yet available, one can still run the ported code on Hawk. Using the CPE environment, you can use the following module:

module load craype-accel-host

This module allows you to simulate GPU execution on a CPU, ensuring the ported code runs correctly, even if actual GPU hardware is unavailable. The kernels will run serially on the CPU.

On a perftools profile, you can check that kernels are indeed working by seeing some ACC_REGIONS in the functions called. You can also do a first validation of the results with this version. Note however that since the code is still run serially, things like data-race conditions or missing private clauses might not be visible on Hawk.

Preparing for Hunter Nodes

Each Hunter node will have 4 APUs and 24 CPUs. At first, we advise to use only one CPU as the “Host” of each APU. This means that there will be only 4 MPI ranks on a Hunter node. Each of those 4 MPI ranks will have access to more compute capabilities than multiple Hawk nodes, and 128 Gbit of memory available.

An application on Hunter should be adapted to run with fewer MPI ranks, but each rank will process more data (elements/cells) than on Hawk. It is the best way to get the best performances from the GPU while fitting in the memory of the node.

Test cases can already be tweaked to use fewer MPI ranks with bigger “subdomains”, remove “cashing” optimization (those do not exist with GPU) etc...

Having an idea of the memory footprint of your code will also be very useful.