- Infos im HLRS Wiki sind nicht rechtsverbindlich und ohne Gewähr -
- Information contained in the HLRS Wiki is not legally binding and HLRS is not responsible for any damages that might result from its use -

Compiler(Hawk): Difference between revisions

From HLRS Platforms
Jump to navigationJump to search
(Added information on Profile Guided Optimization (PGO))
(46 intermediate revisions by 3 users not shown)
Line 1: Line 1:
We '''highly''' recommend to try as much different compilers as possible and compare the performance of the generated code! If you code according to language standards, this is almost for free but can give you a significant speedup! There is no such thing as an "ideal" compiler! One suites better to application A, one suites better to application B (cf. [http://www.prace-ri.eu/IMG/pdf/Best-Practice-Guide-AMD.pdf Best Practice Guide AMD EPYC (Naples)]).
In order to build MPI applications, please us the compiler wrappers mpif77 / mpif90 / mpif08 / mpicc / mpicxx.
 
Please note that compilers do not use optimization flags by default at the moment. Hence, please refer to [https://www.amd.com/system/files/documents/compiler-options-guide-amd-epyc-7xx1-series-processors.pdf Compiler Options Quick Reference Guide] and set the respective flags on your own (with znver1 for Naples and znver2 for Rome nodes). [https://www.amd.com/system/files/TechDocs/32035.pdf Compiler Usage Guidelines for AMD64 Platforms] might also be a source of inspiration w.r.t. optimization flags.
 
 
== Available compilers ==
We '''highly''' recommend to try as many different compilers as possible and compare the performance of the generated code! If you code according to language standards, this is almost for free but can give you a significant speedup! There is no such thing as an "ideal" compiler! One suites better to application A, one suites better to application B (cf. [https://prace-ri.eu/wp-content/uploads/Best-Practice-Guide_AMD.pdf Best Practice Guide AMD EPYC (Naples)]).


<br>
<br>


Please note that compilers do not use optimization flags by default at the moment. Hence, please refer to [https://www.amd.com/system/files/documents/compiler-options-guide-amd-epyc-7xx1-series-processors.pdf Compiler Options Quick Reference Guide] and set the respective flags on your own (with znver1 for Naples and znver2 for Rome nodes). [https://www.amd.com/system/files/TechDocs/32035.pdf Compiler Usage Guidelines for AMD64 Platforms] might also be a source of inspiration w.r.t. optimization flags.
Default compiler flags for the GCC and AOCC compilers are currently:
<pre>-march=znver2 -mtune=znver2 -O3</pre>
Default compiler flags for the Intel compilers are currently:
<pre>-march=core-avx2 -mtune=core-avx2 -O3</pre>
 
Providing a different -O'''X''' value will override the -O3 value.


<br>
<br>
= Compilers =
 
=== GNU ===
=== GCC ===
Make sure to load a more up to date version of the GNU Compiler Collection than the one preinstalled in the system
Make sure to load a more up to date version of the GNU Compiler Collection than the one preinstalled in the system
<pre>module load compiler/gnu/9.1.0</pre>
<pre>module load gcc/9.2.0</pre>


Then compile with
Then compile with
<pre><compiler> -march=znver2</pre>
<pre>gcc|g++|gfortran</pre>


<br>
<br>
Line 20: Line 31:


Load aocc module
Load aocc module
<pre>module load compiler/aocc/2.0.0</pre>
<pre>module load aocc/2.1.0</pre>


Compile with  
Compile with  
<pre>clang/clang++/flang -march=znver2</pre>
<pre>clang|clang++|flang</pre>


AOCC comes with a couple of exclusive compiler flags that are not part of LLVM and allow more aggressive optimizations, they are listed in the [https://developer.amd.com/wp-content/resources/AOCC-2.0-Clang-the%20C%20C++%20Compiler.pdf#page=4 C/C++] and [https://developer.amd.com/wp-content/resources/AOCC-2.0-Flang-the%20Fortran%20Compiler.pdf#page=6 Fortran] compiler manual.
AOCC comes with a couple of exclusive compiler flags that are not part of LLVM and allow more aggressive optimizations, they are documented in the [https://developer.amd.com/wp-content/resources/AOCC-2.1-Clang-the%20C%20C++%20Compiler.pdf#page=4 C/C++] and [https://developer.amd.com/wp-content/resources/AOCC-2.1-Flang-the%20Fortran%20Compiler.pdf#page=6 Fortran] compiler manual.


<br>
<br>


=== Intel ===
=== Intel ===
Please use
Load Intel compiler module
<pre>
<pre>module load intel/19.1.0</pre>
<compiler> -march=core-avx2
</pre>
and do <font color="red">'''not'''</font> use
<pre>
<compiler> -xCORE-AVX2
</pre>
since the latter might give '''very''' bad performance!


<br>
Compile with
<pre>icc|ifort</pre>
Do <font color="red">'''not'''</font> use
<pre><compiler> -xHOST
or
<compiler> -xCORE-AVX2</pre>
since it can '''crash the compiler''' in some cases or a resulting binary '''will refuse to start'''.


=== PGI ===
In some cases compiling with '''-check <arg>''' resulted in the '''compiler crashing''' as well. If you encounter this, try removing the option.


With respect to PGI, we recommend to use
<br>
<pre>
<compiler> -tp=zen -O3
</pre>


= Compiler Options for High Performance Computing =
== Compiler Options for High Performance Computing ==
This section shows compiler flags for GNU-compatible compilers (gnu, aocc, intel), other compilers may have other options for the described functionality.
This section shows compiler flags for GNU-compatible compilers (gnu, aocc, intel), other compilers may have other options for the described functionality.


<br>
<br>


=== Static Builds ===
=== Static Linking ===
<font color="red">'''Attention: Building of static binaries (usually compiled with the "-static" flag) are currently not supported on Hawk. This section describes how to link static libraries into a binary with dynamic linking (run-time linker).'''</font>
 
Large jobs with thousands of processes can overload the file systems connected to the cluster during startup if the binary is linked to (many) shared libraries that are stored on these file systems.
Large jobs with thousands of processes can overload the file systems connected to the cluster during startup if the binary is linked to (many) shared libraries that are stored on these file systems.


To avoid this issue and to also improve the performance by reducing the overhead from function calls from shared libraries, compiling dependencies statically is recommended.
To avoid this issue and to also improve the performance by reducing the overhead from potentially frequent function calls to shared libraries, compiling dependencies statically into the binary is recommended.


During link-time, you can set the compiler to prefer static libraries over shared libraries if both are found in the library search path with
During link-time, you can set the compiler to look for static libraries instead of shared libraries in the library search path with
<pre>
<pre>
# Link libhdf5 statically if available, set back to prefer shared libraries again after (default)
# Link libhdf5 + zlib statically, set back to look for shared libraries again after (default)
<compiler> ... -Wl,-Bstatic -lhdf5 -Wl,-Bdynamic
<compiler> ... -Wl,-Bstatic -lhdf5_fortran -lhdf5_f90cstub -lhdf5 -lz -Wl,-Bdynamic
</pre>
</pre>


You can also specify a static library filename in the library search path directly
You can also specify a static library filename in the library search path directly
<pre>
<pre>
# Staticaclly link libhdf5.a
# Statically link hdf5 + zlib
<compiler> ... -l:libhdf5.a
<compiler> ... -l:libhdf5_fortran.a -l:libhdf5_f90cstub.a -l:libhdf5.a -l:libz.a
</pre>
</pre>


Or provide the full path to the static library like with other object files
Or provide the full path to the static library like with other object files
<pre>
<pre>
# Staticaclly link libhdf5.a
# Statically link hdf5 + zlib
<compiler> ... /path/to/static/lib/libhdf5.a
<compiler> ... /path/to/static/lib/libhdf5_fortran.a /path/to/static/lib/libhdf5_f90cstub.a /path/to/static/lib/libhdf5.a /path/to/static/lib/libz.a
</pre>
</pre>


Keep in mind that all the symbols referenced in the static library need to be resolved during linking. Thus, linking to additional (static) libraries may be required.
Keep in mind that all the symbols referenced in the static library need to be resolved during linking. Thus, linking to additional (static) libraries may be required. In some cases the order of the linked static libraries is important, as with the hdf5 example above.


<br>
<br>


=== Link-Time Optimization (LTO)===
=== Link-Time Optimization (LTO), Interprocedural Optimization (IPO), Whole Program Optimization (WPO) ===
This technique allows the compiler to optimize the code at link time. During this, further rearrangement of the code from separate object files is performed.
These techniques allow the compiler to optimize the code at link time. During this, further rearrangement of the code from separate object files is performed.
 
An article about LTO performance comparison with GCC 10: https://www.phoronix.com/scan.php?page=article&item=gcc10-lto-tr
 
The option needs to be set at '''compile time''' and '''link time'''.


The option needs to be set at '''compile time''' and '''link time''':
'''GCC''', '''AOCC''':
<pre>
<pre>
# Compile with LTO in mind
# Compile with LTO in mind (generate metadata in object files)
<compiler> -flto -o component1.o -c component1.c
<compiler> -flto -o component1.o -c component1.c
<compiler> -flto -o component2.o -c component2.c
<compiler> -flto -o component2.o -c component2.c
Line 94: Line 107:
</pre>
</pre>


Keep in mind LLVM(AOCC) compiles LLVM bitcode files instead of ELF object files when using LTO. Using tools like objdump, readelf, strip, etc. on these files won't work.
Hint: With GCC you can specify the amount of processes to do the actual link-time optimization with
<pre>
# Link with LTO
gcc|g++|gfortran -flto=<n_procs> -o program component1.o component2.o
</pre>
 
Keep in mind LLVM(AOCC) compiles LLVM bitcode files instead of ELF object files when using LTO. Using tools like objdump, readelf, strip, etc. on these files won't work. Neither will linking them with other compilers work.


More information here: https://www.llvm.org/docs/LinkTimeOptimization.html
More information here: https://www.llvm.org/docs/LinkTimeOptimization.html
'''Intel''':
<pre>
# Compile with IPO in mind (generate metadata in object files)
<compiler> -ipo -o component1.o -c component1.c
<compiler> -ipo -o component2.o -c component2.c
# Link with IPO
<compiler> -ipo -o program component1.o component2.o
</pre>
https://software.intel.com/en-us/fortran-compiler-developer-guide-and-reference-ipo-qipo
Linking with LTO/IPO takes a considerable amount of time longer than normal linking.


<br>
<br>
Line 102: Line 134:
=== Profile Guided Optimization (PGO) ===
=== Profile Guided Optimization (PGO) ===
This optimization can lead to a 10-20% boost in performance in some cases. It basically collects information about how the program actually runs and improves the assumptions made about which code paths are more likely to be taken.
This optimization can lead to a 10-20% boost in performance in some cases. It basically collects information about how the program actually runs and improves the assumptions made about which code paths are more likely to be taken.
An article about PGO performance comparison with GCC 10: https://www.phoronix.com/scan.php?page=news_item&px=GCC-10-PGO-3960X-Xmas-Eve


This requires the code to be compiled twice and the program being run with a representative use-case in-between.
This requires the code to be compiled twice and the program being run with a representative use-case in-between.
Line 108: Line 142:
https://developer.ibm.com/articles/gcc-profile-guided-optimization-to-accelerate-aix-applications/
https://developer.ibm.com/articles/gcc-profile-guided-optimization-to-accelerate-aix-applications/


PGO Documentation for LLVM: <br>
PGO documentation for LLVM: <br>
https://clang.llvm.org/docs/UsersManual.html#profiling-with-instrumentation
https://clang.llvm.org/docs/UsersManual.html#profiling-with-instrumentation


PGO Documentation for the Intel Compiler: <br>
PGO documentation for the Intel Compiler: <br>
https://software.intel.com/en-us/cpp-compiler-developer-guide-and-reference-profile-guided-optimization-pgo
https://software.intel.com/en-us/cpp-compiler-developer-guide-and-reference-profile-guided-optimization-pgo
<br>
== Compiler Related Environment Variables ==
The compiler modules set [https://www.gnu.org/software/make/manual/html_node/Implicit-Variables.html implicit environment variables] according to established coding practices that will be used in properly set up build tools (GNU Autotools, CMake, etc.) to choose the currently set compiler automatically from the environment variables
<pre>
${CC}
${CXX}
${FC}
${F77}
${F90}
</pre>
and base compiler/linker flags from the environment variables
<pre>
${CFLAGS}
${CXXFLAGS}
${FFLAGS}
${LDFLAGS}
</pre>
<br>
In a manually set up build process, it is good practice to read from these environment variables, an example:
<pre>
#!/usr/bin/env bash
# configure script
[...]
# Flags for the GCC compiler
if [[ ${CC} == *"gcc"* ]]; then
    CFLAGS="${CFLAGS} -flto"
    LDFLAGS="${LDFLAGS} -flto=16 -l:libamdlibm.a -lm"
# Flags for the AOCC compiler
elif [[ ${CC} == *"clang"* ]]; then
    CFLAGS="${CFLAGS} -flto -finline-aggressive -mllvm -vectorize-memory-aggressively"
    LDFLAGS="${LDFLAGS} -flto -finline-aggressive -l:libamdlibm.a -lm"
fi
[...]
echo "CC = ${CC}" > make.cfg
echo "CFLAGS = ${CFLAGS}" >> make.cfg
echo "LDFLAGS = ${LDFLAGS}" >> make.cfg
</pre>
<pre>
# Makefile
include make.cfg
[...]
program: component1.o component2.o
    $(CC) -o $@ component1.o component2.o $(LDFLAGS)
%.o: %.c
    $(CC) -o $@ -c $(CFLAGS) $<
[...]
</pre>
For large codebases the usage of build tools mentioned above is strongly recommended for maintainable and portable code.
<br>

Revision as of 11:29, 15 October 2020

In order to build MPI applications, please us the compiler wrappers mpif77 / mpif90 / mpif08 / mpicc / mpicxx.

Please note that compilers do not use optimization flags by default at the moment. Hence, please refer to Compiler Options Quick Reference Guide and set the respective flags on your own (with znver1 for Naples and znver2 for Rome nodes). Compiler Usage Guidelines for AMD64 Platforms might also be a source of inspiration w.r.t. optimization flags.


Available compilers

We highly recommend to try as many different compilers as possible and compare the performance of the generated code! If you code according to language standards, this is almost for free but can give you a significant speedup! There is no such thing as an "ideal" compiler! One suites better to application A, one suites better to application B (cf. Best Practice Guide AMD EPYC (Naples)).


Default compiler flags for the GCC and AOCC compilers are currently:

-march=znver2 -mtune=znver2 -O3

Default compiler flags for the Intel compilers are currently:

-march=core-avx2 -mtune=core-avx2 -O3

Providing a different -OX value will override the -O3 value.


GCC

Make sure to load a more up to date version of the GNU Compiler Collection than the one preinstalled in the system

module load gcc/9.2.0

Then compile with

gcc|g++|gfortran


AOCC

AOCC is the AMD Optimizing C/C++ Compiler based on LLVM. It contains a Fortran compiler (flang) as well.

Load aocc module

module load aocc/2.1.0

Compile with

clang|clang++|flang

AOCC comes with a couple of exclusive compiler flags that are not part of LLVM and allow more aggressive optimizations, they are documented in the C/C++ and Fortran compiler manual.


Intel

Load Intel compiler module

module load intel/19.1.0

Compile with

icc|ifort

Do not use

<compiler> -xHOST
or 
<compiler> -xCORE-AVX2

since it can crash the compiler in some cases or a resulting binary will refuse to start.

In some cases compiling with -check <arg> resulted in the compiler crashing as well. If you encounter this, try removing the option.


Compiler Options for High Performance Computing

This section shows compiler flags for GNU-compatible compilers (gnu, aocc, intel), other compilers may have other options for the described functionality.


Static Linking

Attention: Building of static binaries (usually compiled with the "-static" flag) are currently not supported on Hawk. This section describes how to link static libraries into a binary with dynamic linking (run-time linker).

Large jobs with thousands of processes can overload the file systems connected to the cluster during startup if the binary is linked to (many) shared libraries that are stored on these file systems.

To avoid this issue and to also improve the performance by reducing the overhead from potentially frequent function calls to shared libraries, compiling dependencies statically into the binary is recommended.

During link-time, you can set the compiler to look for static libraries instead of shared libraries in the library search path with

# Link libhdf5 + zlib statically, set back to look for shared libraries again after (default)
<compiler> ... -Wl,-Bstatic -lhdf5_fortran -lhdf5_f90cstub -lhdf5 -lz -Wl,-Bdynamic

You can also specify a static library filename in the library search path directly

# Statically link hdf5 + zlib
<compiler> ... -l:libhdf5_fortran.a -l:libhdf5_f90cstub.a -l:libhdf5.a -l:libz.a

Or provide the full path to the static library like with other object files

# Statically link hdf5 + zlib
<compiler> ... /path/to/static/lib/libhdf5_fortran.a /path/to/static/lib/libhdf5_f90cstub.a /path/to/static/lib/libhdf5.a /path/to/static/lib/libz.a

Keep in mind that all the symbols referenced in the static library need to be resolved during linking. Thus, linking to additional (static) libraries may be required. In some cases the order of the linked static libraries is important, as with the hdf5 example above.


Link-Time Optimization (LTO), Interprocedural Optimization (IPO), Whole Program Optimization (WPO)

These techniques allow the compiler to optimize the code at link time. During this, further rearrangement of the code from separate object files is performed.

An article about LTO performance comparison with GCC 10: https://www.phoronix.com/scan.php?page=article&item=gcc10-lto-tr

The option needs to be set at compile time and link time.

GCC, AOCC:

# Compile with LTO in mind (generate metadata in object files)
<compiler> -flto -o component1.o -c component1.c
<compiler> -flto -o component2.o -c component2.c

# Link with LTO
<compiler> -flto -o program component1.o component2.o

Hint: With GCC you can specify the amount of processes to do the actual link-time optimization with

# Link with LTO
gcc|g++|gfortran -flto=<n_procs> -o program component1.o component2.o

Keep in mind LLVM(AOCC) compiles LLVM bitcode files instead of ELF object files when using LTO. Using tools like objdump, readelf, strip, etc. on these files won't work. Neither will linking them with other compilers work.

More information here: https://www.llvm.org/docs/LinkTimeOptimization.html

Intel:

# Compile with IPO in mind (generate metadata in object files)
<compiler> -ipo -o component1.o -c component1.c
<compiler> -ipo -o component2.o -c component2.c

# Link with IPO
<compiler> -ipo -o program component1.o component2.o

https://software.intel.com/en-us/fortran-compiler-developer-guide-and-reference-ipo-qipo

Linking with LTO/IPO takes a considerable amount of time longer than normal linking.


Profile Guided Optimization (PGO)

This optimization can lead to a 10-20% boost in performance in some cases. It basically collects information about how the program actually runs and improves the assumptions made about which code paths are more likely to be taken.

An article about PGO performance comparison with GCC 10: https://www.phoronix.com/scan.php?page=news_item&px=GCC-10-PGO-3960X-Xmas-Eve

This requires the code to be compiled twice and the program being run with a representative use-case in-between.

A good example for GCC can be found here:
https://developer.ibm.com/articles/gcc-profile-guided-optimization-to-accelerate-aix-applications/

PGO documentation for LLVM:
https://clang.llvm.org/docs/UsersManual.html#profiling-with-instrumentation

PGO documentation for the Intel Compiler:
https://software.intel.com/en-us/cpp-compiler-developer-guide-and-reference-profile-guided-optimization-pgo


Compiler Related Environment Variables

The compiler modules set implicit environment variables according to established coding practices that will be used in properly set up build tools (GNU Autotools, CMake, etc.) to choose the currently set compiler automatically from the environment variables

${CC}
${CXX}
${FC}
${F77}
${F90}

and base compiler/linker flags from the environment variables

${CFLAGS}
${CXXFLAGS}
${FFLAGS}

${LDFLAGS}


In a manually set up build process, it is good practice to read from these environment variables, an example:

#!/usr/bin/env bash
# configure script

[...]

# Flags for the GCC compiler
if [[ ${CC} == *"gcc"* ]]; then
    CFLAGS="${CFLAGS} -flto"
    LDFLAGS="${LDFLAGS} -flto=16 -l:libamdlibm.a -lm"
# Flags for the AOCC compiler
elif [[ ${CC} == *"clang"* ]]; then
    CFLAGS="${CFLAGS} -flto -finline-aggressive -mllvm -vectorize-memory-aggressively"
    LDFLAGS="${LDFLAGS} -flto -finline-aggressive -l:libamdlibm.a -lm"
fi

[...]

echo "CC = ${CC}" > make.cfg
echo "CFLAGS = ${CFLAGS}" >> make.cfg
echo "LDFLAGS = ${LDFLAGS}" >> make.cfg
# Makefile
include make.cfg

[...]

program: component1.o component2.o
    $(CC) -o $@ component1.o component2.o $(LDFLAGS)

%.o: %.c
    $(CC) -o $@ -c $(CFLAGS) $<

[...]

For large codebases the usage of build tools mentioned above is strongly recommended for maintainable and portable code.