- Infos im HLRS Wiki sind nicht rechtsverbindlich und ohne Gewähr -
- Information contained in the HLRS Wiki is not legally binding and HLRS is not responsible for any damages that might result from its use -
Open MPI: Difference between revisions
Line 80: | Line 80: | ||
{{Command|command=mpirun -x OMP_NUM_THREADS=4 -x OMP_PLACES=cores -n 16 --map-by l3cache --bind-to l3cache your_app }} | {{Command|command=mpirun -x OMP_NUM_THREADS=4 -x OMP_PLACES=cores -n 16 --map-by l3cache --bind-to l3cache your_app }} | ||
Another example, to run an application with 16 MPI processes times 8 OpenMP threads and binding each thread to his own core: | |||
{Command|command=mpirun -x OMP_NUM_THREADS=8 -x OMP_PLACES=cores -n 16 --map-by node:PE=8 --bind-to core your_app}} | |||
To check the correct pinning of MPI+OpenMP applications beforehand one may use, e.g., the xthi program. | To check the correct pinning of MPI+OpenMP applications beforehand one may use, e.g., the xthi program. |
Revision as of 17:23, 26 November 2021
Open MPI is an Message Passing Interface (MPI) library project combining technologies and resources from several other projects (FT-MPI, LA-MPI, LAM/MPI, and PACX-MPI). |
|
Examples
Simple example
This example shows the basic steps to use the global Open MPI installations on the systems at HLRS.
Load the necessary module to set up the environment
module load openmpi # on hawk
Then compile your application using the Open MPI compiler wrappers mpicc, mpic++ or mpifort:
The wrapper compilers will pass any necessary options, e.g. for include files or libraries to be linked to, to the underlying base compiler and linker (e.g. GNU, Intel, AOCC).
To display the options Open MPI's passes to the base compiler and linker add the following option:
To get options to be passed only to a compiler or linker use -showme:compile
, -showme:link
.
Now you can allocate compute nodes via the Batch_system and run your application with
The most important options you may need on our systems are described in the following sections. For in depth information of all options we refer you to the man pages of mpirun.
Specifying the number of processes per node or socket
Open MPI divides resources in something called 'slots'. Per default, Open MPI's mpirun will use all slots provided by the Batch_system.
If you want to use less processes e.g. because you are restricted by memory requirements, or you want to control the placement of a hybrid parallel application that uses MPI+OpenMP you will need to provide additional options.
By default, Open MPI will try to fill sockets and nodes with processes before moving to the next socket or node.
To avoid this behaviour, you can use the --map-by
option.
To run 2 processes per node the following can be used:
To run 2 processes per socket the following can be used:
Process binding/pinning
To restrict the movement of processes, Open MPI supports binding to resources at various levels. By default Open MPI will choose a binding based on the used mapping, e.g., if you map processes by socket they will be bound by socket.
If you want to specify the process binding explicitly you can use the --bind-to
option.
For example, mapping processes my socket but binding them to cores (instead of the socket default) can be done with
The process binding applied by Open MPI for provided options can be checked easily by the following command
Thread binding/pinning for hybrid MPI+OpenMP applications
To run a hybrid MPI+OpenMP with fine grained control of the binding for the threads one can combine the binding options of Open MPI with OpenMP's environment variables for thread placement.
For example, an application shall be run using 16 MPI processes each itself using 4 OpenMP threads. Each process shall be mapped/bound to the l3cache and each thread shall be bound to a single core (in case of SMT all hwthreads belonging to this core). To achieve this one can use the following command:
Another example, to run an application with 16 MPI processes times 8 OpenMP threads and binding each thread to his own core:
{Command|command=mpirun -x OMP_NUM_THREADS=8 -x OMP_PLACES=cores -n 16 --map-by node:PE=8 --bind-to core your_app}}
To check the correct pinning of MPI+OpenMP applications beforehand one may use, e.g., the xthi program.
Valgrind Debugging
You may easily debug Your application with the memory-error detector valgrind
.
This will detect errors such as usage of uninitialized memory, buffer over-runs, double-free's, lost memory (leaks).
To run with Open MPI, just pass it just before specifying the application:
This will show up many false positives from Open MPI itselve -- e.g. memory communicated via TCP/IP
with known uninitialized memory, buffer copied from the kernel to the InfiniBand Verbs library.
Valgrind offers suppressing these false positives. Open MPI provides a suppression-file installed
in the default location:
Common Problems
InfiniBand retry count
I get an error message about timeouts, what can I do?
-
If your parallel programs sometimes crash with an error message like this:
-------------------------------------------------------------------------- The InfiniBand retry count between two MPI processes has been exceeded. "Retry count" is defined in the InfiniBand spec 1.2 (section 12.7.38): The total number of times that the sender wishes the receiver to retry timeout, packet sequence, etc. errors before posting a completion error. This error typically means that there is something awry within the InfiniBand fabric itself. You should note the hosts on which this error has occurred; it has been observed that rebooting or removing a particular host from the job can sometimes resolve this issue. Two MCA parameters can be used to control Open MPI's behavior with respect to the retry count: * btl_openib_ib_retry_count - The number of times the sender will attempt to retry (defaulted to 7, the maximum value). * btl_openib_ib_timeout - The local ACK timeout parameter (defaulted to 10). The actual timeout value used is calculated as: 4.096 microseconds * (2^btl_openib_ib_timeout) See the InfiniBand spec 1.2 (section 12.7.34) for more details. --------------------------------------------------------------------------
This means that the mpi messages can't pass through our infiniband switches before the btl_openib_ib_timeout is over. How often this occurs depends also on the traffic on the network. We have adjusted the parameters such that it should normally work, but if you have compiled your own OpenMPI, maybe also as part of another program package, you might not have adjusted this value correctly. However, you can specify it when calling mpirun:
mpirun -mca btl_openib_ib_timeout 20 -np ... your-program ...
you can check the preconfigured parameters of the module currently loaded by:
ompi_info --param btl openib
where you can grep for the above mentioned parameters.