- Infos im HLRS Wiki sind nicht rechtsverbindlich und ohne Gewähr -
- Information contained in the HLRS Wiki is not legally binding and HLRS is not responsible for any damages that might result from its use -

Open MPI: Difference between revisions

From HLRS Platforms
Jump to navigationJump to search
(Added logo.)
 
(38 intermediate revisions by 2 users not shown)
Line 1: Line 1:
{{Infobox software
{{Infobox software
| description = '''Open MPI''' is an Message Passing Interface (MPI) library project combining technologies and resources from several other projects (FT-MPI, LA-MPI, LAM/MPI, and PACX-MPI).
| description = '''Open MPI''' is an Message Passing Interface (MPI) library project combining technologies and resources from several other projects (FT-MPI, LA-MPI, LAM/MPI, and PACX-MPI).
| logo = [[Image:open-mpi-logo.png]]
| logo = [[Image:open-mpi-logo.png|75px]]
| developer              = Open MPI Development Team
| developer              = Open MPI Development Team
| available on      = [[NEC Nehalem Cluster]]
| available on      = [[HPE_Hawk]], [[Vulcan]]
| category                  = [[:Category:MPI | MPI]]
| category                  = [[:Category:MPI | MPI]]
| license                = New BSD license
| license                = New BSD license
Line 11: Line 11:
== Examples ==
== Examples ==


==== simple example ====
==== Simple example ====
This example shows the basic steps when using Open MPI.
This example shows the basic steps to use the global Open MPI installations on the systems at HLRS.


Load the necessary module
Load the necessary module to set up the environment
{{Command|command =
{{Command|command =
module load mpi/openmpi
module load openmpi       # on vulcan and hawk
}}
}}


Compile your application using the mpi wrapper compilers mpicc, mpic++ or mpif90:
Then compile your application using the Open MPI compiler wrappers ''mpicc'', ''mpic++'' or ''mpifort'':
{{Command|command =
{{Command|command =
mpicc your_app.c - o your_app
mpicc your_app.c -o your_app
}}
}}


Now we run our application using 128 processes spread accros 16 nodes in an interactive job (-I option):
The wrapper compilers will pass any necessary options, e.g., for include files or libraries to be linked, to the underlying base compiler and linker (e.g. GCC, Intel, AOCC).
To display the options that the Open MPI wrapper passes to the base compiler and linker use the following option:
{{Command|command =
mpicc -showme
}}
To get options to be passed only to a compiler or linker use <code>-showme:compile</code>, <code>-showme:link</code>.
 
Now you can allocate compute nodes via the [[Batch_system]] and run your application with
{{Command | command =  
{{Command | command =  
qsub -l nodes=16:ppn=8,walltime=6:00:00 -I            # get 16 nodes for 6 hours
mpirun <OPTIONS> your_app
mpirun -np 128 your_app                              # run your_app using 128 processes
}}
The most important options you may need on our systems are described in the following sections. For in depth information of all options we refer you to the man pages of mpirun.
 
{{warning
|text=You will most likely have to use process mapping and binding to achieve the best performance for your MPI application. Therefore, please make sure to read the sections about process mapping and binding!
}}
}}


==== specifying the number of processes per node ====
==== Specifying the number of processes per node or socket for pure MPI applications ====
Open MPI divides resources in something called 'slots'. By specifying <code>ppn:X</code> to the batchsystem, the number of slots per node is specified.
Open MPI divides resources in something called 'slots'. By default, Open MPI's mpirun will use all slots provided by the [[Batch_system]].
So for a simple MPI job with 8 process per node (=1 process per core) <code>ppn:8</code> is best choice, as in above example. Details can be specified on <code>mpirun</code> command line. PBS setup is adjusted for ppn:8, please do not use other values.
{{Note|text=On Hawk, the number of slots per node is determined by the property ''mpiprocs'' passed to PBS's ''qsub'' command in the ''select'' statement  (default: 128).}}
 
If you want to use less processes, e.g., because you're application is limited by memory requirements or you want to control the placement of a hybrid parallel application that uses MPI+OpenMP, you will need to provide additional options.
 
By default, Open MPI will try to fill sockets and nodes with processes before moving to the next socket or node.
To achieve a different behaviour use the <code>--map-by</code> option.
 
To run your application with X processes and 1 process per node the following can be used:
{{Command
| command = mpirun -n X --map-by ppr:1:node your_app
}}


If you want to use less processes per node e.g. because you are restricted by memory requirements, or you have a hybrid parallel application using MPI and OpenMP, MPI would always put the first 8 processes on the first node, second 8 on second and so on. To avoid this, you can use the <code>-npernode</code> option. 
To run your application with X processes and 1 processes per socket, e.g., for MPI+OpenMP, the following can be used:
{{Command
{{Command
| command = mpirun -np X -npernode 2 your_app
| command = mpirun -n X --map-by ppr:1:socket your_app
}}
}}
This would start 2 processes per node. Like this, you can use a larger number of nodes
with a smaller number of processes, or you can e.g. start threads out of the processes.


{{Warning|text=For hybrid (MPI+OpenMP) applications it is necessary to adapt the bindings as Open MPI will bind 1 processes to a single core by default! (see [[#Thread binding/pinning for hybrid MPI+OpenMP applications]]) }}


=== process pinning ===  
==== Process binding/pinning ====
If you want to pin your processes to a CPU (and enable NUMA memory affinity) use
To restrict the movement of processes, Open MPI supports binding to resources at various levels.
By default Open MPI will choose a binding based on the used mapping, e.g., if you map processes by socket they will be bound by socket.
 
If you want to specify the process binding explicitly you can use the <code>--bind-to</code> option.
For example, mapping processes by socket but binding them to cores (instead of the socket default) can be done with
{{Command
{{Command
| command = mpirun -np X --mca mpi_paffinity_alone 1 your_app
| command = mpirun -n X --map-by socket --bind-to core your_app
}}
}}


{{Warning
The process binding applied by Open MPI for provided options can be checked easily by the following command
| text = This will not behave as expected for hybrid multi threaded applications (MPI + OpenMP), as the threads will be pinned to a single CPU as well! Use this only if you want to pin one process per core - no extra threads!
{{Command
| command = mpirun ... --report-bindings /bin/true
}}
}}


=== thread pinning ===
{{Warning|text=On HAWK, the default behaviour of Open MPI 4.0.x and 4.1.x is to map by NUMA partition and not by socket as the Open MPI documentation states! If you want to bind by socket you have to specify this explicitly with --bind-to socket. (For details see the related [https://github.com/open-mpi/ompi/issues/9773 MPI Issue]])}}
For pinning of hybrid MPI/OpenMP, use the following wrapper script
 
{{File|filename=thread_pin_wrapper.sh|content=<pre>
{{Note|text=The <code>--map-by</code> and <code>--bind-to</code> options provide a lot of pre-defined solutions. However, if you need full manual control over the process placement and binding you can use the <code>pe-list</code> and <code>rankfile</code> arguments of <code>--map-by</code>.}}
#!/bin/bash
 
export KMP_AFFINITY=verbose,scatter          # Intel specific environment variable
=== Thread binding/pinning for hybrid MPI+OpenMP applications ===
export OMP_NUM_THREADS=4
To run a hybrid MPI+OpenMP application with fine grained control of the binding for the threads one can combine the binding options of Open MPI with OpenMP's environment variables for thread placement.
 
For example, an application shall be run using 16 MPI processes each itself using 4 OpenMP threads. Each process shall be mapped/bound to the l3cache and each thread shall be bound to a single core (in case of SMT all hwthreads belonging to this core). To achieve this one can use the following command:
 
{{Command|command=mpirun -x OMP_NUM_THREADS=4 -x OMP_PLACES=cores -n 16 --map-by l3cache --bind-to l3cache your_app }}
 
Another example, to run an application with 16 MPI processes times 8 OpenMP threads and binding each thread to his own core:
 
{{Command|command=mpirun -x OMP_NUM_THREADS=8 -x OMP_PLACES=cores -n 16 --map-by node:PE=8 --bind-to core your_app}}
 
To check the correct pinning of MPI+OpenMP applications beforehand one may use, e.g., the [https://github.com/olcf/XC30-Training/blob/master/affinity/Xthi.c xthi] program.


RANK=${OMPI_COMM_WORLD_RANK:=$PMI_RANK}
=== Valgrind Debugging ===
if [ $(expr $RANK % 2) = 0  ]
You may easily debug your application with the memory-error detector <code>valgrind</code>.
then
This will detect errors such as usage of uninitialized memory, buffer over-runs, double-free's, lost memory (leaks).
    export GOMP_CPU_AFFINITY=0-3
To run with Open MPI, just call valgrind specifying the application as an argument to it:
    numactl --preferred=0 --cpunodebind=0 $@
 
else
{{Command | command =
    export GOMP_CPU_AFFINITY=4-7
module load tools/valgrind<br>
    numactl --preferred=1 --cpunodebind=1 $@
mpirun -n X valgrind your_app
fi
</pre>
}}
}}


Run your application with the following command
This will show up '''many''' false positives from Open MPI itselve -- e.g. memory communicated via TCP/IP
{{Command
with known uninitialized memory, buffer copied from the kernel to the InfiniBand Verbs library.
| command = mpirun -np X -npernode 2 thread_pin_wrapper.sh your_app
Valgrind offers suppressing these false positives. Open MPI provides a suppression-file installed
in the default location:
{{Command | command =
mpirun -n X valgrind --suppressions=`dirname $(dirname $( which mpirun ))`/share/openmpi/openmpi-valgrind.supp your_app
}}
}}


{{Warning| text =
Do not use the mpi_paffinity_alone option in this case!
}}
== Common Problems ==
== Common Problems ==
==== InfiniBand retry count ====
==== InfiniBand retry count ====
Line 130: Line 163:
== External links ==
== External links ==
* [http://www.open-mpi.org/ Open MPI homepage]
* [http://www.open-mpi.org/ Open MPI homepage]
* [http://www.valgrind.org/ Valgrind homepage]
[[Category:MPI]]
[[Category:MPI]]

Latest revision as of 15:35, 15 November 2023

Open MPI is an Message Passing Interface (MPI) library project combining technologies and resources from several other projects (FT-MPI, LA-MPI, LAM/MPI, and PACX-MPI).
Open-mpi-logo.png
Developer: Open MPI Development Team
Platforms: HPE_Hawk, Vulcan
Category: MPI
License: New BSD license
Website: Open MPI homepage


Examples

Simple example

This example shows the basic steps to use the global Open MPI installations on the systems at HLRS.

Load the necessary module to set up the environment

module load openmpi # on vulcan and hawk


Then compile your application using the Open MPI compiler wrappers mpicc, mpic++ or mpifort:

mpicc your_app.c -o your_app


The wrapper compilers will pass any necessary options, e.g., for include files or libraries to be linked, to the underlying base compiler and linker (e.g. GCC, Intel, AOCC). To display the options that the Open MPI wrapper passes to the base compiler and linker use the following option:

mpicc -showme

To get options to be passed only to a compiler or linker use -showme:compile, -showme:link.

Now you can allocate compute nodes via the Batch_system and run your application with

mpirun <OPTIONS> your_app

The most important options you may need on our systems are described in the following sections. For in depth information of all options we refer you to the man pages of mpirun.

Warning: You will most likely have to use process mapping and binding to achieve the best performance for your MPI application. Therefore, please make sure to read the sections about process mapping and binding!


Specifying the number of processes per node or socket for pure MPI applications

Open MPI divides resources in something called 'slots'. By default, Open MPI's mpirun will use all slots provided by the Batch_system.

Note: On Hawk, the number of slots per node is determined by the property mpiprocs passed to PBS's qsub command in the select statement (default: 128).


If you want to use less processes, e.g., because you're application is limited by memory requirements or you want to control the placement of a hybrid parallel application that uses MPI+OpenMP, you will need to provide additional options.

By default, Open MPI will try to fill sockets and nodes with processes before moving to the next socket or node. To achieve a different behaviour use the --map-by option.

To run your application with X processes and 1 process per node the following can be used:

mpirun -n X --map-by ppr:1:node your_app


To run your application with X processes and 1 processes per socket, e.g., for MPI+OpenMP, the following can be used:

mpirun -n X --map-by ppr:1:socket your_app


Warning: For hybrid (MPI+OpenMP) applications it is necessary to adapt the bindings as Open MPI will bind 1 processes to a single core by default! (see #Thread binding/pinning for hybrid MPI+OpenMP applications)


Process binding/pinning

To restrict the movement of processes, Open MPI supports binding to resources at various levels. By default Open MPI will choose a binding based on the used mapping, e.g., if you map processes by socket they will be bound by socket.

If you want to specify the process binding explicitly you can use the --bind-to option. For example, mapping processes by socket but binding them to cores (instead of the socket default) can be done with

mpirun -n X --map-by socket --bind-to core your_app


The process binding applied by Open MPI for provided options can be checked easily by the following command

mpirun ... --report-bindings /bin/true


Warning: On HAWK, the default behaviour of Open MPI 4.0.x and 4.1.x is to map by NUMA partition and not by socket as the Open MPI documentation states! If you want to bind by socket you have to specify this explicitly with --bind-to socket. (For details see the related MPI Issue])


Note: The --map-by and --bind-to options provide a lot of pre-defined solutions. However, if you need full manual control over the process placement and binding you can use the pe-list and rankfile arguments of --map-by.


Thread binding/pinning for hybrid MPI+OpenMP applications

To run a hybrid MPI+OpenMP application with fine grained control of the binding for the threads one can combine the binding options of Open MPI with OpenMP's environment variables for thread placement.

For example, an application shall be run using 16 MPI processes each itself using 4 OpenMP threads. Each process shall be mapped/bound to the l3cache and each thread shall be bound to a single core (in case of SMT all hwthreads belonging to this core). To achieve this one can use the following command:

mpirun -x OMP_NUM_THREADS=4 -x OMP_PLACES=cores -n 16 --map-by l3cache --bind-to l3cache your_app


Another example, to run an application with 16 MPI processes times 8 OpenMP threads and binding each thread to his own core:

mpirun -x OMP_NUM_THREADS=8 -x OMP_PLACES=cores -n 16 --map-by node:PE=8 --bind-to core your_app


To check the correct pinning of MPI+OpenMP applications beforehand one may use, e.g., the xthi program.

Valgrind Debugging

You may easily debug your application with the memory-error detector valgrind. This will detect errors such as usage of uninitialized memory, buffer over-runs, double-free's, lost memory (leaks). To run with Open MPI, just call valgrind specifying the application as an argument to it:

module load tools/valgrind
mpirun -n X valgrind your_app


This will show up many false positives from Open MPI itselve -- e.g. memory communicated via TCP/IP with known uninitialized memory, buffer copied from the kernel to the InfiniBand Verbs library. Valgrind offers suppressing these false positives. Open MPI provides a suppression-file installed in the default location:

mpirun -n X valgrind --suppressions=`dirname $(dirname $( which mpirun ))`/share/openmpi/openmpi-valgrind.supp your_app


Common Problems

InfiniBand retry count

I get an error message about timeouts, what can I do?

    If your parallel programs sometimes crash with an error message like this:
    --------------------------------------------------------------------------
    The InfiniBand retry count between two MPI processes has been
    exceeded.  "Retry count" is defined in the InfiniBand spec 1.2
    (section 12.7.38):
    
        The total number of times that the sender wishes the receiver to
        retry timeout, packet sequence, etc. errors before posting a
        completion error.
    
    This error typically means that there is something awry within the
    InfiniBand fabric itself.  You should note the hosts on which this
    error has occurred; it has been observed that rebooting or removing a
    particular host from the job can sometimes resolve this issue.  
    
    Two MCA parameters can be used to control Open MPI's behavior with
    respect to the retry count:
    
    * btl_openib_ib_retry_count - The number of times the sender will
      attempt to retry (defaulted to 7, the maximum value).
    
    * btl_openib_ib_timeout - The local ACK timeout parameter (defaulted
      to 10).  The actual timeout value used is calculated as:
    
         4.096 microseconds * (2^btl_openib_ib_timeout)
    
      See the InfiniBand spec 1.2 (section 12.7.34) for more details.
    --------------------------------------------------------------------------
    

    This means that the mpi messages can't pass through our infiniband switches before the btl_openib_ib_timeout is over. How often this occurs depends also on the traffic on the network. We have adjusted the parameters such that it should normally work, but if you have compiled your own OpenMPI, maybe also as part of another program package, you might not have adjusted this value correctly. However, you can specify it when calling mpirun:

    mpirun -mca btl_openib_ib_timeout 20 -np ... your-program ...
    

    you can check the preconfigured parameters of the module currently loaded by:

     ompi_info --param btl openib 
    

    where you can grep for the above mentioned parameters.


See also

External links