Open MPI: Difference between revisions

Revision as of 13:59, 22 September 2020

Open MPI is an Message Passing Interface (MPI) library project combining technologies and resources from several other projects (FT-MPI, LA-MPI, LAM/MPI, and PACX-MPI).


Developer:	Open MPI Development Team
Platforms:	NEC Nehalem Cluster
Category:	MPI
License:	New BSD license
Website:	Open MPI homepage

Examples

Simple example

This example shows the basic steps to use the global Open MPI installations on the systems at HLRS.

Load the necessary module to set up the environment

module load mpi/openmpi   # on vulcan

module load openmpi       # on hawk

Then compile your application using the Open MPI compiler wrappers mpicc, mpic++ or mpifort:

mpicc your_app.c - o your_app

The wrapper compilers will pass any necessary options, e.g. for include files or libraries to be linked to, to the underlying base compiler and linker (e.g. GNU, Intel, AOCC). To display the options Open MPI's passes to the base compiler and linker add the following option:

mpicc -showme

To get options to be passed only to a compiler or linker use -showme:compile, -showme:link.

Now you can allocate compute nodes via the Batch_system and run your application with

mpirun <OPTIONS> your_app

The most important options you may need on our systems are described in the following sections. For in depth information of all options we refer you to the man pages of mpirun.

Warning: You will most likely have to use process mapping and binding to achieve the best performance for your MPI application. Therefore, please make sure to read the sections about process mapping and binding!

Specifying the number of processes per node

Open MPI divides resources in something called 'slots'. Per default, Open MPI's mpirun will use all slots provided by the Batch_system.

If you want to use less processes e.g. because you are restricted by memory requirements, or you want to control the placement of have a hybrid parallel application using MPI and OpenMP you will need to provide additional options.

In general, Open MPI will always try to fill sockets and nodes with processes before moving to the next socket or node. To avoid this behaviour, you can use the -N option.

mpirun -n X -N 2 your_app

This would start 2 processes per node.

Process pinning

Open MPI allows the pinning of processes via the '--bind-to' option.

Binding processes to CPU cores is done with

mpirun -n X --bind-to core your_app

Warning: Do not use the --mca mpi_paffinity_alone 1 with newer versions of Open MPI as it may not do what you expect.

Binding processes to a socket is done with

mpirun -n X --bind-to socket your_app

The actuall binding policy of Open MPI can be displayed with

mpirun -n X --report-bindings --bind-to socket /bin/true

Thread pinning

For pinning of hybrid MPI/OpenMP, use the following wrapper script

File: thread_pin_wrapper.sh

#!/bin/bash
export KMP_AFFINITY=verbose,scatter    # Intel specific environment variable
export OMP_NUM_THREADS=4

RANK=${OMPI_COMM_WORLD_RANK:=$PMI_RANK}
if [ $(expr $RANK % 2) = 0  ]
then
     export GOMP_CPU_AFFINITY=0-3
     numactl --preferred=0 --cpunodebind=0 $@
else
     export GOMP_CPU_AFFINITY=4-7
     numactl --preferred=1 --cpunodebind=1 $@
fi

Run your application with the following command

mpirun -np X -npernode 2 thread_pin_wrapper.sh your_app

Warning: Do not use the mpi_paffinity_alone option in this case!

Valgrind Debugging

You may easily debug Your application with the memory-error detector valgrind. This will detect errors such as usage of uninitialized memory, buffer over-runs, double-free's, lost memory (leaks). To run with Open MPI, just pass it just before specifying the application:

module load tools/valgrind
mpirun -np X valgrind your_app

This will show up many false positives from Open MPI itselve -- e.g. memory communicated via TCP/IP with known uninitialized memory, buffer copied from the kernel to the InfiniBand Verbs library. Valgrind offers suppressing these false positives. Open MPI provides a suppression-file installed in the default location:

mpirun -np X valgrind --suppressions=`dirname $(dirname $( which mpirun ))`/share/openmpi/openmpi-valgrind.supp your_app

Common Problems

InfiniBand retry count

I get an error message about timeouts, what can I do?

--------------------------------------------------------------------------
The InfiniBand retry count between two MPI processes has been
exceeded.  "Retry count" is defined in the InfiniBand spec 1.2
(section 12.7.38):

    The total number of times that the sender wishes the receiver to
    retry timeout, packet sequence, etc. errors before posting a
    completion error.

This error typically means that there is something awry within the
InfiniBand fabric itself.  You should note the hosts on which this
error has occurred; it has been observed that rebooting or removing a
particular host from the job can sometimes resolve this issue.  

Two MCA parameters can be used to control Open MPI's behavior with
respect to the retry count:

* btl_openib_ib_retry_count - The number of times the sender will
  attempt to retry (defaulted to 7, the maximum value).

* btl_openib_ib_timeout - The local ACK timeout parameter (defaulted
  to 10).  The actual timeout value used is calculated as:

     4.096 microseconds * (2^btl_openib_ib_timeout)

  See the InfiniBand spec 1.2 (section 12.7.34) for more details.
--------------------------------------------------------------------------

This means that the mpi messages can't pass through our infiniband switches before the btl_openib_ib_timeout is over. How often this occurs depends also on the traffic on the network. We have adjusted the parameters such that it should normally work, but if you have compiled your own OpenMPI, maybe also as part of another program package, you might not have adjusted this value correctly. However, you can specify it when calling mpirun:

mpirun -mca btl_openib_ib_timeout 20 -np ... your-program ...

you can check the preconfigured parameters of the module currently loaded by:

 ompi_info --param btl openib

where you can grep for the above mentioned parameters.

External links

@@ Line 1: / Line 1: @@
-API . Open MPI is an Message Passing Interface (MPI) library project combining technologies and resources from several other projects (FT-MPI, LA-MPI, LAM/MPI, and PACX-MPI). More.
+{{Infobox software
+| description = '''Open MPI''' is an Message Passing Interface (MPI) library project combining technologies and resources from several other projects (FT-MPI, LA-MPI, LAM/MPI, and PACX-MPI).
+| logo = [[Image:open-mpi-logo.png|75px]]
+| developer              = Open MPI Development Team
+| available on       = [[NEC Nehalem Cluster]]
+| category                  = [[:Category:MPI | MPI]]
+| license                = New BSD license
+| website                = [http://www.open-mpi.org/ Open MPI homepage]
+}}
+== Examples ==
+==== Simple example ====
+This example shows the basic steps to use the global Open MPI installations on the systems at HLRS.
+Load the necessary module to set up the environment
+{{Command|command =
+module load mpi/openmpi   # on vulcan<br>
+module load openmpi       # on hawk
+}}
+Then compile your application using the Open MPI compiler wrappers ''mpicc'', ''mpic++'' or ''mpifort'':
+{{Command|command =
+mpicc your_app.c - o your_app
+}}
+The wrapper compilers will pass any necessary options, e.g. for include files or libraries to be linked to, to the underlying base compiler and linker (e.g. GNU, Intel, AOCC).
+To display the options Open MPI's passes to the base compiler and linker add the following option:
+{{Command|command =
+mpicc -showme
+}}
+To get options to be passed only to a compiler or linker use <code>-showme:compile</code>, <code>-showme:link</code>.
+Now you can allocate compute nodes via the [[Batch_system]] and run your application with
+{{Command | command =
+mpirun <OPTIONS> your_app
+}}
+The most important options you may need on our systems are described in the following sections. For in depth information of all options we refer you to the man pages of mpirun.
+{{warning
+|text=You will most likely have to use process mapping and binding to achieve the best performance for your MPI application. Therefore, please make sure to read the sections about process mapping and binding!
+}}
+==== Specifying the number of processes per node ====
+Open MPI divides resources in something called 'slots'. Per default, Open MPI's mpirun will use all slots provided by the [[Batch_system]].
+If you want to use less processes e.g. because you are restricted by memory requirements, or you want to control the placement of have a hybrid parallel application using MPI and OpenMP you will need to provide additional options.
+In general, Open MPI will always try to fill sockets and nodes with processes before moving to the next socket or node. To avoid this behaviour, you can use the <code>-N</code> option.
+{{Command
+| command = mpirun -n X -N 2 your_app
+}}
+This would start 2 processes per node.
+==== Process pinning ====
+Open MPI allows the pinning of processes via the '--bind-to' option.
+Binding processes to CPU cores is done with
+{{Command
+| command = mpirun -n X --bind-to core your_app
+}}
+{{Warning
+| text = Do not use the --mca mpi_paffinity_alone 1 with newer versions of Open MPI as it may not do what you expect.
+}}
+Binding processes to a socket is done with
+{{Command
+| command = mpirun -n X --bind-to socket your_app
+}}
+The actuall binding policy of Open MPI can be displayed with
+{{Command
+| command = mpirun -n X --report-bindings --bind-to socket /bin/true
+}}
+=== Thread pinning ===
+For pinning of hybrid MPI/OpenMP, use the following wrapper script
+{{File|filename=thread_pin_wrapper.sh|content=<pre>
+#!/bin/bash
+export KMP_AFFINITY=verbose,scatter    # Intel specific environment variable
+export OMP_NUM_THREADS=4
+RANK=${OMPI_COMM_WORLD_RANK:=$PMI_RANK}
+if [ $(expr $RANK % 2) = 0  ]
+then
+     export GOMP_CPU_AFFINITY=0-3
+     numactl --preferred=0 --cpunodebind=0 $@
+else
+     export GOMP_CPU_AFFINITY=4-7
+     numactl --preferred=1 --cpunodebind=1 $@
+fi
+</pre>
+}}
+Run your application with the following command
+{{Command
+| command = mpirun -np X -npernode 2 thread_pin_wrapper.sh your_app
+}}
+{{Warning| text =
+Do not use the mpi_paffinity_alone option in this case!
+}}
+=== Valgrind Debugging ===
+You may easily debug Your application with the memory-error detector <code>valgrind</code>.
+This will detect errors such as usage of uninitialized memory, buffer over-runs, double-free's, lost memory (leaks).
+To run with Open MPI, just pass it just before specifying the application:
+{{Command | command =
+module load tools/valgrind
+mpirun -np X valgrind your_app
+}}
+This will show up '''many''' false positives from Open MPI itselve -- e.g. memory communicated via TCP/IP
+with known uninitialized memory, buffer copied from the kernel to the InfiniBand Verbs library.
+Valgrind offers suppressing these false positives. Open MPI provides a suppression-file installed
+in the default location:
+{{Command | command =
+mpirun -np X valgrind --suppressions=`dirname $(dirname $( which mpirun ))`/share/openmpi/openmpi-valgrind.supp your_app
+}}
+== Common Problems ==
+==== InfiniBand retry count ====
+I get an error message about timeouts, what can I do?<ul>
+If your parallel programs sometimes crash with an error message like this:
+<pre>
+--------------------------------------------------------------------------
+The InfiniBand retry count between two MPI processes has been
+exceeded.  "Retry count" is defined in the InfiniBand spec 1.2
+(section 12.7.38):
+    The total number of times that the sender wishes the receiver to
+    retry timeout, packet sequence, etc. errors before posting a
+    completion error.
+This error typically means that there is something awry within the
+InfiniBand fabric itself.  You should note the hosts on which this
+error has occurred; it has been observed that rebooting or removing a
+particular host from the job can sometimes resolve this issue.
+Two MCA parameters can be used to control Open MPI's behavior with
+respect to the retry count:
+* btl_openib_ib_retry_count - The number of times the sender will
+  attempt to retry (defaulted to 7, the maximum value).
+* btl_openib_ib_timeout - The local ACK timeout parameter (defaulted
+  to 10).  The actual timeout value used is calculated as:
+.096 microseconds * (2^btl_openib_ib_timeout)
+  See the InfiniBand spec 1.2 (section 12.7.34) for more details.
+--------------------------------------------------------------------------
+</pre>
+This means that the mpi messages can't pass through our infiniband switches
+before the btl_openib_ib_timeout is over. How often this occurs depends
+also on the traffic on the network. We have adjusted the parameters such
+that it should normally work, but if you have compiled your own OpenMPI,
+maybe also as part of another program package, you might not have adjusted
+this value correctly. However, you can specify it when calling mpirun:
+ mpirun -mca btl_openib_ib_timeout 20 -np ... your-program ...
+you can check the preconfigured parameters of the module currently loaded by:
+  ompi_info --param btl openib
+where you can grep for the above mentioned parameters.
+</ul>
+== See also ==
+* [[Software Development Tools, Compilers & Libraries]]
 == External links ==
 * [http://www.open-mpi.org/ Open MPI homepage]
+* [http://www.valgrind.org/ Valgrind homepage]
+[[Category:MPI]]

Open MPI: Difference between revisions

Revision as of 13:59, 22 September 2020

Contents

Examples

Simple example

Specifying the number of processes per node

Process pinning

Thread pinning

Valgrind Debugging

Common Problems

InfiniBand retry count

See also

External links

Navigation menu

Page actions

Page actions

Personal tools

Navigation

Search

Tools