- Infos im HLRS Wiki sind nicht rechtsverbindlich und ohne Gewähr -
- Information contained in the HLRS Wiki is not legally binding and HLRS is not responsible for any damages that might result from its use -
Open MPI: Difference between revisions
(Added logo.) |
No edit summary |
||
Line 11: | Line 11: | ||
== Examples == | == Examples == | ||
==== | ==== Simple example ==== | ||
This example shows the basic steps when using Open MPI. | This example shows the basic steps when using Open MPI. | ||
Line 19: | Line 19: | ||
}} | }} | ||
Compile your application using the | Compile your application using the MPI wrapper compilers mpicc, mpic++ or mpif90: | ||
{{Command|command = | {{Command|command = | ||
mpicc your_app.c - o your_app | mpicc your_app.c - o your_app | ||
}} | }} | ||
The wrapper compiler will pass any other option to the underlying base compiler (e.g. GNU, Intel, PGI). | |||
To figure out Open MPI's default options passed to the base compiler is by using | |||
{{Command|command = | |||
mpicc -showme | |||
}} | |||
and it's variants <code>-showme:compile</code>, <code>-showme:link</code> and the like. | |||
Now we run our application using 128 processes spread accros 16 nodes in an interactive job (-I option): | Now we run our application using 128 processes spread accros 16 nodes in an interactive job (-I option): | ||
Line 30: | Line 36: | ||
}} | }} | ||
==== | ==== Specifying the number of processes per node ==== | ||
Open MPI divides resources in something called 'slots'. By specifying <code>ppn:X</code> to the | Open MPI divides resources in something called 'slots'. By specifying <code>ppn:X</code> to the batch system, the number of slots per node is specified. | ||
So for a simple MPI job with 8 process per node (=1 process per core) <code>ppn:8</code> is best choice, as in above example. Details can be specified on <code>mpirun</code> command line. PBS setup is adjusted for ppn:8, please do not use other values. | So for a simple MPI job with 8 process per node (=1 process per core) <code>ppn:8</code> is best choice, as in above example. Details can be specified on <code>mpirun</code> command line. PBS setup is adjusted for <code>ppn:8</code>, please do not use other values. | ||
If you want to use less processes per node e.g. because you are restricted by memory requirements, or you have a hybrid parallel application using MPI and OpenMP, MPI would always put the first 8 processes on the first node, second 8 on second and so on. To avoid this, you can use the <code>-npernode</code> option. | If you want to use less processes per node e.g. because you are restricted by memory requirements, or you have a hybrid parallel application using MPI and OpenMP, MPI would always put the first 8 processes on the first node, second 8 on second and so on. To avoid this, you can use the <code>-npernode</code> option. | ||
Line 52: | Line 58: | ||
}} | }} | ||
=== | === Thread pinning === | ||
For pinning of hybrid MPI/OpenMP, use the following wrapper script | For pinning of hybrid MPI/OpenMP, use the following wrapper script | ||
{{File|filename=thread_pin_wrapper.sh|content=<pre> | {{File|filename=thread_pin_wrapper.sh|content=<pre> | ||
#!/bin/bash | #!/bin/bash | ||
export KMP_AFFINITY=verbose,scatter | export KMP_AFFINITY=verbose,scatter # Intel specific environment variable | ||
export OMP_NUM_THREADS=4 | export OMP_NUM_THREADS=4 | ||
Line 79: | Line 85: | ||
Do not use the mpi_paffinity_alone option in this case! | Do not use the mpi_paffinity_alone option in this case! | ||
}} | }} | ||
=== Valgrind Debugging === | |||
You may easily debug Your application with the memory-error detector <code>valgrind</code>. | |||
This will detect errors such as usage of uninitialized memory, buffer over-runs, double-free's, lost memory (leaks). | |||
To run with Open MPI, just pass it just before specifying the application: | |||
{{Command | command = | |||
module load tools/valgrind | |||
mpirun -np X valgrind your_app | |||
}} | |||
This will show up '''many''' false positives from Open MPI itselve -- e.g. memory communicated via TCP/IP | |||
with known uninitialized memory, buffer copied from the kernel to the InfiniBand Verbs library. | |||
Valgrind offers suppressing these false positives. Open MPI provides a suppression-file installed | |||
in the default location: | |||
{{Command | command = | |||
mpirun -np X valgrind --suppressions=`dirname $(dirname $( which mpirun ))`/share/openmpi/openmpi-valgrind.supp your_app | |||
}} | |||
== Common Problems == | == Common Problems == | ||
==== InfiniBand retry count ==== | ==== InfiniBand retry count ==== | ||
Line 130: | Line 157: | ||
== External links == | == External links == | ||
* [http://www.open-mpi.org/ Open MPI homepage] | * [http://www.open-mpi.org/ Open MPI homepage] | ||
* [http://www.valgrind.org/ Valgrind homepage] | |||
[[Category:MPI]] | [[Category:MPI]] |
Revision as of 13:37, 6 October 2011
Open MPI is an Message Passing Interface (MPI) library project combining technologies and resources from several other projects (FT-MPI, LA-MPI, LAM/MPI, and PACX-MPI). |
|
Examples
Simple example
This example shows the basic steps when using Open MPI.
Load the necessary module
Compile your application using the MPI wrapper compilers mpicc, mpic++ or mpif90:
The wrapper compiler will pass any other option to the underlying base compiler (e.g. GNU, Intel, PGI). To figure out Open MPI's default options passed to the base compiler is by using
and it's variants -showme:compile
, -showme:link
and the like.
Now we run our application using 128 processes spread accros 16 nodes in an interactive job (-I option):
Specifying the number of processes per node
Open MPI divides resources in something called 'slots'. By specifying ppn:X
to the batch system, the number of slots per node is specified.
So for a simple MPI job with 8 process per node (=1 process per core) ppn:8
is best choice, as in above example. Details can be specified on mpirun
command line. PBS setup is adjusted for ppn:8
, please do not use other values.
If you want to use less processes per node e.g. because you are restricted by memory requirements, or you have a hybrid parallel application using MPI and OpenMP, MPI would always put the first 8 processes on the first node, second 8 on second and so on. To avoid this, you can use the -npernode
option.
This would start 2 processes per node. Like this, you can use a larger number of nodes with a smaller number of processes, or you can e.g. start threads out of the processes.
process pinning
If you want to pin your processes to a CPU (and enable NUMA memory affinity) use
Thread pinning
For pinning of hybrid MPI/OpenMP, use the following wrapper script
#!/bin/bash export KMP_AFFINITY=verbose,scatter # Intel specific environment variable export OMP_NUM_THREADS=4 RANK=${OMPI_COMM_WORLD_RANK:=$PMI_RANK} if [ $(expr $RANK % 2) = 0 ] then export GOMP_CPU_AFFINITY=0-3 numactl --preferred=0 --cpunodebind=0 $@ else export GOMP_CPU_AFFINITY=4-7 numactl --preferred=1 --cpunodebind=1 $@ fi
Run your application with the following command
Valgrind Debugging
You may easily debug Your application with the memory-error detector valgrind
.
This will detect errors such as usage of uninitialized memory, buffer over-runs, double-free's, lost memory (leaks).
To run with Open MPI, just pass it just before specifying the application:
This will show up many false positives from Open MPI itselve -- e.g. memory communicated via TCP/IP
with known uninitialized memory, buffer copied from the kernel to the InfiniBand Verbs library.
Valgrind offers suppressing these false positives. Open MPI provides a suppression-file installed
in the default location:
Common Problems
InfiniBand retry count
I get an error message about timeouts, what can I do?
-
If your parallel programs sometimes crash with an error message like this:
-------------------------------------------------------------------------- The InfiniBand retry count between two MPI processes has been exceeded. "Retry count" is defined in the InfiniBand spec 1.2 (section 12.7.38): The total number of times that the sender wishes the receiver to retry timeout, packet sequence, etc. errors before posting a completion error. This error typically means that there is something awry within the InfiniBand fabric itself. You should note the hosts on which this error has occurred; it has been observed that rebooting or removing a particular host from the job can sometimes resolve this issue. Two MCA parameters can be used to control Open MPI's behavior with respect to the retry count: * btl_openib_ib_retry_count - The number of times the sender will attempt to retry (defaulted to 7, the maximum value). * btl_openib_ib_timeout - The local ACK timeout parameter (defaulted to 10). The actual timeout value used is calculated as: 4.096 microseconds * (2^btl_openib_ib_timeout) See the InfiniBand spec 1.2 (section 12.7.34) for more details. --------------------------------------------------------------------------
This means that the mpi messages can't pass through our infiniband switches before the btl_openib_ib_timeout is over. How often this occurs depends also on the traffic on the network. We have adjusted the parameters such that it should normally work, but if you have compiled your own OpenMPI, maybe also as part of another program package, you might not have adjusted this value correctly. However, you can specify it when calling mpirun:
mpirun -mca btl_openib_ib_timeout 20 -np ... your-program ...
you can check the preconfigured parameters of the module currently loaded by:
ompi_info --param btl openib
where you can grep for the above mentioned parameters.