- Infos im HLRS Wiki sind nicht rechtsverbindlich und ohne Gewähr -
- Information contained in the HLRS Wiki is not legally binding and HLRS is not responsible for any damages that might result from its use -
NEC Cluster Using MPI: Difference between revisions
Line 26: | Line 26: | ||
==== more complex examples ==== | ==== more complex examples ==== | ||
Open MPI divides resources in something called 'slots'. By specifying <code>ppn:X</code> to the batchsystem, the number of slots per node is specified. | |||
By specifying | So for a simple MPI job with 8 process per node (=1 process per core) <code>ppn:8</code> is best choice, as in above example. Details can be specified on <code>mpirun</code> command line. PBS setup is adjusted for ppn:8, please do not use other values. | ||
So for a simple MPI job with 8 process per node (=1 process per core) ppn:8 | |||
is best choice, as in above example. Details can be specified on mpirun | |||
PBS setup is adjusted for ppn:8, please do not use other values. | |||
If you want to use less processes per node e.g. because you are restricted by memory per process, or you have a hybrid parallel application using OpenMP and MPI, MPI would always put the first 8 processes on the first node, second 8 on second and so on. To avoid this, you can use the <code>-npernode</code> option. | |||
{{Command | |||
| command = mpirun -np X -npernode 2 your_app | |||
}} | |||
This would start 2 processes per node. Like this, you can use a larger number of nodes | This would start 2 processes per node. Like this, you can use a larger number of nodes | ||
with a smaller number of processes, or you can e.g. start threads out of the processes. | with a smaller number of processes, or you can e.g. start threads out of the processes. | ||
If you want to pin your processes to a CPU (and enable NUMA memory affinity) use | If you want to pin your processes to a CPU (and enable NUMA memory affinity) use | ||
{{Command | |||
| command = mpirun -np X --mca mpi_paffinity_alone 1 your_app | |||
}} | |||
{{Warning | |||
| text = This will not behave as expected for hybrid multithreaded applications, as the threads will be pinned to a single CPU as well! Use this only in case of one process per core, no extra threads. | |||
}} | |||
as the threads will be pinned to a single CPU as well! Use this only in case | |||
of one process per core, no extra threads. | |||
For pinning of hybrid OpenMP/MPI, you can use the wrapper from the intel MPI example, | For pinning of hybrid OpenMP/MPI, you can use the wrapper from the intel MPI example, | ||
and do not use mpi_paffinity_alone switch, but | and do not use mpi_paffinity_alone switch, but | ||
{{Command | |||
| command = mpirun -np X -npernode 2 /path/to/wrapper.sh /path/to/app | |||
}} | |||
=== Intel MPI example === | === Intel MPI example === |
Revision as of 10:21, 25 February 2010
OpenMPI example
simple example
To use OpenMPI with intel Compiler, create a .modulerc in your home with this contents:
#%Module1.0# set version 1.0 module load compiler/intel/11.0 module load mpi/openmpi/1.3-intel-11.0
For compilation use the mpi wrapper scripts like mpicc/mpic++/mpif90.
The following example is for a pure MPI job, using 16 nodes (128 processes). For Illustration, this is done using an interactvie session (-I option)
First step: Batch submit to get the nodes
qsub -l nodes=16:nehalem:ppn=8,walltime=6:00:00 -I # get the 16 nodes
In the session you will get after some time, the application is started with
mpirun -np 128 PathToYourApp
more complex examples
Open MPI divides resources in something called 'slots'. By specifying ppn:X
to the batchsystem, the number of slots per node is specified.
So for a simple MPI job with 8 process per node (=1 process per core) ppn:8
is best choice, as in above example. Details can be specified on mpirun
command line. PBS setup is adjusted for ppn:8, please do not use other values.
If you want to use less processes per node e.g. because you are restricted by memory per process, or you have a hybrid parallel application using OpenMP and MPI, MPI would always put the first 8 processes on the first node, second 8 on second and so on. To avoid this, you can use the -npernode
option.
This would start 2 processes per node. Like this, you can use a larger number of nodes with a smaller number of processes, or you can e.g. start threads out of the processes.
If you want to pin your processes to a CPU (and enable NUMA memory affinity) use
For pinning of hybrid OpenMP/MPI, you can use the wrapper from the intel MPI example,
and do not use mpi_paffinity_alone switch, but
Intel MPI example
simple example
Load the necessary modules
Run your application with
more complex example
As Nehalem system is a two socket system with local attached ccNUMA memory, memory and process placement can be crucial.
Here is an example that shows a 16 node Job, using 1 process per socket and 4 threads per socket and optimum NUMA placement of processes and memory.
Prerequisite: Use intel MPI and best intel compiler To setup environment for this, use this .modulerc file in your home:
#%Module1.0# set version 1.0 module load compiler/intel/11.0 module load mpi/impi/intel-11.0.074-impi-3.2.0.011
And compile your application using mpicc/mpicxx/mpif90 (GNU compiler) or mpiicc/mpiicpc/mpiifort (Intel compiler).
First step: Batch submit to get the nodes
Second step: make a hostlist
Third step: make a process ring to be used by MPI later
Fourth step: start MPI application
With wrapper.sh looking like this
#!/bin/bash export KMP_AFFINITY=verbose,scatter export OMP_NUM_THREADS=4 RANK=${OMPI_COMM_WORLD_RANK:=$PMI_RANK} if [ $(expr $RANK % 2) = 0 ] then export GOMP_CPU_AFFINITY=0-3 numactl --preferred=0 --cpunodebind=0 $@ else export GOMP_CPU_AFFINITY=4-7 numactl --preferred=1 --cpunodebind=1 $@ fi
Result is an application running on 16 nodes, using 32 processes spawning
128 threads. One set of 4 threads is pinned to the one socket, the other set of 4 threads to the other socket.
MVAPICH2 example
simple example
Load the necessary module
Run your application with