- Infos im HLRS Wiki sind nicht rechtsverbindlich und ohne Gewähr -
- Information contained in the HLRS Wiki is not legally binding and HLRS is not responsible for any damages that might result from its use -

NEC Cluster Using MPI: Difference between revisions

From HLRS Platforms
Jump to navigationJump to search
No edit summary
 
(18 intermediate revisions by 3 users not shown)
Line 1: Line 1:
== OpenMPI example ==
=== OpenMPI ===


=== simple example ===
see [[Open MPI]]


To use OpenMPI with intel Compiler, create a .modulerc in your home
=== Intel MPI ===
with this contents:


#%Module1.0#
see [[Intel MPI]]
set version 1.0
module load compiler/intel/11.0
module load mpi/openmpi/1.3-intel-11.0


For compilationuse the mpi wrapper scripts like mpicc/mpic++/mpif90.
=== MVAPICH2 ===


The following example is for a pure MPI job, using 16 nodes (128 processes).
see [[MVAPICH2]]
For Illustration, this is done using an interactvie session (-I option)


First step: Batch submit to get the nodes
=== MPI I/O ===


  qsub -l nodes=16:nehalem:ppn=8,walltime=6:00:00 -I            # get the 16 nodes
see [[MPI-IO]]
 
In the session you will get after some time, the application is started with
mpirun -np 128 PathToYourApp
 
=== more complex examples ===
 
OpenMPI the resources in something called 'slots'.
By specifying 'ppn:X' to the batchsystem, the number of slots per node is specified.
So for a simple MPI job with 8 process per node (=1 process per core) ppn:8
is best choice, as in above example. Details can be specified on mpirun commandline.
 
If you want, e.g. because you are restricted by memory per process less processes
per node, MPI would always put the first 8 processes on the first node, second 8 on second and so on.
To avoid this, you can do
 
mpirun -np X -npernode 2 /path/to/app
 
This would start 2 processes per node. Like this, you can use a larger number of nodes
with a smaller number of processes, or you can e.g. start threads out of the processes.
 
 
 
 
 
 
 
 
== Intel MPI example ==
 
As Nehalem system is a two socket system with local attached ccNUMA memory,
memory and process placmeent can be crucial.
 
Here is an example that shows a 16 node Job, using 1 process per socket and 4 threads
per socket and optimum NUMA placement of processes and memory.
 
Prerequiste: Use intel MPI and best intel compiler
To setup environment for this, use this .modulerc file in your home:
 
#%Module1.0#
set version 1.0
module load compiler/intel/11.0
module load mpi/impi/intel-11.0.074-impi-3.2.0.011
 
And compile your application using mpicc/mpif90.
 
First step: Batch submit to get the nodes
 
  qsub -l nodes=16:nehalem:ppn=8,walltime=6:00:00 -I          # get the 16 nodes
 
Second step: make a hostlist
 
  sort -u  $PBS_NODEFILE  > m
 
Third step: make a process ring to be used by MPI later
 
mpdboot  -n 16 -f m -r ssh 
 
Fourth step: start MPI application
 
mpiexec -perhost 2 -genv I_MPI_PIN 0  -np 32 ./wrapper.sh ./yourGloriousApp
 
With wrapper.sh looking like this
 
#!/bin/bash
export KMP_AFFINITY=verbose,scatter
export OMP_NUM_THREADS=4
if [ $(expr $PMI_RANK % 2) = 0  ]
then
        export GOMP_CPU_AFFINITY=0-3
        numactl --preferred=0 --cpunodebind=0 $@
else
        export GOMP_CPU_AFFINITY=4-7
        numactl --preferred=1 --cpunodebind=1 $@
fi
 
 
Result is an application running on 16 nodes, using 32 processes spawning
128 threads. One set of 4 therads is pinned to the one socket, the other set of 4 threads to the other socket.

Latest revision as of 14:44, 12 June 2013

OpenMPI

see Open MPI

Intel MPI

see Intel MPI

MVAPICH2

see MVAPICH2

MPI I/O

see MPI-IO