- Infos im HLRS Wiki sind nicht rechtsverbindlich und ohne Gewähr -
- Information contained in the HLRS Wiki is not legally binding and HLRS is not responsible for any damages that might result from its use -
NEC Cluster Using MPI: Difference between revisions
No edit summary |
No edit summary |
||
Line 1: | Line 1: | ||
== OpenMPI example == | |||
To use OpenMPI with intel Compiler, create a .modulerc in your home | |||
with this contents: | |||
#%Module1.0# | |||
set version 1.0 | |||
module load compiler/intel/11.0 | |||
module load mpi/openmpi/1.3-intel-11.0 | |||
For compilationuse the mpi wrapper scripts like mpicc/mpic++/mpif90. | |||
The following example is for a pure MPI job, using 16 nodes (128 processes). | |||
For Illustration, this is done using an interactvie session (-I option) | |||
First step: Batch submit to get the nodes | |||
qsub -l nodes=16:nehalem:ppn=8,walltime=6:00:00 -I # get the 16 nodes | |||
In the session you will get after some time, the application is started with | |||
mpirun -np 128 PathToYourApp | |||
== Intel MPI example == | == Intel MPI example == | ||
As Nehalem system is a | As Nehalem system is a two socket system with local attached ccNUMA memory, | ||
memory and process placmeent can be crucial. | memory and process placmeent can be crucial. | ||
Revision as of 09:49, 28 July 2009
OpenMPI example
To use OpenMPI with intel Compiler, create a .modulerc in your home with this contents:
#%Module1.0# set version 1.0 module load compiler/intel/11.0 module load mpi/openmpi/1.3-intel-11.0
For compilationuse the mpi wrapper scripts like mpicc/mpic++/mpif90.
The following example is for a pure MPI job, using 16 nodes (128 processes). For Illustration, this is done using an interactvie session (-I option)
First step: Batch submit to get the nodes
qsub -l nodes=16:nehalem:ppn=8,walltime=6:00:00 -I # get the 16 nodes
In the session you will get after some time, the application is started with
mpirun -np 128 PathToYourApp
Intel MPI example
As Nehalem system is a two socket system with local attached ccNUMA memory, memory and process placmeent can be crucial.
Here is an example that shows a 16 node Job, using 1 process per socket and 4 threads per socket and optimum NUMA placement of processes and memory.
Prerequiste: Use intel MPI and best intel compiler To setup environment for this, use this .modulerc file in your home:
#%Module1.0# set version 1.0 module load compiler/intel/11.0 module load mpi/impi/intel-11.0.074-impi-3.2.0.011
And compile your application using mpicc/mpif90.
First step: Batch submit to get the nodes
qsub -l nodes=16:nehalem:ppn=8,walltime=6:00:00 -I # get the 16 nodes
Second step: make a hostlist
sort -u $PBS_NODEFILE > m
Third step: make a process ring to be used by MPI later
mpdboot -n 16 -f m -r ssh
Fourth step: start MPI application
mpiexec -perhost 2 -genv I_MPI_PIN 0 -np 32 ./wrapper.sh ./yourGloriousApp
With wrapper.sh looking like this
#!/bin/bash export KMP_AFFINITY=verbose,scatter export OMP_NUM_THREADS=4 if [ $(expr $PMI_RANK % 2) = 0 ] then export GOMP_CPU_AFFINITY=0-3 numactl --preferred=0 --cpunodebind=0 $@ else export GOMP_CPU_AFFINITY=4-7 numactl --preferred=1 --cpunodebind=1 $@ fi
Result is an application running on 16 nodes, using 32 processes spawning
128 threads. One set of 4 therads is pinned to the one socket, the other set of 4 threads to the other socket.