- Infos im HLRS Wiki sind nicht rechtsverbindlich und ohne Gewähr -

Difference between revisions of "NEC Cluster Using MPI"

From HLRS Platforms
Jump to navigationJump to search
(Moved description for Open MPI to the corresponding page)
Line 86: Line 86:
=== MVAPICH2 example ===
=== MVAPICH2 example ===
==== simple example ====
see [[MVAPICH2]]
Load the necessary module
| command = module load mpi/mvapich2
Run your application with
| command = mpirun_rsh -np 8 -hostfile $PBS_NODEFILE your_app

Revision as of 14:28, 26 February 2010

OpenMPI example

see Open MPI

Intel MPI example

simple example

Load the necessary modules

module load mpi/impi

Run your application with

mpirun -r ssh -np 8 your_app

more complex example

As Nehalem system is a two socket system with local attached ccNUMA memory, memory and process placement can be crucial.

Here is an example that shows a 16 node Job, using 1 process per socket and 4 threads per socket and optimum NUMA placement of processes and memory.

Prerequisite: Use intel MPI and best intel compiler To setup environment for this, use this .modulerc file in your home:

File: .modulerc
set version 1.0
module load compiler/intel/11.0
module load mpi/impi/intel-11.0.074-impi-

And compile your application using mpicc/mpicxx/mpif90 (GNU compiler) or mpiicc/mpiicpc/mpiifort (Intel compiler).

First step: Batch submit to get the nodes

qsub -l nodes=16:nehalem:ppn=8,walltime=6:00:00 -I # get the 16 nodes

Second step: make a hostlist

sort -u $PBS_NODEFILE > m

Third step: make a process ring to be used by MPI later

mpdboot -n 16 -f m -r ssh

Fourth step: start MPI application

mpiexec -perhost 2 -genv I_MPI_PIN 0 -np 32 ./wrapper.sh ./yourGloriousApp

With wrapper.sh looking like this

File: wrapper.sh
export KMP_AFFINITY=verbose,scatter
if [ $(expr $RANK % 2) = 0  ]
     export GOMP_CPU_AFFINITY=0-3
     numactl --preferred=0 --cpunodebind=0 $@
     export GOMP_CPU_AFFINITY=4-7
     numactl --preferred=1 --cpunodebind=1 $@

Result is an application running on 16 nodes, using 32 processes spawning 128 threads. One set of 4 threads is pinned to the one socket, the other set of 4 threads to the other socket.

MVAPICH2 example