- Infos im HLRS Wiki sind nicht rechtsverbindlich und ohne Gewähr -
- Information contained in the HLRS Wiki is not legally binding and HLRS is not responsible for any damages that might result from its use -

NEC Cluster Using MPI

From HLRS Platforms
Revision as of 19:39, 17 July 2009 by Hwwnec5 (talk | contribs)
Jump to navigationJump to search
The printable version is no longer supported and may have rendering errors. Please update your browser bookmarks and please use the default browser print function instead.

Intel MPI example

As Nehalem system is a twosocket system with local attached ccNUMA memory, memory and process placmeent can be crucial.

Here is an example that shows a 16 node Job, using 1 process per socket and 4 threads per socket and optimum NUMA placement of processes and memory.

First step: Batch submit to get the nodes

 qsub -l nodes=16:nehalem:ppn=8,walltime=6:00:00 -I           # get the 16 nodes

Second step: make a hostlist

 sort -u  $PBS_NODEFILE  > m

Third step: make a process ring to be used by MPI later

mpdboot  -n 16 -f m -r ssh  

Fourth step: start MPI application

mpiexec -perhost 2 -genv I_MPI_PIN 0  -np 32 ./wrapper.sh ./yourGloriousApp

With wrapper.sh looking like this

#!/bin/bash
export KMP_AFFINITY=verbose,scatter
export OMP_NUM_THREADS=4
if [ $(expr $PMI_RANK % 2) = 0  ]
then
       export GOMP_CPU_AFFINITY=0-3
       numactl --preferred=0 --cpunodebind=0 $@
else
       export GOMP_CPU_AFFINITY=4-7
       numactl --preferred=1 --cpunodebind=1 $@
fi


Result is an application running on 16 nodes, using 32 processes spawning 128 threads. One set of 4 therads is pinned to the one socket, the other set of 4 threads to the other socket.