- Infos im HLRS Wiki sind nicht rechtsverbindlich und ohne Gewähr -

Difference between revisions of "NEC Cluster Using MPI"

From HLRS Platforms
Jump to navigationJump to search
Line 1: Line 1:
 +
== OpenMPI example ==
 +
 +
To use OpenMPI with intel Compiler, create a .modulerc in your home
 +
with this contents:
 +
 +
#%Module1.0#
 +
set version 1.0
 +
module load compiler/intel/11.0
 +
module load mpi/openmpi/1.3-intel-11.0
 +
 +
For compilationuse the mpi wrapper scripts like mpicc/mpic++/mpif90.
 +
 +
The following example is for a pure MPI job, using 16 nodes (128 processes).
 +
For Illustration, this is done using an interactvie session (-I option)
 +
 +
First step: Batch submit to get the nodes
 +
 +
  qsub -l nodes=16:nehalem:ppn=8,walltime=6:00:00 -I            # get the 16 nodes
 +
 +
In the session you will get after some time, the application is started with
 +
 +
mpirun -np 128 PathToYourApp
 +
 +
 +
 
== Intel MPI example ==
 
== Intel MPI example ==
  
As Nehalem system is a twosocket system with local attached ccNUMA memory,
+
As Nehalem system is a two socket system with local attached ccNUMA memory,
 
memory and process placmeent can be crucial.  
 
memory and process placmeent can be crucial.  
  

Revision as of 09:49, 28 July 2009

OpenMPI example

To use OpenMPI with intel Compiler, create a .modulerc in your home with this contents:

#%Module1.0#
set version 1.0
module load compiler/intel/11.0
module load mpi/openmpi/1.3-intel-11.0

For compilationuse the mpi wrapper scripts like mpicc/mpic++/mpif90.

The following example is for a pure MPI job, using 16 nodes (128 processes). For Illustration, this is done using an interactvie session (-I option)

First step: Batch submit to get the nodes

 qsub -l nodes=16:nehalem:ppn=8,walltime=6:00:00 -I            # get the 16 nodes

In the session you will get after some time, the application is started with

mpirun -np 128 PathToYourApp


Intel MPI example

As Nehalem system is a two socket system with local attached ccNUMA memory, memory and process placmeent can be crucial.

Here is an example that shows a 16 node Job, using 1 process per socket and 4 threads per socket and optimum NUMA placement of processes and memory.

Prerequiste: Use intel MPI and best intel compiler To setup environment for this, use this .modulerc file in your home:

#%Module1.0#
set version 1.0
module load compiler/intel/11.0
module load mpi/impi/intel-11.0.074-impi-3.2.0.011

And compile your application using mpicc/mpif90.

First step: Batch submit to get the nodes

 qsub -l nodes=16:nehalem:ppn=8,walltime=6:00:00 -I           # get the 16 nodes

Second step: make a hostlist

 sort -u  $PBS_NODEFILE  > m

Third step: make a process ring to be used by MPI later

mpdboot  -n 16 -f m -r ssh  

Fourth step: start MPI application

mpiexec -perhost 2 -genv I_MPI_PIN 0  -np 32 ./wrapper.sh ./yourGloriousApp

With wrapper.sh looking like this

#!/bin/bash
export KMP_AFFINITY=verbose,scatter
export OMP_NUM_THREADS=4
if [ $(expr $PMI_RANK % 2) = 0  ]
then
       export GOMP_CPU_AFFINITY=0-3
       numactl --preferred=0 --cpunodebind=0 $@
else
       export GOMP_CPU_AFFINITY=4-7
       numactl --preferred=1 --cpunodebind=1 $@
fi


Result is an application running on 16 nodes, using 32 processes spawning 128 threads. One set of 4 therads is pinned to the one socket, the other set of 4 threads to the other socket.