NEC Cluster Using MPI

OpenMPI example

simple example

To use OpenMPI with intel Compiler, create a .modulerc in your home with this contents:

File: .modulerc

#%Module1.0#
set version 1.0
module load compiler/intel
module load mpi/openmpi

For compilation use the mpi wrapper scripts like mpicc/mpic++/mpif90.

The following example is for a pure MPI job, using 16 nodes (128 processes). For Illustration, this is done using an interactive session (-I option)

First step: Batch submit to get the nodes

qsub -l nodes=16:nehalem:ppn=8,walltime=6:00:00 -I            # get the 16 nodes

In the session you will get after some time, the application is started with

mpirun -np 128 PathToYourApp

more complex examples

Open MPI divides resources in something called 'slots'. By specifying ppn:X to the batchsystem, the number of slots per node is specified. So for a simple MPI job with 8 process per node (=1 process per core) ppn:8 is best choice, as in above example. Details can be specified on mpirun command line. PBS setup is adjusted for ppn:8, please do not use other values.

If you want to use less processes per node e.g. because you are restricted by memory per process, or you have a hybrid parallel application using OpenMP and MPI, MPI would always put the first 8 processes on the first node, second 8 on second and so on. To avoid this, you can use the -npernode option.

mpirun -np X -npernode 2 your_app

This would start 2 processes per node. Like this, you can use a larger number of nodes with a smaller number of processes, or you can e.g. start threads out of the processes.

If you want to pin your processes to a CPU (and enable NUMA memory affinity) use

mpirun -np X --mca mpi_paffinity_alone 1   your_app

Warning: This will not behave as expected for hybrid multithreaded applications, as the threads will be pinned to a single CPU as well! Use this only in case of one process per core, no extra threads.

For pinning of hybrid OpenMP/MPI, you can use the wrapper from the intel MPI example, and do not use mpi_paffinity_alone switch, but

mpirun -np X -npernode 2 /path/to/wrapper.sh /path/to/app

Intel MPI example

simple example

Load the necessary modules

module load mpi/impi

Run your application with

mpirun -r ssh -np 8 your_app

more complex example

As Nehalem system is a two socket system with local attached ccNUMA memory, memory and process placement can be crucial.

Here is an example that shows a 16 node Job, using 1 process per socket and 4 threads per socket and optimum NUMA placement of processes and memory.

Prerequisite: Use intel MPI and best intel compiler To setup environment for this, use this .modulerc file in your home:

File: .modulerc

#%Module1.0#
set version 1.0
module load compiler/intel/11.0
module load mpi/impi/intel-11.0.074-impi-3.2.0.011

And compile your application using mpicc/mpicxx/mpif90 (GNU compiler) or mpiicc/mpiicpc/mpiifort (Intel compiler).

First step: Batch submit to get the nodes

qsub -l nodes=16:nehalem:ppn=8,walltime=6:00:00 -I           # get the 16 nodes

Second step: make a hostlist

sort -u  $PBS_NODEFILE  > m

Third step: make a process ring to be used by MPI later

mpdboot  -n 16 -f m -r ssh

Fourth step: start MPI application

mpiexec -perhost 2 -genv I_MPI_PIN 0  -np 32 ./wrapper.sh ./yourGloriousApp

With wrapper.sh looking like this

File: wrapper.sh

#!/bin/bash
export KMP_AFFINITY=verbose,scatter
export OMP_NUM_THREADS=4
RANK=${OMPI_COMM_WORLD_RANK:=$PMI_RANK}
if [ $(expr $RANK % 2) = 0  ]
then
     export GOMP_CPU_AFFINITY=0-3
     numactl --preferred=0 --cpunodebind=0 $@
else
     export GOMP_CPU_AFFINITY=4-7
     numactl --preferred=1 --cpunodebind=1 $@
fi

Result is an application running on 16 nodes, using 32 processes spawning 128 threads. One set of 4 threads is pinned to the one socket, the other set of 4 threads to the other socket.

MVAPICH2 example

simple example

Load the necessary module

module load mpi/mvapich2

Run your application with

mpirun_rsh -np 8 -hostfile $PBS_NODEFILE your_app

NEC Cluster Using MPI

Contents

OpenMPI example

simple example

more complex examples

Intel MPI example

simple example

more complex example

MVAPICH2 example

simple example

Navigation menu

Page actions

Page actions

Personal tools

Navigation

Search

Tools