- Infos im HLRS Wiki sind nicht rechtsverbindlich und ohne Gewähr -
- Information contained in the HLRS Wiki is not legally binding and HLRS is not responsible for any damages that might result from its use -

NEC Cluster Using MPI: Difference between revisions

From HLRS Platforms
Jump to navigationJump to search
mNo edit summary
Line 77: Line 77:


As Nehalem system is a two socket system with local attached ccNUMA memory,
As Nehalem system is a two socket system with local attached ccNUMA memory,
memory and process placmeent can be crucial.  
memory and process placement can be crucial.


Here is an example that shows a 16 node Job, using 1 process per socket and 4 threads
Here is an example that shows a 16 node Job, using 1 process per socket and 4 threads
per socket and optimum NUMA placement of processes and memory.
per socket and optimum NUMA placement of processes and memory.


Prerequiste: Use intel MPI and best intel compiler
Prerequisite: Use intel MPI and best intel compiler
To setup environment for this, use this .modulerc file in your home:
To setup environment for this, use this .modulerc file in your home:


#%Module1.0#
{{File
set version 1.0
| filename = .modulerc
module load compiler/intel/11.0
| content = <pre>
module load mpi/impi/intel-11.0.074-impi-3.2.0.011
#%Module1.0#
set version 1.0
module load compiler/intel/11.0
module load mpi/impi/intel-11.0.074-impi-3.2.0.011
</pre>
}}


And compile your application using mpicc/mpicxx/mpif90 (GNU compiler) or mpiicc/mpiicpc/mpiifort (Intel compiler).
And compile your application using mpicc/mpicxx/mpif90 (GNU compiler) or mpiicc/mpiicpc/mpiifort (Intel compiler).
Line 94: Line 99:
First step: Batch submit to get the nodes
First step: Batch submit to get the nodes


  qsub -l nodes=16:nehalem:ppn=8,walltime=6:00:00 -I          # get the 16 nodes
{{Command
| command = qsub -l nodes=16:nehalem:ppn=8,walltime=6:00:00 -I          # get the 16 nodes
}}


Second step: make a hostlist
Second step: make a hostlist
 
{{Command
  sort -u  $PBS_NODEFILE  > m
| command = sort -u  $PBS_NODEFILE  > m
}}


Third step: make a process ring to be used by MPI later
Third step: make a process ring to be used by MPI later
 
{{Command
mpdboot  -n 16 -f m -r ssh   
| command = mpdboot  -n 16 -f m -r ssh   
}}


Fourth step: start MPI application
Fourth step: start MPI application
 
{{Command
mpiexec -perhost 2 -genv I_MPI_PIN 0  -np 32 ./wrapper.sh ./yourGloriousApp
| command = mpiexec -perhost 2 -genv I_MPI_PIN 0  -np 32 ./wrapper.sh ./yourGloriousApp
}}


With wrapper.sh looking like this
With wrapper.sh looking like this


 
{{File
#!/bin/bash
| filename = wrapper.sh
export KMP_AFFINITY=verbose,scatter
| content =<pre>
export OMP_NUM_THREADS=4
#!/bin/bash
RANK=${OMPI_COMM_WORLD_RANK:=$PMI_RANK}
export KMP_AFFINITY=verbose,scatter
if [ $(expr $RANK % 2) = 0  ]
export OMP_NUM_THREADS=4
then
RANK=${OMPI_COMM_WORLD_RANK:=$PMI_RANK}
      export GOMP_CPU_AFFINITY=0-3
if [ $(expr $RANK % 2) = 0  ]
      numactl --preferred=0 --cpunodebind=0 $@
then
else
    export GOMP_CPU_AFFINITY=0-3
      export GOMP_CPU_AFFINITY=4-7
    numactl --preferred=0 --cpunodebind=0 $@
      numactl --preferred=1 --cpunodebind=1 $@
else
fi
    export GOMP_CPU_AFFINITY=4-7
 
    numactl --preferred=1 --cpunodebind=1 $@
fi
</pre>
}}


Result is an application running on 16 nodes, using 32 processes spawning
Result is an application running on 16 nodes, using 32 processes spawning
128 threads. One set of 4 therads is pinned to the one socket, the other set of 4 threads to the other socket.
128 threads. One set of 4 threads is pinned to the one socket, the other set of 4 threads to the other socket.


=== MVAPICH2 example ===
=== MVAPICH2 example ===

Revision as of 09:58, 25 February 2010

OpenMPI example

simple example

To use OpenMPI with intel Compiler, create a .modulerc in your home with this contents:

#%Module1.0#
set version 1.0
module load compiler/intel/11.0
module load mpi/openmpi/1.3-intel-11.0

For compilation use the mpi wrapper scripts like mpicc/mpic++/mpif90.

The following example is for a pure MPI job, using 16 nodes (128 processes). For Illustration, this is done using an interactvie session (-I option)

First step: Batch submit to get the nodes

 qsub -l nodes=16:nehalem:ppn=8,walltime=6:00:00 -I            # get the 16 nodes

In the session you will get after some time, the application is started with

mpirun -np 128 PathToYourApp

more complex examples

OpenMPI the resources in something called 'slots'. By specifying 'ppn:X' to the batchsystem, the number of slots per node is specified. So for a simple MPI job with 8 process per node (=1 process per core) ppn:8 is best choice, as in above example. Details can be specified on mpirun commandline. PBS setup is adjusted for ppn:8, please do not use other values.

If you want to use less processes per node e.g. because you are restricted by memory per process, or you have a hybrid parallel application using OpenMP and MPI, MPI would always put the first 8 processes on the first node, second 8 on second and so on. To avoid this, you can do

mpirun -np X -npernode 2 /path/to/app

This would start 2 processes per node. Like this, you can use a larger number of nodes with a smaller number of processes, or you can e.g. start threads out of the processes.

If you want to pin your processes to a CPU (and enable NUMA memory affinity) use


mpirun -np X --mca mpi_paffinity_alone 1   /path/to/app

Warning: This will not behave as expected for hybrid multithreaded applications, as the threads will be pinned to a single CPU as well! Use this only in case of one process per core, no extra threads.

For pinning of hybrid OpenMP/MPI, you can use the wrapper from the intel MPI example, and do not use mpi_paffinity_alone switch, but

mpirun -np X -npernode 2 /path/to/wrapper.sh /path/to/app




Intel MPI example

simple example

First load the neccessary modules

module load mpi/impi/intel-11.0.074-impi-3.2.0.011

Run your application with

mpirun -r ssh -np 8 app


more complex example

As Nehalem system is a two socket system with local attached ccNUMA memory, memory and process placement can be crucial.

Here is an example that shows a 16 node Job, using 1 process per socket and 4 threads per socket and optimum NUMA placement of processes and memory.

Prerequisite: Use intel MPI and best intel compiler To setup environment for this, use this .modulerc file in your home:

File: .modulerc
#%Module1.0#
set version 1.0
module load compiler/intel/11.0
module load mpi/impi/intel-11.0.074-impi-3.2.0.011


And compile your application using mpicc/mpicxx/mpif90 (GNU compiler) or mpiicc/mpiicpc/mpiifort (Intel compiler).

First step: Batch submit to get the nodes

qsub -l nodes=16:nehalem:ppn=8,walltime=6:00:00 -I # get the 16 nodes


Second step: make a hostlist

sort -u $PBS_NODEFILE > m


Third step: make a process ring to be used by MPI later

mpdboot -n 16 -f m -r ssh


Fourth step: start MPI application

mpiexec -perhost 2 -genv I_MPI_PIN 0 -np 32 ./wrapper.sh ./yourGloriousApp


With wrapper.sh looking like this

File: wrapper.sh
#!/bin/bash
export KMP_AFFINITY=verbose,scatter
export OMP_NUM_THREADS=4
RANK=${OMPI_COMM_WORLD_RANK:=$PMI_RANK}
if [ $(expr $RANK % 2) = 0  ]
then
     export GOMP_CPU_AFFINITY=0-3
     numactl --preferred=0 --cpunodebind=0 $@
else
     export GOMP_CPU_AFFINITY=4-7
     numactl --preferred=1 --cpunodebind=1 $@
fi


Result is an application running on 16 nodes, using 32 processes spawning 128 threads. One set of 4 threads is pinned to the one socket, the other set of 4 threads to the other socket.

MVAPICH2 example

simple example

To use MVAPICH2 first load the necessary module

module load mpi/mvapich2

You run your application with

mpirun_rsh -np 8 -hostfile $PBS_NODEFILE app