NEC Cluster Using MPI: Difference between revisions

Latest revision as of 14:44, 12 June 2013

@@ Line 1: / Line 1: @@
-=== OpenMPI example ===
+=== OpenMPI ===
-==== simple example ====
+see [[Open MPI]]
-To use OpenMPI with intel Compiler, create a .modulerc in your home
+=== Intel MPI ===
-with this contents:
- #%Module1.0#
+see [[Intel MPI]]
- set version 1.0
- module load compiler/intel/11.0
- module load mpi/openmpi/1.3-intel-11.0
-For compilation use the mpi wrapper scripts like mpicc/mpic++/mpif90.
+=== MVAPICH2 ===
-The following example is for a pure MPI job, using 16 nodes (128 processes).
+see [[MVAPICH2]]
-For Illustration, this is done using an interactvie session (-I option)
-First step: Batch submit to get the nodes
+=== MPI I/O ===
-  qsub -l nodes=16:nehalem:ppn=8,walltime=6:00:00 -I            # get the 16 nodes
+see [[MPI-IO]]
-In the session you will get after some time, the application is started with
- mpirun -np 128 PathToYourApp
-==== more complex examples ====
-OpenMPI the resources in something called 'slots'.
-By specifying 'ppn:X' to the batchsystem, the number of slots per node is specified.
-So for a simple MPI job with 8 process per node (=1 process per core) ppn:8
-is best choice, as in above example. Details can be specified on mpirun commandline.
-PBS setup is adjusted for ppn:8, please do not use other values.
-If you want to use less processes per node e.g. because you are restricted by memory per process,
-or you have a hybrid parallel application using OpenMP and MPI,
-MPI would always put the first 8 processes on the first node, second 8 on second and so on.
-To avoid this, you can do
- mpirun -np X -npernode 2 /path/to/app
-This would start 2 processes per node. Like this, you can use a larger number of nodes
-with a smaller number of processes, or you can e.g. start threads out of the processes.
-If you want to pin your processes to a CPU (and enable NUMA memory affinity) use
- mpirun -np X --mca mpi_paffinity_alone 1   /path/to/app
-Warning: This will not behave as expected for hybrid multithreaded applications,
-as the threads will be pinned to a single CPU as well! Use this only in case
-of one process per core, no extra threads.
-For pinning of hybrid OpenMP/MPI, you can use the wrapper from the intel MPI example,
-and do not use mpi_paffinity_alone switch, but
- mpirun -np X -npernode 2 /path/to/wrapper.sh /path/to/app
-=== Intel MPI example ===
-==== simple example ====
-First load the neccessary modules
- module load mpi/impi/intel-11.0.074-impi-3.2.0.011
-Run your application with
- mpirun -r ssh -np 8 app
-==== more complex example ====
-As Nehalem system is a two socket system with local attached ccNUMA memory,
-memory and process placement can be crucial.
-Here is an example that shows a 16 node Job, using 1 process per socket and 4 threads
-per socket and optimum NUMA placement of processes and memory.
-Prerequisite: Use intel MPI and best intel compiler
-To setup environment for this, use this .modulerc file in your home:
-{{File
-| filename = .modulerc
-| content = <pre>
-#%Module1.0#
-set version 1.0
-module load compiler/intel/11.0
-module load mpi/impi/intel-11.0.074-impi-3.2.0.011
-</pre>
-}}
-And compile your application using mpicc/mpicxx/mpif90 (GNU compiler) or mpiicc/mpiicpc/mpiifort (Intel compiler).
-First step: Batch submit to get the nodes
-{{Command
-| command = qsub -l nodes=16:nehalem:ppn=8,walltime=6:00:00 -I           # get the 16 nodes
-}}
-Second step: make a hostlist
-{{Command
-| command = sort -u  $PBS_NODEFILE  > m
-}}
-Third step: make a process ring to be used by MPI later
-{{Command
-| command = mpdboot  -n 16 -f m -r ssh
-}}
-Fourth step: start MPI application
-{{Command
-| command = mpiexec -perhost 2 -genv I_MPI_PIN 0  -np 32 ./wrapper.sh ./yourGloriousApp
-}}
-With wrapper.sh looking like this
-{{File
-| filename = wrapper.sh
-| content =<pre>
-#!/bin/bash
-export KMP_AFFINITY=verbose,scatter
-export OMP_NUM_THREADS=4
-RANK=${OMPI_COMM_WORLD_RANK:=$PMI_RANK}
-if [ $(expr $RANK % 2) = 0  ]
-then
-     export GOMP_CPU_AFFINITY=0-3
-     numactl --preferred=0 --cpunodebind=0 $@
-else
-     export GOMP_CPU_AFFINITY=4-7
-     numactl --preferred=1 --cpunodebind=1 $@
-fi
-</pre>
-}}
-Result is an application running on 16 nodes, using 32 processes spawning
-threads. One set of 4 threads is pinned to the one socket, the other set of 4 threads to the other socket.
-=== MVAPICH2 example ===
-==== simple example ====
-To use MVAPICH2 first load the necessary module
- module load mpi/mvapich2
-You run your application with
- mpirun_rsh -np 8 -hostfile $PBS_NODEFILE app

NEC Cluster Using MPI: Difference between revisions

Latest revision as of 14:44, 12 June 2013

Contents

OpenMPI

Intel MPI

MVAPICH2

MPI I/O

Navigation menu

Page actions

Page actions

Personal tools

Navigation

Search

Tools