|
|
(14 intermediate revisions by 2 users not shown) |
Line 1: |
Line 1: |
| == OpenMPI example == | | === OpenMPI === |
|
| |
|
| === simple example ===
| | see [[Open MPI]] |
|
| |
|
| To use OpenMPI with intel Compiler, create a .modulerc in your home
| | === Intel MPI === |
| with this contents:
| |
|
| |
|
| #%Module1.0#
| | see [[Intel MPI]] |
| set version 1.0
| |
| module load compiler/intel/11.0
| |
| module load mpi/openmpi/1.3-intel-11.0
| |
|
| |
|
| For compilation use the mpi wrapper scripts like mpicc/mpic++/mpif90.
| | === MVAPICH2 === |
|
| |
|
| The following example is for a pure MPI job, using 16 nodes (128 processes).
| | see [[MVAPICH2]] |
| For Illustration, this is done using an interactvie session (-I option)
| |
|
| |
|
| First step: Batch submit to get the nodes
| | === MPI I/O === |
|
| |
|
| qsub -l nodes=16:nehalem:ppn=8,walltime=6:00:00 -I # get the 16 nodes
| | see [[MPI-IO]] |
| | |
| In the session you will get after some time, the application is started with
| |
|
| |
| mpirun -np 128 PathToYourApp
| |
| | |
| === more complex examples ===
| |
| | |
| OpenMPI the resources in something called 'slots'.
| |
| By specifying 'ppn:X' to the batchsystem, the number of slots per node is specified.
| |
| So for a simple MPI job with 8 process per node (=1 process per core) ppn:8
| |
| is best choice, as in above example. Details can be specified on mpirun commandline.
| |
| PBS setup is adjusted for ppn:8, please do not use other values.
| |
| | |
| If you want to use less processes per node e.g. because you are restricted by memory per process,
| |
| or you have a hybrid parallel application using OpenMP and MPI,
| |
| MPI would always put the first 8 processes on the first node, second 8 on second and so on.
| |
| To avoid this, you can do
| |
| | |
| mpirun -np X -npernode 2 /path/to/app
| |
| | |
| This would start 2 processes per node. Like this, you can use a larger number of nodes
| |
| with a smaller number of processes, or you can e.g. start threads out of the processes.
| |
| | |
| If you want to pin your processes to a CPU (and enable NUMA memory affinity) use
| |
| | |
| | |
| mpirun -np X --mca mpi_paffinity_alone 1 /path/to/app
| |
| | |
| Warning: This will not behave as expected for hybrid multithreaded applications,
| |
| as the threads will be pinned to a single CPU as well! Use this only in case
| |
| of one process per core, no extra threads.
| |
| | |
| For pinning of hybrid OpenMP/MPI, you can use the wrapper from the intel MPI example,
| |
| and do not use mpi_paffinity_alone switch, but
| |
| | |
| mpirun -np X -npernode 2 /path/to/wrapper.sh /path/to/app
| |
| | |
| | |
| | |
| | |
| | |
| | |
| == Intel MPI example ==
| |
| | |
| As Nehalem system is a two socket system with local attached ccNUMA memory,
| |
| memory and process placmeent can be crucial.
| |
| | |
| Here is an example that shows a 16 node Job, using 1 process per socket and 4 threads
| |
| per socket and optimum NUMA placement of processes and memory.
| |
| | |
| Prerequiste: Use intel MPI and best intel compiler
| |
| To setup environment for this, use this .modulerc file in your home:
| |
| | |
| #%Module1.0#
| |
| set version 1.0
| |
| module load compiler/intel/11.0
| |
| module load mpi/impi/intel-11.0.074-impi-3.2.0.011
| |
| | |
| And compile your application using mpicc/mpicxx/mpif90 (GNU compiler) or mpiicc/mpiicpc/mpiifort (Intel compiler).
| |
| | |
| First step: Batch submit to get the nodes
| |
| | |
| qsub -l nodes=16:nehalem:ppn=8,walltime=6:00:00 -I # get the 16 nodes
| |
| | |
| Second step: make a hostlist
| |
| | |
| sort -u $PBS_NODEFILE > m
| |
| | |
| Third step: make a process ring to be used by MPI later
| |
| | |
| mpdboot -n 16 -f m -r ssh
| |
| | |
| Fourth step: start MPI application
| |
| | |
| mpiexec -perhost 2 -genv I_MPI_PIN 0 -np 32 ./wrapper.sh ./yourGloriousApp
| |
| | |
| With wrapper.sh looking like this
| |
| | |
| | |
| #!/bin/bash
| |
| export KMP_AFFINITY=verbose,scatter
| |
| export OMP_NUM_THREADS=4
| |
| RANK=${OMPI_COMM_WORLD_RANK:=$PMI_RANK}
| |
| if [ $(expr $RANK % 2) = 0 ]
| |
| then
| |
| export GOMP_CPU_AFFINITY=0-3
| |
| numactl --preferred=0 --cpunodebind=0 $@
| |
| else
| |
| export GOMP_CPU_AFFINITY=4-7
| |
| numactl --preferred=1 --cpunodebind=1 $@
| |
| fi
| |
| | |
| | |
| Result is an application running on 16 nodes, using 32 processes spawning
| |
| 128 threads. One set of 4 therads is pinned to the one socket, the other set of 4 threads to the other socket.
| |