|
|
Line 3: |
Line 3: |
| see [[Open MPI]] | | see [[Open MPI]] |
|
| |
|
| === Intel MPI example ===
| | see [[Intel MPI]] |
| | |
| ==== simple example ====
| |
| | |
| Load the necessary modules
| |
| {{Command
| |
| | command = module load mpi/impi
| |
| }}
| |
| | |
| Run your application with
| |
| {{Command
| |
| | command = mpirun -r ssh -np 8 your_app
| |
| }}
| |
| | |
| ==== more complex example ====
| |
| | |
| As Nehalem system is a two socket system with local attached ccNUMA memory,
| |
| memory and process placement can be crucial.
| |
| | |
| Here is an example that shows a 16 node Job, using 1 process per socket and 4 threads
| |
| per socket and optimum NUMA placement of processes and memory.
| |
| | |
| Prerequisite: Use intel MPI and best intel compiler
| |
| To setup environment for this, use this .modulerc file in your home:
| |
| | |
| {{File
| |
| | filename = .modulerc
| |
| | content = <pre>
| |
| #%Module1.0#
| |
| set version 1.0
| |
| module load compiler/intel/11.0
| |
| module load mpi/impi/intel-11.0.074-impi-3.2.0.011
| |
| </pre>
| |
| }}
| |
| | |
| And compile your application using mpicc/mpicxx/mpif90 (GNU compiler) or mpiicc/mpiicpc/mpiifort (Intel compiler).
| |
| | |
| First step: Batch submit to get the nodes
| |
| | |
| {{Command
| |
| | command = qsub -l nodes=16:nehalem:ppn=8,walltime=6:00:00 -I # get the 16 nodes
| |
| }}
| |
| | |
| Second step: make a hostlist
| |
| {{Command
| |
| | command = sort -u $PBS_NODEFILE > m
| |
| }}
| |
| | |
| Third step: make a process ring to be used by MPI later
| |
| {{Command
| |
| | command = mpdboot -n 16 -f m -r ssh
| |
| }}
| |
| | |
| Fourth step: start MPI application
| |
| {{Command
| |
| | command = mpiexec -perhost 2 -genv I_MPI_PIN 0 -np 32 ./wrapper.sh ./yourGloriousApp
| |
| }}
| |
| | |
| With wrapper.sh looking like this
| |
| | |
| {{File
| |
| | filename = wrapper.sh
| |
| | content =<pre>
| |
| #!/bin/bash
| |
| export KMP_AFFINITY=verbose,scatter
| |
| export OMP_NUM_THREADS=4
| |
| RANK=${OMPI_COMM_WORLD_RANK:=$PMI_RANK}
| |
| if [ $(expr $RANK % 2) = 0 ]
| |
| then
| |
| export GOMP_CPU_AFFINITY=0-3
| |
| numactl --preferred=0 --cpunodebind=0 $@
| |
| else
| |
| export GOMP_CPU_AFFINITY=4-7
| |
| numactl --preferred=1 --cpunodebind=1 $@
| |
| fi
| |
| </pre>
| |
| }}
| |
| | |
| Result is an application running on 16 nodes, using 32 processes spawning
| |
| 128 threads. One set of 4 threads is pinned to the one socket, the other set of 4 threads to the other socket.
| |
|
| |
|
| === MVAPICH2 example === | | === MVAPICH2 example === |
|
| |
|
| see [[MVAPICH2]] | | see [[MVAPICH2]] |