- Infos im HLRS Wiki sind nicht rechtsverbindlich und ohne Gewähr -

PACX-MPI on clusters

From HLRS Platforms

Running PACX-MPI on local clusters, there is a variance of ways to do so:

  • Locally on N machines local to the cluster,
  • Distributed onto several machines.

Running N machines on the local machine

Running locally will require to distinguish the machines, aka having a distinct .hostfile, which will then result in a distinct host number for this particular machine within the Metacomputer.

 cat HOST1/.hostfile
 localhost      4
 127.0.0.1      4
 cat HOST2/.hostfile
 127.0.0.1      4
 localhost      4

This will result in two machines in the MetaComputer, each running on this cluster, each with 4 user-processes (plus the two daemons).

Distributed MetaComputer using Clusters with local IP-adresses

If one wants to couple several clusters, which have local IP-Adresses (10.x.y.z, 172.16.x.y, 192.168.x.y), then one should make sure, that the processes 0 and 1 are running on a node with global IP adresses, which are routable to the outside world.

There are multiple ways to tell the MPI-implementation, that special nodes need to be added / that processes must be bound to specific nodes. The easiest way is the so-called machinefile.

In order to create the machinefile for MPIch/Open MPI, paste twice the internal name of the headnode, plus the standard list of nodes (with PBS using $PBS_NODEFILE) to a file called machinefile. So for cacau.hww.de, one could:

 echo "cacau1" >  machinefile
 echo "cacau1" >> machinefile
 cat $PBS_NODEFILE >> machinefile

The standard way to start (PACX-)MPI-applications using MPIch is described on the standard web-pages. For Open MPI, one needs to circumvent the selection of the PBS-magic-voodoo plugin, and rather select the rsh-based startup. Due to our installation, one therefore needs also to override the name of the SSH-application, with the original name ssh.orig. Further changes are, setting the name under which the server is externally known (here cacau) and, as the local domainname is not set, fixing this as well. Due to further configuration quirks of cacau, one also needs to reset the TCP-interfaces to use to be eth0, only. So, the actual commands would be

 export PACX_HOST_ALIAS=cacau
 export LOCALDOMAIN=hww.de
 mpirun -x PACX_HOST_ALIAS -x LOCALDOMAIN
  --mca btl_tcp_if_include eth0
  --prefix ~/WORK/OPENMPI/openmpi-1.2.3/COMPILE/usr/
  --mca pls rsh --mca pls_rsh_assume_same_shell 0 --mca pls_rsh_agent ssh.orig
  -machinefile machinefile 
  -np 6 ../mpi_stub


To make Your life a lot easier, use the startupserver-method to distribute the information of Your metacomputer, i.e. one globally adressable/public machine runs the startupserver with the number of computers in the Metacomputer as argument.

Each .hostfile now contains the number of this machine within the MetaComputer (starting from zero). With startupserver running on global.hlrs.de, the first host would contain:

 Server global.hlrs.de 0

While the second .hostfile would contain:

 Server global.hlrs.de 1