Best Practices for IO, Parallel IO and MPI-IO

Best practices for I/O

Do not generate Output. Kidding aside, there are a few hints... TBD

File size restriction on lustre file systems

File/data segment size is currently limited to 2TB per OST. If you have files which are larger than 2TB please ensure that the striping distribute your data in a way that the per OST limit is not reached. For more details see http://wiki.lustre.org/index.php/FAQ_-_Sizing.

Best practices of using MPI I/O

The best way to use parallel MPI I/O is to

make as few MPI I/O calls in general in order to create/write/read
perform big data requests
have as few meta-data accesses (seeks, queries or changes of file-size)

MPI I/O best practice example

Note: The following example shows only the usage of the Info object but the best values for your application are very likely different!

The following code-fragment makes usage of the collective MPI_File_write_at_all call and MPI I/O info hints. If this is taken to the extreme, all MPI processes - which have some data to write - will participate in one collective write-request to one file.

File: mpi-io-fragment

/*
 * The following code fragment writes all elements from the buffers of all MPI
 * processes into one contigous file. For demonstration purpose we assume that
 * buffer elements are of type char.
 * In order to know, at which OFFSET each MPI process has to write into the
 * file, the number of elements written by the previous processors
 * is computed via an exclusive prefix reduction using MPI_Exscan.
 */
MPI_Exscan(&num_elements, &offset, 1, MPI_INT, MPI_SUM, MPI_COMM_WORLD);

MPI_Info info = MPI_INFO_NULL;
  
/* Set a few MPI_Info key-values in order to improve the write-speed */
MPI_Info_create (&info);
MPI_Info_set(info, "cb_align", "2");             /* Default: OMPI: none, CrayMPT: 2 */
MPI_Info_set(info, "cb_nodes_list", "*:*");      /* Default: OMPI: *:1, CrayMPT: *:* */
MPI_Info_set(info, "direct_io", "false");        /* Default: OMPI: none, CrayMPT: false */
MPI_Info_set(info, "romio_ds_read", "disable");  /* Default: OMPI: none, CrayMPT: disable */
MPI_Info_set(info, "romio_ds_write", "disable"); /* Default: OMPI: none, CrayMPT: disable */
/* Let's reduce the number of aggregators, should be roughly 2 to 4 times the stripe-factor */
MPI_Info_set(info, "cb_nodes", "8");             /* Default: OMPI: set automatically to the number of distinct nodes; However TOO High */

MPI_File fh;
MPI_File_open(MPI_COMM_WORLD, filename, MPI_MODE_CREATE | MPI_MODE_WRONLY, info, &fh);
MPI_File_write_at_all (fh, offset, buffer, num_elements, MPI_CHAR, MPI_STATUS_IGNORE);
MPI_File_close (&fh);

The write offset for each MPI process is pre-determined using MPI_Excan and (if the file is large enough) will reduce the number of MPI I/O aggregators (then processes collecting data and writing to the OSTs). Please note

In this case, data is contiguous, all data written per process and the sum fit into 2GB (well for the MPI_INT on this platform).
The defaults for Cray MPI's ROMIO were good -- however the striping was too high.
Striping information is set, when a file is creating; mostly the default is fine, e.g. a stripe-factor of four.

However, sometimes this default needs to be changed using

Lustre-Tools from the Shell: touch /mnt/lustre/file ; lfs getstripe /mnt/lustre/file and lfs setstripe.
Consider using Ken Matney's Lustre Utility Library (LUT) to set the information from Your code (see lut_putl)...

Adapting HDF5's MPI I/O parameters to prevent locking on Lustre

The HDF5 library and it's use of MPI I/O exposes a problem on file systems that do not support locking, or are configured without (e.g. most Lustre installations). When creating files using H5Fcreate() on these file systems the MPI I/O layer (in most MPI-implementations ROMIO) causes the file system to hang.

To eliminate the problem: when opening the file one must attach an info parameter, which disables ROMIO's data-sieving ds_read and ds_write and enables ROMIO's collective-buffering. (thanks to Sebastian Lange for the suggestion!)

 hid_t file;
 hid_t plist_id;
 MPI_Info info;
 
 /* Create info to be attached to HDF5 file */
 MPI_Info_create(&info);
 
 /* Disables ROMIO's data-sieving */
 MPI_Info_set(info, "romio_ds_read", "disable");
 MPI_Info_set(info, "romio_ds_write", "disable");
 
 /* Enable ROMIO's collective buffering */
 MPI_Info_set(info, "romio_cb_read", "enable");
 MPI_Info_set(info, "romio_cb_write", "enable");
 
 /* Attach above info as access properties */
 plist_id = H5Pcreate(H5P_FILE_ACCESS);
 /* Make an MPI_Dup() of MPI_COMM_WORLD and attach info, this causes HDF5 to use collective calls */
 H5Pset_fapl_mpio(plist_id, MPI_COMM_WORLD, info);
 file = H5Fcreate(filename, H5F_ACC_TRUNC, H5P_DEFAULT, plist_id);
 
 ...
 
 H5Pclose (plist_id);

Optimal striping on `ws9` & choice of multiplier

ws9 uses the Lustre file system (LFS) which saves data in /stripes/ on Object Storage Targets (OSTs). The number of OSTs used for this is called /stripe count/ and is one of the main factors for getting your I/O to scale, as more disks to write to in parallel are available with higher stripe counts. Analysis of I/O scaling results have shown that the stripe count should be chosen as SC = 5/node + 3 for up to 8 nodes, after which I/O performance levels off for ws9. In order to scale beyond 8 nodes, the locking mode needs to be changed to use Lustre Lock-Ahead (LLA), which allows multiple clients (cores) to write to the same OST without contention (optimally). LLA is activated by setting the MPI-I/O hint cray_cb_write_lock_mode=2 and the number of clients per OST x is set via cray_cb_nodes_multiplier=x. I/O scaling showed that higher write rates are achieved by first using all available OSTs and then activating LLA. The choice of multiplier x however is not so straightforward, as the optimal multiplier is dependent on the data written per worker as well as the number of employed cores. For large data sizes per core (>=4MB) and above 1536 cores, multipliers of 16-32 showed the highest performance. Smaller data sizes generally require smaller multipliers (2-4) to show an increase of performance beyond simply using all OSTs.

Another factor of the striping employed by LFS is the /stripe size/, i.e. how much data is stored in each stripe. This correlates with the number of necessary writes and optimally all writes are of stripe size. Choosing the stripe size to be a small divisor (1-16) of the data per process typically yields above 90% stripe sized writes which correlates to good I/O performance.

For the more adventurous and technically inclined, a lot of information on I/O is given if the environment variables MPICH_MPIIO_STATS and MPICH_MPIIO_TIMERS are set to 1 while running your program. If everything is set nicely, the statistics part may look like this:

  MPIIO write access patterns for goodfile
  independent writes      = 0 // 0 if there is no header
  collective writes       = 3840
  independent writers     = 0
  aggregators             = 54 // should be stripe count * multiplier
  stripe count            = 54
  stripe size             = 4063232
  system writes           = 3890
  stripe sized writes     = 3851 // optimally close to above
  aggregators active      = 0,0,0,3840 (1, <= 27, > 27, 54)  // all writes should be in the rightmost bin
  total bytes for writes  = 15728640000 = 15000 MiB = 14 GiB
  ave system write size   = 4043352
  read-modify-write count = 0 // from here
  read-modify-write bytes = 0
  number of write gaps    = 0 // to here should be 0
  ave write gap size      = NA // for headerless files

MPI-IO

Contents

Best Practices for IO, Parallel IO and MPI-IO