Revision as of 10:09, 11 February 2020

Best Practices for IO, Parallel IO and MPI-IO

Best practices for I/O

Do not generate Output. Kidding aside, there are a few hints... TBD

File size restriction on lustre file systems

File/data segment size is currently limited to 2TB per OST. If you have files which are larger than 2TB please ensure that the striping distribute your data in a way that the per OST limit is not reached. For more details see http://wiki.lustre.org/index.php/FAQ_-_Sizing.

Best practices of using MPI I/O

The best way to use parallel MPI I/O is to

make as few file I/O calls in general in order to create
big data requests and have
as few meta-data accesses (seeks, query or changing of file-size).

Note: The following example shows only the usage of the Info object but the best values for your application are very likely different!

If this is taken to the extreme, all processes having to write data will participate in one collective write-request to one file. The following code-fragment used on Cray Jaguar on Lustre for a performance tracing library makes usage of the collective write call and MPI I/O info hints:

   /*
    * In order to know, at which OFFSET we are writing, let's figure out the previous processor's lengths
    * We need two more slots for comm_rank and for mpistat_unexpected_queue_avg_time_num.
    */
   MPI_Scan (&buffer_pos, &position, 1, MPI_INT, MPI_SUM, MPI_COMM_WORLD);
   /* Scan is inclusive, reduce by our input */
   position -= buffer_pos;
   MPI_Allreduce (&buffer_pos, &file_length, 1, MPI_INT, MPI_SUM, MPI_COMM_WORLD);
   
   /* Set a few MPI_Info key-values, in order to improve the write-speed */
   info = MPI_INFO_NULL;
   if (file_length > 4*1024*1024 || 256 < mpistat_comm_size) {
       MPI_Info_create (&info);
       MPI_Info_set (info, "cb_align", "2");             /* Default: OMPI: none, CrayMPT: 2 */
       MPI_Info_set (info, "cb_nodes_list", "*:*");      /* Default: OMPI: *:1, CrayMPT: *:* */
       MPI_Info_set (info, "direct_io", "false");        /* Default: OMPI: none, CrayMPT: false */
       MPI_Info_set (info, "romio_ds_read", "disable");  /* Default: OMPI: none, CrayMPT: disable */
       MPI_Info_set (info, "romio_ds_write", "disable"); /* Default: OMPI: none, CrayMPT: disable */
       /* Let's reduce the number of aggregators, should be roughly 2 to 4 times the stripe-factor */
       MPI_Info_set (info, "cb_nodes", "8");             /* Default: OMPI: set automatically to the number of distinct nodes; However TOO High */
   }
   
   MPI_File_open (MPI_COMM_WORLD, fn, MPI_MODE_CREATE | MPI_MODE_WRONLY, info, &fh);
   MPI_File_write_at_all (fh, position, buffer, buffer_pos, MPI_CHAR, MPI_STATUS_IGNORE);
   MPI_File_close (&fh);

The length in Bytes per process is pre-determined MPI_Scan and (if the file is large enough) will reduce the number of MPI I/O aggregators (then processes collecting data and writing to the OSTs). Please note

In this case, data is contiguous, all data written per process and the sum fit into 2GB (well for the MPI_INT on this platform).
The defaults for Cray MPI's ROMIO were good -- however the striping was too high.
Striping information is set, when a file is creating; mostly the default is fine, e.g. a stripe-factor of four.

However, sometimes this default needs to be changed using

Lustre-Tools from the Shell: touch /mnt/lustre/file ; lfs getstripe /mnt/lustre/file and lfs setstripe.
Consider using Ken Matney's Lustre Utility Library (LUT) to set the information from Your code (see lut_putl)...

Adapting HDF5's MPI I/O parameters to prevent locking on Lustre

The HDF5 library and it's use of MPI I/O exposes a problem on file systems that do not support locking, or are configured without (e.g. most Lustre installations). When creating files using H5Fcreate() on these file systems the MPI I/O layer (in most MPI-implementations ROMIO) causes the file system to hang.

To eliminate the problem: when opening the file one must attach an info parameter, which disables ROMIO's data-sieving ds_read and ds_write and enables ROMIO's collective-buffering. (thanks to Sebastian Lange for the suggestion!)

 hid_t file;
 hid_t plist_id;
 MPI_Info info;
 
 /* Create info to be attached to HDF5 file */
 MPI_Info_create(&info);
 
 /* Disables ROMIO's data-sieving */
 MPI_Info_set(info, "romio_ds_read", "disable");
 MPI_Info_set(info, "romio_ds_write", "disable");
 
 /* Enable ROMIO's collective buffering */
 MPI_Info_set(info, "romio_cb_read", "enable");
 MPI_Info_set(info, "romio_cb_write", "enable");
 
 /* Attach above info as access properties */
 plist_id = H5Pcreate(H5P_FILE_ACCESS);
 /* Make an MPI_Dup() of MPI_COMM_WORLD and attach info, this causes HDF5 to use collective calls */
 H5Pset_fapl_mpio(plist_id, MPI_COMM_WORLD, info);
 file = H5Fcreate(filename, H5F_ACC_TRUNC, H5P_DEFAULT, plist_id);
 
 ...
 
 H5Pclose (plist_id);

Optimal striping on `ws9` & choice of multiplier

ws9 uses the Lustre file system (LFS) which saves data in /stripes/ on Object Storage Targets (OSTs). The number of OSTs used for this is called /stripe count/ and is one of the main factors for getting your I/O to scale, as more disks to write to in parallel are available with higher stripe counts. Analysis of I/O scaling results have shown that the stripe count should be chosen as SC = 5/node + 3 for up 8 nodes, after which I/O performance levels off for ws9. In order to scale beyond 8 nodes, the locking mode needs to be changed to use Lustre Lock-Ahead (LLA), which allows multiple clients (cores) to write to the same OST without contention (optimally). LLA is activated by setting the MPI-I/O hint cray_cb_write_lock_mode=2 and the number of clients per OST xis set via cray_cb_nodes_multiplier=x. I/O scaling showed that higher write rates are achieved by first using all available OSTs and then activating LLA. The choice of multiplier x however is not so straightforward, as the optimal multiplier is dependent on the data written per worker as well as the number of employed cores. For large data sizes per core (>=4MB) and above 1536 cores, multipliers of 16-32 showed the highest performance. Smaller data sizes generally require smaller multipliers (2-4) to show an increase of performance beyond simply using all OSTs.

Another factor of the striping employed by LFS is the /stripe size/, i.e. how much data is stored in each stripe. This correlates with the number of necessary writes and optimally all writes are of stripe size. Choosing the stripe size to be a small divisor (1-16) of the data per process typically yields above 90% stripe sized writes which correlates to good I/O performance.

For the more adventurous and technically inclined, a lot of information on I/O is given if the environment variables MPICH_MPIIO_STATS and MPICH_MPIIO_TIMERS are set to 1 while running your program. If everything is set nicely, the statistics part may look like this:

  MPIIO write access patterns for goodfile
  independent writes      = 0 // 0 if there is no header
  collective writes       = 3840
  independent writers     = 0
  aggregators             = 54 // should be stripe count * multiplier
  stripe count            = 54
  stripe size             = 4063232
  system writes           = 3890
  stripe sized writes     = 3851 // optimally close to above
  aggregators active      = 0,0,0,3840 (1, <= 27, > 27, 54)  // all writes should be in the rightmost bin
  total bytes for writes  = 15728640000 = 15000 MiB = 14 GiB
  ave system write size   = 4043352
  read-modify-write count = 0 // from here
  read-modify-write bytes = 0
  number of write gaps    = 0 // to here should be 0
  ave write gap size      = NA // for headerless files

MPI-IO: Difference between revisions

Revision as of 10:09, 11 February 2020

Contents

Best Practices for IO, Parallel IO and MPI-IO