- Infos im HLRS Wiki sind nicht rechtsverbindlich und ohne Gewähr -

MPI-IO

From HLRS Platforms

Best Practices for IO, Parallel IO and MPI-IO

Best practices for I/O

Do not generate Output. Kidding aside, there are a few hints... TBD


File size restriction on lustre file systems

File/data segment size is currently limited to 2TB per OST. If you have files which are larger than 2TB please ensure that the striping distribute your data in a way that the per OST limit is not reached. For more details see http://wiki.lustre.org/index.php/FAQ_-_Sizing.

Best practices of using MPI I/O

The best way to use parallel MPI I/O is to

  • make as few file I/O calls in general in order to create
  • big data requests and have
  • as few meta-data accesses (seeks, query or changing of file-size).
Note: The following example shows only the usage of the Info object but the best values for your application are very likely different!


If this is taken to the extreme, all processes having to write data will participate in one collective write-request to one file. The following code-fragment used on Cray Jaguar on Lustre for a performance tracing library makes usage of the collective write call and MPI I/O info hints:

   /*
    * In order to know, at which OFFSET we are writing, let's figure out the previous processor's lengths
    * We need two more slots for comm_rank and for mpistat_unexpected_queue_avg_time_num.
    */
   MPI_Scan (&buffer_pos, &position, 1, MPI_INT, MPI_SUM, MPI_COMM_WORLD);
   /* Scan is inclusive, reduce by our input */
   position -= buffer_pos;
   MPI_Allreduce (&buffer_pos, &file_length, 1, MPI_INT, MPI_SUM, MPI_COMM_WORLD);
   
   /* Set a few MPI_Info key-values, in order to improve the write-speed */
   info = MPI_INFO_NULL;
   if (file_length > 4*1024*1024 || 256 < mpistat_comm_size) {
       MPI_Info_create (&info);
       MPI_Info_set (info, "cb_align", "2");             /* Default: OMPI: none, CrayMPT: 2 */
       MPI_Info_set (info, "cb_nodes_list", "*:*");      /* Default: OMPI: *:1, CrayMPT: *:* */
       MPI_Info_set (info, "direct_io", "false");        /* Default: OMPI: none, CrayMPT: false */
       MPI_Info_set (info, "romio_ds_read", "disable");  /* Default: OMPI: none, CrayMPT: disable */
       MPI_Info_set (info, "romio_ds_write", "disable"); /* Default: OMPI: none, CrayMPT: disable */
       /* Let's reduce the number of aggregators, should be roughly 2 to 4 times the stripe-factor */
       MPI_Info_set (info, "cb_nodes", "8");             /* Default: OMPI: set automatically to the number of distinct nodes; However TOO High */
   }
   
   MPI_File_open (MPI_COMM_WORLD, fn, MPI_MODE_CREATE | MPI_MODE_WRONLY, info, &fh);
   MPI_File_write_at_all (fh, position, buffer, buffer_pos, MPI_CHAR, MPI_STATUS_IGNORE);
   MPI_File_close (&fh);

The length in Bytes per process is pre-determined MPI_Scan and (if the file is large enough) will reduce the number of MPI I/O aggregators (then processes collecting data and writing to the OSTs). Please note

  1. In this case, data is contiguous, all data written per process and the sum fit into 2GB (well for the MPI_INT on this platform).
  2. The defaults for Cray MPI's ROMIO were good -- however the striping was too high.
  3. Striping information is set, when a file is creating; mostly the default is fine, e.g. a stripe-factor of four.

However, sometimes this default needs to be changed using

  1. Lustre-Tools from the Shell: touch /mnt/lustre/file ; lfs getstripe /mnt/lustre/file and lfs setstripe.
  2. Consider using Ken Matney's Lustre Utility Library (LUT) to set the information from Your code (see lut_putl)...

Adapting HDF5's MPI I/O parameters to prevent locking on Lustre

The HDF5 library and it's use of MPI I/O exposes a problem on file systems that do not support locking, or are configured without (e.g. most Lustre installations). When creating files using H5Fcreate() on these file systems the MPI I/O layer (in most MPI-implementations ROMIO) causes the file system to hang.

To eliminate the problem: when opening the file one must attach an info parameter, which disables ROMIO's data-sieving ds_read and ds_write and enables ROMIO's collective-buffering. (thanks to Sebastian Lange for the suggestion!)

 hid_t file;
 hid_t plist_id;
 MPI_Info info;
 
 /* Create info to be attached to HDF5 file */
 MPI_Info_create(&info);
 
 /* Disables ROMIO's data-sieving */
 MPI_Info_set(info, "romio_ds_read", "disable");
 MPI_Info_set(info, "romio_ds_write", "disable");
 
 /* Enable ROMIO's collective buffering */
 MPI_Info_set(info, "romio_cb_read", "enable");
 MPI_Info_set(info, "romio_cb_write", "enable");
 
 /* Attach above info as access properties */
 plist_id = H5Pcreate(H5P_FILE_ACCESS);
 /* Make an MPI_Dup() of MPI_COMM_WORLD and attach info, this causes HDF5 to use collective calls */
 H5Pset_fapl_mpio(plist_id, MPI_COMM_WORLD, info);
 file = H5Fcreate(filename, H5F_ACC_TRUNC, H5P_DEFAULT, plist_id);
 
 ...
 
 H5Pclose (plist_id);