MPI-IO

Best practices of using MPI I/O

The best way to use parallel MPI I/O is to

make as few file I/O calls in general in order to create
big data requests and have
as few meta-data accesses (seeks, query or changing of file-size).

If this is taken to the extreme, all processes having to write data will participate in one collective write-request to one file. The following code-fragment used on Cray Jaguar on Lustre for a performance tracing library makes usage of the collective write call and MPI I/O info hints:

   /*
    * In order to know, at which OFFSET we are writing, let's figure out the previous processor's lengths
    * We need two more slots for comm_rank and for mpistat_unexpected_queue_avg_time_num.
    */
   MPI_Scan (&buffer_pos, &position, 1, MPI_INT, MPI_SUM, MPI_COMM_WORLD);
   /* Scan is inclusive, reduce by our input */
   position -= buffer_pos;
   MPI_Allreduce (&buffer_pos, &file_length, 1, MPI_INT, MPI_SUM, MPI_COMM_WORLD);
   
   /* Set a few MPI_Info key-values, in order to improve the write-speed */
   info = MPI_INFO_NULL;
   if (file_length > 4*1024*1024 || 256 < mpistat_comm_size) {
       MPI_Info_create (&info);
       MPI_Info_set (info, "cb_align", "2");             /* Default: OMPI: none, CrayMPT: 2 */
       MPI_Info_set (info, "cb_nodes_list", "*:*");      /* Default: OMPI: *:1, CrayMPT: *:* */
       MPI_Info_set (info, "direct_io", "false");        /* Default: OMPI: none, CrayMPT: false */
       MPI_Info_set (info, "romio_ds_read", "disable");  /* Default: OMPI: none, CrayMPT: disable */
       MPI_Info_set (info, "romio_ds_write", "disable"); /* Default: OMPI: none, CrayMPT: disable */
       /* Let's reduce the number of aggregators, should be roughly 2 to 4 times the stripe-factor */
       MPI_Info_set (info, "cb_nodes", "8");             /* Default: OMPI: set automatically to the number of distinct nodes; However TOO High */
   }
   
   MPI_File_open (MPI_COMM_WORLD, fn, MPI_MODE_CREATE | MPI_MODE_WRONLY, info, &fh);
   MPI_File_write_at_all (fh, position, buffer, buffer_pos, MPI_CHAR, MPI_STATUS_IGNORE);
   MPI_File_close (&fh);

The length in Bytes per process is pre-determined MPI_Scan and (if the file is large enough) will reduce the number of MPI I/O aggregators (then processes collecting data and writing to the OSTs). Please note

In this case, data is contiguous, all data written per process and the sum fit into 2GB (well for the MPI_INT on this platform).
The defaults for Cray MPI's ROMIO were good -- however the striping was too high.
Striping information is set, when a file is creating; mostly the default is fine, e.g. a stripe-factor of four.

However, sometimes this default needs to be changed using

Lustre-Tools from the Shell: touch /mnt/lustre/file ; lfs getstripe /mnt/lustre/file and lfs setstripe.
Consider using Ken Matney's Lustre Utility Library (LUT) to set the information from Your code (see lut_putl)...

MPI-IO

Best practices of using MPI I/O

Navigation menu

Page actions

Page actions

Personal tools

Navigation

Search

Tools