- Infos im HLRS Wiki sind nicht rechtsverbindlich und ohne Gewähr -
- Information contained in the HLRS Wiki is not legally binding and HLRS is not responsible for any damages that might result from its use -

MPI-IO

From HLRS Platforms
Revision as of 14:30, 24 February 2011 by Hpcraink (talk | contribs)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigationJump to search

Best practices of using MPI I/O

The best way to use parallel MPI I/O is to

  • make as few file I/O calls in general in order to create
  • big data requests and have
  • as few meta-data accesses (seeks, query or changing of file-size).

If this is taken to the extreme, all processes having to write data will participate in one collective write-request to one file. The following code-fragment used on Cray Jaguar on Lustre for a performance tracing library makes usage of the collective write call and MPI I/O info hints:

   /*
    * In order to know, at which OFFSET we are writing, let's figure out the previous processor's lengths
    * We need two more slots for comm_rank and for mpistat_unexpected_queue_avg_time_num.
    */
   MPI_Scan (&buffer_pos, &position, 1, MPI_INT, MPI_SUM, MPI_COMM_WORLD);
   /* Scan is inclusive, reduce by our input */
   position -= buffer_pos;
   MPI_Allreduce (&buffer_pos, &file_length, 1, MPI_INT, MPI_SUM, MPI_COMM_WORLD);
   
   /* Set a few MPI_Info key-values, in order to improve the write-speed */
   info = MPI_INFO_NULL;
   if (file_length > 4*1024*1024 || 256 < mpistat_comm_size) {
       MPI_Info_create (&info);
       MPI_Info_set (info, "cb_align", "2");             /* Default: OMPI: none, CrayMPT: 2 */
       MPI_Info_set (info, "cb_nodes_list", "*:*");      /* Default: OMPI: *:1, CrayMPT: *:* */
       MPI_Info_set (info, "direct_io", "false");        /* Default: OMPI: none, CrayMPT: false */
       MPI_Info_set (info, "romio_ds_read", "disable");  /* Default: OMPI: none, CrayMPT: disable */
       MPI_Info_set (info, "romio_ds_write", "disable"); /* Default: OMPI: none, CrayMPT: disable */
       /* Let's reduce the number of aggregators, should be roughly 2 to 4 times the stripe-factor */
       MPI_Info_set (info, "cb_nodes", "8");             /* Default: OMPI: set automatically to the number of distinct nodes; However TOO High */
   }
   
   MPI_File_open (MPI_COMM_WORLD, fn, MPI_MODE_CREATE | MPI_MODE_WRONLY, info, &fh);
   MPI_File_write_at_all (fh, position, buffer, buffer_pos, MPI_CHAR, MPI_STATUS_IGNORE);
   MPI_File_close (&fh);

The length in Bytes per process is pre-determined MPI_Scan and (if the file is large enough) will reduce the number of MPI I/O aggregators (then processes collecting data and writing to the OSTs). Please note

  1. In this case, data is contiguous, all data written per process and the sum fit into 2GB (well for the MPI_INT on this platform).
  2. The defaults for Cray MPI's ROMIO were good -- however the striping was too high.
  3. Striping information is set, when a file is creating; mostly the default is fine, e.g. a stripe-factor of four.

However, sometimes this default needs to be changed using

  1. Lustre-Tools from the Shell: touch /mnt/lustre/file ; lfs getstripe /mnt/lustre/file and lfs setstripe.
  2. Consider using Ken Matney's Lustre Utility Library (LUT) to set the information from Your code (see lut_putl)...