- Infos im HLRS Wiki sind nicht rechtsverbindlich und ohne Gewähr -
- Information contained in the HLRS Wiki is not legally binding and HLRS is not responsible for any damages that might result from its use -

MPI-IO: Difference between revisions

From HLRS Platforms
Jump to navigationJump to search
m (Update description to reflect previous MPI_Exscan update)
 
(4 intermediate revisions by 2 users not shown)
Line 11: Line 11:
== Best practices of using MPI I/O ==
== Best practices of using MPI I/O ==
The best way to use parallel MPI I/O is to
The best way to use parallel MPI I/O is to
* make as few file I/O calls in general in order to create
* make as few MPI I/O calls in general in order to create/write/read
* big data requests and have
* perform big data requests
* as few meta-data accesses (seeks, query or changing of file-size).
* have as few meta-data accesses (seeks, queries or changes of file-size)


=== MPI I/O best practice example ===
{{Note
{{Note
| text = The following example shows only the usage of the Info object but the best values for your application are very likely different!
| text = The following example shows only the usage of the Info object but the best values for your application are very likely different!
}}
}}


If this is taken to the extreme, all processes having to write data will participate
The following code-fragment makes usage of the collective <code>MPI_File_write_at_all</code> call and  MPI I/O info hints.
in '''one''' collective write-request to '''one''' file.
If this is taken to the extreme, all MPI processes - which have some data to write - will participate in '''one''' collective write-request to '''one''' file.
The following code-fragment used on Cray Jaguar on Lustre for a performance tracing library makes usage of the collective write call
and  MPI I/O info hints:


    /*
{{File|filename=mpi-io-fragment
    * In order to know, at which OFFSET we are writing, let's figure out the previous processor's lengths
|content=<pre>
    * We need two more slots for comm_rank and for mpistat_unexpected_queue_avg_time_num.
/*
    */
* The following code fragment writes all elements from the buffers of all MPI
    MPI_Scan (&buffer_pos, &position, 1, MPI_INT, MPI_SUM, MPI_COMM_WORLD);
* processes into one contigous file. For demonstration purpose we assume that
    /* Scan is inclusive, reduce by our input */
* buffer elements are of type char.
    position -= buffer_pos;
* In order to know, at which OFFSET each MPI process has to write into the
    MPI_Allreduce (&buffer_pos, &file_length, 1, MPI_INT, MPI_SUM, MPI_COMM_WORLD);
* file, the number of elements written by the previous processors
   
* is computed via an exclusive prefix reduction using MPI_Exscan.
    /* Set a few MPI_Info key-values, in order to improve the write-speed */
*/
    info = MPI_INFO_NULL;
MPI_Exscan(&num_elements, &offset, 1, MPI_INT, MPI_SUM, MPI_COMM_WORLD);
    if (file_length > 4*1024*1024 || 256 < mpistat_comm_size) {
 
        MPI_Info_create (&info);
MPI_Info info = MPI_INFO_NULL;
        MPI_Info_set (info, "cb_align", "2");            /* Default: OMPI: none, CrayMPT: 2 */
 
        MPI_Info_set (info, "cb_nodes_list", "*:*");      /* Default: OMPI: *:1, CrayMPT: *:* */
/* Set a few MPI_Info key-values in order to improve the write-speed */
        MPI_Info_set (info, "direct_io", "false");        /* Default: OMPI: none, CrayMPT: false */
MPI_Info_create (&info);
        MPI_Info_set (info, "romio_ds_read", "disable");  /* Default: OMPI: none, CrayMPT: disable */
MPI_Info_set(info, "cb_align", "2");            /* Default: OMPI: none, CrayMPT: 2 */
        MPI_Info_set (info, "romio_ds_write", "disable"); /* Default: OMPI: none, CrayMPT: disable */
MPI_Info_set(info, "cb_nodes_list", "*:*");      /* Default: OMPI: *:1, CrayMPT: *:* */
        /* Let's reduce the number of aggregators, should be roughly 2 to 4 times the stripe-factor */
MPI_Info_set(info, "direct_io", "false");        /* Default: OMPI: none, CrayMPT: false */
        MPI_Info_set (info, "cb_nodes", "8");            /* Default: OMPI: set automatically to the number of distinct nodes; However TOO High */
MPI_Info_set(info, "romio_ds_read", "disable");  /* Default: OMPI: none, CrayMPT: disable */
    }
MPI_Info_set(info, "romio_ds_write", "disable"); /* Default: OMPI: none, CrayMPT: disable */
   
/* Let's reduce the number of aggregators, should be roughly 2 to 4 times the stripe-factor */
    MPI_File_open (MPI_COMM_WORLD, fn, MPI_MODE_CREATE | MPI_MODE_WRONLY, info, &fh);
MPI_Info_set(info, "cb_nodes", "8");            /* Default: OMPI: set automatically to the number of distinct nodes; However TOO High */
    MPI_File_write_at_all (fh, position, buffer, buffer_pos, MPI_CHAR, MPI_STATUS_IGNORE);
 
    MPI_File_close (&fh);
MPI_File fh;
MPI_File_open(MPI_COMM_WORLD, filename, MPI_MODE_CREATE | MPI_MODE_WRONLY, info, &fh);
MPI_File_write_at_all (fh, offset, buffer, num_elements, MPI_CHAR, MPI_STATUS_IGNORE);
MPI_File_close (&fh);
</pre>
}}


The length in Bytes per process is pre-determined <tt>MPI_Scan</tt> and (if the file is
The write offset for each MPI process is pre-determined using <code>MPI_Excan</code> and (if the file is
large enough) will reduce the number of MPI I/O aggregators (then processes collecting data and
large enough) will reduce the number of MPI I/O aggregators (then processes collecting data and
writing to the OSTs).
writing to the OSTs).
Line 97: Line 102:
<tt>ws9</tt> uses the Lustre file system (LFS) which saves data in /stripes/ on Object Storage Targets (OSTs).
<tt>ws9</tt> uses the Lustre file system (LFS) which saves data in /stripes/ on Object Storage Targets (OSTs).
The number of OSTs used for this is called /stripe count/ and is one of the main factors for getting your I/O to scale, as more disks to write to in parallel are available with higher stripe counts.
The number of OSTs used for this is called /stripe count/ and is one of the main factors for getting your I/O to scale, as more disks to write to in parallel are available with higher stripe counts.
Analysis of I/O scaling results have shown that the stripe count should be chosen as SC = 5/node + 3 for up 8 nodes, after which I/O performance levels off for <tt>ws9</tt>.
Analysis of I/O scaling results have shown that the stripe count should be chosen as SC = 5/node + 3 for up to 8 nodes, after which I/O performance levels off for <tt>ws9</tt>.
In order to scale beyond 8 nodes, the locking mode needs to be changed to use Lustre Lock-Ahead (LLA), which allows multiple clients (cores) to write to the same OST without contention (optimally).
In order to scale beyond 8 nodes, the locking mode needs to be changed to use Lustre Lock-Ahead (LLA), which allows multiple clients (cores) to write to the same OST without contention (optimally).
LLA is activated by setting the MPI-I/O hint <tt>cray_cb_write_lock_mode=2</tt> and the number of clients per OST <tt>x</tt>is set via <tt>cray_cb_nodes_multiplier=x</tt>.
LLA is activated by setting the MPI-I/O hint <tt>cray_cb_write_lock_mode=2</tt> and the number of clients per OST <tt>x</tt> is set via <tt>cray_cb_nodes_multiplier=x</tt>.
I/O scaling showed that higher write rates are achieved by first using all available OSTs and then activating LLA.
I/O scaling showed that higher write rates are achieved by first using all available OSTs and then activating LLA.
The choice of multiplier <tt>x</tt> however is not so straightforward, as the optimal multiplier is dependent on the data written per worker as well as the number of employed cores.
The choice of multiplier <tt>x</tt> however is not so straightforward, as the optimal multiplier is dependent on the data written per worker as well as the number of employed cores.

Latest revision as of 22:44, 8 March 2023

Best Practices for IO, Parallel IO and MPI-IO

Best practices for I/O

Do not generate Output. Kidding aside, there are a few hints... TBD


File size restriction on lustre file systems

File/data segment size is currently limited to 2TB per OST. If you have files which are larger than 2TB please ensure that the striping distribute your data in a way that the per OST limit is not reached. For more details see http://wiki.lustre.org/index.php/FAQ_-_Sizing.

Best practices of using MPI I/O

The best way to use parallel MPI I/O is to

  • make as few MPI I/O calls in general in order to create/write/read
  • perform big data requests
  • have as few meta-data accesses (seeks, queries or changes of file-size)


MPI I/O best practice example

Note: The following example shows only the usage of the Info object but the best values for your application are very likely different!


The following code-fragment makes usage of the collective MPI_File_write_at_all call and MPI I/O info hints. If this is taken to the extreme, all MPI processes - which have some data to write - will participate in one collective write-request to one file.

File: mpi-io-fragment
/*
 * The following code fragment writes all elements from the buffers of all MPI
 * processes into one contigous file. For demonstration purpose we assume that
 * buffer elements are of type char.
 * In order to know, at which OFFSET each MPI process has to write into the
 * file, the number of elements written by the previous processors
 * is computed via an exclusive prefix reduction using MPI_Exscan.
 */
MPI_Exscan(&num_elements, &offset, 1, MPI_INT, MPI_SUM, MPI_COMM_WORLD);

MPI_Info info = MPI_INFO_NULL;
  
/* Set a few MPI_Info key-values in order to improve the write-speed */
MPI_Info_create (&info);
MPI_Info_set(info, "cb_align", "2");             /* Default: OMPI: none, CrayMPT: 2 */
MPI_Info_set(info, "cb_nodes_list", "*:*");      /* Default: OMPI: *:1, CrayMPT: *:* */
MPI_Info_set(info, "direct_io", "false");        /* Default: OMPI: none, CrayMPT: false */
MPI_Info_set(info, "romio_ds_read", "disable");  /* Default: OMPI: none, CrayMPT: disable */
MPI_Info_set(info, "romio_ds_write", "disable"); /* Default: OMPI: none, CrayMPT: disable */
/* Let's reduce the number of aggregators, should be roughly 2 to 4 times the stripe-factor */
MPI_Info_set(info, "cb_nodes", "8");             /* Default: OMPI: set automatically to the number of distinct nodes; However TOO High */

MPI_File fh;
MPI_File_open(MPI_COMM_WORLD, filename, MPI_MODE_CREATE | MPI_MODE_WRONLY, info, &fh);
MPI_File_write_at_all (fh, offset, buffer, num_elements, MPI_CHAR, MPI_STATUS_IGNORE);
MPI_File_close (&fh);


The write offset for each MPI process is pre-determined using MPI_Excan and (if the file is large enough) will reduce the number of MPI I/O aggregators (then processes collecting data and writing to the OSTs). Please note

  1. In this case, data is contiguous, all data written per process and the sum fit into 2GB (well for the MPI_INT on this platform).
  2. The defaults for Cray MPI's ROMIO were good -- however the striping was too high.
  3. Striping information is set, when a file is creating; mostly the default is fine, e.g. a stripe-factor of four.

However, sometimes this default needs to be changed using

  1. Lustre-Tools from the Shell: touch /mnt/lustre/file ; lfs getstripe /mnt/lustre/file and lfs setstripe.
  2. Consider using Ken Matney's Lustre Utility Library (LUT) to set the information from Your code (see lut_putl)...

Adapting HDF5's MPI I/O parameters to prevent locking on Lustre

The HDF5 library and it's use of MPI I/O exposes a problem on file systems that do not support locking, or are configured without (e.g. most Lustre installations). When creating files using H5Fcreate() on these file systems the MPI I/O layer (in most MPI-implementations ROMIO) causes the file system to hang.

To eliminate the problem: when opening the file one must attach an info parameter, which disables ROMIO's data-sieving ds_read and ds_write and enables ROMIO's collective-buffering. (thanks to Sebastian Lange for the suggestion!)

 hid_t file;
 hid_t plist_id;
 MPI_Info info;
 
 /* Create info to be attached to HDF5 file */
 MPI_Info_create(&info);
 
 /* Disables ROMIO's data-sieving */
 MPI_Info_set(info, "romio_ds_read", "disable");
 MPI_Info_set(info, "romio_ds_write", "disable");
 
 /* Enable ROMIO's collective buffering */
 MPI_Info_set(info, "romio_cb_read", "enable");
 MPI_Info_set(info, "romio_cb_write", "enable");
 
 /* Attach above info as access properties */
 plist_id = H5Pcreate(H5P_FILE_ACCESS);
 /* Make an MPI_Dup() of MPI_COMM_WORLD and attach info, this causes HDF5 to use collective calls */
 H5Pset_fapl_mpio(plist_id, MPI_COMM_WORLD, info);
 file = H5Fcreate(filename, H5F_ACC_TRUNC, H5P_DEFAULT, plist_id);
 
 ...
 
 H5Pclose (plist_id);

Optimal striping on ws9 & choice of multiplier

ws9 uses the Lustre file system (LFS) which saves data in /stripes/ on Object Storage Targets (OSTs). The number of OSTs used for this is called /stripe count/ and is one of the main factors for getting your I/O to scale, as more disks to write to in parallel are available with higher stripe counts. Analysis of I/O scaling results have shown that the stripe count should be chosen as SC = 5/node + 3 for up to 8 nodes, after which I/O performance levels off for ws9. In order to scale beyond 8 nodes, the locking mode needs to be changed to use Lustre Lock-Ahead (LLA), which allows multiple clients (cores) to write to the same OST without contention (optimally). LLA is activated by setting the MPI-I/O hint cray_cb_write_lock_mode=2 and the number of clients per OST x is set via cray_cb_nodes_multiplier=x. I/O scaling showed that higher write rates are achieved by first using all available OSTs and then activating LLA. The choice of multiplier x however is not so straightforward, as the optimal multiplier is dependent on the data written per worker as well as the number of employed cores. For large data sizes per core (>=4MB) and above 1536 cores, multipliers of 16-32 showed the highest performance. Smaller data sizes generally require smaller multipliers (2-4) to show an increase of performance beyond simply using all OSTs.

Another factor of the striping employed by LFS is the /stripe size/, i.e. how much data is stored in each stripe. This correlates with the number of necessary writes and optimally all writes are of stripe size. Choosing the stripe size to be a small divisor (1-16) of the data per process typically yields above 90% stripe sized writes which correlates to good I/O performance.

For the more adventurous and technically inclined, a lot of information on I/O is given if the environment variables MPICH_MPIIO_STATS and MPICH_MPIIO_TIMERS are set to 1 while running your program. If everything is set nicely, the statistics part may look like this:

  MPIIO write access patterns for goodfile
  independent writes      = 0 // 0 if there is no header
  collective writes       = 3840
  independent writers     = 0
  aggregators             = 54 // should be stripe count * multiplier
  stripe count            = 54
  stripe size             = 4063232
  system writes           = 3890
  stripe sized writes     = 3851 // optimally close to above
  aggregators active      = 0,0,0,3840 (1, <= 27, > 27, 54)  // all writes should be in the rightmost bin
  total bytes for writes  = 15728640000 = 15000 MiB = 14 GiB
  ave system write size   = 4043352
  read-modify-write count = 0 // from here
  read-modify-write bytes = 0
  number of write gaps    = 0 // to here should be 0
  ave write gap size      = NA // for headerless files