- Infos im HLRS Wiki sind nicht rechtsverbindlich und ohne Gewähr -
- Information contained in the HLRS Wiki is not legally binding and HLRS is not responsible for any damages that might result from its use -
MPI-IO: Difference between revisions
Line 7: | Line 7: | ||
==File size restriction on lustre file systems== | ==File size restriction on lustre file systems== | ||
File/data segment size is currently limited to 2TB per OST. If you have files which are larger than 2TB please ensure that the striping distribute your data in a way that the per OST limit is not reached. | File/data segment size is currently limited to 2TB per OST. If you have files which are larger than 2TB please ensure that the striping distribute your data in a way that the per OST limit is not reached. | ||
For more details see http://wiki.lustre.org/index.php/FAQ_-_Sizing. | |||
== Best practices of using MPI I/O == | == Best practices of using MPI I/O == |
Revision as of 10:42, 21 June 2013
Best Practices for IO, Parallel IO and MPI-IO
Best practices for I/O
Do not generate Output. Kidding aside, there are a few hints... TBD
File size restriction on lustre file systems
File/data segment size is currently limited to 2TB per OST. If you have files which are larger than 2TB please ensure that the striping distribute your data in a way that the per OST limit is not reached. For more details see http://wiki.lustre.org/index.php/FAQ_-_Sizing.
Best practices of using MPI I/O
The best way to use parallel MPI I/O is to
- make as few file I/O calls in general in order to create
- big data requests and have
- as few meta-data accesses (seeks, query or changing of file-size).
If this is taken to the extreme, all processes having to write data will participate in one collective write-request to one file. The following code-fragment used on Cray Jaguar on Lustre for a performance tracing library makes usage of the collective write call and MPI I/O info hints:
/* * In order to know, at which OFFSET we are writing, let's figure out the previous processor's lengths * We need two more slots for comm_rank and for mpistat_unexpected_queue_avg_time_num. */ MPI_Scan (&buffer_pos, &position, 1, MPI_INT, MPI_SUM, MPI_COMM_WORLD); /* Scan is inclusive, reduce by our input */ position -= buffer_pos; MPI_Allreduce (&buffer_pos, &file_length, 1, MPI_INT, MPI_SUM, MPI_COMM_WORLD); /* Set a few MPI_Info key-values, in order to improve the write-speed */ info = MPI_INFO_NULL; if (file_length > 4*1024*1024 || 256 < mpistat_comm_size) { MPI_Info_create (&info); MPI_Info_set (info, "cb_align", "2"); /* Default: OMPI: none, CrayMPT: 2 */ MPI_Info_set (info, "cb_nodes_list", "*:*"); /* Default: OMPI: *:1, CrayMPT: *:* */ MPI_Info_set (info, "direct_io", "false"); /* Default: OMPI: none, CrayMPT: false */ MPI_Info_set (info, "romio_ds_read", "disable"); /* Default: OMPI: none, CrayMPT: disable */ MPI_Info_set (info, "romio_ds_write", "disable"); /* Default: OMPI: none, CrayMPT: disable */ /* Let's reduce the number of aggregators, should be roughly 2 to 4 times the stripe-factor */ MPI_Info_set (info, "cb_nodes", "8"); /* Default: OMPI: set automatically to the number of distinct nodes; However TOO High */ } MPI_File_open (MPI_COMM_WORLD, fn, MPI_MODE_CREATE | MPI_MODE_WRONLY, info, &fh); MPI_File_write_at_all (fh, position, buffer, buffer_pos, MPI_CHAR, MPI_STATUS_IGNORE); MPI_File_close (&fh);
The length in Bytes per process is pre-determined MPI_Scan and (if the file is large enough) will reduce the number of MPI I/O aggregators (then processes collecting data and writing to the OSTs). Please note
- In this case, data is contiguous, all data written per process and the sum fit into 2GB (well for the MPI_INT on this platform).
- The defaults for Cray MPI's ROMIO were good -- however the striping was too high.
- Striping information is set, when a file is creating; mostly the default is fine, e.g. a stripe-factor of four.
However, sometimes this default needs to be changed using
- Lustre-Tools from the Shell: touch /mnt/lustre/file ; lfs getstripe /mnt/lustre/file and lfs setstripe.
- Consider using Ken Matney's Lustre Utility Library (LUT) to set the information from Your code (see lut_putl)...
Adapting HDF5's MPI I/O parameters to prevent locking on Lustre
The HDF5 library and it's use of MPI I/O exposes a problem on file systems that do not support locking, or are configured without (e.g. most Lustre installations). When creating files using H5Fcreate() on these file systems the MPI I/O layer (in most MPI-implementations ROMIO) causes the file system to hang.
To eliminate the problem: when opening the file one must attach an info parameter, which disables ROMIO's data-sieving ds_read and ds_write and enables ROMIO's collective-buffering. (thanks to Sebastian Lange for the suggestion!)
hid_t file; hid_t plist_id; MPI_Info info; /* Create info to be attached to HDF5 file */ MPI_Info_create(&info); /* Disables ROMIO's data-sieving */ MPI_Info_set(info, "romio_ds_read", "disable"); MPI_Info_set(info, "romio_ds_write", "disable"); /* Enable ROMIO's collective buffering */ MPI_Info_set(info, "romio_cb_read", "enable"); MPI_Info_set(info, "romio_cb_write", "enable"); /* Attach above info as access properties */ plist_id = H5Pcreate(H5P_FILE_ACCESS); /* Make an MPI_Dup() of MPI_COMM_WORLD and attach info, this causes HDF5 to use collective calls */ H5Pset_fapl_mpio(plist_id, MPI_COMM_WORLD, info); file = H5Fcreate(filename, H5F_ACC_TRUNC, H5P_DEFAULT, plist_id); ... H5Pclose (plist_id);