- Infos im HLRS Wiki sind nicht rechtsverbindlich und ohne Gewähr -
- Information contained in the HLRS Wiki is not legally binding and HLRS is not responsible for any damages that might result from its use -
MPI-IO
Best practices of using MPI I/O
The best way to use parallel MPI I/O is to
- make as few file I/O calls in general in order to create
- big data requests and have
- as few meta-data accesses (seeks, query or changing of file-size).
If this is taken to the extreme, all processes having to write data will participate in one collective write-request to one file. The following code-fragment used on Cray Jaguar on Lustre for a performance tracing library makes usage of the collective write call and MPI I/O info hints:
/* * In order to know, at which OFFSET we are writing, let's figure out the previous processor's lengths * We need two more slots for comm_rank and for mpistat_unexpected_queue_avg_time_num. */ MPI_Scan (&buffer_pos, &position, 1, MPI_INT, MPI_SUM, MPI_COMM_WORLD); /* Scan is inclusive, reduce by our input */ position -= buffer_pos; MPI_Allreduce (&buffer_pos, &file_length, 1, MPI_INT, MPI_SUM, MPI_COMM_WORLD); /* Set a few MPI_Info key-values, in order to improve the write-speed */ info = MPI_INFO_NULL; if (file_length > 4*1024*1024 || 256 < mpistat_comm_size) { MPI_Info_create (&info); MPI_Info_set (info, "cb_align", "2"); /* Default: OMPI: none, CrayMPT: 2 */ MPI_Info_set (info, "cb_nodes_list", "*:*"); /* Default: OMPI: *:1, CrayMPT: *:* */ MPI_Info_set (info, "direct_io", "false"); /* Default: OMPI: none, CrayMPT: false */ MPI_Info_set (info, "romio_ds_read", "disable"); /* Default: OMPI: none, CrayMPT: disable */ MPI_Info_set (info, "romio_ds_write", "disable"); /* Default: OMPI: none, CrayMPT: disable */ /* Let's reduce the number of aggregators, should be roughly 2 to 4 times the stripe-factor */ MPI_Info_set (info, "cb_nodes", "8"); /* Default: OMPI: set automatically to the number of distinct nodes; However TOO High */ } MPI_File_open (MPI_COMM_WORLD, fn, MPI_MODE_CREATE | MPI_MODE_WRONLY, info, &fh); MPI_File_write_at_all (fh, position, buffer, buffer_pos, MPI_CHAR, MPI_STATUS_IGNORE); MPI_File_close (&fh);
The length in Bytes per process is pre-determined MPI_Scan and (if the file is large enough) will reduce the number of MPI I/O aggregators (then processes collecting data and writing to the OSTs). Please note
- In this case, data is contiguous, all data written per process and the sum fit into 2GB (well for the MPI_INT on this platform).
- The defaults for Cray MPI's ROMIO were good -- however the striping was too high.
- Striping information is set, when a file is creating; mostly the default is fine, e.g. a stripe-factor of four.
However, sometimes this default needs to be changed using
- Lustre-Tools from the Shell: touch /mnt/lustre/file ; lfs getstripe /mnt/lustre/file and lfs setstripe.
- Consider using Ken Matney's Lustre Utility Library (LUT) to set the information from Your code (see lut_putl)...