• National Institute for Computational Sciences is a UT/ORNL Partnership

I/O Tips - Lustre Striping and Parallel I/O

Tips for getting better I/O performance on Krakens Lustre file system




The Lustre file system on Kraken exists across a set of 336 block storage devices, referred to as Object Storage Targets (OSTs), that are managed by 48 service nodes serving as Object Storage Servers (OSSs). Each file in a Lustre file system is broken into chunks and stored on a subset of the OSTs. A single service node serving as the Metadata Server (MDS) assigns and tracks all of the the storage locations associated with each file in order to direct file I/O requests to the correct set of OSTs and corresponding OSSs. When a compute node needs to create or access a file, it requests the associated storage locations from the MDS and then communicates directly with all of the OSSs that manage the provided storage locations to complete the file transaction in parallel across the OSTs.


Striping

Storing a single file across multiple OSTs (referred to as striping) offers two benefits: 1) an increase in the bandwidth available when accessing the file and 2) an increase in the available disk space for storing the file. However, striping is not without disadvantages, namely: 1) increased overhead due to network operations and server contention and 2) increased risk of file damage due to hardware malfunction. Given the tradeoffs involved, the Lustre file system allows users to specify the striping policy for each file or directory of files using the lfs utility. The default stripe width as of July 2008 is 4.

Two commonly used lfs suboptions are getstripe and setstripe. The command lfs getstripe can be used to get striping information on files and directories, while the command lfs setstripe can be used to set the striping (and Lustre stripe buffer size).

The setstripe usage is as follows:

lfs setstripe <filename|dirname> <stripe size> <stripe index> <stripe count>

where

stripe size = the number of bytes on each OST (0 indicating default of 1 MB) specified with k, m, or g to indicate units of KB, MB, or GB, respectively,
stripe index = the OST index of first stripe (-1 indicating default), and
stripe count = the number of OSTs to stripe over (0 indicating default of 4 and -1 indicating all OSTs).

For example, the command

lfs setstripe <dir> 0 -1 1

sets the stripe count (width) to 1 on a directory.

The recommended striping configuration for a file in the Lustre file system on Kraken is as follows:

  • a stripe size of 512 KB to 4 GB that is an integer multiple of the size of an individual write used to output into the file,
  • a stripe index of -1 to use the default placement algorithm, and
  • a stripe count set to the lesser of
    1. 336 divided by the number of files in simultaneous use by the job,
    2. the average file size of the files in simultaneous use by the job in gigabytes (GB), and
    3. the maximum of 160 imposed by Lustre.
     ( round down to the nearest integer; subject to a minimum value of 1)

Please do not use a stripe index other than the default.

Further suggestions:

  • Use the default stripe size of 1 MB unless specific application benchmarks support an altered setting.

Please contact NICS user support for further guidance, if needed.


Parallel I/O

With the /lustre/scratch file system, high parallel I/O bandwidths can be achieved using MPI I/O to either a shared file or to a file per process. And, for example, similar performance can be achieved with Fortran writes within an MPI program (each process writing to its own file.)

What does it mean to the user that there is a range of process counts to get the best I/O performance? It means

  • If you are in that range or below, your current parallel I/O method is probably okay. That is, if you write to one shared file, or if you create one file per process, your I/O performance will probably do okay. But you will need to make sure you set your striping appropriately.
  • If you are running beyond the high range (>2k), then you might want to consider using a subset of your MPI processes to do I/O. Yes, this will require code changes, but in actuality the changes are not difficult in theory. And the performance gain can be nearly an order of magnitude. But this may be necessary only if your I/O takes more than 5% of your runtime, or you would like to do more I/O but dont because of the cost. Please contact the NICS User Assistance Center if you would like help with your parallel I/O. The following Fortran example creates an MPI communicator that include only ionodes:
! listofionodes is an array of the ranks of writers/readers
call MPI_COMM_GROUP(MPI_COMM_WORLD, WORLD_GROUP, ierr)
call MPI_GROUP_INCL(WORLD_GROUP, nionodes, listofionodes, IO_GROUP,ierr)
call MPI_COMM_CREATE(MPI_COMM_WORLD,IO_GROUP, MPI_COMM_IO, ierr)
! open
call MPI_FILE_OPEN(MPI_COMM_IO, trim(filename), filemode, finfo, mpifh, ierr)
! read/write
call MPI_FILE_WRITE_AT(mpifh, offset, iobuf, bufsize, MPI_REAL8, status, ierr)
!   OR
o!  call MPI_FILE_SET_VIEW(mpifh, disp, MPI_REAL8, MPI_REAL8, "native", finfo, ierr)
!  call MPI_FILE_WRITE_ALL(mpifh, iobuf, bufsize, MPI_REAL8, status, ierr)
! close
call MPI_FILE_CLOSE(mpifh, ierr)

Note: If you cannot implement a subsetting approach, it would still be to your advantage to limit the number of synchronous opens, say to 100, even if you can not limit writes/reads. This is useful for limiting many requests from hitting the metadata server (of which there is only one) at the same time.