The National Institute for Computational Sciences

I/O and Lustre Usage

 I/O and Lustre Usage

 

Lustre Fundamentals


The Lustre file system on Kraken exists across a set of 336 block storage devices, referred to as Object Storage Targets (OSTs), that are managed by 48 service nodes serving as Object Storage Servers (OSSs). Each file in a Lustre file system is broken into chunks and stored on a subset of the OSTs. A single service node serving as the Metadata Server (MDS) assigns and tracks all of the the storage locations associated with each file in order to direct file I/O requests to the correct set of OSTs and corresponding OSSs. The metadata itself is stored on a block storage device referred to as the MDT.

Lustre Components


The Lustre file system is made up of an underlying set of I/O servers called Object Storage Servers (OSSs) and disks called Object Storage Targets (OSTs). The file metadata is controlled by a Metadata Server (MDS) and stored on a Metadata Target (MDT). A single Lustre file system consists of one MDS and one MDT. The functions of each of these components are described in the following list:

  • Object Storage Servers (OSSs) manage a small set of OSTs by controlling I/O access and handling network requests to them. OSSs contain some metadata about the files stored on their OSTs. They typically serve between 2 and 8 OSTs, up to 16 TB in size each.
  • Object Storage Targets (OSTs) are block storage devices that store user file data. An OST may be thought of as a virtual disk, though it often consists of several physical disks, in a RAID configuration for instance. User file data is stored in one or more objects, with each object stored on a separate OST. The number of objects per file is user configurable and can be tuned to optimize performance for a given workload.
  • The Metadata Server (MDS) is a single service node that assigns and tracks all of the storage locations associated with each file in order to direct file I/O requests to the correct set of OSTs and corresponding OSSs. Once a file is opened, the MDS is not involved with I/O to the file. This is different from many block-based clustered file systems where the MDS controls block allocation, eliminating it as a source of contention for file I/O.
  • The Metadata Target (MDT) stores metadata (such as filenames, directories, permissions and file layout) on storage attached to an MDS. Storing the metadata on a MDT provides an efficient division of labor between computing and storage resources. Each file on the MDT contains the layout of the associated data file, including the OST number and object identifier and points to one or more objects associated with the data file.

Figure 1 shows the interaction among Lustre components in a basic cluster.

Figure 1: View of the Lustre File System. The route for data movement from application process memory to disk is shown by arrows.

When a compute node needs to create or access a file, it requests the associated storage locations from the MDS and the associated MDT. I/O operations then occur directly with the OSSs and OSTs associated with the file bypassing the MDS. For read operations, file data flows from the OSTs to memory. Each OST and MDT maps to a distinct subset of the RAID devices. The total storage capacity of a Lustre file system is the sum of the capacities provided by the OSTs.

Back to Contents

File Striping Basics


A key feature of the Lustre file system is its ability to distribute the segments of a single file across multiple OSTs using a technique called file striping. A file is said to be striped when its linear sequence of bytes is separated into small chunks, or stripes, so that read and write operations can access multiple OSTs concurrently.

A file is a linear sequence of bytes lined up one after another. Figure 2 shows a logical view of a single file, File A, broken into five segments and lined up in sequence.

Figure 2: Logical view of a file.

A physical view of File A striped across four OSTs in five distinct pieces is shown in Figure 3.

Figure 3: Physical view of a file.

Storing a single file across multiple OSTs (referred to as striping) offers two benefits: 1) an increase in the bandwidth available when accessing the file and 2) an increase in the available disk space for storing the file. However, striping is not without disadvantages, namely: 1) increased overhead due to network operations and server contention and 2) increased risk of file damage due to hardware malfunction. Given the tradeoffs involved, the Lustre file system allows users to specify the striping policy for each file or directory of files using the lfs utility. The lfs utility usage can be found in the Basic Lustre User Commands section.

Back to Contents

Stripe Alignment


Performance concerns related to file striping include resource contention on the block device (OST) and request contention on the OSS associated with the OST. This contention is minimized when processes (who access the file in parallel) access file locations that reside on different stripes.

Additionally, performance can be improved by minimizing the number of OSTs in which a process must communicate. An effective strategy to accomplish this is to stripe align your I/O requests. Ensure that processes access the file at offsets which correspond to stripe boundaries. Stripe settings should take into account the I/O pattern utilized to access the file.

Aligned Stripes

In Figure 3 we gave an example of a single file spread across four OSTs in five distinct pieces. Now, we add information to that example to show how the stripes are aligned in the logical view of File A. Since the file is spread across 4 OSTs the stripe count is 4. If File A has 9 MB of data and the stripe size is set to 1 MB it can be segmented into 9 equally sized stripes that will be accessed concurrently. The physical and logical views of File A are shown in Figure 4.

Figure 4: Physical and Logical Views of File A.

In this example, the I/O requests are stripe aligned, meaning that the processes access the file at offsets that correspond to stripe boundaries.

Non-aligned Stripes

Next, we give an example where the stripes are not aligned. Four processes write different amounts of data to a single shared File B that is 5 MB in size. The file is striped across 4 OSTs and the stripe size is 1 MB, meaning that the file will require 5 stripes. Each process writes its data as a single contiguous region in File B. No overlaps or gaps between these regions should be present; otherwise the data in the file would be corrupted. The sizes of the four writes and their corresponding offsets are as follows:

  • Process 0 writes 0.6 MB starting at offset 0 MB
  • Process 1 writes 1.8 MB starting at offset 0.6 MB
  • Process 2 writes 1.2 MB starting at offset 2.4 MB
  • Process 3 writes 1.4 MB starting at offset 3.6 MB

The logical and physical views of File B are shown in Figure 5.

Figure 5: Logical and Physical Views of File B.

None of the four writes fits the stripe size exactly so Lustre will split each of them into pieces. Since these writes cross an object boundary, they are not stripe aligned as in our previous example. When they are not stripe aligned, some of the OSTs are simultaneously receiving data from more than one process. In our non-aligned example, OST 0 is simultaneously receiving data from processes 0, 1 and 3; OST 2 is simultaneously receiving data from processes 1 and 2; and OST 3 is simultaneously receiving data from processes 2 and 3. This creates resource contention on the OST and request contention on the OSS associated with the OST. This contention is a significant performance concern related to striping. It is minimized when processes (that access the file in parallel) access file locations that reside on different stripes as in our stripe aligned example.

Back to Contents

I/O Benchmarks


The purpose of this section is to convey tips for getting better performance with your I/O on the Kraken Lustre file systems. You can also view our list of I/O Best Practices.

Serial I/O


Serial I/O includes those application I/O patterns in which one process performs I/O operations to one or more files. In general, serial I/O is not scalable.

Figure 6: Write Performance for serial I/O at various Lustre stripe counts. File size is 32 MB per OST utilized and write operations are 32 MB in size. Utilizing more OSTs does not increase write performance. The Best performance is seen by utilizing a stripe size which matches the size of write operations.

Figure 7: Write Performance for serial I/O at various Lustre stripe sizes and I/O operation sizes. File utilized is 256 MB written to a single OST. Performance is limited by small operation sizes and small stripe sizes. Either can become a limiting factor in write performance. The best performance is obtained in each case when the I/O operation and stripe sizes are similar.

  • Serial I/O is limited by the single process which performs I/O. I/O operations can only occur as quickly as the single processes can read/write data to disk.

  • Parallelism in the Lustre file system cannot be exploited to increase I/O performance.

  • Larger I/O operations and matching Lustre stripe settings may improve performance. This reduces the latency of I/O operations.

Back to Contents

File-per-Process


File-per-process is a communication pattern in which each process of a parallel application writes its data to a private file. This pattern creates N or more files for an application run of N processes. The performance of each process’s file write is governed by the statements made above for serial I/O. However, this pattern constitutes the simplest implementation of parallel I/O due to the possibility of improved I/O performance from a parallel file system.

Figure 8: Write performance of a file-per-process I/O pattern as a function of number of files/processes. The file size is 128 MB with 32 MB sized write operations. Performance increases as the number of processes/files increases until OST and metadata contention hinder performance improvements.

  • Each file is subject to the limitations of serial I/O.

  • Improved performance can be obtained from a parallel file system such as Lustre. However, at large process counts (large number of files) metadata operations may hinder overall performance. Additionally, at large process counts (large number of files) OSS and OST contention will hinder overall performance.

Back to Contents

Single-shared-file


A single shared file I/O pattern involves multiple application processes which either independently or concurrently share access to the same file. This particular I/O pattern can take advantage of both process and file system parallelism to achieve high levels of performance. However, at large process counts contention for file system resources OSTs can hinder performance gains.

Figure 9: Two possible shared file layouts. The aggregate file size in both cases is 1 and 2 GB depending on which block size is utilized. The major difference in file layouts is the locality of the data from each process. Layout #1 keeps data from a process in a contiguous block, while Layout #2 strides this data throughout the file. Thirty-two (32) processes will concurrently access this shared file.

Figure 10: Write performance utilizing a single shared file accessed by 32 processes. Stripe counts utilized are 32 (1 GB file) and 64 (2 GB file) with stripe sizes of 32 MB and 1 MB. A 1 MB stripe size on Layout #1 results in the lowest performance due to OST contention. Each OST is accessed by every process. Whereas, the highest performance is seen from a 32 MB stripe size on Layout #1. Each OST is accessed by only one process. A 1 MB stripe size gives better performance with Layout #2. Each OST is accessed by only one process. However, the overall performance is lower due to the increased latency in the write (smaller I/O operations). With a stripe count of 64 each process communicates with 2 OSTs.

Figure 11: Write Performance of a single shared file as the number of processes increases. A file size of 32 MB per process is utilized with 32 MB write operations. For each I/O library (Posix, MPI-IO, and HDF5) performance levels off at high core counts.

  • The layout of the single shared file and its interaction with Lustre settings is particularly important with respect to performance.

  • At large core counts file system contention limits the performance gains of utilizing a single shared file. The major limitation is the 160 OST limit on the striping of a single file.

Back to Contents

Basic Lustre User Commands


Lustre's lfs utility provides several options for monitoring and configuring your Lustre environment. In this section, we describe the basic options that enable you to:

  • List OSTs in the File System
  • Search the Directory Tree
  • Check Disk Space Usage
  • Get Striping Information
  • Set Striping Patterns

For a complete list of available options, type help at the lfs prompt.

$ lfs help

To get more information on a specific option, type help along with the option name.

$ lfs help option-name

Back to Contents

List OSTs in the File System


The lfs osts command lists all OSTs available on a file system, which can vary from one system to another. The usage for the command is:

lfs osts [path]

If a path is specified, only OSTs belonging to the specified path are displayed.

The lfs osts command displays the IDs of all available OSTs in the file system along with the default path to the file system, stripe count, stripe size, and stripe offset. The listing below shows the output produced by the lfs osts command on NICS's Kraken supercomputer:

$ lfs osts
OBDS:
0: scratch-OST0000_UUID ACTIVE
1: scratch-OST0001_UUID ACTIVE
.............//...............
334: scratch-OST014e_UUID ACTIVE
335: scratch-OST014f_UUID ACTIVE
/lustre/scratch
stripe_count: 4 stripe_size: 0 stripe_offset: -1

From this output you can see that Kraken (as of 1/15/12) has 336 OSTs, numbered from 0 to 335. In addition, the lfs osts command gives the path to the file system, /lustre/scratch, default stripe count (1), stripe size (0), and stripe offset (-1). This means that the files will be striped over 1 OST, the stripe size will be 1MB, and the MDS will choose the starting index. These are the default stripe setting values.

Back to Contents

Search the Directory Tree


The lfs find command searches the directory tree rooted at the given directory/filename for files that match the specified parameters. The usage for the lfs find command is:

lfs find [[!] --atime|-A [-+]N] [[!] --mtime|-M [-+]N]
         [[!] --ctime|-C [-+]N] [--maxdepth|-D N] [--name|-n ]
         [--print|-p] [--print0|-P] [[!] --obd|-O ]
         [[!] --size|-S [+-]N[kMGTPE]] --type |-t {bcdflpsD}]
         [[!] --gid|-g|--group|-G |]
         [[!] --uid|-u|--user|-U |]
         <dirname|filename>

Note that it is usually more efficient to use lfs find rather than the GNU find when searching for files on Lustre.

Descriptions of the optional parameters for the lfs find command are given in the following table:

Parameter

Description

--atime File was last accessed N*24 hours ago. (There is no guarantee that atime is kept coherent across the cluster.)
--mtime File status was last modified N*24 hours ago.
--ctime File status was last changed N*24 hours ago.
--maxdepth Limits find to descend at most N levels of the directory tree.
--print / --print0 Prints the full filename, followed by a new line or NULL character correspondingly.
--obd File has an object on a specific OST(s).
--size File has a size in bytes or kilo-, Mega-, Giga-, Tera-, Peta- or Exabytes if a suffix is given.
--type File has the type (block, character, directory, pipe, file, symlink, socket or Door [Solaris]).
--gid File has a specific group ID.
--group File belongs to a specific group (numeric group ID allowed).
--uid File has a specific numeric user ID.
--user File is owned by a specific user (numeric user ID is allowed).

Using an exclamation point “!” before an option negates its meaning (files NOT matching the parameter). Using a plus sign “+” before a numeric value means files with the parameter OR MORE. Using a minus sign “-” before a numeric value means files with the parameter OR LESS.

Consider an example of a 3-level directory tree shown below:

Results from using the lfs find command with various parameters are given in the following table:

Command

Result

lfs find /ROOTDIR

/ROOTDIR
/ROOTDIR/level_1_file
/ROOTDIR/LEVEL_1_DIR
/ROOTDIR/LEVEL_1_DIR/level_2_file
/ROOTDIR/LEVEL_1_DIR/LEVEL_2_DIR
/ROOTDIR/LEVEL_1_DIR/LEVEL_2_DIR/level_3_file

lfs find /ROOTDIR --maxdepth 1
or
lfs find /ROOTDIR --maxdepth 1 --print

/ROOTDIR
/ROOTDIR/level_1_file
/ROOTDIR/LEVEL_1_DIR

lfs find /ROOTDIR --maxdepth 1 --print0

/ROOTDIR/ROOTDIR/level_1_file/ROOTDIR/LEVEL_1_DIR

The following example of using the -mtime parameter will result in a recursive list of all regular files in the directory /lustre/scratch/$USER that are more than 30 days old:

$ lfs find /lustre/scratch/$USER -mtime +30 -type f -print

Back to Contents

Check Disk Space Usage


The lfs df command displays the file system disk space usage. Additional parameters can be specified to display inode usage of each MDT/OST or a subset of OSTs. The usage for the lfs df command is:

lfs df [-i] [-h] [--pool|-p <fsname>[.<pool>] [path]

By default, the usage of all mounted Lustre file systems is displayed. Otherwise, if a path is specified the usage of the specified file system is displayed.

Descriptions of the optional parameters are given in the following table:

Parameter

Description

-i Lists inode usage per OST and MDT.
-h Output is printed in human-readable format, using SI base-2 suffixes for Mega-, Giga-, Tera-, Peta-, or Exabytes.
--pool|-p <fsname>[.<pool>] Lists space or inode usage for the specific OST pool.

The lfs df command executed on NICS’s Kraken supercomputer produces the following output:

$ lfs df
UUID                 1K-blocks      Used Available  Use% Mounted on
scratch-MDT0000_UUID 3062330704 113107132 2774219516    3% /lustre/scratch[MDT:0]
scratch-OST0000_UUID 7691221300 5666708192 1633600912   73% /lustre/scratch[OST:0]
scratch-OST0001_UUID 7691221300 5472788640 1827596444   71% /lustre/scratch[OST:1]
...
scratch-OST014d_UUID 7691221300 5823378472 1477030256   75% /lustre/scratch[OST:333]
scratch-OST014e_UUID 7691221300 5456738732 1843642728   70% /lustre/scratch[OST:334]
scratch-OST014f_UUID 7691221300 5352581988 1947801800   69% /lustre/scratch[OST:335]

filesystem summary:  2584250356800 1840567550908 612364259352   71% /lustre/scratch

You can see from this output that the file system is fairly balanced with none of the OSTs near 100% full. However, there are times when a Lustre file system becomes unbalanced meaning that one of the file’s associated OSTs becomes 100% utilized. An OST may become 100% utilized even if there is space available on the file system. Examples of when this may occur include when stripe settings are not specified correctly or very large files are not striped over multiple OSTs. If an OST is full and you attempt to write to the file system, you will get an error message.

An individual user can run

lfs quota -u username /lustre/scratch

to see their own usage.  However, this will not let users see other people's usage.
 $ lfs quota -u username /lustre/scratch/
Disk quotas for user XXX(uid 1579):
Filesystem kbytes quota limit grace files quota limit grace
/lustre/scratch/  397     0     0      -       3       0       0       -

Back to Contents

Get Striping Information


The lfs getstripe option lists the striping information for a file or directory. The syntax for the getstripe option is:

lfs getstripe [--obd|-O ] [--quiet|-q] [--verbose|-v]
              [--count|-c] [--index|-i | --offset|-o]
              [--size|-s] [--pool|-p] [--directory|-d]
              [--recursive|-r]  ...

When querying a directory, the default striping parameters set for files created in that directory are listed. When querying a file, the OSTs over which the file is striped are listed.

Several parameters are available for retrieving specific striping information. These are listed and described in the table below:

Parameter

Description

--obd Lists only files that have an object on a specific OST.
--quiet Lists details about the file’s object ID information.
--verbose Prints additional striping information.
--count Lists the stripe count (how many OSTs to use).
--index Lists the index for each OST in the file system.
--offset Lists the OST index on which file striping starts.
--pool Lists the pools to which a file belongs.
--size Lists the stripe size (how much data to write to one OST before moving to the next OST).
--directory Lists entries about a specified directory instead of its contents (in the same manner as ls -d).
--recursive Recurses into all sub-directories.

The following example shows that file1 has a stripe count of six on OSTs 19, 59, 70, 54, 39, and 28.

$ lfs getstripe dir/file1

...
dir/file1
        obdidx           objid          objid            group
            19        28675008      0x1b58bc0                0
            59        28592466      0x1b44952                0
            70        28656421      0x1b54325                0
            54        28652653      0x1b5346d                0
            39        28850966      0x1b83b16                0
            28        28854363      0x1b8485b                0

Now observe how the --quiet parameter is used to list only information about a file’s object ID.

$ lfs getstripe --quiet dir/file1

            19        28675008      0x1b58bc0                0
            59        28592466      0x1b44952                0
            70        28656421      0x1b54325                0
            54        28652653      0x1b5346d                0
            39        28850966      0x1b83b16                0
            28        28854363      0x1b8485b                0

The next example shows the output when querying a directory.

$ lfs getstripe dir1

...
dir1
stripe_count: 6 stripe_size: 0 stripe_offset: -1
dir1/file1
        obdidx           objid          objid            group
            19        28675008      0x1b58bc0                0
            59        28592466      0x1b44952                0
            70        28656421      0x1b54325                0
            54        28652653      0x1b5346d                0
            39        28850966      0x1b83b16                0
            28        28854363      0x1b8485b                0

In order to prevent unnecessary information from appearing you can use the following trick.

$ lfs getstripe dir1 | grep stripe

stripe_count: 6 stripe_size: 0 stripe_offset: -1

Back to Contents

Set Striping Patterns


Files and directories inherit striping patterns from the parent directory. However, you can change them for a single file, multiple files, or a directory using the lfs setstripe command. The lfs setstripe command creates a new file with a specified stripe configuration or sets a default striping configuration for files created in a directory. The usage for the command is:

lfs setstripe [--size|-s stripe_size] [--count|-c stripe_cnt]
              [--index|-i|--offset|-o start_ost_index]
              [--pool|-p <pool>]
              <dirname|filename>

Descriptions of the optional parameters are given in the following table:

Parameter

Description

--size stripe_size Number of bytes to store on an OST before moving to the next OST. A stripe_size of 0 uses the file system’s default stripe size, (default is 1 MB). Can be specified with k (KB), m (MB), or g (GB), respectively.
--count stripe_cnt Number of OSTs over which to stripe a file. The default value for start_ost is -1 , which allows the MDS to choose the starting index. A stripe_cnt of 0 uses the file system-wide default stripe count (default is 1). A stripe_cnt of -1 stripes over all available OSTs, and normally results in a file with 80 stripes.
--index start_ost_index
or
--offset start_ost_index
The OST index (base 10, starting at 0) on which to start striping for the file. The default value for start_ost is -1 , which allows the MDS to choose the starting index.

Shorter versions of these sub-options are also available, namely -s, -c, -o and -i, as given in the usage above.

Although when using lfs setstripe you can specify option values based on position, it is best to use the explicit rather than the positional options. Using the positional options are error-prone and often misused. For example, it is best to use the following command:

$ lfs setstripe $NAME -s 1m -c 16

rather than

$ lfs setstripe $NAME 1m -1 16

Note that not specifying an option keeps the current value.

Setting the Striping Pattern for a Single File

You can specify the striping pattern of a file by using the lfs setstripe command to create it. This enables you to tune the file layout more optimally for your application. For example, the following command will create a new zero length file named file1 with a stripe size of 2MB, and a stripe count of 40:

$ lfs setstripe file1 -s 2m -c 40

Note that you cannot alter the striping pattern of an existing file with the lfs setstripe command. If you try to execute this command on an existing file, it will fail. Instead, you can create a new file with the desired attributes using lfs setstripe and then copy the existing file to the newly created file.

Setting the Striping Pattern for a Directory

Invoking the lfs setstripe command on an existing directory sets a default striping configuration for any new files created in the directory. Existing files in the directory are not affected. The usage is the same as lfs setstripe for creating a file, except that the directory must already exist. For example, to limit the number of OSTs to 2 for all new files to be created in an existing directory dir1 you can use the following command:

$ lfs setstripe dir1 -c 2

Setting the Striping Pattern for Multiple Files

You can't directly alter the stripe patterns of a large number of files with lfs setstripe but you can do it by taking advantage of the fact that files inherit the directory's settings. First, create a new directory setting its striping pattern to your desired settings using the lfs setstripe command. Then copy the files to the new directory and the files will inherit the directory settings that you specified.

Using the Non-striped Option

There are times when striping will not help your application's I/O performance. In those cases, it is recommended that you use Lustre's non-striped option. You can set the non-striped option by using a stripe count of 1 along with the default values for stripe index and stripe size. The lfs setstripe command for the non-striped option is as follows:

$ lfs setstripe dir1 -c 1

Striping across all OSTs

You can stripe across all or a subset of the OSTs by using a stripe count of -1 along with the default values for stripe index and stripe size. The lfs setstripe command for striping across all OSTs is as follows:

$ lfs setstripe dir1 -c -1

Back to Contents

I/O Best Practices


Lustre is a shared resource by all users on the system. Optimizing your IO performance will not only lessen the load on lustre, it will save you compute time as well. Here are some pointers to improve your code's performance.

Working with large files on Lustre


 

Lustre determines the striping configuration for a file at the time it is created. Although users can specify striping parameters, it is common to rely on the system default values. In many cases, the default striping parameters are reasonable, and users do not think about the striping of their files. However, when creating large files, proper striping becomes very important.

The default stripe count on NICS's Lustre file systems (which are small) are not suitable for very large files. Creating large files with low stripe counts can cause IO performance bottlenecks. It can also cause one or more OSTs (Object Storage Targets, or "disks") to fill up, resulting in IO errors when writing data to those OSTs.

When dealing with large Lustre files, it is a good practice to create a special directory with a large stripe count to contain those files. Files transferred to (e.g., scp/cp/gridftp) or created in (e.g., tar) this larger striped directory will inherit the stripe count of the directory. Below is an example showing how to create a large striped directory on Kraken.

$ cd /lustre/scratch/$USER
$ mkdir LARGE_FILES
$ lfs setstripe -c 50 LARGE_FILES/

In the above example, the default stripe count for the directory is set to 50. This is a reasonable value for files up to 2-3 TB in size. If larger files will be created, the stripe count can be increased accordingly.

Examples

Creating a tar file within the larger striped directory

$ cd /lustre/scratch/$USER/data
$ tar  -cf /lustre/scratch/$USER/LARGE_FILES/my_sims.tar  my_sims/

This will tar up the my_sims directory and places it in the larger striped directory. Note, one can add the “j” flag for the bz2 compression (the file would change to my_sims.tar.bz2).

Conversely, if one has a large tar file in the LARGE_FILES directory, and this tar file contains many smaller files, it can be extracted to a separate directory with a smaller stripe count:

$ lfs getstripe /lustre/scratch/$USER/data
/lustre/scratch/$USER/data
stripe_count:   1 stripe_size:    1048576 stripe_offset:  -1
$ cd /lustre/scratch/$USER/LARGE_FILES
$ tar –xf my_sim.tar –C /lustre/scratch/$USER/data

This will extract the tar file into a directory with a default stripe count of 1.

For HPSS transfers, the syntax is

hsi {put | get} local_file : hpss_file

To retrieve a large HPSS file (named large.tar) and place it onto lustre, one would run the following command:

hsi get /lustre/scratch/$USER/LARGE_FILES/large.tar : /home/$USER/large.tar 

Back to Contents

Opening and checking file status


 

Open files read-only whenever possible

If a file to be opened is not subject to write(s), it should be opened as read-only. Furthermore, if the access time on the file does not need to be updated, the open flags should be O_RDONLY | O_NOATIME. If this file is opened by all files in the group, the master process (rank 0) should open it O_RDONLY with all of the non-master processes (rank > 0) opening it O_RDONLY | O_NOATIME.

Limit the number of files in a single directory using a directory hierarchy

For large scale applications that are going to write large numbers of files using private data, it is best to implement a subdirectory structure to limit the number of files in a single directory. A suggested approach is a two-level directory structure with sqrt(N) directories each containing sqrt(N) files, where N is the number of tasks.

Stat files from a single task

If many processes need the information from stat on a single file, it is most efficient to have a single process perform the stat call, then broadcast the results. This can be achieved by modifying:

C Example:

From

struct stat sB;

iRC=lstat( PathName, &sB );

int iRank;
struct stat sB;
         

To

MPI_Comm_rank( MPI_COMM_WORLD, iRank );
if(!iRank)
{
  iRC=lstat( PathName, &sB );
}
MPI_Bcast( &sB, size(struct stat), MPI_CHAR, 0, MPI_COMM_WORLD );

 

FORTRAN Example:

From

      INTEGER*4 sB(13)

      CALL LSTAT(PathName, sB, ierr)

To

      INTEGER iRank
      INTEGER*4 sB(13)
      INTEGER ierr

      CALL MPI_COMM_RANK(MPI_COMM_WORLD, iRank, ierr)
          IF (iRank .eq. 0) THEN
          CALL LSTAT(PathName, sB, ierr)
      ENDIF
      CALL MPI_BCAST(sB, 13, MPI_INTEGER, 0, MPI_COMM_WORLD, ierr)

Avoid opening and closing files frequently

Excessive overhead is created when file I/O is performed by:

  • Opening a file in append mode
  • Writing a small amount of data
  • Closing the file

 

If you will be writing to a file many times throughout the application run, it is more efficient to open the file once at the beginning. Data can then be written to the file during the course of the run. The file can be closed at the end of the application run.

Use ls -l only where absolutely necessary

Consider that ls -l must communicate with every OST that is assigned to a file being listed and this is done for every file listed; and so, is a very expensive operation. Depending on Kraken's usage at that moment, and how the files in that directory are distributed among the OSSs, there is a good chance that one of the OSSs are busy, which causes ls -l to hang. It also causes excessive overhead for other users.

On the contrary, a basic ls only has to contact the meta-data server (MDS), not the object-storage servers (OSS), where the bottleneck often occurs. Many users alias ls to give additional information, for example, using different colors for different file types, which requires contacting the OSSs. You can bypass this by using /bin/ls. When there are many files in the same directory, and you don't need the output to be sorted, /bin/ls -U works even faster.

You can also use the Lustre utility lfs find to look for files. For example, the syntax to emulate a regular ls in any directory is

lfs find  -D 0  *

For convenience, you may want to add an alias definition to your login config files. For example Bash users can add to their ~/.bashrc the following line to create an alias called lls.

alias lls="/bin/ls -U"

Avoid using wild cards with GNU commands, such as tar and rm on a large number of files

Several GNU commands, such as tar and rm are inefficient when operating on a large class of files on Lustre. For example, with millions of files, rm -rf * may take days, and have a considerable impact on Lustre for other users. The reason lies in the time it takes to expand the wild card. A better way to do this is to generate a list of files to be removed or tar-ed, and to act them one at a time, or in small sets.

A good way to review files before they are deleted is the following:

$ lfs find <dir> -t f > rmlist.txt
$ vi rmlist.txt
$ sed -e 's:^:/bin/rm :' rmlist.txt > rmlist.sh
$ sh rmlist.sh
# the directory structure will remain, but unless there are very many, 
# directories, we can simply delete it:
$ rm -rf <dir>

Back to Contents

Use the appropriate striping technique


 

Place small files on single OST

If only one process will read/write the file and the amount of data in the file is small (< 1 MB to 1 GB) , performance will be improved by limiting the file to a single OST on creation. This can be done as shown below by:

$ lfs setstripe PathName -s 1m -i -1 -c 1

Place directories containing many small files on single OSTs

If you are going to create many small files in a single directory, greater efficiency will be achieved if you have the directory default to 1 OST on creation:

$ lfs setstripe DirPathName -s 1m -i -1 -c 1

All files created in this directory will inherit the 1 OST setting.

This is especially effective when extracting source code distributions from a tarball:

$ lfs setstripe DirPathName -s 1m -i -1 -c 1
$ cd DirPathName
$ tar -x -f TarballPathName

All of the source files, header files, etc only span one OST. But, also, when you build the code, all of the object files use only one OST. It is true that the binary will span one OST; but you can copy the binary as so:

$ lfs setstripe NewBin -s 1m -i -1 -c 4
$ rm -f OldBinPath
$ mv NewBin OldBinPath

or you can modify the Makefile along the lines of:

	OldBinPath:   ...
	rm -f OldBinPath
	lfs setstripe OldBinPath -s 1m -i -1 -c 4
	cc -o OldBinPath ...

Set the stripe count and size appropriately for shared files

Single shared files should have a stripe count equal to the number of processes which access the file. If the number of processes accessing the file is greater than 160 then the stripe count should be set to -1 (max 160). The stripe size should be set to allow as much stripe alignment as possible. A single process should not need to access stripes on all utilized OSTs. Take into account the structure of the shared file, number of processes, and size of I/O operations in order to decide on a stripe size which will maximize stripe-aligned I/O.

Set the stripe count appropriately for applications which utilize a file-per-process

Files utilized within a File-per-process I/O pattern should utilize a stripe count of 1. Due to the large number of files/processes possible it is necessary to limit possible OST contention by limiting files to a single OST. At large scales, even when a stripe count of 1 is utilized, it is very possible that OST contention will adversely affect performance. The most effective implementation is to set the stripe count on a directory to 1 and write all files within this directory.

Back to Contents

IO considerations


 

Read small, shared files from a single task

Instead of reading a small file from every task, it is advisable to read the entire file from one task and broadcast the contents to all other tasks.

C Example From

int  iRead;
char cBuf[SIZE];

iFD=open( PathName, O_RDONLY | O_NOATIME, 0444 );
//      Check file descriptor

iRead=read( iFD, cBuf, SIZE );
//      Check number of bytes read

To

int  iRank;
int  iRead;
char cBuf[SIZE];

MPI_Comm_rank( MPI_COMM_WORLD, iRank );
if(!iRank) {

  iFD=open( PathName, O_RDONLY | O_NOATIME, 0444 );
//          Check file descriptor

  iRead=read( iFD, cBuf, SIZE );
//          Check number of bytes read
}
MPI_Bcast( cBuf, SIZE, MPI_CHAR, 0, MPI_COMM_WORLD );

 

FORTRAN Example From

      INTEGER iRead
      CHARACTER cBuf(SIZE)

      OPEN(1, FileName)
      READ(1,*) cBuf

To

      INTEGER iRank
      INTEGER iRead
      INTEGER ierr
      CHARACTER cBuf(SIZE)

      CALL MPI_COMM_RANK(MPI_COMM_WORLD, iRank, ierr)
      IF (iRank .eq. 0) THEN
          OPEN(UNIT=1,FILE=PathName,ACTION='READ')
          READ(1,*) cBuf
      ENDIF
      CALL MPI_BCAST(cBuf, SIZE, MPI_CHAR, 0, MPI_COMM_WORLD, ierr)

Use large and stripe-aligned I/O where possible

I/O requests should be large, e.g., a full stripe width or greater. In addition, you will get better performance by making these stripe aligned, where possible. If the amount of data generated or required from the file on a client is small, a group of processes should be selected to perform the actual I/O request with those processes performing data aggregation.

Standard output and standard error

Avoid excessive use of stdout and stderr I/O streams from parallel processes. These I/O streams are serialized by aprun. Limit output to these streams to one process in production jobs. Debugging messages which originate from each process should be disabled in production runs. Frequent buffer flushes on these streams should be avoided.

Back to Contents

Application level


 

Subsetting IO

At large core counts I/O performance can be hindered by the collection of metadata operations (File-per-process) or file system contention (Single-shared-file). One solution is to use a subset of application processes to perform I/O. This action will limit the number of files (File-per-process) or limit the number of processes accessing file system resources (Single-shared-file).

  • An example follows which creates an MPI communicatior that only includes I/O nodes (a subset of the total number of processes). This example also shows independent and collective I/O with MPI-I/O.

! listofionodes is an array of the ranks of writers/readers
  call MPI_COMM_GROUP(MPI_COMM_WORLD, WORLD_GROUP, ierr)
  call MPI_GROUP_INCL(WORLD_GROUP, nionodes, listofionodes, IO_GROUP,ierr)
  call MPI_COMM_CREATE(MPI_COMM_WORLD,IO_GROUP, MPI_COMMio, ierr)
! open
  call MPI_FILE_OPEN
&  (MPI_COMMio, trim(filename), filemode, finfo, mpifh, ierr)
! read/write
  call MPI_FILE_WRITE_AT
&  (mpifh, offset, iobuf, bufsize, MPI_REAL8, status, ierr)
! OR utilizing collective writes
!  call MPI_FILE_SET_VIEW
!&  (mpifh, disp, MPI_REAL8, MPI_REAL8, "native", finfo, ierr)
!  call MPI_FILE_WRITE_ALL
!&  (mpifh, iobuf, bufsize, MPI_REAL8, status, ierr)
! close
call MPI_FILE_CLOSE(mpifh, ierr)
  • If you cannot implement a subsetting approach, it would still be to your advantage to limit the number of synchronous file opens. This is useful for limiting the number of requests hitting the metadata server (of which there is only one).

Parallel libraries

Managing one's IO can be performed at the application level with re-tooling one's code or implementing additional libraries. Some example of these additional middleware applications are ADIOS, HDF5, and MPI-IO.

Recognize situations where file system contention may limit performance

When an I/O pattern is scaled to large core counts performance degradation may occur due to file system contention. This situation arises when many-many more processes than file system resources request I/O nearly simultaneously. Examples include file-per-process I/O patterns which utilize over ten-thousand processes/files and single-shared-file I/O patterns which utilize over five-thousand processes accessing a single file. Potential solutions involve decreasing the number of processes which perform I/O simultaneously. For a file-per-process pattern this may involve allowing only a subset of processes to perform I/O at any particular time. For a single-shared file pattern this solution may involve utilizing more than one shared-file in which a subset of processes perform I/O. Additionally, some I/O libraries such as MPI-IO allow for collective buffering which aggregates I/O from the running processes onto a subset of processes which perform I/O.

Back to Contents