• National Institute for Computational Sciences is a UT/ORNL Partnership

Kraken FAQ

General Frequently Asked Questions

Table of Contents

Compiling/Linking

Running Jobs

Lustre File System

Runtime Messages/Errors

Miscellaneous

Compiling/Linking

What compilers do you support? Can I use fill-in-compiler?

We support PGI, GNU, and Pathscale compilers, these should be more than sufficient; it is unlikely that we will add compilers such as Intel or Borland. More information on the compiling may be found at Compiling. When compiling for the compute nodes, do not use the compilers directly, but instead use the Cray compilers (cc, CC, ftn). See Modules.

We are investigating the Cray compiler, which supports CoArray-Fortran, and Unified Parallel C, as well as some new profiling features for standard MPI programs. We are unlikely to purchase the compiler unless there is a strong demand for it, so if you want it, please let us know at help@teragrid.org

Why does my compile fail with “usr/bin/ld: can not find -lsma”?

This error message occurs when using the mpi* compiler wrappers (mpicc, mpif90, etc.). These are intermediate wrappers that should not be called directly by users. Instead, users should compile with either ftn, cc, or CC. The ftn, cc, and CC scripts will do the necessary setup and then automatically call the appropriate intermediate scripts and ultimately the compilers.

Why does my compile fail with the message “relocation truncated to fit: R_X86_64_PC32″?

The default memory model for the PGI compilers is the “small” model. This requires that the object be smaller than 2 GB in size. The PGI compilers support the “medium” memory model, which allows objects to be larger than 2 GB. Unfortunately, for a code to use the medium memory model, all objects and static libraries must be compiled under the medium memory model. Several system libraries are not, so in general, executables on Kraken must use the small memory model.

The “relocation truncated” error message occurs when an object file or executable is too large for the memory model. To work around this error, you should reduce the static memory usage for your code. Common ways to do this include the following:

  • Remove (either by deleting or via compiler directives) subroutines that are not used on the XT platform.
  • Remove static variables (especially large arrays) that are not used on the XT platform.
  • Use allocatable arrays instead of static arrays. Because the memory model applies to only static size, allocatable arrays can be larger than 2 GB with the small memory model.

How do I link a C program that calls Fortran routines?

Use the pgf90 compiler to link and provide the -Mnomain option.

What does “multiple definition of main” and/or “undefined reference to MAIN_” mean?

This most likely means you have a C program that calls Fortran, and you are linking with the Portland Group Fortran compiler. The Fortran compiler has its own default “main,” and now there is a second main from the C source. You may need to add the -Mnomain flag during link time to fix this.

If this fails, another option is to use the C/C++ compilers to link. Now, 'main' may be defined manually: -Wl tells pgcc/pgCC to pass the following comma-deliminated list to the linker, --defsym defines a list of symbols. Thus, the following should allow your Fortran-with-C program to compile and link.

pgcc -Wl,--defsym,main=MAIN_ ...

What do I do with “configure: error: linking to Fortran libraries from C fails”?

That message sometimes comes as a result of using configure on the XT3 with the FC=ftn and CC=cc compilers. The error usually shows up in the configure log with the following output:

checking how to get verbose linking output from ftn... -v
checking for Fortran libraries of ftn...  -L/opt/acml/2.7/pgi64/lib/cray/cnos64 -llapacktimers -L/opt/xt-mpt/1.3.15/mpich2-64/P2/lib -L/opt/acml/2.7/pgi64/lib -L/opt/xt-libsci/1.3.15/pgi/cnos64/lib -L/opt/xt-mpt/1.3.15/sma/lib -L/opt/xt-tools/papi/3.2.1/lib/cnos64 -lpapi -lperfctr -L/opt/xt-lustre-ss/1.3.15/lib64 -L/opt/xt-catamount/1.3.15/lib/cnos64 -L/opt/xt-pe/1.3.15/lib/cnos64 -L/opt/xt-libc/1.3.15/amd64/lib -L/opt/xt-os/1.3.15/lib/cnos64 -L/opt/xt-service/1.3.15/lib/cnos64 -L/opt/pgi/6.1.1/linux86-64/6.1/lib -L/opt/gcc/3.2.3/lib/gcc-lib/x86_64-suse-linux/3.2.3/ -lacml -lmpichf90 -lsci -lmpich -llustre -lpgf90 -lpgf90_rpm1 -lpgf902 -lpgf90rtl -lpgftnrtl -lpgc -lm -lcatamount -lsysio -lportals -lC -lcrtend' -lcrtend
checking for dummy main to link with Fortran libraries... unknown
configure: error: linking to Fortran libraries from C fails
See 'config.log' for more details.

If you look at the end of the Fortran libraries line, you will see “-lcrtend -lcrtend.” There is an extra “‘”. To get around this, usually you specify this long line of Fortran libraries in a environment variable like FLIBS or FCLIBS with the extra “‘” and the extra “-lcrtend” removed.

My code compiles without any trouble, but fails in the link step.

Internally, the compilers use several variables/macros even if theyre not specified on the command line. These include F90FLAGS, FFLAGS, CFLAGS, and others. If your make file defines these variables with flags not intended for the link step, the link may fail. For example, if they contain the -c flag, which tells the compiler to skip the link step, the link will fail.

Can I use the 1.5 programming environments on the CNL system?

The 1.5 programming environments are available on the CNL system. However, they will build for Catamount and should not be used on the CNL system. The 2. and greater programming environment versions should be used on the CNL system.

How do I link a C++ object with ftn? It worked on the Catamount system without modification.

Under the 1.5 programming environments used under Catamount, ftn linked in libC.a. Under the 2. programming environments used under CNL, ftn does not link in libC.a. Fortran codes that link in libraries that contain C++ objects will need to add -lC to the link line.

libc.a is added to the link under 2. as it was under 1.5. Adding -lc to the link will result in multiple definition warnings.

Why do I see the message: SEEK_SET is #defined but must not be for the C++ binding of MPI?

The following error message:

#error "SEEK_SET is #defined but must not be for the C++ binding of MPI" 

Is the result of a name conflict between stdio.h and the MPI C++ binding. Users should place the mpi include before the stdio.h and iostream includes.

Users may also see the following error messages as a result of including stdio or iostream before mpi:

#error "SEEK_CUR is #defined but must not be for the C++ binding of MPI" 
#error "SEEK_END is #defined but must not be for the C++ binding of MPI"

When profiling with TAU, you may get this message regardless of the order. In this case, you can add -DMPICH_IGNORE_CXX_SEEK to the compile line to remove the error (in fact, this fix should work generally).

I cannot build my program because the autoconf (Autotools) hangs indefinitely.

The Autotools programs hang with our NFS Linux server that serves the home areas because of a known and unresolved issue. The two workarounds available to overcome this issue consist of

  • Run the autotools programs in the Lustre scratch area. Please note that this will be slower.
  • A better solution is to create a file in your home directory called ~/.autom4te.cfg with the following contents:
    begin-language: "Autoconf-without-aclocal-m4"
    args: --no-cache
    end-language: "Autoconf-without-aclocal-m4"
    

    This will disable the file caching of autom4te which is usually the culprit.

Why do I get the error: "/usr/include/c++/4.1.2/backward/backward_warning.h:32:2: warning: #warning This file includes at least one deprecated or antiquated header. Please consider using one of the 32 headers found in section 17.4.1.2 of the C++ standard. Examples include substituting the header for the header for C++ includes, or instead of the deprecated header . To disable this warning use -Wno-deprecated," when I try to compile my code?

#include is the Standard C++ way to include header files. The 'iostream' is an identifier that maps to the file iostream.h. In older C++ versions you had to specify the file name of the header file, hence #include . Older compilers may not recognize the modern method but newer compilers will accept both methods even though the old method is obsolete.

fstream.h became fstream vector.h became vector string.h became string, etc.

So although the library was depracated for several years., many C++ users still use it in new code instead of using the newer, standard compliant library. What are the differences between the two? First, the .h notation of the standard header files was depracated more than 5 years ago. Using depracated features in new code is never a good idea. In terms of functionality, contains a set of templatized I/O classes which support both narrow and wide characters. By contrast, classes are confined to char exclusively. Third, the C++ standard specification of iostream's interface was changed in many subtle aspects. Consequently, the interfaces and implementation of differ from components are declared in the global scope. Becauseof these substantial differences, you cannot mix the two libraries in one program. As a rule, use in a new code and stick to in legacy code that is incompatible with the new library.

Running Jobs

How do I find out what nodes I am using?

There are a couple of easy ways to find out what nodes are assigned to your batch job. The easiest is to issue checkjob <jobid>. Part of the output will return a list of nodes like the following:

Allocated Nodes:      

[84:1][85:1][86:1][87:1][88:1][89:1][90:1][91:1]

The method returns the a logical numbering of nodes. A physical numbering of the nodes as well as the pid layout can be obtained by setting the PMI_DEBUG variable to 1.

gt; setenv PMI_DEBUG 1
> aprun -n4 ./a.out
Detected aprun CNOS interface
MPI rank order: Using default aprun rank ordering
rank 0 is on nid00015 pid 76; originally was on nid00015 pid 76
rank 1 is on nid00015 pid 77; originally was on nid00015 pid 77
rank 2 is on nid00016 pid 69; originally was on nid00016 pid 69
rank 3 is on nid00016 pid 70; originally was on nid00016 pid 70

From within your code, you can reference PMI_CNOS_Get_nid to get the physical number for each process.

#include <stdio.h>
#include "mpi.h"int main (int argc, char *argv[])
{
  int rank,nproc,nid;
  int i;
  MPI_Status status;
MPI_Init(&argc, &argv);
  MPI_Comm_rank(MPI_COMM_WORLD, &rank);
  MPI_Comm_size(MPI_COMM_WORLD, &nproc);
PMI_CNOS_Get_nid(rank, &nid);
printf("  Rank: %10d  NID: %10d  Total: %10d n",rank,nid,nproc);
MPI_Finalize();
return 0;
}

The output with four cores would be as follows:

aprun -n4 ./hello-mpi.x
  Rank:          1  NID:         15  Total:          4
  Rank:          0  NID:         15  Total:          4
  Rank:          2  NID:         16  Total:          4
  Rank:          3  NID:         16  Total:          4
Application 13390 resources: utime 0, stime 0

The aprun -q option can be used to run commands outside of a code as shown below.

> aprun -q -n4 /bin/hostname
nid00015
nid00015
nid00016
nid00016
>

Or

> aprun -q -n4 /bin/cat /proc/cray_xt/nid

15
15
16
16
>

Why do I get the error “qsub: Job exceeds queue resource limits MSG=cannot satisfy server max mem requirement” when submitting a job?

The queuing system on Kraken does not allow memory requests with the #PBS -lmem= flag. Jobs requesting memory will be rejected with the error message shown above.

Memory on the Kraken is not shared between nodes. Each node has access to 16 GB of memory: 1 1/3 GB per core if all cores are used. Thus, memory is directly related to the number of processors requested. Because the memory is not shared, it does not make sense to request memory directly via PBS. (It is implicitly requested based on the #PBS -lsize=... request.)

Can I run size=0 jobs?

Yes, size=0 jobs are supported. These jobs are a good way to automate data transfers to HPSS. The hsi command runs on a service node. So, if you use hsi at the conclusion of a production run, all of the compute nodes your job was allocated remain idle. As an alternative, you can submit a production job, and then submit a second ‘data transfer’ job. This second job should be submitted with a dependency on the first job so that it will not start until the first job finishes. Additionally, it should be submitted with a size argument of 0. Since hsi runs on service nodes, it does not require any compute node (thus, size=0).

Serial programs

Please do not run intensive serial processes using size=0, or on the login nodes. Service nodes have limited resources which are shared between all users, and running these nodes out of memory can cause system problems. In many cases, serial pre/post -processing may be accomplished on the compute nodes:

  • Request one node: #PBS -l size=8
  • Run on one process: aprun -n -d 8 ./serial.x

The -d 8 is required to use all the memory on a node, see here.

Why am I not getting the basic error messages I expect?

Sometimes some of the basic error messages (such as reading past the EOF) are suppressed because a shell interpreter is not specified in the PBS script. Make sure that the first line of the PBS script contains a shell interpreter: #!/bin/bash, for example.

I get the error message "OOM killer terminated this process". What is OOM?

This error message indicates that the node is running Out Of Memory. This could be the result of a bug in the code, or memory requirements for the given input.

One quick solution might be to run with only four MPI processes per socket so each process gets a larger share of the memory on the node:

aprun -n $(( $PBS_NNODES/4 )) -S 4 ./a.out

Of course, the above solution leaves two cores idle on each socket. The best solution may be to identify the memory requirements in the code and make any necessary changes there, in terms of memory parameters, domain decomposition, etc.

Why nothing happens when I submit my job?

If you submit your job, it only executes for an instant, gets terminated without any error messages and the output files are empty, it may be that you have a customized login script that changes your shell interpreter at login time by explicitly executing another shell. For example, sometimes users whose default shell is Bash will change it to the C-Shell by doing the following in their .bashrc file:

# .bashrc

# Source global definitions
if [ -f /etc/bashrc ]; then        . /etc/bashrc
fi

# User specific aliases and functions

exec csh

will encounter this problem. If you do want to change your default shell, please use the ldap_chsh command instead.

Why am I getting "could not find *.so"? Or: can I use dynamic libraries?

These files are dynamic libraries, which is not supported on the compute nodes. Since most interpreted languages use dynamic libraries, these may not run on the compute nodes either. To check if an executable has dynamic linking, use ldd executable.

What are your guiding principles for configuring the queues on Kraken? Jobs at 32K processors and larger jobs make my jobs wait. I think these large jobs should have the lowest priority.

Jobs with large core counts intentionally get the highest priority on Kraken – without a high priority they would never run. Kraken enables capability jobs that cannot be run on other TeraGrid systems. Jobs with small core counts can be run on other TeraGrid systems, and thus their relative priority is lower on Kraken. NICS does not restrict or discourage jobs with small core counts running on Kraken, but their priority is lower than for large jobs.

Jobs with short wall clock limits sometimes start sooner than jobs with a 24-hour limit. These jobs can be used for back-fill while the system is collecting nodes for a larger job. The scheduler can give those nodes temporarily to short jobs without delaying the start time of the large job.

Lustre File System

What is file striping?

The Lustre file system is made up of an underlying set of file systems called Object Storage Targets (OST's), which are essentially a set of parallel IO servers. A file is said to be striped when read and write operations access multiple OST's concurrently. File striping is a way to increase IO performance since writing or reading from multiple OST's simultaneously increases the available IO bandwidth.

Striping will likely have little impact for the following codes:
  • Serial IO where a single processor or node performs all of the IO for an application.
  • Multiple nodes perform IO, access files at different times.
  • Multiple nodes perform IO simultaneously to different files that are small (each < 100 MB).

Lustre allows users to set file striping at the file or directory level. As mentioned above, striping will not improve IO performance for all files. For example, in a parallel application, if each processor writes its own file then file striping will not provide any benefit. Each file will already be placed in its own OST and the application will be using OST's concurrently. File striping, in this case, could lead to a performance decrease due to contention between the processors as they try to write (or read) pieces of their files spread across multiple OST's.

For MPI applications with parallel IO, multiple processors accessing multiple OST's can provide large IO bandwidths. Using all the available OST's on Kraken will provide maximum performance.

There are a few disadvantages to striping. Interactive commands such as ls -l will be slower for striped files. Additionally, striped files are more likely to suffer from data loss from a hardware failure since the the file is spread across multiple OST's.

How is striping set up in Lustre?

The lfs command can be used to determine the Lustre file system setup. Note that each file and directory can have its own striping pattern. This means that a user can set striping patterns for his own files and/or directories. The default stripe width as of July 2008 is 4.

This command will give you information on the striping information for a directory/file.

lfs find -v <directory/file>

If the command returns has no stripe info, then that means the directory/file is set to not stripe, or in other words the stripe width is 1.

How do I change the striping in Lustre?

A user can change the striping settings for a file or directory in Lustre by using the lfs command. More specifically, one would use lfs setstripe <directory> <options>. Note that if you change the settings for existing files, the file will get the new settings only if it is recreated. To change the settings for an existing directory, you will need to rename the directory, create a new directory with the proper settings, and then copy (not move) the files to the new directory to inherit the new settings.

If your application is the type in which each separate process writes to its own file, then we believe that the best option is to not use striping. This can be set by using this command:

> lfs setstripe <directory> 0 -1 1

Then we see that

> lfs find -v testdirectory
OBDS:
0: ost1_UUID ACTIVE
1: ost2_UUID ACTIVE
2: ost3_UUID ACTIVE
3: ost4_UUID ACTIVE
4: ost5_UUID ACTIVE
5: ost6_UUID ACTIVE
6: ost7_UUID ACTIVE
7: ost8_UUID ACTIVE
8: ost9_UUID ACTIVE
9: ost10_UUID ACTIVE
10: ost11_UUID ACTIVE
11: ost12_UUID ACTIVE
12: ost13_UUID ACTIVE
13: ost14_UUID ACTIVE
14: ost15_UUID ACTIVE
15: ost16_UUID ACTIVE
testdirectory/
default stripe_count: 1 stripe_size: 0 stripe_offset: -1

This shows we have a stripe count of 1 (no striping), the stripe size is set to 0 (which means use the default), and the stripe offset is set to -1 (which means to round-robin the files across the OSTs). You should almost always use -1 for stripe_offset.

The stripe count and stripe size are something you can tweak for performance. If your application writes very large files, then we believe that the best option is to stripe across all or a subset of the OSTs on the file system. Striping across all OSTs can be set by using this command:

> lfs setstripe <directory> 0 -1 -1

Caution: Not striping large files may cause a write error if the file's size is larger than the space on a single OST. Each OST has a finite size which is smaller than the total Lustre area of all OSTs.,

Why is Lustre taking so long to respond to 'ls'? Is it down?

There are some large core count MPI applications that read/write so many files and TB of data at given moments (for example at checkpoints) that swamp the Lustre file system server with so many I/O requests. So, while the Lustre server is busy handling the paramount requests for a single user, then all other users have to wait and they will perceive the file system as unresponsive for several minutes. So far, we are working with some users to help them do more suitable IO patterns in our system. Also, we are suggesting users to use the lfs utility described below.

Is there any other faster way to list my files in my Lustre scratch area?

Yes! A basic ls only has to contact the meta-data server (MDS), not the object-storage servers (OSS), where the bottleneck often occurs. In general, ls is aliased to give additional information, which requires the OSS's. You can bypass this by using /bin/ls. When there are many files in the same directory, and you don't need the output to be sorted, /bin/ls -U is even faster.

You can also use the Lustre utility lfs to look for files. For example, the syntax to emulate a regular ls in any directory is

lfs find  -D 0  *

For convenience, you may want to add an alias definition to your login config files. For example Bash users can add to their ~/.bashrc the following line to create an alias called lls.

alias lls="/bin/ls -U'

Run-Time Messages/Errors

What does “MPIDI_PORTALSU_REQUEST_FDU_OR_AEP: DROPPED EVENT ON UNEXPECTED RECEIVE QUEUE” mean?

By setting

MPICH_PTL_SEND_CREDITS=-1

A flow control mechanism can be enabled. See the mpi_intro man page for details.

For best performance, the number of event queue entries for the MPI unexpected receive queue should be set as high as possible.

MPICH_PTL_UNEX_EVENTS=80000

Note that this fix does not address unexpected message buffer exhaustion. Thus, the user may still need to adjust MPICH_MAX_SHORT_MSG_SIZE or MPICH_UNEX_BUFFER_SIZE if this buffering overflows.

Miscellaneous

What “endian”ness is the XT4 and XT5? Is there any way to affect it?

The Cray XT4 and XT5 are little-endian. There is a compiler switch -Mbyteswapio that makes the default Fortran unformatted I/O big-endian (read and write.)

Note that this little-endian-to-big-endian conversion feature is intended for Fortran unformatted I/O operations. It enables the development and processing of files with big-endian data organization. The feature also enables processing of the files developed on processors that generate big-endian data (such as IBM, Cray X1, Sun).

What profiling tools are available?

At least three profiling tools are available on Kraken.

  1. CrayPat is provided by Cray. Follow this link for more information.
  2. fpmpi is an unsupported product that can provide a very concise profile of MPI routines in an application. To use it, simply load the fpmpi (or fpmpi_papi) module and relink. Then rerun your application. There are a few environment variables to control profiling output:
    • MPI_PROFILE_DISABLE : Disables statistic collection until fpmpi_enable is called (#include fpmpi.h).
    • MPI_PROFILE_SUMMARY : Setting disables creation of individual MPI process statistics files. Should set this when running with 1000s of processes.
    • MPI_PROFILE_FILE : Name of process statistic file; default is profile.txt.
    • MPI_HWPC_COUNTERS : List of events or event set number as in libhwpc.
  3. A third tool that is unsupported is TAU. TAU (Tuning and Analysis Utilities) is a portable profiling and tracing toolkit for performance analysis of parallel programs written in Fortran, C, C++, Java, Python. Basic profiling with TAU can be done in the following steps:
    1. Load the tau module: module load tau
    2. Set the environmental variable TAU_MAKEFILE: In tcsh, setenv TAU_MAKEFILE $(TAUROOT)/lib/Makefile.tau-mpi-pdt
    3. Compile code with the tau wrappers (which should be in your path), tau_f90.sh, tau_cc.sh, or tau_cxx.sh.
    4. You will get a regular executable. Submit your job as usual.
    5. After execution, there should be a profile.xxx text file.

TAU can also do MPI profiling and collect hardware performance counter data.

How do I get performance counter data for my program?

Use the following process:

  1. Use module load xt-craypat.
  2. Compile code.
    1. If Fortran90 with modules, compile with -Mprof=func.
  3. Run pat_build -u -g mpi a.out.
  4. Run a.out+pat as you would a.out, BUT make sure PAT_RT_HWPC is set to 1 in batch script.
    1. If you want just a regular profile, dont set PAT_RT_HWPC.
  5. Run pat_report <dir>/*.xf, where <dir> is automatically generated by instrumented code.

The resulting output will have performance counter results for the entire run AND for each subroutine.

Where can I find documentation on MPI environment variables?

You can find current information on MPI environment variables from the mpi_intro man page.

Can a user login directly to a compute node?

No, users cannot login directly to a compute node, but by submitting an interactive batch job, users can get access to an aprun node, from where they can execute commands as if they were directly executing them on a compute node. For more information on how to run interactive batch jobs, please view the information found at Interactive Batch Jobs

Can a user use the login nodes for pre/post processing?

Login nodes are not meant for pre or post processing operations. Users can use other systems like Verne for such purposes. If they really have to use Kraken for their pre and post processing operations, they will have to use the compute nodes to do so.

Where can I find more information?

If you havent already, please check out the other Kraken resource pages at Kraken resources on compiling, file systems, batch jobs, open issues, parallel I/O tips, CrayPAT overview, and other reports and presentations related to Kraken.

Another good resource (without Kraken-specific information) is the documentation that Cray provides at CrayDocs.