The PBS script is executed on the aprun node (or login node for interactive jobs). If executables are called directly (eg ./a.out), they will be run serially on the service node. This may be useful for records keeping, staging data, etc. Please run any memory- or computationally-intensive programs using aprun, otherwise it bogs down the node, and may cause system problems. You may run non-MPI programs on a compute node using aprun, see the section on Single-Processor (Serial) jobs below.
To launch parallel jobs on one or more compute nodes, use the aprun command. Kraken's layout should be kept in mind when running a job using aprun. A Kraken XT5 node consists of two sockets, each with 6 cores, so there are 12 cores/node. The PBS size option requests compute cores. This is not necessarily the number of cores that will be used, but rather the number of cores that will be made unavailable (idle cores are still inaccessible to other users). The easiest way to determine this number may be to calculate the number of nodes that will be occupied (even partially) and multiplying that number by 12 cores/node.
Aprun accepts the following common options:
| -n | Total number of MPI processes (default: 1) |
| -N | Number of MPI processes per node. ( XT5: 1–12.) |
| -S | Number of MPI processes per socket (XT5 1–6) |
| -d | Specifies number of cores per MPI process (for use with OpenMP, XT5: 1–12) |
The best way to understand the effects of these options is to try them yourself, please see our tutorial on the subject.
MPI examples
aprun -n $PBS_NNODES ./a.out
This uses all cores, one MPI process on each core. The environmental variable PBS_NNODES is the number of cores requested at the top of the PBS script. In most cases, it is unnecessary to do anything beyond this.
aprun -n 15 ./a.out
If for some reason you want to use a number of cores that is not a multiple of 12, that is valid. Round up to the next multiple of 12 for the resource request, the extra cores will remain idle. This example would require #PBS -l size=24.
aprun -n 8 -N 4 ./a.out
This will cause the XT5 to emulate the 4 cores/node layout of the XT4: there will be four MPI processes per node, all on one socket. This example would require you to request 24 cores on the XT5 for the cores that are left idle. This might be interesting for benchmarking purposes, however, half of the memory on the node is associated with the other socket, and may not be initialized unless you use both sockets.
aprun -n 8 -S 2 ./a.out
On the XT5, this is similar to the previous example, running 4 MPI processes per node, however, now they are running two on each socket. This ensures that both sockets are used, and that the memory is evenly distributed among the sockets. This ensures even distribution of L3 cache, and memory (a process can access memory on the other socket, but not as quickly as its own memory).
MPI/OpenMP
Kraken supports threaded programming within a node. The aprun -d flag is used to specify the number of cores per MPI process, so with OpenMP, "aprun -d $OMP_NUM_THREADS" uses one thread per core. When using every core, this would require at least n*d cores to be requested, the following examples assume that three nodes have been requested – #PBS -l size=36.
export OMP_NUM_THREADS=2 aprun -n12 -N4 -S2 -d2 ./a.out
Here, each MPI process has two OpenMP threads, filling three whole nodes. For some codes, two OpenMP threads per MPI process may be optimal. If the reason for using OpenMP is instead to increase the available memory, you may want to use 6 or even 12 threads per MPI process instead, though there is some performance penalty for using OpenMP across sockets in Kraken's current configuration (using HyperTransport 1).
export OMP_NUM_THREADS=5 aprun -n6 -N2 -S1 -d5 ./a.out
The -d flag specifies the depth, or number of cores to assign to each MPI process (when the MPI process spawns an OpenMP thread, it has a dedicated core to put it on). The -S option causes the second process to be put all on the second socket, rather than filling out the first socket first.
Single-Processor (Serial) Jobs
Serial programs which are memory or computationally intensive should never be run on the service nodes (anything outside of aprun). Service nodes have limited resources shared between all users, and when they run out, system problems may result. To run serial programs on the compute nodes, the program must be compiled with the compiler wrappers (cc, CC or ftn). You would then request one node (12 cores) with PBS (#PBS -l size=12). Use the following line to run a serial executable on a compute node:
aprun -n 1 ./a.out
If, however, your executable need all of the available memory or if you are running an OpenMP code, then use the following:
aprun -n1 -d12 ./a.out
Where the "-d" flag specifies the number of cores (threads) for each process. (Typically, OMP_NUM_THREADS and "-d" will match.)
When using "-d 12", a serial code (with or without threads) can use all the memory on the node (16 Gbytes). Note that there is a slight performance hit to access the second memory bank because communication has to go through the other socket.
Running Multiple Single-Processor Programs on a Compute Node
The following batch script shows how to run multiple copies of a serial program on a compute node:#!/bin/csh #PBS -A TG-XXXXXXXXX #PBS -N run_serial #PBS -l walltime=00:30:00,size=12 #PBS -j oe #PBS -V set echo cd /lustre/scratch/$USER/serial_job # Use aprun to start a shell script which runs 12 copies of the # of the same executable on a compute node # Note: all aprun options specified below are required # -n 1 # run on a single node # -d 12 # allows the script to access all the memory on the node # -cc none # allows each serial process to run on its own core # -a xt # required by aprun to run a script instead of a program aprun -n 1 -d 12 -cc none -a xt ./run_serialThe run_serial script looks like this:
#!/bin/sh # This must be /bin/sh (other shells do not work) # Run 12 copies of serial_code in the background ./serial_code & ./serial_code & ./serial_code & ./serial_code & ./serial_code & ./serial_code & ./serial_code & ./serial_code & ./serial_code & ./serial_code & ./serial_code & ./serial_code & # Wait until all copies of serial_code have finished wait

