If you need to run many instances of a serial code (as in a typical parameter sweep study for instance), we highly recommend using Eden. Eden is a simple script-based master-worker framework for running multiple serial jobs within a single PBS job.
To find out how to use the performance tools on Darter, enter the following commands on the login node:
module load perftools man intro_perftools
Python's multiprocessing module is similar to threading, so you should use the following in your Darter batch script to launch the python script on a single node:
module load python aprun -d 16 python script.pyThis will make all 16 cores on the node available to the Python script. Please note: whether or not the cores are fully utilized is up to the programming of the script.
If the python script is parallelized using MPI (e.g. with mpi4py which is available on Darter), then it should be run just like any other MPI program using the following syntax in your batch script:
module load python aprun -n numproc python parallel_script.pyIf there is no MPI in the python script, use the following syntax in your batch script:
module load python aprun python serial.py
If you want to see what potentially happens while compiling your code, but you don't want any files to be created or overwritten, you must use the
-dryrun option flag when using Cray wrappers. This option shows commands built by the driver but does not actually compile.
For example, "cc -dryrun hello.C -o hello.exe".
While using tools is a preferable method of debugging to simply using print statements, sometimes the latter option is the only method to find the bug. In this case, the most effective way to isolate the error in your code is through the method of bisection, which is an iterative process for tracing the program manually.
Step 1: In the main routine of your code, comment out the second half of the code (or approximately the second half).
Step 2: Compile and run the code. Did it crash as before?
Sometimes a code will work fine in many cases and circumstances but there will be a bug which only rears its head when a certain perfect storm of case and job size occurs. This causes the code to die in a strange spot and it is not obvious exactly why or where. In cases like this, Cray's ATP (Abnormal Termination Processing) can likely help!
In order to determine memory usage for a given process on a compute node, one would normally simply issue the command "top" and look at the memory usage of the process in question. However, this cannot be done on a Darter compute node, since they are not accessible to the user. Also, OOM (Out of Memory) errors often occur even when a problem has been discretized finely enough but memory leaks in the code occur in the worst case scenario, causing the program to crash.
Unlike Darter's compute nodes, its login nodes have modest hardware specs: a single quad-core processor with 8 gigabytes of memory. However, each of the Darter login nodes may have up to 30 user login sessions active at any given time. As a result, a single user who runs a very processor- or memory-intensive task on a Darter login node can affect the work of several dozen other users. As a result, NICS recommends that concurrent makes ("make -j N") on Darter be done with an N of 2 or less.
If a user wants to use:
#PBS -l size=192 ### Assuming you want to use 24 MPI tasks
aprun -n 24 -N 2 -S 1
Here's what the above aprun command means. You are asking for 24 MPI tasks, 2 MPI tasks per node, and 1 MPI task per socket.
Each Darter node has 16 cores and 32 Gbytes of memory: about 2 GB per core if all cores are used. Sometimes it is necessary to leave some cores idle to make more memory available per core. For example, if you use 8 cores per node, each core has access to about 4 Gbytes of memory.
Your file transfer has caused a Lustre storage server (OST) to become full, resulting in an error like:
ead_cond_timedwait() return error 22, errno=0 OUT OF SPACE condition detected while writing local file
This usually happens because the stripe count is too small (often 1). To solve this issue, remove the partially transferred file and change the stripe count of the directory before transferring the file. To change the stripe count of the directory, first
cd to that directory. Second, type the following command:
In order to enable the creation of a coredump file when a program crashes in the compute node of a CRAY system like Darter, the following command should be added to the job script before the aprun call:
|Bourne shell||ulimit -c unlimited|
|C shell||limit coredumpsize unlimited|
For example if using a Bourne like job scrip, the script will look like:
- Replace all compiler commands (
pgf90, etc) with the following:
- Remove all references to MPI libraries and environment variables related to third-party libraries within the makefile.
The MPI_IN_PLACE option causes communication on an intra-communicator to happen in place, rather than being copied into buffers. This reduces the required number of operations (it is only possible within a node, not between nodes).
In order to use this option with
MPI_Alltoall, you need to disable Cray's optimization for that call: