If you need to run many instances of a serial code (as in a typical parameter sweep study for instance), we highly recommend using Eden. Eden is a simple script-based master-worker framework for running multiple serial jobs within a single PBS job.
To find out how to use the performance tools on Darter, enter the following commands on the login node:
module load perftools man intro_perftools
Python's multiprocessing module is similar to threading, so you should use the following in your Darter batch script to launch the python script on a single node:
module load python aprun -d 16 python script.pyThis will make all 16 cores on the node available to the Python script. Please note: whether or not the cores are fully utilized is up to the programming of the script.
If the python script is parallelized using MPI (e.g. with mpi4py which is available on Darter), then it should be run just like any other MPI program using the following syntax in your batch script:
module load python aprun -n numproc python parallel_script.pyIf there is no MPI in the python script, use the following syntax in your batch script:
module load python aprun python serial.py
If you want to see what potentially happens while compiling your code, but you don't want any files to be created or overwritten, you must use the
-dryrun option flag when using Cray wrappers. This option shows commands built by the driver but does not actually compile.
For example, "cc -dryrun hello.C -o hello.exe".
While using tools is a preferable method of debugging to simply using print statements, sometimes the latter option is the only method to find the bug. In this case, the most effective way to isolate the error in your code is through the method of bisection, which is an iterative process for tracing the program manually.
Step 1: In the main routine of your code, comment out the second half of the code (or approximately the second half).
Step 2: Compile and run the code. Did it crash as before?
Sometimes a code will work fine in many cases and circumstances but there will be a bug which only rears its head when a certain perfect storm of case and job size occurs. This causes the code to die in a strange spot and it is not obvious exactly why or where. In cases like this, Cray's ATP (Abnormal Termination Processing) can likely help!
In order to determine memory usage for a given process on a compute node, one would normally simply issue the command "top" and look at the memory usage of the process in question. However, this cannot be done on a Darter compute node, since they are not accessible to the user. Also, OOM (Out of Memory) errors often occur even when a problem has been discretized finely enough but memory leaks in the code occur in the worst case scenario, causing the program to crash.
Unlike Darter's compute nodes, its login nodes have modest hardware specs: a single quad-core processor with 8 gigabytes of memory. However, each of the Darter login nodes may have up to 30 user login sessions active at any given time. As a result, a single user who runs a very processor- or memory-intensive task on a Darter login node can affect the work of several dozen other users. As a result, NICS recommends that concurrent makes ("make -j N") on Darter be done with an N of 2 or less.
If a user wants to use:
#PBS -l size=192 ### Assuming you want to use 24 MPI tasks
aprun -n 24 -N 2 -S 1
Here's what the above aprun command means. You are asking for 24 MPI tasks, 2 MPI tasks per node, and 1 MPI task per socket.