Before getting into scripting details it is very, very important to understand that a little knowledge can be a bad thing. In this section, there will be descriptions of how to semi-automate one’s research. Automating is not good; this will create a divide between the computational scientist and their data. For instance, if their simulation begins to show an energy drift and they are unaware of this for a week, the study and their allocation could be wasted. Hence the term, semi-automated is used. In addition, with system maintenance, updates, and the sharing the resource with other researchers, one should not chain their simulations for longer than a week.
If one needs a longer wall time or would like save time by semi-automating their jobs on a supercomputer; an effective solution is to set up a job chaining script. This method would be useful if your simulation is checkpointable and ready for production runs. First, some definitions are needed.
Chaining PBS jobs requires the knowledge of creating a proper PBS job script and dependency keywords. One can learn how to construct a PBS submission script by reading the Batch Scripts section of the Running Jobs page. Any PBS job that are dependent upon another PBS job are placed in a hold state. The dependency list is set up by knowing the independent job’s ID number and the dependency keywords.
|after||Execute current job after listed jobs have begun|
|afterok||Execute current job after listed jobs have terminated without error|
|afternotok||Execute current job after listed jobs have terminated with an error|
|afterany||Execute current job after listed jobs have terminated for any reason|
|before||Listed jobs can be run after current job begins execution|
|beforeok||Listed jobs can be run after current job terminates without error|
|beforenotok||Listed jobs can be run after current job terminates with an error|
|beforeany||Listed jobs can be run after current job terminates for any reason|
The most common dependency keyword used is afterok. The construct of a dependency is "-W depend=dependency expression".
The dependency expression contains the dependency keyword with one or more job ID numbers (colon separated list). For example, see below,
qsub my_script.pbs –W depend=afterok:1187721
Or included in your PBS script:
#PBS –W depend=before:1187723:1187724
In order to chain PBS jobs, one creates the necessary PBS submission script and a shell script shown below. Note, after creating this shell script, give it execution privileges by “chmod u+x script.sh”.
A ‘flat’ chain can be used if one would like to submit a sequential series of calculations that consists of various pre/post-production runs. For this, one needs the various PBS submission scripts that handle the separate calculations (named calc1.pbs and calc2.pbs below).
#!/bin/bash one=$(qsub calc1.pbs) echo $one two=$(qsub -W depend=afterok:$one calc2.pbs) echo $two
One can continue chaining jobs by continuing the numbering convention ($three, $four, …). To execute these scripts, and place them in the PBS queue, one issues the command “./script_name”.
A ‘looped’ chain is useful for a single PBS job that can be re-submitted multiple times. One has to be careful here to ensure the integrity of the looped chain. For instance, a researcher’s molecular dynamics simulation can produce a 4 nanosecond trajectory in 24 hours. They would like to have a total of 12 nanosecond trajectory. The maximum wall time on Kraken is only 24 hours. Submitting this same job 3 times would do the trick. Running up to the 24 wall time limit poses the risk of the job being killed preemptively. In addition, did the simulation write the restart file? If not, the chain could be broken. If the simulation was setup to write the restart every 6 hours, but the last one was not written, then the next simulation would be starting at the 18 hour mark. This is why it is a good idea to run one’s calculation once or twice before semi-automating the process. Benchmarking to understand the simulation’s scaling behavior is another good idea as well.
Below is the ‘looped’ job chain script,
#!/bin/bash one=$(qsub submit.pbs) echo $one for id in seq 2 4; do two=$(qsub -W depend=afterok:$one submit.pbs) one=$two done
In the above example, the script will submit the PBS submission script “submit.pbs” four times. If one needs a different number of loops, modify the for-loop from 4 to whatever you require (remember do not submit a lot jobs, see A Must Read section) . Viewing these jobs in the queue will show the first submitted job state (S column) as ‘Q’ for Queued. The succeeding ones will have a job state of ‘H’ for Held, because they are dependent on the first job.