The National Institute for Computational Sciences

Darter: How do I use Cray ATP to determine where and why a code died abnormally?

Darter: How do I use Cray ATP on Darter to determine where and why a code died abnormally?

Sometimes a code will work fine in many cases and circumstances but there will be a bug which only rears its head when a certain perfect storm of case and job size occurs. This causes the code to die in a strange spot and it is not obvious exactly why or where. In cases like this, Cray's ATP (Abnormal Termination Processing) can likely help!

Simply do

module load atp 

and re-compile your code without optimization (use the "-g" flag for debugging) using any backend compiler (PrgEnv) with the Cray wrappers (ftn, cc, or CC). This simultaneously helps assure that the error was not brought on by compiler optimization mistakes and creates the instrumented executable.

Now, you are ready to use ATP to generate a backtrace to the line where the code died.

Add the following to your PBS script to make sure that the ATP module is loaded into your aprun environment and that the ATP environment variable is set to collect information:
module load atp
export ATP_ENABLED=1

If a backtrace file appears in your directory upon run termination, search through it to find the line that your code died on. If the code completes successfully, you need to lower the compiler optimization number in order that the compiler does not optimize your code to incorrect results.

Also, you may go back and add "-traceback", an Intel compiler flag, to the compilation, which may assist in producing a traceback file as well. This only works when "ProgEnv-intel" is loaded, but you can pass it to the Cray wrappers "cc", "CC", or "ftn" and it will pass it to the backend Intel compiler.

If you are still unable to find the problem, stepping through with a debugger like DDT or Totalview may be helpful.