• National Institute for Computational Sciences is a UT/ORNL Partnership

Grid Services Documentation


Contents




Introduction to Grid Services at NICS

NICS Grid Services utilizes the Globus GRAM (Grid Resource Allocation and Management) service (version 5.0.1). The client tool that is needed is the Globus Toolkit, versions 4.0.8 or 5.0.0 (preferred). This communicates to one Grid Service node (grid.nics.utk.edu/grid.nics.xsede.org) that interfaces with all of our available resources. The Local Resource Managers (LRM) are fork and PBS (examples in the next section). Fork jobs (jobmanager-fork) only run on the grid node, while PBS jobs (jobmanager-pbs) are submitted to the batch queue system. Note that fork jobs only have access to your NICS NFS home directory.

A GRAM5 log file is created during the use of these services. NICS has trace logging enabled giving the maximum information for the user. Examination and understanding of the information available in this log file is key to the troubleshooting of remotely executed jobs (see FAQ and how to read log file). The log file can be found at $HOME/.globus/job/grid.nics.utk.edu/gram_YYYYMMDD.log, where YYYYMMDD is the date, year-month-day.

For any questions or to get setup, email help@xsede.org. If you experience a problem, please include a full description of the executed commands, output, and relevant log file. For access to NICS grid services and Gram5 please include your username, allocation, and purpose of this access.

    Requirements

To understand NICS level of support and the requirements of Grid Services, please see the policy.

NICS uses commsh for gateways. Commsh allows only certain executables to be run. In order to check which ones can run, concatenate (cat) your commsh configure script (/etc/commsh.d/commsh.user.conf). The commands one can execute are defined as “DirectAccess”. You can request executables to be added to your configuration by submitting a XSEDE ticket to help@xsede.org with the location of the executable. Additional information on commsh can be found at:

For a Scientific Gateway, developers have to provide end user information to NICS. This information can be packed into a SAML attribute attached to a credential. For more information, please see the following.

globusrun Examples

This following describes the Globus RSL specifications in use at NICS by giving definitions and examples of globusrun jobs. Note, in all examples, any italic type is specific to the user. First and foremost, before any of these examples can be run, one must retrieve their Grid Security Infrastructure (GSI) credentials. Most often the credential obtained is from the MyProxy server, via:


   :~> myproxy-logon –l username
   :~> grid-proxy-info      (use this to show the current credential)
For more info see: Using Globus Tools

One can check their authentication is working by a simple globusrun authentication ping request to the server.

   :~> globusrun -a -r grid.nics.utk.edu:2119/jobmanager-fork 


   GRAM Authentication test successful

Argument descriptions:

    -a - authenticate only

    -o - enables stderr/stdout output to be redirected interactively

    -r - specifies the resource manager, host:port number

There are two types of job managers for the Globus GRAM service at NICS -fork and PBS. Fork jobs run directly on the grid node on which GRAM services are deployed. Jobs that require the use of compute nodes will have to use the PBS job manager, more on this later. In order for globusrun to utilize resources, RSL scripts can be used. The RSL script can either be a RSL file (proper RSL markups are needed) and can be used with the –f rsl_filename option to globusrun. In these examples, the RSL script will be placed on the command line of the jobmanger command and will be contained within the ‘single quotes’. To perform a simple fork example, execute the following:

   :~> globusrun -o -r grid.nics.utk.edu:2119/jobmanager-fork '&(executable=/bin/hostname)(arguments=--fqdn)'

   grid.nics.utk.edu 

The job manager also accepts PBS batch submissions via RSL markup. Remember, when requesting supercomputing resources one must specify the amount of the resource they need (number of cores and walltime) and the queue classification. These attributes are needed in the RSL script as well.

:~> globusrun -o -r grid.nics.utk.edu:2119/jobmanager-pbs '&(executable=/bin/hostname)(jobtype=single)(count=0)(maxtime=3) (project=my_allocation)'
aprun12

“jobtype = single” indicates this is solitary, non-parallel job. Of course, it is required to specify what the calculation needs. The “maxtime” attribute is the wall clock time for the job and the unit is minutes. “count” is directly translated into the PBS attribute for core count (ncpus and size for Nautilus and Kraken, respectively); this is the maximum number of CPU cores the job can run on. For advanced users this is the mpitask count. A “count=0” means no cores or mpitasks are needed when listed without any other RSL attributes that specify core count, node count, or MPI task layout. Jobs that specify “count=0” become special jobs that run only on a service node, not a compute node. If one were to specify the executable as aprun in the above example, the job would fail since using aprun requires compute nodes and “count=0” specifies no compute nodes. Using ”count=0” can be used for short serial pre/post-processing calculations (Caution: if the job uses too much memory it may slowdown, hang or crash the assigned service node which is a shared resource). The RSL attribute “project” specifies the name of one’s project allocation to charge the job to. It is a good idea to always specify a project allocation to be charged. If not specified, the user’s default project will be selected automatically.

A PBS script can accept many directives, such as, to email the user when the job completes or declare the name of the job. These PBS directives have RSL attribute equivalents to specify them. The below example illustrates some of these RSL attributes.

:~> globusrun -o -r grid.nics.utk.edu:2119/jobmanager-pbs '&(executable="/bin/ls")(arguments="-l" "/tmp") (jobtype=single)(count=0)(maxtime=3)(directory='/tmp')'
(stderr='$HOME/globusjob.ls.err')(stdout='$HOME/globusjob.ls.out') (emailonabort=yes)(emailonexecution=yes)(emailontermination=yes) (email_address=your_email@address)(project=my_allocation)(name=lsjob) (save_job_description=yes)'

Notice there are “double quotes” around the executable and each argument definitions. There are ‘single quotes’ associated with other attributes. Environmental variables such as $HOME and $TG_CLUSTER_SCRATCH can be used to point to the proper directories, if they are contained in single quotes. One can also associate the full pathname as well. This does not require single quotes. Standard output and standard error (stout and stderr, respectively) are the typical PBS scripting options to direct the respective information to a particular filename. The directory attribute sets the current working directory for the job. A common mistake is incorrect directory paths (if running a job on the compute node, remember, those nodes only have access to /lustre/scratch/$USER). Make sure one sets this directory, as it will insert a change directory command in the PBS script so that the executable knows where to find it’s input files if they are not specified directly. One can have their job notify them via email by using the email attributes (emailonabort, emailonexecution, emailontermination, and emailaddress). If one would like notification of a job failure, use emailonabort. When the job changes from the queued to running state, emailonexecution should be used. When the job finishes, either cleanly or failure, emailontermination would be the option. In order to obtain these email notifications, one must supply their email address (via emailaddress attribute). The save_job_description can be useful for debugging as it will save the RSL attributes and information to $HOME/gram_{globusjobid}.pl.

It is very important to specify all required and optional resources for the job. If not specified, this could cause potential errors when the PBS script is created. In the below example, the queue and maxtime are varying (see chart),

:~> globusrun -o -r grid.nics.utk.edu:2119/jobmanager-pbs '&(executable="/bin/ls")(directory=/lustre/scratch/user)(jobtype=single) (count=12)(queue=XXX)(maxtime=YYY)(project=my_allocation)(name=titleofjob)'
globusrun optionsPBS interprets asqstat output shows
maxtime queue walltime qclass walltime qclass
200 n/a 200 n/a 3:20 small
1500 n/a 1500 n/a 25:00 capability
500 small 500 small 8:20 small
500 medium 500 medium 8:20 medium
1500 capability 1500 capability 25:00 capability
n/a large n/a large 1:00 large

Note, maxtime and walltime (for the auto-generated PBS script) are in minutes. In the output of qstat it is converted into hours:minutes format. By not specifying the queue type, depending on the length of job, it will be submitted to either the small or capability queues. Not specifying a maxtime, it defaults to 1 hour. It is good practice to specify the queue type and maxtime/walltime. The PBS queue classes are configured for different size (core counts) and job lengths, if these conditions are not meant, the job will fail. One can look this information up by either “qstat –q” or for additional information on NICS PBS queues.

A parallel calculation on Kraken using a PBS batch submission, the execution line states “aprun –n #_of_cores executable”. Therefore, one’s executable should be aprun and its arguments are “-n”, “core count”, and the last argument the “mpi executable”. For example,

:~> globusrun -o -r grid.nics.utk.edu:2119/jobmanager-pbs '&(executable=aprun)(arguments="-n" "24" "helloworld") (directory=/lustre/scratch/user)(jobtype=single)(count=24)(maxtime=10) (project=my_allocation)'
 

In the auto-generated Kraken PBS batch script will state "aprun -n 24 /lustre/scratch/user/helloworld". It is important that the RSL “count” attribute specifies the same value that is specified for the aprun “–n” argument as in the above example.

Note: all of these examples have been ran on Kraken. The translation to run similar jobs on Nautilus is straight forward if one understands some fundamental information. For example, the parallel launcher on Nautilus is mpiexec having "-np" (instead of "-n") for the number of cores argument. It is recommended that before porting to Nautilus, run a few test calculations there to become familiar with the system, please see details here. Note, our GRAM5 service does not support fork jobs on Nautilus and is just configured to allow remote execution of jobs via qsub’s and data staging to local Nautilus file systems with gridftp’s.

Alternatively, one can use the jobtype=mpi where the "aprun -n number_of_cores" will be automatically placed in the submission script. In this case, one only needs to specify the MPI executable and it's arguments.

:~>globusrun -o -r grid.nics.utk.edu:2119/jobmanager-pbs '&(directory=/lustre/scratch/user)(executable=/lustre/scratch/user/helloworld)(jobtype=mpi)(count=12)(project=my_allocation)'

Hello world from process 8 of 12
Hello world from process 10 of 12
Hello world from process 7 of 12
Hello world from process 3 of 12
Hello world from process 11 of 12
Hello world from process 4 of 12
Hello world from process 5 of 12
Hello world from process 1 of 12
Hello world from process 0 of 12
Hello world from process 9 of 12
Hello world from process 2 of 12
Hello world from process 6 of 12

Application 5579880 resources: utime 0, stime 0

Jobtype=multiple is implemented at NICS for parallel applications that do not depend on the Message Passing Interface (MPI). NICS grid software (GRAM5) is aware of the different resources and launches the specified application for jobtype=multiple using the method that makes most sense on that resource in order to run the executable as a non-MPI application. Below is a simple example of a serial fortran program running multiple instances with jobtype=multiple.

:~> cat serial.f 
	implicit none
	write(6,*)"this is a serial program"
	end

:~> globusrun -o -r grid.nics.utk.edu:2119/jobmanager-pbs '&(directory=/lustre/scratch/user)(executable=/lustre/scratch/user/serial.x)(jobtype=multiple)(count=12)(project=my_allocation)'

 this is a serial program
 this is a serial program
 this is a serial program
 this is a serial program
 this is a serial program
 this is a serial program
 this is a serial program
 this is a serial program
 this is a serial program
 this is a serial program
 this is a serial program
 this is a serial program

Application 5691827 resources: utime 0, stime 0

However, on systems such as the Cray XT5, and though not recommended, jobtype=multiple can be used for both MPI and non-MPI parallel applications since the application launcher supports both MPI and non-MPI compiled executables. Jobtype=multiple is implemented by using the native resource application launcher to launch the parallel application specified by the RSL attribute, (executable={executable}). Some systems, such as the NICS SGI UV system Nautilus, requires an MPI launching program (mpiexec/mpirun) for MPI applications (jobtype=mpi) and there is no system application launcher for non-MPI applications (jobtype=multiple). On Nautilus non-MPI applications are simply executed multiple times in the background. For Nautilus, jobtype=multiple is only implemented for non-MPI applications using the background method and attempting to run an MPI application using a jobtype=multiple specification will fail. Run MPI applications using jobtype=mpi for Nautilus is the recommended method.

NICS RSL Attributes Table

Attribute Description Default Value Required LRM Usage Dependency
project The NICS charge code, your allocation Users default NICS project No, but highly suggested PBS only None
executable executable to run No default Yes PBS and fork NICS' GRAM5 does not check that this exists
arguments Arguments to the executable No default No PBS and fork For use with executable
directory set the current working directory for the job $HOME No PBS and fork None
count The number of tasks or MPI tasks to initiate None Required for PBS jobs if hostcount not specified PBS only Can be used in combination with hostcount, mpitasknode, mpitasksocket, threads. Count can be zero which does not allocate any compute node resources and runs the job on the service node and not on a compute note. If an aprun is specified with count or hostcount=0 it will fail.
hostcount The number of nodes None Required for PBS jobs if count is not specified PBS only Can be used in combination with hostcount, mpitasknodes, mpitasksocket, threads. sets PBS size= value. For kraken size equals twelve times this value. Hostcount can be zero which does not allocate any compute node resources and runs the job on the service node and not on a compute note. If an aprun is specified with count or hostcount=0 it will fail.
jobtype The job type Multiple. Values are single or mpi. No default for PBS jobs. single for fork jobs Required for PBS jobs PBS only If set to mpi, then GRAM will generate an appropriate aprun command using the specified executable to run on compute nodes.
Example:
(executable=/lustre/scratch/
  user/mpiprogram)
(jobtype=mpi)
(project=TG-MYPROJ)
(count=24)
(hostcount=2)
This will generate in the batch script "aprun -n 24 /lustre/scratch/user/mpiprogram" If set to single, then the end user is given the most flexibility and can specify the executable which could be aprun to run a parallel application on the compute nodes or something else. If set to single and aprun is specified by the executable then use the appropriate aprun arguments necessary to launch a parallel application along with the binary to launch as the last argument.
Example:
(executable=aprun)
(arguments="-n" "24" "/lustre/scratch/
  user/mpiprogram")
(jobtype=single)
(project=TG-MYPROJ)
(count=24)
(hostcount=2)















maxtime The maximum walltime No default Not required, but strongly encouraged PBS and fork None
maxmemory The maximum memory for the job No default No, but strongly encouraged PBS and fork None
queue The PBS queue Kraken's default queue or the pbsserver's queue Not required PBS only Depends on pbsserver value specified. The queue specified must exist on the system identified by pbsserver value.
stderr The file for standard error $HOME/.globus/
job/grid.nics.utk.edu/
{globusjobdir}/stderr

Not required PBS and fork None
stdout The file for standard output $HOME/.globus/
job/grid.nics.utk.edu/
{globusjobdir}/stdout

Not required PBS and fork None
name The job name jobid Not required PBS and fork If name is specified, the PBS defaults for stderr and stdout filenames are overruled by GRAM5.
emailonabort Send email on job abort No default Not required PBS only Email address
emailontermination Send email on job termination No default Not required PBS only Email address
emailaddress Send email to this address User's email address in NICS LDAP Not required PBS only emailonabort, emailontermination
savejobdescription Save the RSL attributes and info to $HOME/
gram_
{globusjobid}.pl

None No PBS and fork None
pbsserver The NICS PBS service to submit the job pbs.kraken
  .nics.xsede.org
No PBS only queue, count, hostcount, mpitasknode, mpitasksocket, threads all depend on the target resource

Obtaining the NICS job ID

One can retrieve the NICS job ID of their GRAM5 submitted job via the getnicsjobid script. In order to access this script one should ssh to krakenpf1. Here are the steps.

globusrun -b -r grid.nics.utk.edu:2119/jobmanager-pbs '&(executable=aprun)(arguments="-n" "144" "mpi_test")(directory=/lustre/scratch/mmcken6)(jobtype=single)(count=144)(maxtime=30)(project=UT-SUPPORT)'
globus_gram_client_callback_allow successful
GRAM Job submission successful
https://grid.nics.utk.edu:50386/16145858171099241891/1367557268601583767/
GLOBUS_GRAM_PROTOCOL_JOB_STATE_PENDING
Now look at the port and Globus GRAM job ID, in bold print above. The getnicsjobid script is located /sw/xt/globus/4.0.8/binary/bin/getnicsjobid, having the globus module loaded one would only need to type the script's name. Provide the script with the port and Globus GRAM job ID.
:~> getnicsjobid https://grid.nics.utk.edu:50386/16145858171099241891/1367557268601583767/
1207773.nid00016

Note, the information in the gold database is updated periodically at 5am, 11am, 5pm and 11pm.

Frequently Asked Questions (FAQ)

Below are descriptions and fixes of potential errors that may be encountered when trying to use NICS grid services.

To begin troubleshooting, note the location of the GRAM server-side log file which is located at $HOME/.globus/job/grid.nics.utk.edu/gram_YYYYMMDD.log. Each line of this file specifies “id=#####” which indicates the gram ID number. Every successive Globus submission will produce log output having a separate unique ID number for each GRAM submitted job. Subsequent log messages are appended to that days GRAM log file. Trace logging is turned on at NICS to give maximum information to end users, however, this puts a lot of extra information in the log file that hides finding the real reason for an error.

Please consult us (email: help@xsede.org), if one confronts issues not described within.

1. GRAM Job failed because the job manager detected an invalid script status (error code 25)

When you see “error code 25” for a globus job, this is pretty much a generic error message for several things but mostly related to NICS use of commsh to authorize users and authorize the “executables” that users specify. Check the gram log file for

ts=2011-02-17T16:42:42.597173Z id=7092 event=gram.script_read.info level=TRACE gramid=/16145884300446296816/1367557268601603989/ response="GRAM_SCRIPT_ERROR:165"

If you see “GRAM_SCRIPT_ERROR:165” in the log file for the job you just submitted this means that either (a) you are not authorized to run GRAM5 jobs at NICS or (b) the executable you specified is not authorized to be run by you at NICS.

(a) If you were not preauthorized to run GRAM5 jobs at NICS you will need to turn in a help@xsede.org ticket to request access along with justification for why you need remote access to a NICS resource.

(b) If the executable you specified is not authorized, first, check the spelling of the executable to make sure you don’t have a typo. If there is no spelling error turn in a ticket to request a new executable to be allowed to be run via GRAM5 services by you at NICS. See the accounts portion of this documentation for details on how to obtain access to GRAM5. In order to check what you can run, concatenate (cat) your commsh configure script (/etc/commsh.d/commsh.user.conf). The commands one can execute are defined as “DirectAccess”. Is the executable argument used in your globusrun exactly as it is in your "DirectAccess" commsh configuration? If not, then this is the reason why it failed. If it is not listed, please contact us with any new executable you want to run on NICS resources via GRAM services. In this email, include the location of the executable so that we can review it.

2. GRAM Job submission failed because data transfer to the server failed (error code 10)

When you run a Globus job for the first time using the NICS GRAM5 service you may see the above error message.

This means you do not have the required log directory $HOME/.globus/job/grid.nics.utk.edu/ created in order to write a log file. Use of GRAM5 services at NICS has to be preauthorized. When authorized, a NICS administrator creates this GRAM5 log directory for you and enables you in a configuration file to be able to submit jobs via GRAM5 services. You can create this directory on your own, but if you are not authorized in the configuration file you will no longer get “error code 10” and you will start getting “error code 25” with “GRAM_SCRIPT_ERROR:165” in the Globus log file. A reason you get this error is because you are not authorized along with other authorization reasons (see the “error code 25” FAQ item).

3. GRAM Job submission failed because one of the RSL parameters is not supported (error code 1)

This error corresponds to spelling mistakes for RSL attributes. For example, agruments instead of arguments.

:~> globusrun -o -r grid.nics.utk.edu:2119/jobmanager-pbs '&(executable=aprun)(agruments="-n" "12" "./helloworld") (jobType=single)(count=12)(maxtime=5)(project=my_allocation)(name=myjob) '

4. GRAM Job failed because the job failed when the job manager attempted to run it (error code 17)

This error message is pretty much a generic error message for several things but NICS has seen these related to items that cause a batch job script to fail at qsub time or missing RSL items that are required by NICS. You will have to do a bit of detective work and look through the log file to determine what happened. Trace logging is turned on at NICS to give maximum information to end users, however, this puts a lot of extra information in the log file that hides finding the real reason for an error. Search (grep) for the following items in the log file which can help narrow down the search for the real culprit.

gram.read_request.info

gram.new_request.end

failure_code=

For instance, in the below log see the qsub error message just after the “qsub returned” message. In this case count was not set or count was not a multiple of twelve as required for Kraken.

ts=2011-02-08T14:26:56.523287Z id=10570 event=gram.script_read.info level=TRACE gramid=/16145843589154690171/1367557268601571993/ response="GRAM_SCRIPT_LOG:msg=\"submitting job -- /opt/torque/default/bin/qsub < /nics/a/home/$USER/.globus/job/grid.nics.utk.edu/16145843589154690171.1367557268601571993/scheduler_pbs_job_script 2>/nics/a/home/$USER/.globus/job/grid.nics.utk.edu/16145843589154690171.1367557268601571993/scheduler_pbs_submit_stderr\""
ts=2011-02-08T14:26:56.679460Z id=10570 event=gram.script_read.info level=TRACE gramid=/16145843589154690171/1367557268601571993/ response="GRAM_SCRIPT_LOG:msg=\"qsub returned \""
ts=2011-02-08T14:26:56.679562Z id=10570 event=gram.script_read.info level=TRACE gramid=/16145843589154690171/1367557268601571993/ response="GRAM_SCRIPT_LOG:msg=\"qsub stderr \\nNotice: Your job was NOT submitted \\n\\n  Core requests on Kraken must be a multiple of twelve.  You have requested \\n  an invalid number of cores ( 1 ). Please resubmit the \\n  job requesting an appropriate number of cores. \\n\\n  Please contact help@xsede.org if you need assistance. \\n\\nqsub: Your job has been administratively rejected by the queueing system.\\nqsub: There may be a more detailed explanation prior to this notice.\\n\""

5. Error code 73

Error code 73 may occur when specifying stdout and stderr. This is due to the grid node can not ‘touch’ that file to create it in the GRAM5 job startup. Reproducing this error,

~> globusrun -b -r grid.nics.utk.edu:2119/jobmanager-pbs '&(directory=/gpfs/medusa/user)(executable=/gpfs/medusa/user/test.x)(jobtype=multiple)(count=8)(maxtime=2)(project=my_allocation)(pbsserver=nautilus)(maxmemory=1000)(stdout=/gpfs/medusa/user/the.out)(stderr=/gpfs/medusa/user/the.err)'
globus_gram_client_callback_allow successful
GRAM Job submission failed because the job manager failed to open stdout (error code 73)

This issue was taken fro when Nautilus has GPFS. This error still could occur on lustre due to the file lock reason.

A work around to this can occur if one redirects stdout/stderr in the “arguments” of the executable.

globusrun -b -r grid.nics.utk.edu:2119/jobmanager-pbs '&(directory=/gpfs/medusa/user)(executable=/gpfs/medusa/user/test.x)(arguments=">" "/gpfs/medusa/user/works.out" "2>" "/gpfs/medusa/user/works.err")(jobtype=multiple)(count=8)(maxtime=2)(project=my_allocation)(pbsserver=nautilus)(maxmemory=1000)'
globus_gram_client_callback_allow successful
GRAM Job submission successful
https://grid.nics.utk.edu:50383/16145737154125279391/1367557268601595050/
GLOBUS_GRAM_PROTOCOL_JOB_STATE_PENDING

6. Jobtype = Multiple warning

Jobtype=multiple giving incorrect number of output lines.

Since this jobtype runs multiple independent executables on the node, all of them will write to the same standard output and error files. Having an unrestricted writeable file access will led to processes being overwritten resulting in the lack of all output from all the executables. If this is not the desired result, a possible solution would be to rewrite the application in such a way as to be able to specify the standard out and error directly in the executable.

7. myproxy-logon authentication fails?

~> myproxy-logon 
Error authenticating: GSS Major Status: Authentication Failed
GSS Minor Status Error Chain:
globus_gss_assist: Error during context initialization
globus_gsi_callback_module: Could not verify credential
globus_gsi_callback_module: Could not verify credential
globus_gsi_callback_module: Invalid CRL: The available CRL has expired

This is due to your certificate being outdated. Go to your home directory, there is a hidden ".globus/certificates" directory. Simply move this directory, for example (in your $HOME)

~> mv .globus/certificates .globus/certificates_old
Then reissue your myproxy-logon command.

How to read a GRAM5 log file

Below is an annotated gram log file. The comments are marked with ##. Many lines have been removed to better view the GRAM5 workflow.

ts=2011-03-16T18:31:14.082779Z id=32310 event=gram.register_proxy_timeout.end level=TRACE status=0 lifetime=42956 timeout=600 
ts=2011-03-16T18:31:14.082845Z id=32310 event=gram.startup_socket_init.lock.start level=TRACE path="/nics/d/home/mmcken6/.globus/job/grid.nics.utk.edu/pbs.c54e6daf.lock"
## Notice the change of id numbers, below is the new globus job
ts=2011-03-16T19:22:30.283747Z id=21298 event=gram.validation_record.info level=TRACE attribute=threads description="Number of (OpenMP) threads for use with aprun -d option" required_when=0 default_when=0 default_value="" enumerated_values="" 
ts=2011-03-16T19:22:30.285598Z id=21298 event=gram.validation_record.info level=TRACE attribute=mpitasksocket description="mpi tasks per socket value for use with aprun -S option" required_when=0 default_when=0 default_value="" enumerated_values="" 
ts=2011-03-16T19:22:30.285740Z id=21298 event=gram.validation_record.info level=TRACE attribute=stdoutposition description="Specifies where in the file remote output streaming should be restarted from. Must be 0." required_when=0 default_when=0 default_value="" enumerated_values="0" 
## So far all this output is defining all known attributes - set or not set
ts=2011-03-16T19:22:30.285750Z id=21298 event=gram.validation_record.info level=TRACE attribute=restart description="Start a new job manager, but instead of submitting a new job, start managing an existing job. The job manager will search for the job state file created by the original job manager. If it finds the file and successfully reads it, it will become the new manager of the job, sending callbacks on status and streaming stdout/err if appropriate. It will fail if it detects that the old jobmanager is still alive (via a timestamp in the state file). If stdout or stderr was being streamed over the network, new stdout and stderr attributes can be specified in the restart RSL and the jobmanager will stream to the new locations (useful when output is going to a GASS server started by the client that's listening on a dynamic port, and the client was restarted). The new job manager will return a new contact string that should be used to communicate with it. If a jobmanager is restarted multiple times, any of the previous contact strings can be given for the restart attribute." required_when=2 default_when=0 default_value="" enumerated_values="" 
ts=2011-03-16T19:22:30.377773Z id=21298 event=gram.read_request.start level=TRACE fd=16
ts=2011-03-16T19:22:30.377801Z id=21298 event=gram.read_request.info level=TRACE request_string="protocol-version: 2\r\njob-state-mask: 1048575\r\ncallback-url: https://krakenpf6.nics.utk.edu:37491/\r\nrsl: \"&(\\\"rsl_substitution\\\" = (\\\"GLOBUSRUN_GASS_URL\\\" \\\"https://krakenpf6.nics.utk.edu:37608\\\" ) )(\\\"stderr\\\" = $(\\\"GLOBUSRUN_GASS_URL\\\") # \\\"/dev/stderr\\\" )(\\\"stdout\\\" = $(\\\"GLOBUSRUN_GASS_URL\\\") # \\\"/dev/stdout\\\" )(\\\"executable\\\" = \\\"/lustre/scratch/mmcken6/test\\\" )(\\\"directory\\\" = \\\"/lustre/scratch/mmcken6\\\" )(\\\"jobtype\\\" = \\\"mpi\\\" )(\\\"maxtime\\\" = \\\"5\\\" )(\\\"project\\\" = \\\"UT-SUPPORT\\\" )(\\\"queue\\\" = \\\"small\\\" )(\\\"name\\\" = \\\"pjob\\\" )(\\\"count\\\" = \\\"12\\\" )\"\r\n" 
## The above line is the actual globusrun command used
## For this case, it was globusrun -o -r grid.nics.utk.edu:2119/jobmanager-pbs '&(directory=/lustre/scratch/mmcken6)(executable=/lustre/scratch/mmcken6/test)(jobtype=mpi)(maxtime=5)(count=12)(project=UT-SUPPORT)'
ts=2011-03-16T19:22:30.377866Z id=21298 event=gram.read_request.end level=TRACE status=0 
## Below is the creation of the temporary job directory in $HOME/.globus/job/grid.nics.utk.edu
ts=2011-03-16T19:22:30.378107Z id=21298 event=gram.make_job_dir.start level=TRACE gramid=/16145688682996727996/1367557268601574807/ 
ts=2011-03-16T19:22:30.383682Z id=21298 event=gram.make_job_dir.end level=TRACE gramid=/16145688682996727996/1367557268601574807/ status=0 path=/nics/d/home/mmcken6/.globus/job/grid.nics.utk.edu/16145688682996727996.1367557268601574807 
ts=2011-03-16T19:22:30.383897Z id=21298 event=gram.gass_cache_init.start level=TRACE gramid=/16145688682996727996/1367557268601574807/ 
ts=2011-03-16T19:22:30.383905Z id=21298 event=gram.gass_cache_init.info level=TRACE gramid=/16145688682996727996/1367557268601574807/ path=/nics/d/home/mmcken6/.globus/.gass_cache
ts=2011-03-16T19:22:30.389082Z id=21298 event=gram.gass_cache_init.end level=TRACE gramid=/16145688682996727996/1367557268601574807/ status=0 path=/nics/d/home/mmcken6/.globus/.gass_cache
ts=2011-03-16T19:22:30.389167Z id=21298 event=gram.validate_rsl.info level=TRACE msg="Inserting default RSL for attribute" attribute=dryrun default="no" 
ts=2011-03-16T19:22:30.520018Z id=21298 event=gram.script_read.info level=TRACE gramid=/16145688682996727996/1367557268601574807/ response="GRAM_SCRIPT_LOG:msg=\"submitting job -- /opt/torque/default/bin/qsub < /nics/d/home/mmcken6/.globus/job/grid.nics.utk.edu/16145688682996727996.1367557268601574807/scheduler_pbs_job_script 2>/nics/d/home/mmcken6/.globus/job/grid.nics.utk.edu/16145688682996727996.1367557268601574807/scheduler_pbs_submit_stderr\"" 
ts=2011-03-16T19:22:32.466847Z id=21298 event=gram.script_read.info level=TRACE gramid=/16145688682996727996/1367557268601574807/ response="GRAM_SCRIPT_LOG:msg=\"job submission successful, setting state to PENDING\""
## Successfully submitted the job to the PBS server
ts=2011-03-16T19:22:32.467002Z id=21298 event=gram.script_read.info level=TRACE gramid=/16145688682996727996/1367557268601574807/ response="GRAM_SCRIPT_JOB_ID:1104367.nid00016" 
ts=2011-03-16T19:22:32.467485Z id=21298 event=gram.script_read.info level=TRACE gramid=/16145688682996727996/1367557268601574807/ response="GRAM_SCRIPT_JOB_STATE:1" 
ts=2011-03-16T19:22:32.546984Z id=21298 event=gram.remove_reference.info level=TRACE gramid=/16145688682996727996/1367557268601574807/ refcount=1 reason="Job state callbacks" 
ts=2011-03-16T19:22:32.546989Z id=21298 event=gram.remove_reference.end level=TRACE gramid=/16145688682996727996/1367557268601574807/ status=0  
ts=2011-03-16T19:22:32.644843Z id=21298 event=gram.script_read.info level=TRACE gramid=/16145688682996727996/1367557268601574807/ response="GRAM_SCRIPT_LOG:msg=\"qstat job_state line is:     job_state = Q\"" 
##  Checking the PBS job state, it is in the queue
ts=2011-03-16T19:22:32.644891Z id=21298 event=gram.script_read.info level=TRACE gramid=/16145688682996727996/1367557268601574807/ response="GRAM_SCRIPT_JOB_STATE:1" 
ts=2011-03-16T19:22:32.644898Z id=21298 event=gram.script_read.info level=TRACE gramid=/16145688682996727996/1367557268601574807/ response="" 
ts=2011-03-16T19:22:42.760278Z id=21298 event=gram.script_read.info level=TRACE gramid=/16145688682996727996/1367557268601574807/ response="GRAM_SCRIPT_LOG:msg=\"qstat job_state line is:     job_state = Q\""
## Checking submitted job's status
ts=2011-03-16T19:22:42.760324Z id=21298 event=gram.script_read.info level=TRACE gramid=/16145688682996727996/1367557268601574807/ response="GRAM_SCRIPT_JOB_STATE:1" 
ts=2011-03-16T19:22:52.870068Z id=21298 event=gram.script_read.info level=TRACE gramid=/16145688682996727996/1367557268601574807/ response="GRAM_SCRIPT_LOG:msg=\"qstat job_state line is:     job_state = R\"" 
## The job is now running 
ts=2011-03-16T19:22:52.870117Z id=21298 event=gram.script_read.info level=TRACE gramid=/16145688682996727996/1367557268601574807/ response="GRAM_SCRIPT_JOB_STATE:2" 
ts=2011-03-16T19:23:02.879038Z id=21298 event=gram.script_read.info level=TRACE gramid=/16145688682996727996/1367557268601574807/ response="GRAM_SCRIPT_LOG:msg=\"polling job 1104367.nid00016\"" 
ts=2011-03-16T19:23:03.048010Z id=21298 event=gram.script_read.info level=TRACE gramid=/16145688682996727996/1367557268601574807/ response="GRAM_SCRIPT_LOG:msg=\"qstat job_state line is:     job_state = R\"" 
ts=2011-03-16T19:23:03.048128Z id=21298 event=gram.script_read.info level=TRACE gramid=/16145688682996727996/1367557268601574807/ response="GRAM_SCRIPT_JOB_STATE:2" 
ts=2011-03-16T19:23:13.054467Z id=21298 event=gram.script_read.info level=TRACE gramid=/16145688682996727996/1367557268601574807/ response="GRAM_SCRIPT_LOG:msg=\"polling job 1104367.nid00016\"" 
ts=2011-03-16T19:23:13.155589Z id=21298 event=gram.script_read.info level=TRACE gramid=/16145688682996727996/1367557268601574807/ response="GRAM_SCRIPT_LOG:msg=\"qstat job_state line is:     job_state = R\"" 
## The job is still running
ts=2011-03-16T19:23:13.155709Z id=21298 event=gram.script_read.info level=TRACE gramid=/16145688682996727996/1367557268601574807/ response="GRAM_SCRIPT_JOB_STATE:2" 
16145688682996727996/1367557268601574807/ response="GRAM_SCRIPT_LOG:msg=\"qstat job_state line is:     job_state = R\"" 
ts=2011-03-16T19:23:33.380681Z id=21298 event=gram.script_read.info level=TRACE gramid=/16145688682996727996/1367557268601574807/ response="GRAM_SCRIPT_JOB_STATE:2" 
ts=2011-03-16T19:23:43.386063Z id=21298 event=gram.script_read.info level=TRACE gramid=/16145688682996727996/1367557268601574807/ response="GRAM_SCRIPT_LOG:msg=\"polling job 1104367.nid00016\"" 
ts=2011-03-16T19:23:43.486973Z id=21298 event=gram.script_read.info level=TRACE gramid=/16145688682996727996/1367557268601574807/ response="GRAM_SCRIPT_LOG:msg=\"qstat job_state line is:     job_state = R\"" 
ts=2011-03-16T19:23:43.487093Z id=21298 event=gram.script_read.info level=TRACE gramid=/16145688682996727996/1367557268601574807/ response="GRAM_SCRIPT_JOB_STATE:2" 
ts=2011-03-16T19:25:54.844361Z id=21298 event=gram.script_read.info level=TRACE gramid=/16145688682996727996/1367557268601574807/ response="GRAM_SCRIPT_LOG:msg=\"polling job 1104367.nid00016\"" 
ts=2011-03-16T19:25:54.945012Z id=21298 event=gram.script_read.info level=TRACE gramid=/16145688682996727996/1367557268601574807/ response="GRAM_SCRIPT_LOG:msg=\"qstat job_state line is:     job_state = R\"" 
ts=2011-03-16T19:25:54.945131Z id=21298 event=gram.script_read.info level=TRACE gramid=/16145688682996727996/1367557268601574807/ response="GRAM_SCRIPT_JOB_STATE:2" 
ts=2011-03-16T19:26:15.056354Z id=21298 event=gram.script_read.info level=TRACE gramid=/16145688682996727996/1367557268601574807/ response="GRAM_SCRIPT_LOG:msg=\"polling job 1104367.nid00016\"" 
ts=2011-03-16T19:26:15.158406Z id=21298 event=gram.script_read.info level=TRACE gramid=/16145688682996727996/1367557268601574807/ response="GRAM_SCRIPT_LOG:msg=\"qstat job_state line is:     job_state = R\"" 
## Job is still running
ts=2011-03-16T19:26:15.158521Z id=21298 event=gram.script_read.info level=TRACE gramid=/16145688682996727996/1367557268601574807/ response="GRAM_SCRIPT_JOB_STATE:2" 
ts=2011-03-16T19:26:35.284984Z id=21298 event=gram.script_read.info level=TRACE gramid=/16145688682996727996/1367557268601574807/ response="GRAM_SCRIPT_LOG:msg=\"polling job 1104367.nid00016\"" 
ts=2011-03-16T19:26:35.392014Z id=21298 event=gram.script_read.info level=TRACE gramid=/16145688682996727996/1367557268601574807/ response="GRAM_SCRIPT_LOG:msg=\"qstat job_state line is:     job_state = C\""
## Checked the job's status, it is complete 
ts=2011-03-16T19:26:35.392138Z id=21298 event=gram.script_read.info level=TRACE gramid=/16145688682996727996/1367557268601574807/ response="GRAM_SCRIPT_JOB_STATE:8" 
ts=2011-03-16T19:26:35.392153Z id=21298 event=gram.set_job_status.start level=TRACE gramid=/16145688682996727996/1367557268601574807/ state=8 failure_code=0 
ts=2011-03-16T19:26:35.743850Z id=21298 event=gram.remove_reference.info level=TRACE msg="No jobs remain, setting job manager termination timer" 
ts=2011-03-16T19:26:35.743925Z id=21298 event=gram.remove_reference.end level=TRACE gramid=/16145688682996727996/1367557268601574807/ status=0  
## Removing the temporary job directory in $HOME/.globus/job/grid.nics.utk.edu
ts=2011-03-16T19:27:35.744250Z id=21298 event=gram.grace_period_expired.start level=TRACE 
ts=2011-03-16T19:27:35.744302Z id=21298 event=gram.grace_period_expired.end level=TRACE status=0 terminating=true