The National Institute for Computational Sciences

Darter: How do I find out what nodes my batch job is using?

There are a couple of easy ways to find out what nodes are assigned to your batch job. The easiest is to use the checkjob command. Part of the output will return a list of nodes like the following:

Allocated Nodes:      

[84:1][85:1][86:1][87:1][88:1][89:1][90:1][91:1]

The method returns the a logical numbering of nodes. A physical numbering of the nodes as well as the pid layout can be obtained by setting the PMI_DEBUG variable to 1.

> setenv PMI_DEBUG 1
> aprun -n4 ./a.out
Detected aprun CNOS interface
MPI rank order: Using default aprun rank ordering
rank 0 is on nid00015 pid 76; originally was on nid00015 pid 76
rank 1 is on nid00015 pid 77; originally was on nid00015 pid 77
rank 2 is on nid00016 pid 69; originally was on nid00016 pid 69
rank 3 is on nid00016 pid 70; originally was on nid00016 pid 70

From within your code, you can reference PMI_CNOS_Get_nid to get the physical number for each process.

#include 
#include "mpi.h"
int main (int argc, char *argv[])
{
  int rank,nproc,nid;
  int i;
  MPI_Status status;
  MPI_Init(&argc, &argv);
  MPI_Comm_rank(MPI_COMM_WORLD, &rank);
  MPI_Comm_size(MPI_COMM_WORLD, &nproc);
  PMI_CNOS_Get_nid(rank, &nid);
  printf("  Rank: %10d  NID: %10d  Total: %10d \n",rank,nid,nproc);
  MPI_Finalize();
  return 0;
}

The output with four cores would be as follows:

aprun -n4 ./hello-mpi.x
  Rank:          1  NID:         15  Total:          4
  Rank:          0  NID:         15  Total:          4
  Rank:          2  NID:         16  Total:          4
  Rank:          3  NID:         16  Total:          4
Application 13390 resources: utime 0, stime 0

aprun can be used to run Unix commands on the compute nodes that display the node names as shown below.

> aprun -n4 /bin/hostname
nid00015
nid00015
nid00016
nid00016
>

Or

> aprun -n4 /bin/cat /proc/cray_xt/nid

15
15
16
16
>