The National Institute for Computational Sciences

Darter: How to determine memory usage on the compute node

Darter: How do I determine memory usage on a compute node during a running job?

In order to determine memory usage for a given process on a compute node, one would normally simply issue the command "top" and look at the memory usage of the process in question. However, this cannot be done on a Darter compute node, since they are not accessible to the user. Also, OOM (Out of Memory) errors often occur even when a problem has been discretized finely enough but memory leaks in the code occur in the worst case scenario, causing the program to crash.

This crashing behavior means that the user needs to instrument their code and fix the memory leaks, and the Scientific Computing staff at NICS have created a simple method to add to your current program in spots where memory usage is suspect due to possible leaks. This can assist with finding potential memory leaks as well as diagnosing situations where memory is growing in a manner not commensurate with what the user expected. While tools like valgrind and electric fence exist, they often slow the code execution to the point where the memory issue cannot be found within the prescribed wall time, making the run a waste of SUs and user time.

The following is a C function "GetMemoryUsage" which can be added into the source tree and compiled along with the rest of the user code. This function returns a program's memory usage on the compute node at the point in the program at which it is called. The idea is that one can insert "GetMemoryUsage" function calls at different places in the source, recompile, and run to observe memory leaks. To test if a function / subroutine has memory leak, one can call GetMemoryUsage at the beginning and end of the function and check if there is noticeable different in memory usage. If there is, that means there is some memory leak in that function, unless it is allocating memory of its own. If the latter is true, then the user should be able to note that the growth was by the exact amount allocated, otherwise a memory leak still exists. Regardless, the user should be able to see how much memory is allocated for a given function and determine if that is commensurate with what they were expecting. Through repeated insertion of the GetMemoryUsage function call, one can narrow down which part of large code is contributing to the memory leak.

The sample program "memusage_test.c" is to show how the function can be used, and running this should assist the user in becoming familiar with how the application works to prepare for use in a larger code base. In the sample program, a code with memory leak is created intentionally, and therefore GetMemoryUsage will keep returning higher and higher memory usage as the program continues. A sample makefile is also provided for convenience.

GetMemoryUsage.c
#include <stdio.h>
#include <stdlib.h>
#include <string.h>

#define MEMORY_INFO_FILE "/proc/self/status"
#define BUFFER_SIZE 1024

void GetMemoryUsage ( HWM, RSS )
double *HWM, *RSS;
  {
  FILE *fp;
  size_t n = BUFFER_SIZE;
  char buffer [ BUFFER_SIZE ], scratch [ BUFFER_SIZE ];
  char *loc;

  fp = fopen ( MEMORY_INFO_FILE, "r" );
  while ( fgets ( buffer, BUFFER_SIZE, fp ) )
    {
    if ( strncmp ( buffer, "VmHWM:", 6 ) == 0 )
      {
      loc = strchr(&buffer [ 7 ], 'k');
      n = loc - &buffer [ 7 ];
      strncpy ( scratch, &buffer [ 7 ], n );
      *HWM = strtod ( scratch, NULL );
      }
    if ( strncmp ( buffer, "VmRSS:", 6 ) == 0 )
      {
      loc = strchr(&buffer [ 7 ], 'k');
      n = loc - &buffer [ 7 ];
      strncpy ( scratch, &buffer [ 7 ], n );
      *RSS = strtod ( scratch, NULL );
      }
    }
  }
memusage_test.c
#include <stdio.h>
#include <stdlib.h>

int main ( int argc, char **argv)
  {
  int i, j;
  double HWM, RSS;
  double *Array;
  GetMemoryUsage ( &HWM, &RSS );
  printf ( "Initial Usage: \nHWM : %f kB \nRSS : %f kB\n\n", HWM, RSS );
   // Create leaky code
  for ( j = 1; j < 100; j++ )
    {
    Array = malloc ( sizeof ( double ) * 100000 );
    for ( i = 0; i < 100000; i++ )
      Array [ i ] = 0.0;
    Array = NULL;

    GetMemoryUsage ( &HWM, &RSS );
    printf ( "Usage at j = %d \nHWM : %f kB \nRSS : %f kB\n\n", j, HWM, RSS );
    }
  return 0;
  }
Makefile
all:
        cc -c GetMemoryUsage.c
        cc -o memusage_test.exe memusage_test.c GetMemoryUsage.o

clean:
        rm -f *.o *.exe