The National Institute for Computational Sciences

Beacon

  Beacon


MIC Programming Models

Native Mode

  • All code runs directly on the coprocessors
  • Any libraries used will need to be recompiled for native mode usage
  • Use the compiler flag '-mmic' to compile for native mode

Offload Mode

  • Code starts running on CPU host
  • Parallel regions of code can be manually specified to run on the coprocessors using pragmas/directives
  • Data is either copied explicitly to the coprocessors or implicitly (used for complex data types involving pointers, only available in C++)
  • Automatic Offload (AO) is available for certain Intel Math Kernel Library (MKL) functions: ?GEMM, ?TRSM, ? PORTF, ?GEQRF, and ?GETRF

Current Intel recommended approach for file I/O from within an offload section

  • Files can be read from the NFS mount of /global, but only in read only mode
  • Files can be directly read from and written to the MIC's internal memory
  • To transfer a file directly to a MIC's internal memory use micscp file beaconXXX-micX:/User or /tmp
  • Be sure to reset the permissions on the file such that 'other' has read/write permissions (all I/O in offload region is executed as 'micuser')
  • Use the absolute path as the argument to fopen()
  • Remember to copy any output files off of the MICs before exiting a job
  • Files can also be read from and written to /lustre/medusa/$USER, and be sure to reset file permission (o+rw) as above and directory permission (o+x)
  • $HOME is not mounted on the MICs

Heterogeneous MPI Tasks on Hosts With Offload To Cards (Could Include OpenMP)

  • Run MPI tasks as you would for Native Mode on the Host
  • Use MIC_ENV_PREFIX=MIC_ and MIC_OMP_NUM_THREADS=N to have all cards offloaded to use same number of threads
  • Use omp_set_num_threads in the code and have it logically tied to the device type or number for different numbers of threads for each different offload
  • This is an important time to set MIC_KMP_AFFINITY, especially if offloading to multiple cards from the same host: http://software.intel.com/en-us/articles/openmp-thread-affinity-control. Also, the pseudocode snippet below will assist with offloading to multiple devices from one host:

  • #pragma omp parallel
    #pragma omp single
    {
    #pragma omp task
        #pragma omp target(mic) OR #pragma offload target(mic)
        {
        <various serial code>
        #pragma omp parallel for
        for (int i=0; i < limit; i++)
        <parallel loop body>
        }
    #pragma omp task
    <host code or another offload>
    }

Heterogeneous MPI Tasks on Hosts or Cards With OpenMP (No Offload)

  • A user is running MPI ranks on the hosts and or the MICs with OpenMP pragmas to be invoked directly from each MPI task (without offload pragmas)
  • Use omp_set_num_threads in the code and have it logically tied to the device type or number --OR--
  • micmpiexec -n 1 -host beaconXXX -env OMP_NUM_THREADS=N1; -n 1 -host beaconXXX-mic0 -wdir $DIR -env OMP_NUM_THREADS=N2; etc.
  • This passes the appropriate number of OMP threads to each MIC/host through its own local environment variable (-env versus the global -genv)

Heterogeneous MPI Tasks on Hosts and Cards With Machine File

  • A user is running MPI ranks on the hosts and the MICs with an executable named test and test.MIC
  • Set environment variable I_MPI_MIC=1 and I_MPI_MIC_POSTFIX=.MIC
  • Use command: generate-mic-hostlist hybrid X Y > machines, where X is the number of Xeon ranks ON EACH NODE REQUESTED and Y is the number of Xeon Phi ranks ON EACH PHI ON EACH NODE REQUESTED.
  • micmpiexec -n NNODES*X+4*NNODES*Y -machinefile machines ./test

Back to Contents