Intel Hyper-Threading Technology allows each physical core to be seen as multiple logical cores by the operating system. The logical cores share the resources of the physical core and can execute independent processes or threads in parallel. This may increase the throughput and utilization of the processor since idle cycles on a logical core can be used by another logical core. However, since physical resources are shared, Hyper-Threading may adversely affect performance if this sharing isn’t effective for the particular application.
Hyper-Threading is available on Darter processors but is not enabled by default. To run with Hyper-Threading on compute node, the appropriate -j option needs to be passed to aprun :
-j 1 to use 1 logical core on each physical core (default, no Hyper-Threading) -j 2 to use all logical cores on each physical core (Hyper-Threading On)On Darter, the default value is -j1 which effectively turns off Hyper-Threading.
The following cautions should be noted for applications running with Hyper-Threading. Since multiple logical cores share processor resources, computationally intensive applications will likely observe degradation in absolute performance. The available memory per MPI tasks or OpenMP threads will also be less (up to 50% less) with Hyper-Threading enabled since more MPI tasks or OpenMP threads can be packed on a node. Whether an application can benefit from Hyper-Threading is very application dependent and user must test and recognize themselves for their application.
The following show some example of how to use Hyper-Threading on Darter with MPI tasks and OpenMP process. Process placement on the CPU due to Hyper-Threading is also illustrated (still need development).
1. Running 32 MPI tasks with Hyper-Threading on single node
#PBS -lsize=16,walltime=1:00:00 aprun -n 32 -j 2 ./xthiWithout -j 2 option, the above aprun command will fail because there will not be enough cores requested to put 32 MPI tasks.
2. Running 16 MPI tasks with 2 thread each in Hyper-Threading mode:
#PBS -lsize=16,walltime=1:00:00 aprun -n 16 -d 2 -j 2 ./xthiThe depth argument (-d 2) specify the number of cores for each task. In this case, 2 cores is available for threading for each xhi task, on which threads can be spawn. Effectively the two threads in this example run on the same physical core due to hyper-threading. If OpenMP threads is used, one also needs to specify OMP_NUM_THREADS=2.
3. The following run two different program on a single node, optimizing for NUMA-node placement where each program is restricted to run on a NUMA node:
#PBS -lsize=16,walltime=1:00:00 aprun -n 4 -j 2 -sl 0 ./program1 & aprun -n 16 -j 2 -sl 1 ./program2 & waitThe first aprun command runs program1 with 4 processes and constraint it to the NUMA node 0 (with -sl 0 option). The second aprun command runs program2 with 16 process and constraint it to NUMA node 1 (with -sl 1 option). Without -sl option, program2 will be placed spanning both NUMA nodes.