- Infos im HLRS Wiki sind nicht rechtsverbindlich und ohne Gewähr -

Difference between revisions of "NEC Cluster NUMA Tuning"

From HLRS Platforms
Jump to navigationJump to search
 
 
(4 intermediate revisions by 2 users not shown)
Line 1: Line 1:
 
'''NUMA Tuning'''
 
'''NUMA Tuning'''
  
The Nehalem architecture is the first intel platform following a NUMA (=non uniform memory access) design
+
The Nehalem architecture was the first intel platform following a NUMA (=non uniform memory access) design
pattern.
+
pattern, and idea that is still used today.
 
The memory controller is located within the CPU chip, and therefor, a system with two CPU sockets as
 
The memory controller is located within the CPU chip, and therefor, a system with two CPU sockets as
 
the HLRS/NEC installation has two distinct memory controllers, each responsible for 1/2 of the memory.
 
the HLRS/NEC installation has two distinct memory controllers, each responsible for 1/2 of the memory.
Line 10: Line 10:
 
or ''NUMA nodes'', in a pattern that matches the accesses from the processes or thread.
 
or ''NUMA nodes'', in a pattern that matches the accesses from the processes or thread.
  
The linux kernel us NUMA aware, and does it's best to place data at a good (=near) memory location.
+
The linux kernel is NUMA aware, and does it's best to place data at a good (=near) memory location.
 
This is achieved by following a ''first touch policy''. A memory page is allocated - if possible -
 
This is achieved by following a ''first touch policy''. A memory page is allocated - if possible -
 
nearby the process touching it first.
 
nearby the process touching it first.
  
'''Important:''' First touch, not allocate!
+
'''Important:''' First touch means: location is fixed when touching, not allocating!
  
 
MPI programs with one process per core tend to do things right, data is accessed from
 
MPI programs with one process per core tend to do things right, data is accessed from
Line 100: Line 100:
 
OMP_NUM_THREADS=8 KMP_AFFINITY=scatter ./a.out
 
OMP_NUM_THREADS=8 KMP_AFFINITY=scatter ./a.out
 
}}
 
}}
one gets a runtime of about 0.443941 seconds.
+
one gets a runtime of about 0.443941 seconds (on an old nehalem node)
 
(we use 8 threads and use intel compilers openmp runtime option
 
(we use 8 threads and use intel compilers openmp runtime option
 
to pin threads to cores for most consistent performance and
 
to pin threads to cores for most consistent performance and
Line 113: Line 113:
 
The initialization of the data within a parallel loop makes sure the data is for each thread on the local
 
The initialization of the data within a parallel loop makes sure the data is for each thread on the local
 
node, and therefor nearly doubles the performance.
 
node, and therefor nearly doubles the performance.
As the data is small enough to fit into one node, if no care it taken, all data resides
+
As the data is small enough to fit into one node, if no care is taken, all data resides
 
behind one of the two memory controllers, and one half of the available memory bandwidth is wasted.
 
behind one of the two memory controllers, and one half of the available memory bandwidth is wasted.
  
The command ''numastat'' can be used to check the numa accesses.
 
 
<pre>
 
$numastat
 
                          node0          node1
 
numa_hit                30649213        13343389
 
numa_miss                      0          977606
 
numa_foreign              977606              0
 
interleave_hit            24121          23621
 
local_node              30645880        13304097
 
other_node                  3333        1016898
 
</pre>
 
 
see the line ''numa_miss''.
 
 
If started once before and after the above test program,
 
the numa_miss counter difference is 0 in the good case,
 
and 108687 in the bad case (difference of the value
 
between the two invocations).
 
  
 
If a hybrid MPI-OpenMP approach is used, it can make perfect sense
 
If a hybrid MPI-OpenMP approach is used, it can make perfect sense

Latest revision as of 14:20, 2 October 2018

NUMA Tuning

The Nehalem architecture was the first intel platform following a NUMA (=non uniform memory access) design pattern, and idea that is still used today. The memory controller is located within the CPU chip, and therefor, a system with two CPU sockets as the HLRS/NEC installation has two distinct memory controllers, each responsible for 1/2 of the memory.

To achieve maximum performance, this has to be taken into consideration, as data is best accessed by processes nearby the data - or in other words, for best performance, the data has to be distributed over the two memory controllers, or NUMA nodes, in a pattern that matches the accesses from the processes or thread.

The linux kernel is NUMA aware, and does it's best to place data at a good (=near) memory location. This is achieved by following a first touch policy. A memory page is allocated - if possible - nearby the process touching it first.

Important: First touch means: location is fixed when touching, not allocating!

MPI programs with one process per core tend to do things right, data is accessed from the process allocating and first touching the data. Only extreme inbalance, like one rank using so much memory on one NUMA node that other ranks memory does not fit into the local node. In such a case, performance could suffer as memory accesses have to be done to remote memory on the further node.

For OpenMP programs, care has to be taken that the nodes are used in a balanced way, and that processes touch the memory multithreaded, and in the same way as the data is accessed later, if possible.

Here is simple example to demonstrate the impact of wrong initialization of data.

File: numa.c
#include <stdlib.h>
#include <stdio.h>
#include <sys/time.h>

#define GB (1024*1024*1024)
#define N (GB/2)

double second()
{
        struct timeval tp;
        struct timezone tzp;
        int i;
        i = gettimeofday(&tp,&tzp);
        return ( (double) tp.tv_sec + (double) tp.tv_usec * 1.e-6 );
}


main()
{
        int i;
        double t1,t2;
        float *a,*b,*c;

        /* allocate memory */
        a=malloc(N*sizeof(float));
        if(!a) {
                fprintf(stderr, "allocation error\n");
                exit(1);
        }
        b=malloc(N*sizeof(float));
        if(!b) {
                fprintf(stderr, "allocation error\n");
                exit(1);
        }
        c=malloc(N*sizeof(float));
        if(!c) {
                fprintf(stderr, "allocation error\n");
                exit(1);
        }

        /* initialize the data */
#ifdef PARALLELINIT
#pragma omp parallel for private(i) shared(a,b,c) 
#endif
        for(i=0; i<N; i++) {
                a[i]=0.0f;
                b[i]=0.0f;
                c[i]=1.0f;
        }

        /* do something with data */
        t1=second();
#pragma omp parallel for private(i) shared(a,b,c)
        for(i=0; i<N; i++) {
                a[i]+=b[i]*c[i];
        }
        t2=second();

        printf("time: %lf sec\n",t2-t1);
}


If this code is compiled using

icc -openmp -O3 -xSSE4.1 numa.c

and run using

OMP_NUM_THREADS=8 KMP_AFFINITY=scatter ./a.out

one gets a runtime of about 0.443941 seconds (on an old nehalem node) (we use 8 threads and use intel compilers openmp runtime option to pin threads to cores for most consistent performance and NUMA placement)

If compiled with

icc -DPARALLELINIT -openmp -O3 -xSSE4.1 numa.c

the runtime is 0.230432 seconds, which is x1.9 better than the initial result.

The initialization of the data within a parallel loop makes sure the data is for each thread on the local node, and therefor nearly doubles the performance. As the data is small enough to fit into one node, if no care is taken, all data resides behind one of the two memory controllers, and one half of the available memory bandwidth is wasted.


If a hybrid MPI-OpenMP approach is used, it can make perfect sense to run one MPI process per numa node, so on the Nehalem cluster, 2 per node, and use 4 threads per MPI process.

See the MPI section for a wrapper script allowing numa placement of the MPI processes and thread pinning of the OpenMP threads for best results.