- Infos im HLRS Wiki sind nicht rechtsverbindlich und ohne Gewähr -
- Information contained in the HLRS Wiki is not legally binding and HLRS is not responsible for any damages that might result from its use -

Hsw

From HLRS Platforms
Revision as of 09:37, 27 October 2016 by Hwwnec5 (talk | contribs)
Jump to navigationJump to search
The printable version is no longer supported and may have rendering errors. Please update your browser bookmarks and please use the default browser print function instead.

Haswell nodes

The cluster was upgraded to contain 88 nodes with dual socket Intel Xeon E5-2660 v3, 2.6 GHz "Haswell".

  • 20 cores per node in 2 sockets, 40 threads, AVX2 support
  • 128/256GB of DDR4 2133 Mhz memory
  • 4 memory channels per CPU, total of >110GB/s memory bandwidth
  • Mellanox QDR ConnectX-3 Infiniband HCA, connected with PCIe-gen3 bus, 2:1 overcommitted in the switch fabric

and in addition

196 nodes with dual socket Intel Xeon E5-2680 v3, 2.5 GHz "Haswell".

  • 24 cores per node in 2 sockets, 48 threads, AVX2 support
  • 128/256GB of DDR4 2133 Mhz memory
  • 4 memory channels per CPU, total of >110GB/s memory bandwidth
  • Mellanox QDR ConnectX-3 Infiniband HCA, connected with PCIe-gen3 bus, 2:1 overcommitted in the switch fabric

main user benefits through hardware compared to SandyBridge nodes

  • more cores per node
  • improved memory bandwidth
  • 128GB/256GB memory per node


Remarks

  • use -march=core-avx2 switches to generate best code with compilers, AVX is supported by Intel and GCC and Portland compilers, see details in manuals
  • frontends are nehalem type CPUs, if compiler uses autodetection if no architecture switch is specified, you will get non-optimal code!
  • redhat 6.2 based scientific linux 6.2 which is used on the new nodes shows performance degradation if SMT (hyperthreading) is enabled - which is the case - and MPI is used without using the additional threads. To get best performance, use CPU pinning.
    • for openmpi use mpirun -bind-to-core
    • for HP-MPI/Platform MPI use mpirun -cpu_bind=rank
    • for intel MPI pinning is on by default

An example for effect of pinning is IMB allreduce benchmark within 1 node, it gives for 4MB messages a time of ~13000us without pinning and 7500us with pinning, using 16 processes.