- Infos im HLRS Wiki sind nicht rechtsverbindlich und ohne Gewähr -
- Information contained in the HLRS Wiki is not legally binding and HLRS is not responsible for any damages that might result from its use -
Hsw: Difference between revisions
From HLRS Platforms
Jump to navigationJump to search
No edit summary |
|||
Line 1: | Line 1: | ||
'''This information is depracated. The hardware was removed from clusters in 2024.''' | |||
== Haswell nodes == | == Haswell nodes == |
Latest revision as of 09:41, 23 August 2024
This information is depracated. The hardware was removed from clusters in 2024.
Haswell nodes
The cluster was upgraded to contain 76 nodes with dual socket Intel Xeon E5-2660 v3, 2.6 GHz "Haswell".
- 20 cores per node in 2 sockets, 40 threads, AVX2 support
- 128GB of DDR4 2133 Mhz memory
- 4 memory channels per CPU, total of >110GB/s memory bandwidth
- Mellanox QDR ConnectX-3 Infiniband HCA, connected with PCIe-gen3 bus, 2:1 overcommitted in the switch fabric
and in addition 196 nodes with dual socket Intel Xeon E5-2680 v3, 2.5 GHz "Haswell".
- 24 cores per node in 2 sockets, 48 threads, AVX2 support
- 128/256GB of DDR4 2133 Mhz memory
- 4 memory channels per CPU, total of >110GB/s memory bandwidth
- Mellanox QDR ConnectX-3 Infiniband HCA, connected with PCIe-gen3 bus, 2:1 overcommitted in the switch fabric
main user benefits through hardware compared to SandyBridge nodes
- more cores per node
- improved memory bandwidth
- 128GB/256GB memory per node
Remarks
- use -march=core-avx2 switches to generate best code with compilers, AVX is supported by Intel and GCC and Portland compilers, see details in manuals
- frontends are nehalem type CPUs, if compiler uses autodetection if no architecture switch is specified, you will get non-optimal code!
- redhat 6.2 based scientific linux 6.2 which is used on the new nodes shows performance degradation if SMT (hyperthreading) is enabled - which is the case - and MPI is used without using the additional threads. To get best performance, use CPU pinning.
- for openmpi use mpirun -bind-to-core
- for HP-MPI/Platform MPI use mpirun -cpu_bind=rank
- for intel MPI pinning is on by default
An example for effect of pinning is IMB allreduce benchmark within 1 node, it gives for 4MB messages a time of ~13000us without pinning and 7500us with pinning, using 16 processes.