- Infos im HLRS Wiki sind nicht rechtsverbindlich und ohne Gewähr -
- Information contained in the HLRS Wiki is not legally binding and HLRS is not responsible for any damages that might result from its use -

Communication on Cray XC40 Aries network

From HLRS Platforms
Jump to navigationJump to search

Communication in the Cray XC40 using the Aries network chip and the impact of huge pages

  UNDER CONSTRUCTION 

Introduction / Motivation

Since upgrading Hornet to Hazelhen, performance variations have been reported increasingly. Some of these variations are due to IO, but not all. HLRS and Cray have analyzed this issue. We can explain it and also offer a workaround that reduces the problem. These will be presented after a brief explanation of the communication problem.

XC40 design

To understand the problem, a basic knowledge of the communication mechanism of the Cray XC is necessary. The communication on the Cray XC runs over the Cray Interconnect, which is implemented via the Aries network chip.

  • 4 Nodes (not necessary compute nodes) share one single Aries network chip.
  • 16 Aries chips are mounted in a chassis and all-to-all connected over the so called backplane.
  • 3 of these chassis are mounted in a cabinet (rack) and 2 cabinets (6 chassis) build a Cabinet Group. Connections within these group is realized using copper cables.
  • The cabinet groups are interconnected via optical cables (see picture).

Aries Connections.png

Cray XC communication mechanism

  • The communication is done, by transferring data cache coherent from the main memory of the source node into the main memory of the destination node.
  • The Aries chip has to translate the logical address to a physical memory address. Memory is managed by the System in memory pages (Pages).
  • To avoid lots of these calculations, some values ​​are stored in an internal table within the Aries chip, similar to a TLB of a processor. If the value is not present in this table, it must be recalculated, what performance cost.

Timing variation problem

To reduce this problem, configurable Page sizes have been introduced to Linux a few years ago. Through greater Pages (default is 4096 bytes) is the addressable memory space that is available without address translation to disposal increased, thereby reducing the number of necessary conversions. On the Cray XC 40 we found that can be overloaded by the required address conversion by very high volume of communications in conjunction with the default 4k Pages of Aries network chip. This communication overloads occur especially at all2all and All2one communication scheme. The Aries network chip is so busy that this is a communication process consumes almost all of the resources, ie all other nodes, which also use this chip for communication, slow down.

  • is now proven that a strong performance intrusion occurs when an Aries chip is used by different applications and generates one of them such a load.
  • The load increases with the number of participating in the global communications MPI processes. With Hazel Hen we have almost doubled the number of nodes, and the priority of akadem. User placed on large jobs. Since then, many jobs are running in the range> 1000 nodes.
  • There are since the system upgrade many users requests or complaints that the runtime behavior of an application is no longer predictable. It is now certain that this can often be avoided by the use of huge pages. The term variation is partly within the range of more than 100%.
  • Benefit from huge pages both programs, which use even intense global communication routines, as well as programs that do not use such routines, but using a network chip, run over the global communications. By using huge pages as default ensures that the Aries network chip is busy not only with the address conversion for global communication.
  • In addition, Cray has optimized the MPI_Alltoall routines and it is recommended to re-translated provided these routines are used directly.

With the planned for the new year system upgrade, we will introduce on the Cray nodes huge pages by default. This change is essential because the effects of the overloading of the Aries network chips are so drastically. It is worked by Cray for months very hard to prepare for this change and to identify possible sources of problems and eliminate them. To try Huge Pages now, they can easily create their application with huge pages once again left. For this purpose they invite before the left one of the modules, for example, "Module load craype-hugepages8M" (size does not matter). In addition, you must load a Huge Pages module in your batch script. The different sizes, you can try again without to left. For more information, see XXX We welcome feedback regarding. Their experience with huge pages, both by creating its program as well as about the performance. Who wants to read more about it here is a link: https://en.wikipedia.org/wiki/Page_(computer_memory)