- Infos im HLRS Wiki sind nicht rechtsverbindlich und ohne Gewähr -

Communication on Cray XC40 Aries network

From HLRS Platforms
Revision as of 19:03, 9 December 2015 by Hpctobei (talk | contribs)
Jump to navigationJump to search

Communication in the Cray XC40 using the Aries network chip and the impact of huge pages


Since upgrading Hornet to Hazelhen, performance variations have been reported increasingly. Some of these variations are due to IO, but not all. HLRS and Cray have investigated and we can explain and also offer a work around that reduces the problem. These will be presented after a brief explanation of the communication problem. To understand the problem, a basic knowledge of the communication mechanism of the Cray XC is neccessary. The communication on the Cray XC runs over the Cray Interconnect, which is implemented via the Aries network chip. 4 Nodes (not neccessary computen nodes ) share a single Aries network chip. 16 Aries are in a, chassis' over the backplane connected together. Three of these chassis are mounted in a cabinet (Rack) and 2 cabinets are connected via copper cables to build a Cabinet Group. The cabinet groups are interconnected via optical cables (see picture)

Aries Chip.jpg

  • The communication is done, by transferring data cache coherent from the mainmemory of the source node into the mainmemory of the destination node.
  • The Aries chip has to translate the logical address to a physical memory address. Memory is managed by the System in memory pages (Pages).
  • To avoid lots of this calculation, some values ​​are stored in an internal table within the Aries chip, similar to a TLB of a processor. If the value is not present in this table, it must be recalculated, what performance cost.

To reduce this problem, configurable Page sizes have been introduced to Linux a few years ago. Through greater Pages (default is 4096 bytes) is the addressable memory space that is available without address translation to disposal increased, thereby reducing the number of necessary conversions. On the Cray XC 40 we found that can be overloaded by the required address conversion by very high volume of communications in conjunction with the default 4k Pages of Aries network chip. This communication overloads occur especially at all2all and All2one communication scheme. The Aries network chip is so busy that this is a communication process consumes almost all of the resources, ie all other nodes, which also use this chip for communication, slow down. • is now proven that a strong performance intrusion occurs when an Aries chip is used by different applications and generates one of them such a load. • The load increases with the number of participating in the global communications MPI processes. With Hazel Hen we have almost doubled the number of nodes, and the priority of akadem. User placed on large jobs. Since then, many jobs are running in the range> 1000 nodes. • There are since the system upgrade many users requests or complaints that the runtime behavior of an application is no longer predictable. It is now certain that this can often be avoided by the use of huge pages. The term variation is partly within the range of more than 100%. • Benefit from huge pages both programs, which use even intense global communication routines, as well as programs that do not use such routines, but using a network chip, run over the global communications. By using huge pages as default ensures that the Aries network chip is busy not only with the address conversion for global communication. • In addition, Cray has optimized the MPI_Alltoall routines and it is recommended to re-translated provided these routines are used directly. With the planned for the new year system upgrade, we will introduce on the Cray nodes huge pages by default. This change is essential because the effects of the overloading of the Aries network chips are so drastically. It is worked by Cray for months very hard to prepare for this change and to identify possible sources of problems and eliminate them. To try Huge Pages now, they can easily create their application with huge pages once again left. For this purpose they invite before the left one of the modules, for example, "Module load craype-hugepages8M" (size does not matter). In addition, you must load a Huge Pages module in your batch script. The different sizes, you can try again without to left. For more information, see XXX We welcome feedback regarding. Their experience with huge pages, both by creating its program as well as about the performance. Who wants to read more about it here is a link: https://en.wikipedia.org/wiki/Page_(computer_memory)