- Infos im HLRS Wiki sind nicht rechtsverbindlich und ohne Gewähr -
- Information contained in the HLRS Wiki is not legally binding and HLRS is not responsible for any damages that might result from its use -

Communication on Cray XC40 Aries network: Difference between revisions

From HLRS Platforms
Jump to navigationJump to search
No edit summary
 
(15 intermediate revisions by 3 users not shown)
Line 1: Line 1:
== Communication in the Cray XC40 using the Aries network chip and the impact of huge pages ==
  UNDER CONSTRUCTION


  UNDER CONSTRUCTION
Communication in the Cray XC40 using the Aries network chip and the impact of HugePages


== Introduction / Motivation ==
Since upgrading Hornet to Hazelhen, performance variations have been reported increasingly. Some of these variations are due to IO, but not all.
Since upgrading Hornet to Hazelhen, performance variations have been reported increasingly. Some of these variations are due to IO, but not all.
HLRS and Cray have investigated and we can explain and also offer a work around that reduces the problem. These will be presented after a brief explanation of the communication problem.
HLRS and Cray have analyzed this issue. We can explain it and also offer a workaround that reduces the problem. These will be presented after a brief explanation of the communication problem.
To understand the problem, a basic knowledge of the communication mechanism of the Cray XC is neccessary. The communication on the Cray XC runs over the Cray Interconnect, which is implemented via the Aries network chip. 4 Nodes (not neccessary computen nodes ) share a single Aries network chip.  16 Aries are in a, chassis' over the backplane connected together. Three of these chassis are mounted in a cabinet (Rack) and 2 cabinets are connected via copper cables to build a ''Cabinet Group''. The cabinet groups are interconnected via optical cables (see picture)


[[File:Aries_Chip.jpg]]
== XC40 design ==
To understand the problem, a basic knowledge of the communication mechanism of the Cray XC is necessary. The communication on the Cray XC runs over the Cray Interconnect, which is implemented via the Aries network chip.
* 4 Nodes (compute nodes) share one single Aries network chip. Following pictures a compute blade is shown in structure and logically.
* 16 Blades,each with a single Aries chip, are mounted in a ''chassis'' and all-to-all connected over the so called ''backplane''.
* 3 of these chassis are mounted in a cabinet (rack) and 2 cabinets (6 chassis) build a ''Cabinet Group''. Connections within these group is realized using copper cables.
* The cabinet groups are interconnected via optical cables (see picture).


The communication is carried out, by transferring data cache coherent from the source to the memory of the destination node.
<div><ul>
• It has to the logical address in the physical memory address converted from Aries be chip. The logical memory is managed in memory pages (Pages).
<li style="display: inline-block;"> [[File:Hazelhen-blade_structure.png|frame|compute blade (structure)]] </li>
To work around this calculation, some values ​​stored in an internal table Aries chip, similar to a TLB at a processor. If the value is not present in this table, it must be recalculated, what performance cost.
<li style="display: inline-block;"> [[Image:Aries_Connections.png|frame|compute blade (logical view)]] </li>
To reduce this problem, configurable Page sizes were introduced to Linux a few years ago. Through greater Pages (default is 4096 bytes) is the addressable memory space that is available without address translation to disposal increased, thereby reducing the number of necessary conversions.
</ul></div>
On the Cray XC 40 we found that can be overloaded by the required address conversion by very high volume of communications in conjunction with the default 4k Pages of Aries network chip. This communication overloads occur especially at all2all and All2one communication scheme. The Aries network chip is so busy that this is a communication process consumes almost all of the resources, ie all other nodes, which also use this chip for communication, slow down.
 
is now proven that a strong performance intrusion occurs when an Aries chip is used by different applications and generates one of them such a load.
== Cray XC communication mechanism ==
The load increases with the number of participating in the global communications MPI processes. With Hazel Hen we have almost doubled the number of nodes, and the priority of akadem. User placed on large jobs. Since then, many jobs are running in the range> 1000 nodes.
* The communication is done, by transferring data cache coherent from the main memory of the source node into the main memory of the destination node.
• There are since the system upgrade many users requests or complaints that the runtime behavior of an application is no longer predictable. It is now certain that this can often be avoided by the use of huge pages. The term variation is partly within the range of more than 100%.
* The Aries chip has to translate the logical address to a physical memory address. Memory is managed by the System in memory pages (Pages).
• Benefit from huge pages both programs, which use even intense global communication routines, as well as programs that do not use such routines, but using a network chip, run over the global communications. By using huge pages as default ensures that the Aries network chip is busy not only with the address conversion for global communication.
* To avoid lots of these calculations, some values ​​are stored in an internal table within the Aries chip, similar to a TLB of a processor. If the value is not present in this table, it must be recalculated, resulting in a reduced performance.
In addition, Cray has optimized the MPI_Alltoall routines and it is recommended to re-translated provided these routines are used directly.
 
With the planned for the new year system upgrade, we will introduce on the Cray nodes huge pages by default. This change is essential because the effects of the overloading of the Aries network chips are so drastically. It is worked by Cray for months very hard to prepare for this change and to identify possible sources of problems and eliminate them.
== Timing variation problem ==
To try Huge Pages now, they can easily create their application with huge pages once again left. For this purpose they invite before the left one of the modules, for example, "Module load craype-hugepages8M" (size does not matter). In addition, you must load a Huge Pages module in your batch script. The different sizes, you can try again without to left.
To reduce this problem, configurable Page sizes have been introduced to Linux systems a few years ago. By the use of greater Pages (default is 4096 bytes), the addressable memory space, which is available without address translation to disposal, increased. Thereby the number of necessary conversions is reduced.
For more information, see XXX
On the Cray XC40 we discovered that the Aries network chip could be overload by very high volume of communications in combination with the default 4k Pages. This communication overload is caused especially by All2all and All2one communication schemes. Although, one communication process can keep the Aries network chip busy with recalculating the addresses, consuming almost all of its resources. As a result, communications from other nodes, which also use this chip for communication, slow down.
We welcome feedback regarding. Their experience with huge pages, both by creating its program as well as about the performance.
* It is now proven that a strong performance intrusion occurs when an Aries chip is used by different applications and one of these generates such a load.
Who wants to read more about it here is a link:
* The load increases with the global number of participating MPI communication processes. With Hazel Hen the numbers of nodes is almost doubled, and the priority is set on large jobs for academic user. As a result, many jobs are running with > 1000 nodes.
https://en.wikipedia.org/wiki/Page_(computer_memory)
* Starting with the system upgrade, many user requests or user complaints arise dealing with unpredictable run-time behavior of their applications, in ranges more than 100%. It is verified that this can be avoided using HugePages.  
* Two kinds of applications will benefit from HugePages. On one hand, if they use intense global communication routines. And on the other hand, if they do not use such routines, but using a network chip which is effected by global communications of other applications. Setting HugePages as default ensures that the Aries network chip can handle more tasks than converting addresses for global communication.
* In addition, Cray has optimized the MPI_Alltoall routines and it is recommended to re-compile the application if these routines are used directly.
 
== Module environment change and testing ==
Starting with the system upgrade 4./5. January 2016, HugePages are be set as default on all Cray nodes. This change is required since the effects of the overloading of the Aries network chips was that drastically. Cray prepared this change intensively to identify and eliminate possible sources of problems.  
 
You can test different HugePages by
* re-link you application (with a loaded HugePage module),  
* switching to a certain HugePages module (module switch craype-hugepages16MB craype-hugepagesXXX, where XXX is a size greater equal 1M) in your batch submission script (not necessarily the size of the one used for linking)
 
For more information see [[ HugePages#How to use HugePages? | How to use HugePages]].
Please report issues with compiling, running and performance to the [http://www.hlrs.de/trouble-ticket-submission-form/ ticket system].
 
== Feedback ==
We welcome your feedback related to experience with compiling and performance using HugePages.
More about pages and HugePages is described at:
[https://en.wikipedia.org/wiki/Page_(computer_memory) | Pages@Wiki]

Latest revision as of 10:31, 22 February 2016

 UNDER CONSTRUCTION 

Communication in the Cray XC40 using the Aries network chip and the impact of HugePages

Introduction / Motivation

Since upgrading Hornet to Hazelhen, performance variations have been reported increasingly. Some of these variations are due to IO, but not all. HLRS and Cray have analyzed this issue. We can explain it and also offer a workaround that reduces the problem. These will be presented after a brief explanation of the communication problem.

XC40 design

To understand the problem, a basic knowledge of the communication mechanism of the Cray XC is necessary. The communication on the Cray XC runs over the Cray Interconnect, which is implemented via the Aries network chip.

  • 4 Nodes (compute nodes) share one single Aries network chip. Following pictures a compute blade is shown in structure and logically.
  • 16 Blades,each with a single Aries chip, are mounted in a chassis and all-to-all connected over the so called backplane.
  • 3 of these chassis are mounted in a cabinet (rack) and 2 cabinets (6 chassis) build a Cabinet Group. Connections within these group is realized using copper cables.
  • The cabinet groups are interconnected via optical cables (see picture).
  • compute blade (structure)
  • compute blade (logical view)

Cray XC communication mechanism

  • The communication is done, by transferring data cache coherent from the main memory of the source node into the main memory of the destination node.
  • The Aries chip has to translate the logical address to a physical memory address. Memory is managed by the System in memory pages (Pages).
  • To avoid lots of these calculations, some values ​​are stored in an internal table within the Aries chip, similar to a TLB of a processor. If the value is not present in this table, it must be recalculated, resulting in a reduced performance.

Timing variation problem

To reduce this problem, configurable Page sizes have been introduced to Linux systems a few years ago. By the use of greater Pages (default is 4096 bytes), the addressable memory space, which is available without address translation to disposal, increased. Thereby the number of necessary conversions is reduced. On the Cray XC40 we discovered that the Aries network chip could be overload by very high volume of communications in combination with the default 4k Pages. This communication overload is caused especially by All2all and All2one communication schemes. Although, one communication process can keep the Aries network chip busy with recalculating the addresses, consuming almost all of its resources. As a result, communications from other nodes, which also use this chip for communication, slow down.

  • It is now proven that a strong performance intrusion occurs when an Aries chip is used by different applications and one of these generates such a load.
  • The load increases with the global number of participating MPI communication processes. With Hazel Hen the numbers of nodes is almost doubled, and the priority is set on large jobs for academic user. As a result, many jobs are running with > 1000 nodes.
  • Starting with the system upgrade, many user requests or user complaints arise dealing with unpredictable run-time behavior of their applications, in ranges more than 100%. It is verified that this can be avoided using HugePages.
  • Two kinds of applications will benefit from HugePages. On one hand, if they use intense global communication routines. And on the other hand, if they do not use such routines, but using a network chip which is effected by global communications of other applications. Setting HugePages as default ensures that the Aries network chip can handle more tasks than converting addresses for global communication.
  • In addition, Cray has optimized the MPI_Alltoall routines and it is recommended to re-compile the application if these routines are used directly.

Module environment change and testing

Starting with the system upgrade 4./5. January 2016, HugePages are be set as default on all Cray nodes. This change is required since the effects of the overloading of the Aries network chips was that drastically. Cray prepared this change intensively to identify and eliminate possible sources of problems.

You can test different HugePages by

  • re-link you application (with a loaded HugePage module),
  • switching to a certain HugePages module (module switch craype-hugepages16MB craype-hugepagesXXX, where XXX is a size greater equal 1M) in your batch submission script (not necessarily the size of the one used for linking)

For more information see How to use HugePages. Please report issues with compiling, running and performance to the ticket system.

Feedback

We welcome your feedback related to experience with compiling and performance using HugePages. More about pages and HugePages is described at: | Pages@Wiki