- Infos im HLRS Wiki sind nicht rechtsverbindlich und ohne Gewähr -
- Information contained in the HLRS Wiki is not legally binding and HLRS is not responsible for any damages that might result from its use -
NEC Aurora HW: Difference between revisions
No edit summary |
|||
(3 intermediate revisions by 2 users not shown) | |||
Line 1: | Line 1: | ||
=== Hardware === | === Hardware === | ||
* 64 NEC Aurora TSUBASA | * 64 NEC Aurora TSUBASA cards with a single vector processor (8 A300-8 with 8 VE type 10B each) | ||
** 8 cores per processor/VE, 2.1 TFlops peak | ** 8 cores per processor/VE, 2.1 TFlops peak | ||
** 1.4 GHz frequency | ** 1.4 GHz frequency | ||
Line 13: | Line 13: | ||
** 6 HBM channels, for 1.2TB/s memory bandwidth | ** 6 HBM channels, for 1.2TB/s memory bandwidth | ||
** 2xEDR 100GBit/s Infiniband links per VH | ** 2xEDR 100GBit/s Infiniband links per VH | ||
=== Software Architecture === | |||
The VE card does not run an OS, but only runs the user application the moment it is started, | |||
before and after it is stopped and does nothing. | |||
All system calls are forwarded to the VH, filesystem accesses are therefor transparent. | |||
The execution model allows processes and threads, but it does not favor thread oversubscription, | |||
it is not good practice and will yield bad performance to have more than one process or thread on a core. | |||
It is a timesharing model, so it is possible to have several processes (even of different users) on a core, | |||
but it makes not really sense, timeslices are long and context switches are expensive. | |||
In contrast to previous SX architecture, the data format is little endian, and compiler mimic gcc memory as | |||
good as possible, so binary files should be exchangable with files written with gcc compiled programs in x86-64. | |||
Caches including LLC are coherent, LLC is writeback, all memory transfer is passing through LLC, but LLC knows two priorities, look for keyworkd ''retain'' in compiler manuals. |
Latest revision as of 16:28, 8 December 2022
Hardware
- 64 NEC Aurora TSUBASA cards with a single vector processor (8 A300-8 with 8 VE type 10B each)
- 8 cores per processor/VE, 2.1 TFlops peak
- 1.4 GHz frequency
- 32 vector pipes per core, each doing 3 FMA per clock, resulting in 192 flops/cycle
- 64 vector registers
- little endian data formats
- Out of Order execution
- 256KB L2 cache per core
- 16MB shared LLC
- 48 GB HBM memory per node
- 6 HBM channels, for 1.2TB/s memory bandwidth
- 2xEDR 100GBit/s Infiniband links per VH
Software Architecture
The VE card does not run an OS, but only runs the user application the moment it is started, before and after it is stopped and does nothing.
All system calls are forwarded to the VH, filesystem accesses are therefor transparent.
The execution model allows processes and threads, but it does not favor thread oversubscription, it is not good practice and will yield bad performance to have more than one process or thread on a core.
It is a timesharing model, so it is possible to have several processes (even of different users) on a core, but it makes not really sense, timeslices are long and context switches are expensive.
In contrast to previous SX architecture, the data format is little endian, and compiler mimic gcc memory as good as possible, so binary files should be exchangable with files written with gcc compiled programs in x86-64.
Caches including LLC are coherent, LLC is writeback, all memory transfer is passing through LLC, but LLC knows two priorities, look for keyworkd retain in compiler manuals.