- Infos im HLRS Wiki sind nicht rechtsverbindlich und ohne Gewähr -
Optimizing for SX-9
Usage of hardware information
To get simple information about your application, please set F_PROGINF for Fortran or C_PROGINF for C/C++ programms. Possible values are YES and DETAIL. If this environment variable is set during runtime, an application prints out some timing information at programm end. Example F_PROGINF=YES:
****** Program Information ****** Real Time (sec) : 26.804331 User Time (sec) : 26.275639 Sys Time (sec) : 0.308973 Vector Time (sec) : 24.559803 Inst. Count : 2334758597. V. Inst. Count : 1305590123. V. Element Count : 271274447129. FLOP Count : 129865319762. MOPS : 10363.348892 MFLOPS : 4942.422871 VLEN : 207.779181 V. Op. Ratio (%) : 99.622051 Memory Size (MB) : 48.031250
Start Time (date) : 2004/04/13 14:54:42 End Time (date) : 2004/04/13 14:55:09
****** Program Information ****** Real Time (sec) : 26.768830 User Time (sec) : 26.289378 Sys Time (sec) : 0.309090 Vector Time (sec) : 24.571109 Inst. Count : 2334758675. V. Inst. Count : 1305590123. V. Element Count : 271274447129. FLOP Count : 129865319762. MOPS : 10357.932794 MFLOPS : 4939.839858 VLEN : 207.779181 V. Op. Ratio (%) : 99.622051 Memory Size (MB) : 48.031250 MIPS : 88.809961 I-Cache (sec) : 0.041723 O-Cache (sec) : 0.059745 Bank (sec) : 0.000690
Start Time (date) : 2004/04/13 14:56:24 End Time (date) : 2004/04/13 14:56:51
Using this variable does only cause constant overhead (reading counters at the beginning, reading at end and computing and priunting of values).
General strategy for tuning is:
vector time should be as close at possible to user time. This means, V. Op. Ratio will be close to 100. MFLOPS should be as high as possible (as long as the application is doing floating point operations). A value between 2000 and 4500 is respectable. If it exceeds 9000, celebrate a miracle.
To achieve good performance, O-Cache (operand cache misses in seconds) should be close to 0. Vectorized code can not cause o-cache misses! If V. Op. Ratio is 99, but performance in MFLOPS is still bad, there are several possibilities:
- no floating point operations in the code
- short vector length VLEN
- high bank times
VLEN is the average vector length which is processed by vector pipes. So this is average of (looplength) modulo 256, and can therefor not exceed 256, no matter how long loops are. It should be close to 256. The longer the loops are, the more efficient the CPU can work. Try to achieve loop length in the order of thousands.
High bank times show a high number of bank conflicts. A bank conflict is caused when a memory bank is accessed before the bank busy time from the last access is over. If this is high, search for power of two leading dimensions in the code. Distances between memory accesses in the form of a multiple of a large power of two should be avoided. Try to have odd or prime distances, best is stride 1 (in unit of words). When using lookup tables, high bank conflict times can arise as well if the lookup always hits the same value (what might be a consequence of cache optimizations). Try to make copies of the tables, and iterate over the tables.