- Infos im HLRS Wiki sind nicht rechtsverbindlich und ohne Gewähr -

Optimizing for SX-9

From HLRS Platforms
Revision as of 11:38, 28 November 2008 by Hwwnec5 (talk | contribs)

Usage of hardware information

To get simple information about your application, please set F_PROGINF for Fortran or C_PROGINF for C/C++ programms. Possible values are YES and DETAIL. If this environment variable is set during runtime, an application prints out some timing information at programm end. Example F_PROGINF=YES:

    ******  Program Information  ******
 Real Time (sec)       :         26.804331
 User Time (sec)       :         26.275639
 Sys  Time (sec)       :          0.308973
 Vector Time (sec)     :         24.559803
 Inst. Count           :        2334758597.
 V. Inst. Count        :        1305590123.
 V. Element Count      :      271274447129.
 FLOP Count            :      129865319762.
 MOPS                  :      10363.348892
 MFLOPS                :       4942.422871
 VLEN                  :        207.779181
 V. Op. Ratio (%)      :         99.622051
 Memory Size (MB)      :         48.031250
 Start Time (date)  :  2004/04/13 14:54:42
 End   Time (date)  :  2004/04/13 14:55:09


    ******  Program Information  ******
 Real Time (sec)       :         26.768830
 User Time (sec)       :         26.289378
 Sys  Time (sec)       :          0.309090
 Vector Time (sec)     :         24.571109
 Inst. Count           :        2334758675.
 V. Inst. Count        :        1305590123.
 V. Element Count      :      271274447129.
 FLOP Count            :      129865319762.
 MOPS                  :      10357.932794
 MFLOPS                :       4939.839858
 VLEN                  :        207.779181
 V. Op. Ratio (%)      :         99.622051
 Memory Size (MB)      :         48.031250
 MIPS                  :         88.809961
 I-Cache (sec)         :          0.041723
 O-Cache (sec)         :          0.059745
 Bank (sec)            :          0.000690
 Start Time (date)  :  2004/04/13 14:56:24
 End   Time (date)  :  2004/04/13 14:56:51

Using this variable does only cause constant overhead (reading counters at the beginning, reading at end and computing and priunting of values).

General strategy for tuning is:

vector time should be as close at possible to user time. This means, V. Op. Ratio will be close to 100. MFLOPS should be as high as possible (as long as the application is doing floating point operations). A value between 2000 and 4500 is respectable. If it exceeds 9000, celebrate a miracle.

To achieve good performance, O-Cache (operand cache misses in seconds) should be close to 0. Vectorized code can not cause o-cache misses! If V. Op. Ratio is 99, but performance in MFLOPS is still bad, there are several possibilities:

  • no floating point operations in the code
  • short vector length VLEN
  • high bank times

VLEN is the average vector length which is processed by vector pipes. So this is average of (looplength) modulo 256, and can therefor not exceed 256, no matter how long loops are. It should be close to 256. The longer the loops are, the more efficient the CPU can work. Try to achieve loop length in the order of thousands.

High bank times show a high number of bank conflicts. A bank conflict is caused when a memory bank is accessed before the bank busy time from the last access is over. If this is high, search for power of two leading dimensions in the code. Distances between memory accesses in the form of a multiple of a large power of two should be avoided. Try to have odd or prime distances, best is stride 1 (in unit of words). When using lookup tables, high bank conflict times can arise as well if the lookup always hits the same value (what might be a consequence of cache optimizations). Try to make copies of the tables, and iterate over the tables.