- Infos im HLRS Wiki sind nicht rechtsverbindlich und ohne Gewähr -

Difference between revisions of "Optimizing for SX-9"

From HLRS Platforms
Line 1: Line 1:
== Usage of hardware information ==  
+
== Identifying optimization potential ==
 +
 
 +
=== Usage of hardware information ===
  
 
To get simple information about your application, please set ''F_PROGINF'' for Fortran or ''C_PROGINF'' for C/C++ programms. Possible values are YES and DETAIL. If this environment variable is set during runtime, an application prints out some timing information at programm end. Example ''F_PROGINF=YES'':
 
To get simple information about your application, please set ''F_PROGINF'' for Fortran or ''C_PROGINF'' for C/C++ programms. Possible values are YES and DETAIL. If this environment variable is set during runtime, an application prints out some timing information at programm end. Example ''F_PROGINF=YES'':
Line 63: Line 65:
 
High bank times show a high number of bank conflicts. A bank conflict is caused when a memory bank is accessed before the bank busy time from the last access is over. If this is high, search for power of two leading dimensions in the code. Distances between memory accesses in the form of a multiple of a large power of two should be avoided. Try to have odd or prime distances, best is stride 1 (in unit of words). When using lookup tables, high bank conflict times can arise as well if the lookup always hits the same value (what might be a consequence of cache optimizations). Try to make copies of the tables, and iterate over the tables.
 
High bank times show a high number of bank conflicts. A bank conflict is caused when a memory bank is accessed before the bank busy time from the last access is over. If this is high, search for power of two leading dimensions in the code. Distances between memory accesses in the form of a multiple of a large power of two should be avoided. Try to have odd or prime distances, best is stride 1 (in unit of words). When using lookup tables, high bank conflict times can arise as well if the lookup always hits the same value (what might be a consequence of cache optimizations). Try to make copies of the tables, and iterate over the tables.
  
== Profiling on subroutine Basis ==
+
=== Profiling on subroutine Basis ===
  
 
Simple Unix profiling by linking the code with ''-p'' option and calling ''prof <executable name>'' processing the generated mon.out file does not cause large overhead. This gives you simple information how much time was spent in which subroutine.
 
Simple Unix profiling by linking the code with ''-p'' option and calling ''prof <executable name>'' processing the generated mon.out file does not cause large overhead. This gives you simple information how much time was spent in which subroutine.
Line 69: Line 71:
 
If yoy want to know the number of calls of the subroutine as well, the subroutines have to recompiled with ''-p'' as well. This causes more overhead.  
 
If yoy want to know the number of calls of the subroutine as well, the subroutines have to recompiled with ''-p'' as well. This causes more overhead.  
  
== Hardware information on subroutine Basis ==
+
=== Hardware information on subroutine Basis ===
  
 
To get more detailled information about subroutines, use ftrace feature of the compilers.
 
To get more detailled information about subroutines, use ftrace feature of the compilers.
Line 110: Line 112:
  
  
== Hardware information on loop basis ==
+
=== Hardware information on loop/block basis ===
  
 
If you need to have more detailled information of the contents of a subroutine, regions within subroutines can be defined to be used with ftrace.
 
If you need to have more detailled information of the contents of a subroutine, regions within subroutines can be defined to be used with ftrace.
Line 129: Line 131:
  
 
The name is a name you can choose, it has to match in the begin and end call. This will be used as the identifier in the ftrace printout.
 
The name is a name you can choose, it has to match in the begin and end call. This will be used as the identifier in the ftrace printout.
 +
 +
=== I/O ===
 +
 +
Use environment variables ''F_FILEINF/C_FILEINF'' with values ''YES/DETAIL'' to find slow I/O. DETAIL (fortran only) gives
 +
information for every I/O, this gives a lot of output, be carefull.
 +
 +
 +
== Improving performance ==
 +
 +
=== general ===
 +
 +
Improving performance means improving vectorization. For detailled discussion see
 +
[http://fs.hlrs.de/~hwwnec5/SX-9/g1af28e/chap4.html optimizing C]  and [http://fs.hlrs.de/~hwwnec5/SX-9/g1af28e/chap5.html vectorising C]
 +
for C programmers and [http://fs.hlrs.de/~hwwnec5/SX-9/g1af07e/chap4.html optimizing fortran] and
 +
[http://fs.hlrs.de/~hwwnec5/SX-9/g1af07e/chap5.html vectorizing fortran] for fortran programmers.
 +
 +
some basic rules
 +
 +
* improve inner loop iteration length, longer loops are better than short loops
 +
* avoid indirect addressing and pointers
 +
* avoid power of 2 strides when accessing data in memory
 +
* use ''restrict'' keyword for C pointers
 +
* avoid function calls
 +
* enable inlining (''-pi auto'' option of compiler)
  
  
== I/O ==
+
=== I/O ===
  
 
Because underlaying GFS uses striping, large I/O should be done to make efficient use of the resources.
 
Because underlaying GFS uses striping, large I/O should be done to make efficient use of the resources.

Revision as of 16:13, 28 November 2008

Identifying optimization potential

Usage of hardware information

To get simple information about your application, please set F_PROGINF for Fortran or C_PROGINF for C/C++ programms. Possible values are YES and DETAIL. If this environment variable is set during runtime, an application prints out some timing information at programm end. Example F_PROGINF=YES:

    ******  Program Information  ******

 Real Time (sec)       :         26.804331
 User Time (sec)       :         26.275639
 Sys  Time (sec)       :          0.308973
 Vector Time (sec)     :         24.559803
 Inst. Count           :        2334758597.
 V. Inst. Count        :        1305590123.
 V. Element Count      :      271274447129.
 FLOP Count            :      129865319762.
 MOPS                  :      10363.348892
 MFLOPS                :       4942.422871
 VLEN                  :        207.779181
 V. Op. Ratio (%)      :         99.622051
 Memory Size (MB)      :         48.031250

 Start Time (date)  :  2004/04/13 14:54:42
 End   Time (date)  :  2004/04/13 14:55:09

Example F_PROGINF=DETAIL:

    ******  Program Information  ******

 Real Time (sec)       :         26.768830
 User Time (sec)       :         26.289378
 Sys  Time (sec)       :          0.309090
 Vector Time (sec)     :         24.571109
 Inst. Count           :        2334758675.
 V. Inst. Count        :        1305590123.
 V. Element Count      :      271274447129.
 FLOP Count            :      129865319762.
 MOPS                  :      10357.932794
 MFLOPS                :       4939.839858
 VLEN                  :        207.779181
 V. Op. Ratio (%)      :         99.622051
 Memory Size (MB)      :         48.031250
 MIPS                  :         88.809961
 I-Cache (sec)         :          0.041723
 O-Cache (sec)         :          0.059745
 Bank (sec)            :          0.000690

 Start Time (date)  :  2004/04/13 14:56:24
 End   Time (date)  :  2004/04/13 14:56:51

Using this variable does only cause constant overhead (reading counters at the beginning, reading at end and computing and priunting of values).

General strategy for tuning is:

vector time should be as close at possible to user time. This means, V. Op. Ratio will be close to 100. MFLOPS should be as high as possible (as long as the application is doing floating point operations). A value between 2000 and 4500 is respectable. If it exceeds 9000, celebrate a miracle.

To achieve good performance, O-Cache (operand cache misses in seconds) should be close to 0. Vectorized code can not cause o-cache misses! If V. Op. Ratio is 99, but performance in MFLOPS is still bad, there are several possibilities:

  • no floating point operations in the code
  • short vector length VLEN
  • high bank times

VLEN is the average vector length which is processed by vector pipes. So this is average of (looplength) modulo 256, and can therefor not exceed 256, no matter how long loops are. It should be close to 256. The longer the loops are, the more efficient the CPU can work. Try to achieve loop length in the order of thousands.

High bank times show a high number of bank conflicts. A bank conflict is caused when a memory bank is accessed before the bank busy time from the last access is over. If this is high, search for power of two leading dimensions in the code. Distances between memory accesses in the form of a multiple of a large power of two should be avoided. Try to have odd or prime distances, best is stride 1 (in unit of words). When using lookup tables, high bank conflict times can arise as well if the lookup always hits the same value (what might be a consequence of cache optimizations). Try to make copies of the tables, and iterate over the tables.

Profiling on subroutine Basis

Simple Unix profiling by linking the code with -p option and calling prof <executable name> processing the generated mon.out file does not cause large overhead. This gives you simple information how much time was spent in which subroutine.

If yoy want to know the number of calls of the subroutine as well, the subroutines have to recompiled with -p as well. This causes more overhead.

Hardware information on subroutine Basis

To get more detailled information about subroutines, use ftrace feature of the compilers.

Compile all subroutines you want to examine, but at least the entry and the exit of your application with -ftrace (Fortran and C/C++).

Run the application, and keep the generated ftrace.out files. Use the sxftrace/ftrace tool to get the actual information out of the binary file.

Example output:

*--------------------------*
 FLOW TRACE ANALYSIS LIST
*--------------------------*

Execution : Tue Apr 13 15:33:11 2004
Total CPU : 0:00'29"189


PROG.UNIT  FREQUENCY  EXCLUSIVE       AVER.TIME   MOPS MFLOPS V.OP  AVER.   VECTOR I-CACHE O-CACHE    BANK
                      TIME[sec](  % )    [msec]               RATIO V.LEN    TIME   MISS    MISS      CONF

chempo          1000    13.452( 46.1)    13.452 8832.9 4631.5 99.54 217.1   11.183  0.0057  0.0029  0.0000
kraft           1100    13.410( 45.9)    12.191 11377.9 5013.2 99.81 201.0   13.287  0.0024  0.0342  0.0007
zufall       4324320     2.181(  7.5)     0.001  148.7    5.9  0.00   0.0    0.000  0.0006  0.0007  0.0000
lj2                1     0.059(  0.2)    59.033  328.0  173.3 85.06 237.9    0.003  0.0295  0.0170  0.0000
korekt          1100     0.041(  0.1)     0.037 7792.8 3285.9 99.71 222.8    0.038  0.0015  0.0010  0.0000
voraus          1100     0.035(  0.1)     0.032 8537.4 3976.9 99.76 242.1    0.032  0.0014  0.0010  0.0000
transf          1100     0.005(  0.0)     0.005 11972.9 5976.8 99.84 216.0    0.005  0.0000  0.0000  0.0000
skal            1100     0.004(  0.0)     0.004 3554.4 1170.5 98.09 240.0    0.003  0.0004  0.0004  0.0000
geschw             1     0.002(  0.0)     1.726  161.4   30.3 12.72 230.0    0.000  0.0000  0.0000  0.0000
gitter             1     0.000(  0.0)     0.071  333.2   43.3 21.77 235.6    0.000  0.0000  0.0000  0.0000
init             100     0.000(  0.0)     0.000  249.3    0.1 43.79 235.6    0.000  0.0000  0.0000  0.0000
psi22              4     0.000(  0.0)     0.004  110.8   15.1  0.00   0.0    0.000  0.0000  0.0000  0.0000
phi22              4     0.000(  0.0)     0.002  112.9   15.3  0.00   0.0    0.000  0.0000  0.0000  0.0000
cor22              2     0.000(  0.0)     0.004   86.7    8.6  0.00   0.0    0.000  0.0000  0.0000  0.0000
cutoff             1     0.000(  0.0)     0.002   89.4    5.3  0.00   0.0    0.000  0.0000  0.0000  0.0000
----------------------------------------------------------------------------------------------------------
total        4330934    29.189(100.0)     0.007 9333.6 4449.1 99.57 207.8   24.550  0.0415  0.0570  0.0007


Hardware information on loop/block basis

If you need to have more detailled information of the contents of a subroutine, regions within subroutines can be defined to be used with ftrace.

Enclose the section to be examined by special ftrace function calls:

       CALL FTRACE_REGION_BEGIN("REGION_A")
       DO I=1,10000
         A(I)=I
       ENDDO
       CALL FTRACE_REGION_END("REGION_A")

or in C/C++:

ftrace_region_begin("region_a");
/* region */
ftrace_region_end("region_a");

The name is a name you can choose, it has to match in the begin and end call. This will be used as the identifier in the ftrace printout.

I/O

Use environment variables F_FILEINF/C_FILEINF with values YES/DETAIL to find slow I/O. DETAIL (fortran only) gives information for every I/O, this gives a lot of output, be carefull.


Improving performance

general

Improving performance means improving vectorization. For detailled discussion see optimizing C and vectorising C for C programmers and optimizing fortran and vectorizing fortran for fortran programmers.

some basic rules

  • improve inner loop iteration length, longer loops are better than short loops
  • avoid indirect addressing and pointers
  • avoid power of 2 strides when accessing data in memory
  • use restrict keyword for C pointers
  • avoid function calls
  • enable inlining (-pi auto option of compiler)


I/O

Because underlaying GFS uses striping, large I/O should be done to make efficient use of the resources.

Try to make I/O in multiple of 4MB blocks, if this is possible. For fortran, setting the units I/O buffer to 4 MB improves I/O speed a lot. Use export F_SETBUF<unit>=4096 to set buffersize for fortran unit <unit>.

To assist in tuning of fortran I/O, it is possible to generate statitics of the I/O (only fortran). Set export F_FILEINF=YES or export F_FILEINF=DETAIL to get information about I/O sizes, achieved transfer bandwidth and settings of the fortran units.

In C, use setvbuf to increase the buffer size of buffered I/O calls fwrite and fread. Call setvbuf after fopen but before first actual I/O. You can also use environment variable C_SETBUF.

In C++, use pubsetbuf like in this example:

#include <fstream>
using namespace std;
int main(int argc, char **argv)
{
       char buffer[4096*4096];
       fstream out_file("huhu",ios::app|ios::out);
       out_file.rdbuf()->pubsetbuf(buffer,4096*4096);
       out_file << "Hallo" << endl;
}