- Infos im HLRS Wiki sind nicht rechtsverbindlich und ohne Gewähr -
- Information contained in the HLRS Wiki is not legally binding and HLRS is not responsible for any damages that might result from its use -

Optimizing for SX-9: Difference between revisions

From HLRS Platforms
Jump to navigationJump to search
No edit summary
 
(9 intermediate revisions by 2 users not shown)
Line 1: Line 1:
== Usage of hardware information ==  
== Identifying optimization potential ==


To get simple information about your application, please set ''F_PROGINF'' for Fortran or ''C_PROGINF'' for C/C++ programms. Possible values are YES and DETAIL. If this environment variable is set during runtime, an application prints out some timing information at programm end. Example ''F_PROGINF=YES'':
=== Usage of hardware information ===
 
To get simple information about your application, please set ''F_PROGINF'' for Fortran or ''C_PROGINF'' for C/C++ programms. Possible values are YES and DETAIL. If this environment variable is set during runtime, an application prints out some timing information at programm end.  
 
Example ''F_PROGINF=YES'':


     ******  Program Information  ******
     ******  Program Information  ******
   Real Time (sec)       :         26.804331
   Real Time (sec)               :         8005.250512
   User Time (sec)       :         26.275639
   User Time (sec)               :         8004.623898
   Sys  Time (sec)       :         0.308973
   Sys  Time (sec)               :             0.430407
   Vector Time (sec)     :         24.559803
   Vector Time (sec)             :         6360.582440
   Inst. Count           :        2334758597.
   Inst. Count                   :        3924843780618.
   V. Inst. Count       :        1305590123.
   V. Inst. Count               :        1479787913250.
   V. Element Count     :      271274447129.
   V. Element Count             :      370521602681208.
   FLOP Count           :      129865319762.
   FLOP Count                   :      184739227878234.
   MOPS                 :     10363.348892
   MOPS                         :         46593.901637
   MFLOPS               :       4942.422871
   MFLOPS                       :         23079.064080
   VLEN                 :       207.779181
   A. V. Length                 :           250.388315
   V. Op. Ratio (%)     :         99.622051
   V. Op. Ratio (%)             :           99.344430
   Memory Size (MB)     :         48.031250
   Memory Size (MB)             :         2560.031250
   Start Time (date) 2004/04/13 14:54:42
   
   End  Time (date) 2004/04/13 14:55:09
   Start Time (date)   Fri Nov 21 11:10:28 MET 2008
   End  Time (date)   Fri Nov 21 13:23:53 MET 2008
 


Example ''F_PROGINF=DETAIL'':
Example ''F_PROGINF=DETAIL'':


    ******  Program Information  ******
      ******  Program Information  ******
   Real Time (sec)       :         26.768830
   Real Time (sec)               :         8005.250512
   User Time (sec)       :         26.289378
   User Time (sec)               :         8004.623898
   Sys  Time (sec)       :         0.309090
   Sys  Time (sec)               :             0.430407
   Vector Time (sec)     :         24.571109
   Vector Time (sec)             :         6360.582440
   Inst. Count           :        2334758675.
   Inst. Count                   :        3924843780618.
   V. Inst. Count       :        1305590123.
   V. Inst. Count               :        1479787913250.
   V. Element Count     :      271274447129.
   V. Element Count             :      370521602681208.
   FLOP Count           :      129865319762.
   FLOP Count                   :      184739227878234.
   MOPS                 :     10357.932794
   MOPS                         :         46593.901637
   MFLOPS               :       4939.839858
   MFLOPS                       :         23079.064080
   VLEN                 :       207.779181
   A. V. Length                 :           250.388315
   V. Op. Ratio (%)     :         99.622051
   V. Op. Ratio (%)             :           99.344430
   Memory Size (MB)     :         48.031250
   Memory Size (MB)             :         2560.031250
   MIPS                 :         88.809961
   MIPS                         :           490.322073
   I-Cache (sec)         :         0.041723
   I-Cache (sec)                 :             0.064346
   O-Cache (sec)         :          0.059745
   O-Cache (sec)                 :          1511.726309
   Bank (sec)           :          0.000690
   Bank Conflict Time
   Start Time (date) 2004/04/13 14:56:24
    CPU Port Conf. (sec)        :          165.409913
   End  Time (date) 2004/04/13 14:56:51
    Memory Network Conf. (sec) :          2350.324116
   
   Start Time (date)   Fri Nov 21 11:10:28 MET 2008
   End  Time (date)   Fri Nov 21 13:23:53 MET 2008


Using this variable does only cause constant overhead (reading counters at the beginning, reading at end and computing and priunting of values).
Using this variable does only cause constant overhead (reading counters at the beginning, reading at end and computing and printing of values).


General strategy for tuning is:
General strategy for tuning is:


vector time should be as close at possible to user time. This means, ''V. Op. Ratio'' will be close to 100. MFLOPS should be as high as possible (as long as the application is doing floating point operations). A value between 2000 and 4500 is respectable. If it exceeds 9000, celebrate a miracle.
vector time should be as close at possible to user time. This means, ''V. Op. Ratio'' will be close to 100. MFLOPS should be as high as possible (as long as the application is doing floating point operations).  


To achieve good performance, ''O-Cache'' (operand cache misses in seconds) should be close to 0. Vectorized code can not cause o-cache misses! If ''V. Op. Ratio'' is 99, but performance in MFLOPS is still bad, there are several possibilities:
To achieve good performance, ''O-Cache'' (operand cache misses in seconds) should be close to 0. Vectorized code can not cause o-cache misses! If ''V. Op. Ratio'' is 99, but performance in MFLOPS is still bad, there are several possibilities:
Line 58: Line 67:


High bank times show a high number of bank conflicts. A bank conflict is caused when a memory bank is accessed before the bank busy time from the last access is over. If this is high, search for power of two leading dimensions in the code. Distances between memory accesses in the form of a multiple of a large power of two should be avoided. Try to have odd or prime distances, best is stride 1 (in unit of words). When using lookup tables, high bank conflict times can arise as well if the lookup always hits the same value (what might be a consequence of cache optimizations). Try to make copies of the tables, and iterate over the tables.
High bank times show a high number of bank conflicts. A bank conflict is caused when a memory bank is accessed before the bank busy time from the last access is over. If this is high, search for power of two leading dimensions in the code. Distances between memory accesses in the form of a multiple of a large power of two should be avoided. Try to have odd or prime distances, best is stride 1 (in unit of words). When using lookup tables, high bank conflict times can arise as well if the lookup always hits the same value (what might be a consequence of cache optimizations). Try to make copies of the tables, and iterate over the tables.
=== Profiling on subroutine Basis ===
Simple Unix profiling by linking the code with ''-p'' option and calling ''prof <executable name>'' processing the generated mon.out file does not cause large overhead. This gives you simple information how much time was spent in which subroutine.
If yoy want to know the number of calls of the subroutine as well, the subroutines have to recompiled with ''-p'' as well. This causes more overhead.
=== Hardware information on subroutine Basis ===
To get more detailled information about subroutines, use ftrace feature of the compilers.
Compile all subroutines you want to examine, but at least the entry and the exit of your application with -ftrace (Fortran and C/C++).
Run the application, and keep the generated ftrace.out files. Use the sxftrace/ftrace tool to get the actual information out of the binary file.
Example output:
*--------------------------*
  FLOW TRACE ANALYSIS LIST
*--------------------------*
Execution : Tue Apr 13 15:33:11 2004
Total CPU : 0:00'29"189
PROG.UNIT  FREQUENCY  EXCLUSIVE      AVER.TIME  MOPS MFLOPS V.OP  AVER.  VECTOR I-CACHE O-CACHE    BANK
                      TIME[sec](  % )    [msec]              RATIO V.LEN    TIME  MISS    MISS      CONF
chempo          1000    13.452( 46.1)    13.452 8832.9 4631.5 99.54 217.1  11.183  0.0057  0.0029  0.0000
kraft          1100    13.410( 45.9)    12.191 11377.9 5013.2 99.81 201.0  13.287  0.0024  0.0342  0.0007
zufall      4324320    2.181(  7.5)    0.001  148.7    5.9  0.00  0.0    0.000  0.0006  0.0007  0.0000
lj2                1    0.059(  0.2)    59.033  328.0  173.3 85.06 237.9    0.003  0.0295  0.0170  0.0000
korekt          1100    0.041(  0.1)    0.037 7792.8 3285.9 99.71 222.8    0.038  0.0015  0.0010  0.0000
voraus          1100    0.035(  0.1)    0.032 8537.4 3976.9 99.76 242.1    0.032  0.0014  0.0010  0.0000
transf          1100    0.005(  0.0)    0.005 11972.9 5976.8 99.84 216.0    0.005  0.0000  0.0000  0.0000
skal            1100    0.004(  0.0)    0.004 3554.4 1170.5 98.09 240.0    0.003  0.0004  0.0004  0.0000
geschw            1    0.002(  0.0)    1.726  161.4  30.3 12.72 230.0    0.000  0.0000  0.0000  0.0000
gitter            1    0.000(  0.0)    0.071  333.2  43.3 21.77 235.6    0.000  0.0000  0.0000  0.0000
init            100    0.000(  0.0)    0.000  249.3    0.1 43.79 235.6    0.000  0.0000  0.0000  0.0000
psi22              4    0.000(  0.0)    0.004  110.8  15.1  0.00  0.0    0.000  0.0000  0.0000  0.0000
phi22              4    0.000(  0.0)    0.002  112.9  15.3  0.00  0.0    0.000  0.0000  0.0000  0.0000
cor22              2    0.000(  0.0)    0.004  86.7    8.6  0.00  0.0    0.000  0.0000  0.0000  0.0000
cutoff            1    0.000(  0.0)    0.002  89.4    5.3  0.00  0.0    0.000  0.0000  0.0000  0.0000
----------------------------------------------------------------------------------------------------------
total        4330934    29.189(100.0)    0.007 9333.6 4449.1 99.57 207.8  24.550  0.0415  0.0570  0.0007
=== Hardware information on loop/block basis ===
If you need to have more detailled information of the contents of a subroutine, regions within subroutines can be defined to be used with ftrace.
Enclose the section to be examined by special ftrace function calls:
        CALL FTRACE_REGION_BEGIN("REGION_A")
        DO I=1,10000
          A(I)=I
        ENDDO
        CALL FTRACE_REGION_END("REGION_A")
or in C/C++:
ftrace_region_begin("region_a");
/* region */
ftrace_region_end("region_a");
The name is a name you can choose, it has to match in the begin and end call. This will be used as the identifier in the ftrace printout.
=== I/O ===
Use environment variables ''F_FILEINF/C_FILEINF'' with values ''YES/DETAIL'' to find slow I/O. DETAIL (fortran only) gives
information for every I/O, this gives a lot of output, be carefull.
== Improving performance ==
=== General ===
Improving performance means improving vectorization. For detailled discussion see
[http://fs.hlrs.de/~hwwnec5/SX-9/g1af28e/chap4.html optimizing C]  and [http://fs.hlrs.de/~hwwnec5/SX-9/g1af28e/chap5.html vectorising C]
for C programmers and [http://fs.hlrs.de/~hwwnec5/SX-9/g1af07e/chap4.html optimizing fortran] and
[http://fs.hlrs.de/~hwwnec5/SX-9/g1af07e/chap5.html vectorizing fortran] for fortran programmers.
some basic rules
* improve inner loop iteration length, longer loops are better than short loops
* avoid indirect addressing and pointers
* avoid power of 2 strides when accessing data in memory
* use ''restrict'' keyword for C pointers
* avoid function calls
* enable inlining (''-pi auto'' option of compiler)
=== I/O ===
Because underlaying GFS uses striping, large I/O should be done to make efficient use of the resources.
Try to make I/O in multiple of 4MB blocks, if this is possible. For fortran, setting the units I/O buffer to 4 MB improves I/O speed a lot. Use export ''F_SETBUF<unit>=4096'' to set buffersize for fortran unit <unit>.
To assist in tuning of fortran I/O, it is possible to generate statitics of the I/O (only fortran). Set export ''F_FILEINF=YES'' or export ''F_FILEINF=DETAIL'' to get information about I/O sizes, achieved transfer bandwidth and settings of the fortran units.
In C, use setvbuf to increase the buffer size of buffered I/O calls fwrite and fread. Call setvbuf after fopen but before first actual I/O.
You can also use environment variable ''C_SETBUF''. Use ''sxcc -D_USE_SETBUF=1'' to enable this feature, and set e.g. ''C_SETBUF=16M'.
In C++, use pubsetbuf like in this example:
#include <fstream>
using namespace std;
int main(int argc, char **argv)
{
        char buffer[4096*4096];
        fstream out_file("huhu",ios::app|ios::out);
        out_file.rdbuf()->pubsetbuf(buffer,4096*4096);
        out_file << "Hallo" << endl;
}

Latest revision as of 18:34, 10 December 2008

Identifying optimization potential

Usage of hardware information

To get simple information about your application, please set F_PROGINF for Fortran or C_PROGINF for C/C++ programms. Possible values are YES and DETAIL. If this environment variable is set during runtime, an application prints out some timing information at programm end.

Example F_PROGINF=YES:

    ******  Program Information  ******
 Real Time (sec)               :          8005.250512
 User Time (sec)               :          8004.623898
 Sys  Time (sec)               :             0.430407
 Vector Time (sec)             :          6360.582440
 Inst. Count                   :        3924843780618.
 V. Inst. Count                :        1479787913250.
 V. Element Count              :      370521602681208.
 FLOP Count                    :      184739227878234.
 MOPS                          :         46593.901637
 MFLOPS                        :         23079.064080
 A. V. Length                  :           250.388315
 V. Op. Ratio (%)              :            99.344430
 Memory Size (MB)              :          2560.031250
    
 Start Time (date)    :  Fri Nov 21 11:10:28 MET 2008
 End   Time (date)    :  Fri Nov 21 13:23:53 MET 2008


Example F_PROGINF=DETAIL:

     ******  Program Information  ******
 Real Time (sec)               :          8005.250512
 User Time (sec)               :          8004.623898
 Sys  Time (sec)               :             0.430407
 Vector Time (sec)             :          6360.582440
 Inst. Count                   :        3924843780618.
 V. Inst. Count                :        1479787913250.
 V. Element Count              :      370521602681208.
 FLOP Count                    :      184739227878234.
 MOPS                          :         46593.901637
 MFLOPS                        :         23079.064080
 A. V. Length                  :           250.388315
 V. Op. Ratio (%)              :            99.344430
 Memory Size (MB)              :          2560.031250
 MIPS                          :           490.322073
 I-Cache (sec)                 :             0.064346
 O-Cache (sec)                 :          1511.726309
 Bank Conflict Time
   CPU Port Conf. (sec)        :           165.409913
   Memory Network Conf. (sec)  :          2350.324116
    
 Start Time (date)    :  Fri Nov 21 11:10:28 MET 2008
 End   Time (date)    :  Fri Nov 21 13:23:53 MET 2008

Using this variable does only cause constant overhead (reading counters at the beginning, reading at end and computing and printing of values).

General strategy for tuning is:

vector time should be as close at possible to user time. This means, V. Op. Ratio will be close to 100. MFLOPS should be as high as possible (as long as the application is doing floating point operations).

To achieve good performance, O-Cache (operand cache misses in seconds) should be close to 0. Vectorized code can not cause o-cache misses! If V. Op. Ratio is 99, but performance in MFLOPS is still bad, there are several possibilities:

  • no floating point operations in the code
  • short vector length VLEN
  • high bank times

VLEN is the average vector length which is processed by vector pipes. So this is average of (looplength) modulo 256, and can therefor not exceed 256, no matter how long loops are. It should be close to 256. The longer the loops are, the more efficient the CPU can work. Try to achieve loop length in the order of thousands.

High bank times show a high number of bank conflicts. A bank conflict is caused when a memory bank is accessed before the bank busy time from the last access is over. If this is high, search for power of two leading dimensions in the code. Distances between memory accesses in the form of a multiple of a large power of two should be avoided. Try to have odd or prime distances, best is stride 1 (in unit of words). When using lookup tables, high bank conflict times can arise as well if the lookup always hits the same value (what might be a consequence of cache optimizations). Try to make copies of the tables, and iterate over the tables.

Profiling on subroutine Basis

Simple Unix profiling by linking the code with -p option and calling prof <executable name> processing the generated mon.out file does not cause large overhead. This gives you simple information how much time was spent in which subroutine.

If yoy want to know the number of calls of the subroutine as well, the subroutines have to recompiled with -p as well. This causes more overhead.

Hardware information on subroutine Basis

To get more detailled information about subroutines, use ftrace feature of the compilers.

Compile all subroutines you want to examine, but at least the entry and the exit of your application with -ftrace (Fortran and C/C++).

Run the application, and keep the generated ftrace.out files. Use the sxftrace/ftrace tool to get the actual information out of the binary file.

Example output:

*--------------------------*
 FLOW TRACE ANALYSIS LIST
*--------------------------*

Execution : Tue Apr 13 15:33:11 2004
Total CPU : 0:00'29"189


PROG.UNIT  FREQUENCY  EXCLUSIVE       AVER.TIME   MOPS MFLOPS V.OP  AVER.   VECTOR I-CACHE O-CACHE    BANK
                      TIME[sec](  % )    [msec]               RATIO V.LEN    TIME   MISS    MISS      CONF

chempo          1000    13.452( 46.1)    13.452 8832.9 4631.5 99.54 217.1   11.183  0.0057  0.0029  0.0000
kraft           1100    13.410( 45.9)    12.191 11377.9 5013.2 99.81 201.0   13.287  0.0024  0.0342  0.0007
zufall       4324320     2.181(  7.5)     0.001  148.7    5.9  0.00   0.0    0.000  0.0006  0.0007  0.0000
lj2                1     0.059(  0.2)    59.033  328.0  173.3 85.06 237.9    0.003  0.0295  0.0170  0.0000
korekt          1100     0.041(  0.1)     0.037 7792.8 3285.9 99.71 222.8    0.038  0.0015  0.0010  0.0000
voraus          1100     0.035(  0.1)     0.032 8537.4 3976.9 99.76 242.1    0.032  0.0014  0.0010  0.0000
transf          1100     0.005(  0.0)     0.005 11972.9 5976.8 99.84 216.0    0.005  0.0000  0.0000  0.0000
skal            1100     0.004(  0.0)     0.004 3554.4 1170.5 98.09 240.0    0.003  0.0004  0.0004  0.0000
geschw             1     0.002(  0.0)     1.726  161.4   30.3 12.72 230.0    0.000  0.0000  0.0000  0.0000
gitter             1     0.000(  0.0)     0.071  333.2   43.3 21.77 235.6    0.000  0.0000  0.0000  0.0000
init             100     0.000(  0.0)     0.000  249.3    0.1 43.79 235.6    0.000  0.0000  0.0000  0.0000
psi22              4     0.000(  0.0)     0.004  110.8   15.1  0.00   0.0    0.000  0.0000  0.0000  0.0000
phi22              4     0.000(  0.0)     0.002  112.9   15.3  0.00   0.0    0.000  0.0000  0.0000  0.0000
cor22              2     0.000(  0.0)     0.004   86.7    8.6  0.00   0.0    0.000  0.0000  0.0000  0.0000
cutoff             1     0.000(  0.0)     0.002   89.4    5.3  0.00   0.0    0.000  0.0000  0.0000  0.0000
----------------------------------------------------------------------------------------------------------
total        4330934    29.189(100.0)     0.007 9333.6 4449.1 99.57 207.8   24.550  0.0415  0.0570  0.0007


Hardware information on loop/block basis

If you need to have more detailled information of the contents of a subroutine, regions within subroutines can be defined to be used with ftrace.

Enclose the section to be examined by special ftrace function calls:

       CALL FTRACE_REGION_BEGIN("REGION_A")
       DO I=1,10000
         A(I)=I
       ENDDO
       CALL FTRACE_REGION_END("REGION_A")

or in C/C++:

ftrace_region_begin("region_a");
/* region */
ftrace_region_end("region_a");

The name is a name you can choose, it has to match in the begin and end call. This will be used as the identifier in the ftrace printout.

I/O

Use environment variables F_FILEINF/C_FILEINF with values YES/DETAIL to find slow I/O. DETAIL (fortran only) gives information for every I/O, this gives a lot of output, be carefull.


Improving performance

General

Improving performance means improving vectorization. For detailled discussion see optimizing C and vectorising C for C programmers and optimizing fortran and vectorizing fortran for fortran programmers.

some basic rules

  • improve inner loop iteration length, longer loops are better than short loops
  • avoid indirect addressing and pointers
  • avoid power of 2 strides when accessing data in memory
  • use restrict keyword for C pointers
  • avoid function calls
  • enable inlining (-pi auto option of compiler)


I/O

Because underlaying GFS uses striping, large I/O should be done to make efficient use of the resources.

Try to make I/O in multiple of 4MB blocks, if this is possible. For fortran, setting the units I/O buffer to 4 MB improves I/O speed a lot. Use export F_SETBUF<unit>=4096 to set buffersize for fortran unit <unit>.

To assist in tuning of fortran I/O, it is possible to generate statitics of the I/O (only fortran). Set export F_FILEINF=YES or export F_FILEINF=DETAIL to get information about I/O sizes, achieved transfer bandwidth and settings of the fortran units.

In C, use setvbuf to increase the buffer size of buffered I/O calls fwrite and fread. Call setvbuf after fopen but before first actual I/O. You can also use environment variable C_SETBUF. Use sxcc -D_USE_SETBUF=1 to enable this feature, and set e.g. C_SETBUF=16M'.

In C++, use pubsetbuf like in this example:

#include <fstream>
using namespace std;
int main(int argc, char **argv)
{
       char buffer[4096*4096];
       fstream out_file("huhu",ios::app|ios::out);
       out_file.rdbuf()->pubsetbuf(buffer,4096*4096);
       out_file << "Hallo" << endl;
}