- Infos im HLRS Wiki sind nicht rechtsverbindlich und ohne Gewähr -
- Information contained in the HLRS Wiki is not legally binding and HLRS is not responsible for any damages that might result from its use -

SX ACE optimizing

From HLRS Platforms
Revision as of 15:15, 30 January 2015 by Hwwnec5 (talk | contribs) (Created page with "''this article is a stub, examples are not yet adopted for SX-ACE output, SX-ACE gives a little bit different output due to ADB, but the general guidelines hold true'' == Ide...")
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigationJump to search

this article is a stub, examples are not yet adopted for SX-ACE output, SX-ACE gives a little bit different output due to ADB, but the general guidelines hold true

Identifying optimization potential

Usage of hardware information

To get simple information about your application, please set F_PROGINF for Fortran or C_PROGINF for C/C++ programms. Possible values are YES and DETAIL. If this environment variable is set during runtime, an application prints out some timing information at programm end.

Example F_PROGINF=YES:

    ******  Program Information  ******
 Real Time (sec)               :          8005.250512
 User Time (sec)               :          8004.623898
 Sys  Time (sec)               :             0.430407
 Vector Time (sec)             :          6360.582440
 Inst. Count                   :        3924843780618.
 V. Inst. Count                :        1479787913250.
 V. Element Count              :      370521602681208.
 FLOP Count                    :      184739227878234.
 MOPS                          :         46593.901637
 MFLOPS                        :         23079.064080
 A. V. Length                  :           250.388315
 V. Op. Ratio (%)              :            99.344430
 Memory Size (MB)              :          2560.031250
    
 Start Time (date)    :  Fri Nov 21 11:10:28 MET 2008
 End   Time (date)    :  Fri Nov 21 13:23:53 MET 2008


Example F_PROGINF=DETAIL:

     ******  Program Information  ******
 Real Time (sec)               :          8005.250512
 User Time (sec)               :          8004.623898
 Sys  Time (sec)               :             0.430407
 Vector Time (sec)             :          6360.582440
 Inst. Count                   :        3924843780618.
 V. Inst. Count                :        1479787913250.
 V. Element Count              :      370521602681208.
 FLOP Count                    :      184739227878234.
 MOPS                          :         46593.901637
 MFLOPS                        :         23079.064080
 A. V. Length                  :           250.388315
 V. Op. Ratio (%)              :            99.344430
 Memory Size (MB)              :          2560.031250
 MIPS                          :           490.322073
 I-Cache (sec)                 :             0.064346
 O-Cache (sec)                 :          1511.726309
 Bank Conflict Time
   CPU Port Conf. (sec)        :           165.409913
   Memory Network Conf. (sec)  :          2350.324116
    
 Start Time (date)    :  Fri Nov 21 11:10:28 MET 2008
 End   Time (date)    :  Fri Nov 21 13:23:53 MET 2008

Using this variable does only cause constant overhead (reading counters at the beginning, reading at end and computing and printing of values).

General strategy for tuning is:

vector time should be as close at possible to user time. This means, V. Op. Ratio will be close to 100. MFLOPS should be as high as possible (as long as the application is doing floating point operations).

To achieve good performance, O-Cache (operand cache misses in seconds) should be close to 0. Vectorized code can not cause o-cache misses! If V. Op. Ratio is 99, but performance in MFLOPS is still bad, there are several possibilities:

  • no floating point operations in the code
  • short vector length VLEN
  • high bank times

VLEN is the average vector length which is processed by vector pipes. So this is average of (looplength) modulo 256, and can therefor not exceed 256, no matter how long loops are. It should be close to 256. The longer the loops are, the more efficient the CPU can work. Try to achieve loop length in the order of thousands.

High bank times show a high number of bank conflicts. A bank conflict is caused when a memory bank is accessed before the bank busy time from the last access is over. If this is high, search for power of two leading dimensions in the code. Distances between memory accesses in the form of a multiple of a large power of two should be avoided. Try to have odd or prime distances, best is stride 1 (in unit of words). When using lookup tables, high bank conflict times can arise as well if the lookup always hits the same value (what might be a consequence of cache optimizations). Try to make copies of the tables, and iterate over the tables.

Profiling on subroutine Basis

Simple Unix profiling by linking the code with -p option and calling sxprof <executable name> processing the generated mon.out file on the frontend does not cause large overhead. This gives you simple information how much time was spent in which subroutine.

If you want to know the number of calls of the subroutine as well, the subroutines have to recompiled with -p as well. This causes more overhead.

Hardware information on subroutine Basis

To get more detailed information about subroutines, use ftrace feature of the compilers.

Compile all subroutines you want to examine, but at least the entry and the exit of your application with -ftrace option (Fortran and C/C++).

Run the application, and keep the generated ftrace.out files. Use the sxftrace/sxftrace++ tool to get the actual information out of the binary file.

Example output:

*--------------------------*
 FLOW TRACE ANALYSIS LIST
*--------------------------*

Execution : Tue Apr 13 15:33:11 2004
Total CPU : 0:00'29"189


PROG.UNIT  FREQUENCY  EXCLUSIVE       AVER.TIME   MOPS MFLOPS V.OP  AVER.   VECTOR I-CACHE O-CACHE    BANK
                      TIME[sec](  % )    [msec]               RATIO V.LEN    TIME   MISS    MISS      CONF

chempo          1000    13.452( 46.1)    13.452 8832.9 4631.5 99.54 217.1   11.183  0.0057  0.0029  0.0000
kraft           1100    13.410( 45.9)    12.191 11377.9 5013.2 99.81 201.0   13.287  0.0024  0.0342  0.0007
zufall       4324320     2.181(  7.5)     0.001  148.7    5.9  0.00   0.0    0.000  0.0006  0.0007  0.0000
lj2                1     0.059(  0.2)    59.033  328.0  173.3 85.06 237.9    0.003  0.0295  0.0170  0.0000
korekt          1100     0.041(  0.1)     0.037 7792.8 3285.9 99.71 222.8    0.038  0.0015  0.0010  0.0000
voraus          1100     0.035(  0.1)     0.032 8537.4 3976.9 99.76 242.1    0.032  0.0014  0.0010  0.0000
transf          1100     0.005(  0.0)     0.005 11972.9 5976.8 99.84 216.0    0.005  0.0000  0.0000  0.0000
skal            1100     0.004(  0.0)     0.004 3554.4 1170.5 98.09 240.0    0.003  0.0004  0.0004  0.0000
geschw             1     0.002(  0.0)     1.726  161.4   30.3 12.72 230.0    0.000  0.0000  0.0000  0.0000
gitter             1     0.000(  0.0)     0.071  333.2   43.3 21.77 235.6    0.000  0.0000  0.0000  0.0000
init             100     0.000(  0.0)     0.000  249.3    0.1 43.79 235.6    0.000  0.0000  0.0000  0.0000
psi22              4     0.000(  0.0)     0.004  110.8   15.1  0.00   0.0    0.000  0.0000  0.0000  0.0000
phi22              4     0.000(  0.0)     0.002  112.9   15.3  0.00   0.0    0.000  0.0000  0.0000  0.0000
cor22              2     0.000(  0.0)     0.004   86.7    8.6  0.00   0.0    0.000  0.0000  0.0000  0.0000
cutoff             1     0.000(  0.0)     0.002   89.4    5.3  0.00   0.0    0.000  0.0000  0.0000  0.0000
----------------------------------------------------------------------------------------------------------
total        4330934    29.189(100.0)     0.007 9333.6 4449.1 99.57 207.8   24.550  0.0415  0.0570  0.0007


Hardware information on loop/block basis

If you need to have more detailled information of the contents of a subroutine, regions within subroutines can be defined to be used with ftrace.

Enclose the section to be examined by special ftrace function calls:

       CALL FTRACE_REGION_BEGIN("REGION_A")
       DO I=1,10000
         A(I)=I
       ENDDO
       CALL FTRACE_REGION_END("REGION_A")

or in C/C++:

ftrace_region_begin("region_a");
/* region */
ftrace_region_end("region_a");

The name is a name you can choose, it has to match in the begin and end call. This will be used as the identifier in the ftrace printout.

I/O

Use environment variables F_FILEINF/C_FILEINF with values YES/DETAIL to find slow I/O. DETAIL (fortran only) gives information for every I/O, this gives a lot of output, be carefull.


Improving performance

General

Improving performance means improving vectorization. For detailled discussion see optimizing C and vectorising C for C programmers and optimizing fortran and vectorizing fortran for fortran programmers.

some basic rules

  • improve inner loop iteration length, longer loops are better than short loops
  • avoid indirect addressing and pointers
  • avoid power of 2 strides when accessing data in memory
  • use restrict keyword for C pointers
  • avoid function calls
  • enable inlining (-pi auto option of compiler)


I/O

Because underlaying ScaTeFS uses striping, large I/O should be done to make efficient use of the resources.

Try to make I/O in multiple of 4MB blocks, if this is possible. For Fortran, setting the units I/O buffer to 4 MB improves I/O speed a lot. Use export F_SETBUF<unit>=4096 to set buffersize for fortran unit <unit>.

To assist in tuning of fortran I/O, it is possible to generate statitics of the I/O (only fortran). Set export F_FILEINF=YES or export F_FILEINF=DETAIL to get information about I/O sizes, achieved transfer bandwidth and settings of the fortran units.

In C, use setvbuf to increase the buffer size of buffered I/O calls fwrite and fread. Call setvbuf after fopen but before first actual I/O. You can also use environment variable C_SETBUF. Use sxcc -D_USE_SETBUF=1 to enable this feature, and set e.g. C_SETBUF=16M'.

In C++, use pubsetbuf like in this example:

#include <fstream>
using namespace std;
int main(int argc, char **argv)
{
       char buffer[4096*4096];
       fstream out_file("huhu",ios::app|ios::out);
       out_file.rdbuf()->pubsetbuf(buffer,4096*4096);
       out_file << "Hallo" << endl;
}