- Infos im HLRS Wiki sind nicht rechtsverbindlich und ohne Gewähr -
- Information contained in the HLRS Wiki is not legally binding and HLRS is not responsible for any damages that might result from its use -
Optimizing for SX-9
Identifying optimization potential
Usage of hardware information
To get simple information about your application, please set F_PROGINF for Fortran or C_PROGINF for C/C++ programms. Possible values are YES and DETAIL. If this environment variable is set during runtime, an application prints out some timing information at programm end. Example F_PROGINF=YES:
****** Program Information ****** Real Time (sec) : 26.804331 User Time (sec) : 26.275639 Sys Time (sec) : 0.308973 Vector Time (sec) : 24.559803 Inst. Count : 2334758597. V. Inst. Count : 1305590123. V. Element Count : 271274447129. FLOP Count : 129865319762. MOPS : 10363.348892 MFLOPS : 4942.422871 VLEN : 207.779181 V. Op. Ratio (%) : 99.622051 Memory Size (MB) : 48.031250 Start Time (date) : 2004/04/13 14:54:42 End Time (date) : 2004/04/13 14:55:09
Example F_PROGINF=DETAIL:
****** Program Information ****** Real Time (sec) : 26.768830 User Time (sec) : 26.289378 Sys Time (sec) : 0.309090 Vector Time (sec) : 24.571109 Inst. Count : 2334758675. V. Inst. Count : 1305590123. V. Element Count : 271274447129. FLOP Count : 129865319762. MOPS : 10357.932794 MFLOPS : 4939.839858 VLEN : 207.779181 V. Op. Ratio (%) : 99.622051 Memory Size (MB) : 48.031250 MIPS : 88.809961 I-Cache (sec) : 0.041723 O-Cache (sec) : 0.059745 Bank (sec) : 0.000690 Start Time (date) : 2004/04/13 14:56:24 End Time (date) : 2004/04/13 14:56:51
Using this variable does only cause constant overhead (reading counters at the beginning, reading at end and computing and printing of values).
General strategy for tuning is:
vector time should be as close at possible to user time. This means, V. Op. Ratio will be close to 100. MFLOPS should be as high as possible (as long as the application is doing floating point operations).
To achieve good performance, O-Cache (operand cache misses in seconds) should be close to 0. Vectorized code can not cause o-cache misses! If V. Op. Ratio is 99, but performance in MFLOPS is still bad, there are several possibilities:
- no floating point operations in the code
- short vector length VLEN
- high bank times
VLEN is the average vector length which is processed by vector pipes. So this is average of (looplength) modulo 256, and can therefor not exceed 256, no matter how long loops are. It should be close to 256. The longer the loops are, the more efficient the CPU can work. Try to achieve loop length in the order of thousands.
High bank times show a high number of bank conflicts. A bank conflict is caused when a memory bank is accessed before the bank busy time from the last access is over. If this is high, search for power of two leading dimensions in the code. Distances between memory accesses in the form of a multiple of a large power of two should be avoided. Try to have odd or prime distances, best is stride 1 (in unit of words). When using lookup tables, high bank conflict times can arise as well if the lookup always hits the same value (what might be a consequence of cache optimizations). Try to make copies of the tables, and iterate over the tables.
Profiling on subroutine Basis
Simple Unix profiling by linking the code with -p option and calling prof <executable name> processing the generated mon.out file does not cause large overhead. This gives you simple information how much time was spent in which subroutine.
If yoy want to know the number of calls of the subroutine as well, the subroutines have to recompiled with -p as well. This causes more overhead.
Hardware information on subroutine Basis
To get more detailled information about subroutines, use ftrace feature of the compilers.
Compile all subroutines you want to examine, but at least the entry and the exit of your application with -ftrace (Fortran and C/C++).
Run the application, and keep the generated ftrace.out files. Use the sxftrace/ftrace tool to get the actual information out of the binary file.
Example output:
*--------------------------* FLOW TRACE ANALYSIS LIST *--------------------------* Execution : Tue Apr 13 15:33:11 2004 Total CPU : 0:00'29"189 PROG.UNIT FREQUENCY EXCLUSIVE AVER.TIME MOPS MFLOPS V.OP AVER. VECTOR I-CACHE O-CACHE BANK TIME[sec]( % ) [msec] RATIO V.LEN TIME MISS MISS CONF chempo 1000 13.452( 46.1) 13.452 8832.9 4631.5 99.54 217.1 11.183 0.0057 0.0029 0.0000 kraft 1100 13.410( 45.9) 12.191 11377.9 5013.2 99.81 201.0 13.287 0.0024 0.0342 0.0007 zufall 4324320 2.181( 7.5) 0.001 148.7 5.9 0.00 0.0 0.000 0.0006 0.0007 0.0000 lj2 1 0.059( 0.2) 59.033 328.0 173.3 85.06 237.9 0.003 0.0295 0.0170 0.0000 korekt 1100 0.041( 0.1) 0.037 7792.8 3285.9 99.71 222.8 0.038 0.0015 0.0010 0.0000 voraus 1100 0.035( 0.1) 0.032 8537.4 3976.9 99.76 242.1 0.032 0.0014 0.0010 0.0000 transf 1100 0.005( 0.0) 0.005 11972.9 5976.8 99.84 216.0 0.005 0.0000 0.0000 0.0000 skal 1100 0.004( 0.0) 0.004 3554.4 1170.5 98.09 240.0 0.003 0.0004 0.0004 0.0000 geschw 1 0.002( 0.0) 1.726 161.4 30.3 12.72 230.0 0.000 0.0000 0.0000 0.0000 gitter 1 0.000( 0.0) 0.071 333.2 43.3 21.77 235.6 0.000 0.0000 0.0000 0.0000 init 100 0.000( 0.0) 0.000 249.3 0.1 43.79 235.6 0.000 0.0000 0.0000 0.0000 psi22 4 0.000( 0.0) 0.004 110.8 15.1 0.00 0.0 0.000 0.0000 0.0000 0.0000 phi22 4 0.000( 0.0) 0.002 112.9 15.3 0.00 0.0 0.000 0.0000 0.0000 0.0000 cor22 2 0.000( 0.0) 0.004 86.7 8.6 0.00 0.0 0.000 0.0000 0.0000 0.0000 cutoff 1 0.000( 0.0) 0.002 89.4 5.3 0.00 0.0 0.000 0.0000 0.0000 0.0000 ---------------------------------------------------------------------------------------------------------- total 4330934 29.189(100.0) 0.007 9333.6 4449.1 99.57 207.8 24.550 0.0415 0.0570 0.0007
Hardware information on loop/block basis
If you need to have more detailled information of the contents of a subroutine, regions within subroutines can be defined to be used with ftrace.
Enclose the section to be examined by special ftrace function calls:
CALL FTRACE_REGION_BEGIN("REGION_A") DO I=1,10000 A(I)=I ENDDO CALL FTRACE_REGION_END("REGION_A")
or in C/C++:
ftrace_region_begin("region_a"); /* region */ ftrace_region_end("region_a");
The name is a name you can choose, it has to match in the begin and end call. This will be used as the identifier in the ftrace printout.
I/O
Use environment variables F_FILEINF/C_FILEINF with values YES/DETAIL to find slow I/O. DETAIL (fortran only) gives information for every I/O, this gives a lot of output, be carefull.
Improving performance
General
Improving performance means improving vectorization. For detailled discussion see optimizing C and vectorising C for C programmers and optimizing fortran and vectorizing fortran for fortran programmers.
some basic rules
- improve inner loop iteration length, longer loops are better than short loops
- avoid indirect addressing and pointers
- avoid power of 2 strides when accessing data in memory
- use restrict keyword for C pointers
- avoid function calls
- enable inlining (-pi auto option of compiler)
I/O
Because underlaying GFS uses striping, large I/O should be done to make efficient use of the resources.
Try to make I/O in multiple of 4MB blocks, if this is possible. For fortran, setting the units I/O buffer to 4 MB improves I/O speed a lot. Use export F_SETBUF<unit>=4096 to set buffersize for fortran unit <unit>.
To assist in tuning of fortran I/O, it is possible to generate statitics of the I/O (only fortran). Set export F_FILEINF=YES or export F_FILEINF=DETAIL to get information about I/O sizes, achieved transfer bandwidth and settings of the fortran units.
In C, use setvbuf to increase the buffer size of buffered I/O calls fwrite and fread. Call setvbuf after fopen but before first actual I/O. You can also use environment variable C_SETBUF. Use sxcc -D_USE_SETBUF=1 to enable this feature, and set e.g. C_SETBUF=16M'.
In C++, use pubsetbuf like in this example:
#include <fstream> using namespace std;
int main(int argc, char **argv) { char buffer[4096*4096]; fstream out_file("huhu",ios::app|ios::out); out_file.rdbuf()->pubsetbuf(buffer,4096*4096); out_file << "Hallo" << endl; }