- Infos im HLRS Wiki sind nicht rechtsverbindlich und ohne Gewähr -
- Information contained in the HLRS Wiki is not legally binding and HLRS is not responsible for any damages that might result from its use -
SX ACE optimizing: Difference between revisions
No edit summary |
|||
Line 1: | Line 1: | ||
''this article is a stub, examples are not yet adopted for SX-ACE output, SX-ACE gives a little bit different output due to ADB, | ''this article is a stub, examples are not yet adopted for SX-ACE output, SX-ACE gives a little bit different output due to ADB, but the general guidelines hold true'' | ||
but the general guidelines hold true'' | |||
== Identifying optimization potential == | == Identifying optimization potential == |
Revision as of 17:59, 30 January 2015
this article is a stub, examples are not yet adopted for SX-ACE output, SX-ACE gives a little bit different output due to ADB, but the general guidelines hold true
Identifying optimization potential
Usage of hardware information
To get simple information about your application, please set F_PROGINF for Fortran or C_PROGINF for C/C++ programms. Possible values are YES and DETAIL. If this environment variable is set during runtime, an application prints out some timing information at programm end.
Example F_PROGINF=YES:
****** Program Information ****** Real Time (sec) : 8005.250512 User Time (sec) : 8004.623898 Sys Time (sec) : 0.430407 Vector Time (sec) : 6360.582440 Inst. Count : 3924843780618. V. Inst. Count : 1479787913250. V. Element Count : 370521602681208. FLOP Count : 184739227878234. MOPS : 46593.901637 MFLOPS : 23079.064080 A. V. Length : 250.388315 V. Op. Ratio (%) : 99.344430 Memory Size (MB) : 2560.031250 Start Time (date) : Fri Nov 21 11:10:28 MET 2008 End Time (date) : Fri Nov 21 13:23:53 MET 2008
Example F_PROGINF=DETAIL:
****** Program Information ****** Real Time (sec) : 8005.250512 User Time (sec) : 8004.623898 Sys Time (sec) : 0.430407 Vector Time (sec) : 6360.582440 Inst. Count : 3924843780618. V. Inst. Count : 1479787913250. V. Element Count : 370521602681208. FLOP Count : 184739227878234. MOPS : 46593.901637 MFLOPS : 23079.064080 A. V. Length : 250.388315 V. Op. Ratio (%) : 99.344430 Memory Size (MB) : 2560.031250 MIPS : 490.322073 I-Cache (sec) : 0.064346 O-Cache (sec) : 1511.726309 Bank Conflict Time CPU Port Conf. (sec) : 165.409913 Memory Network Conf. (sec) : 2350.324116 Start Time (date) : Fri Nov 21 11:10:28 MET 2008 End Time (date) : Fri Nov 21 13:23:53 MET 2008
Using this variable does only cause constant overhead (reading counters at the beginning, reading at end and computing and printing of values).
General strategy for tuning is:
vector time should be as close at possible to user time. This means, V. Op. Ratio will be close to 100. MFLOPS should be as high as possible (as long as the application is doing floating point operations).
To achieve good performance, O-Cache (operand cache misses in seconds) should be close to 0. Vectorized code can not cause o-cache misses! If V. Op. Ratio is 99, but performance in MFLOPS is still bad, there are several possibilities:
- no floating point operations in the code
- short vector length VLEN
- high bank times
VLEN is the average vector length which is processed by vector pipes. So this is average of (looplength) modulo 256, and can therefor not exceed 256, no matter how long loops are. It should be close to 256. The longer the loops are, the more efficient the CPU can work. Try to achieve loop length in the order of thousands.
High bank times show a high number of bank conflicts. A bank conflict is caused when a memory bank is accessed before the bank busy time from the last access is over. If this is high, search for power of two leading dimensions in the code. Distances between memory accesses in the form of a multiple of a large power of two should be avoided. Try to have odd or prime distances, best is stride 1 (in unit of words). When using lookup tables, high bank conflict times can arise as well if the lookup always hits the same value (what might be a consequence of cache optimizations). Try to make copies of the tables, and iterate over the tables.
Profiling on subroutine Basis
Simple Unix profiling by linking the code with -p option and calling sxprof <executable name> processing the generated mon.out file on the frontend does not cause large overhead. This gives you simple information how much time was spent in which subroutine.
If you want to know the number of calls of the subroutine as well, the subroutines have to recompiled with -p as well. This causes more overhead.
Hardware information on subroutine Basis
To get more detailed information about subroutines, use ftrace feature of the compilers.
Compile all subroutines you want to examine, but at least the entry and the exit of your application with -ftrace option (Fortran and C/C++).
Run the application, and keep the generated ftrace.out files. Use the sxftrace/sxftrace++ tool to get the actual information out of the binary file.
Example output:
*--------------------------* FLOW TRACE ANALYSIS LIST *--------------------------* Execution : Tue Apr 13 15:33:11 2004 Total CPU : 0:00'29"189 PROG.UNIT FREQUENCY EXCLUSIVE AVER.TIME MOPS MFLOPS V.OP AVER. VECTOR I-CACHE O-CACHE BANK TIME[sec]( % ) [msec] RATIO V.LEN TIME MISS MISS CONF chempo 1000 13.452( 46.1) 13.452 8832.9 4631.5 99.54 217.1 11.183 0.0057 0.0029 0.0000 kraft 1100 13.410( 45.9) 12.191 11377.9 5013.2 99.81 201.0 13.287 0.0024 0.0342 0.0007 zufall 4324320 2.181( 7.5) 0.001 148.7 5.9 0.00 0.0 0.000 0.0006 0.0007 0.0000 lj2 1 0.059( 0.2) 59.033 328.0 173.3 85.06 237.9 0.003 0.0295 0.0170 0.0000 korekt 1100 0.041( 0.1) 0.037 7792.8 3285.9 99.71 222.8 0.038 0.0015 0.0010 0.0000 voraus 1100 0.035( 0.1) 0.032 8537.4 3976.9 99.76 242.1 0.032 0.0014 0.0010 0.0000 transf 1100 0.005( 0.0) 0.005 11972.9 5976.8 99.84 216.0 0.005 0.0000 0.0000 0.0000 skal 1100 0.004( 0.0) 0.004 3554.4 1170.5 98.09 240.0 0.003 0.0004 0.0004 0.0000 geschw 1 0.002( 0.0) 1.726 161.4 30.3 12.72 230.0 0.000 0.0000 0.0000 0.0000 gitter 1 0.000( 0.0) 0.071 333.2 43.3 21.77 235.6 0.000 0.0000 0.0000 0.0000 init 100 0.000( 0.0) 0.000 249.3 0.1 43.79 235.6 0.000 0.0000 0.0000 0.0000 psi22 4 0.000( 0.0) 0.004 110.8 15.1 0.00 0.0 0.000 0.0000 0.0000 0.0000 phi22 4 0.000( 0.0) 0.002 112.9 15.3 0.00 0.0 0.000 0.0000 0.0000 0.0000 cor22 2 0.000( 0.0) 0.004 86.7 8.6 0.00 0.0 0.000 0.0000 0.0000 0.0000 cutoff 1 0.000( 0.0) 0.002 89.4 5.3 0.00 0.0 0.000 0.0000 0.0000 0.0000 ---------------------------------------------------------------------------------------------------------- total 4330934 29.189(100.0) 0.007 9333.6 4449.1 99.57 207.8 24.550 0.0415 0.0570 0.0007
For SX-ACE there is a new tool to see the ftrace output in a graphical user interface,
call
/SX/opt/fv/fv
and open your ftrace output file with it. To be able to do this, you have to login with activated SSH X-Forwarding, use
ssh -X -C kabuki.hww.de -l username
Hardware information on loop/block basis
If you need to have more detailled information of the contents of a subroutine, regions within subroutines can be defined to be used with ftrace.
Enclose the section to be examined by special ftrace function calls:
CALL FTRACE_REGION_BEGIN("REGION_A") DO I=1,10000 A(I)=I ENDDO CALL FTRACE_REGION_END("REGION_A")
or in C/C++:
ftrace_region_begin("region_a"); /* region */ ftrace_region_end("region_a");
The name is a name you can choose, it has to match in the begin and end call. This will be used as the identifier in the ftrace printout.
I/O
Use environment variables F_FILEINF/C_FILEINF with values YES/DETAIL to find slow I/O. DETAIL (fortran only) gives information for every I/O, this gives a lot of output, be carefull.
Improving performance
General
Improving performance means improving vectorization. For detailled discussion see optimizing C and vectorising C for C programmers and optimizing fortran and vectorizing fortran for fortran programmers.
some basic rules
- improve inner loop iteration length, longer loops are better than short loops
- avoid indirect addressing and pointers
- avoid power of 2 strides when accessing data in memory
- use restrict keyword for C pointers
- avoid function calls
- enable inlining (-pi auto option of compiler)
I/O
Because underlaying ScaTeFS uses striping, large I/O should be done to make efficient use of the resources.
Try to make I/O in multiple of 4MB blocks, if this is possible. For Fortran, setting the units I/O buffer to 4 MB improves I/O speed a lot. Use export F_SETBUF<unit>=4096 to set buffersize for fortran unit <unit>.
To assist in tuning of fortran I/O, it is possible to generate statitics of the I/O (only fortran). Set export F_FILEINF=YES or export F_FILEINF=DETAIL to get information about I/O sizes, achieved transfer bandwidth and settings of the fortran units.
In C, use setvbuf to increase the buffer size of buffered I/O calls fwrite and fread. Call setvbuf after fopen but before first actual I/O. You can also use environment variable C_SETBUF. Use sxcc -D_USE_SETBUF=1 to enable this feature, and set e.g. C_SETBUF=16M'.
In C++, use pubsetbuf like in this example:
#include <fstream> using namespace std; int main(int argc, char **argv) { char buffer[4096*4096]; fstream out_file("huhu",ios::app|ios::out); out_file.rdbuf()->pubsetbuf(buffer,4096*4096); out_file << "Hallo" << endl; }