- Infos im HLRS Wiki sind nicht rechtsverbindlich und ohne Gewähr -
- Information contained in the HLRS Wiki is not legally binding and HLRS is not responsible for any damages that might result from its use -
Advisor: Difference between revisions
(127 intermediate revisions by the same user not shown) | |||
Line 1: | Line 1: | ||
== Introduction == | |||
{{Infobox software | {{Infobox software | ||
| description = Intel® '''Advisor XE''' is a threading assistant for C, C++, C# and Fortran. It guides developers through threading design, automating analyses required for fast and correct implementation. | | description = Intel® '''Advisor XE''' is a low-weight threading assistant for C, C++, C# and Fortran. It guides developers through threading design, automating analyses required for fast and correct implementation. It also helps developers to add parallelism to their existing C/C++ or Fortran programs. | ||
It helps developers to add parallelism to their existing C/C++ or Fortran programs. | |||
Apart from thread parallelism, Advisor also supports analyzing MPI-parallel applications. The overall efficiency of an MPI-parallel loop/function can be measured by manually adding individual bandwidths and performances. For example, one runs an application on "n" MPI ranks, attaching Advisor to each rank. Let us assume that for a specific loop/function the bandwidths corresponding to rank-1 to rank-n turn out to be (X1, X2,...,Xn) GB/sec and the respective performances are (Y1, Y2,...,Yn) GF/sec. Then, the total bandwidth would be (X1+X2+...+Xn) GB/sec and total performance (Y1+Y2+...+Yn) GF/sec. Note that, if there is a significant deviation among the values of Xi or Yi (i=1,2,...n), then there is a load imbalance among the ranks and one may add "--mpi-trace" flag to the survey command and repeat the analysis. However, if the load balance among MPI-ranks is good, running Advisor on a single rank would be enough. | |||
| logo = [[Image:intel-logo.png]] | | logo = [[Image:intel-logo.png]] | ||
| developer = Intel | | developer = Intel | ||
Line 14: | Line 12: | ||
| website = [http://software.intel.com/en-us/intel-advisor-xe Intel® Advisor XE homepage] | | website = [http://software.intel.com/en-us/intel-advisor-xe Intel® Advisor XE homepage] | ||
}} | }} | ||
In brief, one may use the Intel Advisor XE to: | |||
* find the most time-consuming serial code regions in your program. | |||
* analyse Roofline plot, which highlights hot functions/loops and suggests necessary optimizations. With the help of Roofline plot, one can confirm whether an application is memory-bound or compute-bound. | |||
* analyse memory access pattern to know if the memory is being accessed in unit stride or non-unit stride. | |||
* explore loop-carried dependencies hindering efficient vectorization. | |||
* find out if there are data-type conversions hindering vectorization in the code. | |||
* estimate the load imbalance and parallel efficiency in an MPI-parallel application. | |||
* insert Intel Advisor XE annotations to identify these as possible parallel code regions. | |||
* predict the approximate parallel performance characteristics of the proposed parallel code regions. | |||
* check for data sharing problems that could prevent the application from working correctly when parallelized. | |||
'''''Slides for a general introduction about the Advisor can be found [https://kb.hlrs.de/platforms/upload/General_presentation_Advisor_2017.pdf here].''''' | |||
'''''For the slides on the memory access pattern, one may click [https://kb.hlrs.de/platforms/upload/MAP_Analysis_Advisor_2017.pdf here].''''' | |||
'''''Slides on vectorization and dependency can be found [https://kb.hlrs.de/platforms/upload/MAP_Analysis_Advisor_2017.pdf here].''''' | |||
'''''For the last three points starting from "insert Intel Advisor XE annotations..." in the introduction section, one may refer to the [https://kb.hlrs.de/platforms/upload/Tutorial_2013_Advisor.pdf tutorial here].''''' | |||
== Why Intel Advisor? == | == Why Intel Advisor? == | ||
Before checking the parallel efficiency of an application, it is necessary to understand how the application behaves at the core level. For example, | Before checking the parallel efficiency of an application, it is necessary to understand how the application behaves at the core and node level. For example, | ||
* whether it is memory bound or compute bound | * whether it is memory bound or compute bound | ||
Line 26: | Line 42: | ||
* whether there are dependencies hindering vectorization | * whether there are dependencies hindering vectorization | ||
* where different loops/functions lie on the Roofline plot etc. | * where different loops/functions lie on the Roofline plot and if there is a room for improvement, etc. | ||
Intel Advisor not only | Intel Advisor not only provides answer to all the above-mentioned queries, but also suggests solutions, for example, what kind of optimizations one needs to implement in order to improve the performance of an application. | ||
== How to use Intel Advisor? == | == How to use Intel Advisor? == | ||
First compile your application with | First, compile your application with the flag "-g" followed by other optimization flags, for example on Hawk "-O2 (or -O3) -march=core-avx2". Then, set up an environment for the Advisor by loading the corresponding module. | ||
For example, on Hawk | For example, on Hawk | ||
Line 41: | Line 57: | ||
If you have installed [https://www.intel.com/content/www/us/en/developer/tools/oneapi/base-toolkit-download.html?operatingsystem=linux&distributions=webdownload&options=offline Intel oneAPI] on your laptop then, | If you have installed [https://www.intel.com/content/www/us/en/developer/tools/oneapi/base-toolkit-download.html?operatingsystem=linux&distributions=webdownload&options=offline Intel oneAPI] on your laptop then, | ||
source /opt/intel/oneapi/setvars.sh | |||
=== Running Advisor on OpenMP parallel application === | |||
Select the number of OpenMP threads as, | |||
export OMP_NUM_THREADS=num_of_threads | export OMP_NUM_THREADS=num_of_threads | ||
and bind them as | and bind them as, | ||
export OMP_PROC_BIND=spread | export OMP_PROC_BIND=spread | ||
Line 53: | Line 71: | ||
Afterwards, collect survey, tripcounts and flops as follows, | Afterwards, collect survey, tripcounts and flops as follows, | ||
advixe-cl -collect survey -project-dir results_advisor ./a.out | advixe-cl -collect survey -project-dir results_advisor ./a.out | ||
advixe-cl -collect tripcounts -flop -project-dir results_advisor ./a.out | |||
Here, survey is an internal tool which locates non-vectorized and poorly vectorized loops/functions and estimates performance gain with efficient vectorization. Trip counts introduces counters to measure time spent in a particular loop/function, and the flag “-flop” enables the flop counter. It is necessary to collect survey, tripcounts and flops in order to see the Roofline plot. | |||
Results can be visualized using Advisor GUI, | |||
advixe-gui results_advisor/e000/e000.advixeexp | |||
Visualizing results on Hawk could be slow if the GUI contents are being transferred via thin DSL lines. Therefore, one shall try using [https://kb.hlrs.de/platforms/index.php/Hawk_PrePostProcessing VNC] for the purpose of visualization. Alternatively, one may pack up all the results in a read-only file as follows | |||
advixe-cl --snapshot --project-dir=results_advisor --cache-sources path_to_source_code --cache-binaries path_to_binary | |||
Above command will create a file snapshot000.advixeexpz which requires very less memory as compared to the original results_advisor directory and thus can be easily copied/stored to the local machine. The file can be viewed in GUI as, | |||
advixe-gui snapshot000.advixeexpz | |||
=== Running Advisor on MPI parallel application === | |||
In order to get useful information from the Advisor analysis, it is necessary to pin MPI ranks according to the machine hardware. | |||
This is possible only when command line arguments are exported as variables. For example, if one would like to use OpenMPI and wants to distribute 8 MPI ranks uniformly over the Hawk node. This can be achieved with the following steps: | |||
Export number of MPI ranks | |||
export NUM_MPI=8 | |||
Then export pinning in a variable | |||
export MPIRUN_OPTIONS="--bind-to cpu-list:ordered --cpu-list 0,16,32,48,64,80,96,112 -report-bindings" | |||
Export survey and tripcounts command as follows | |||
export ADVISOR_SURVEY="advixe-cl -collect survey -project-dir results_advisor" | |||
export ADVISOR_TRIPCOUNTS="advixe-cl -collect tripcounts -flop -project-dir results_advisor" | |||
Now collect survey and tripcounts as | |||
mpirun -np ${NUM_MPI} ${MPIRUN_OPTIONS} ${ADVISOR_SURVEY} ./a.out | |||
wait | |||
mpirun -np ${NUM_MPI} ${MPIRUN_OPTIONS} ${ADVISOR_TRIPCOUNTS} ./a.out | |||
Note that in the batch-scipt, as shown above, it is recommended to put a "wait" command between two "mpirun" commands. Above commands will create Advisor reports for all the ranks. In case, in the above example, one would like to run Advisor only on the single rank, then do the following, | |||
export MPIRUN_OPTIONS_1="--bind-to cpu-list:ordered --cpu-list 0,16,32,48,64,80,96" | |||
export MPIRUN_OPTIONS_2="--bind-to cpu-list:ordered --cpu-list 112" | |||
Now collect survey and tripcounts as follows, | |||
mpirun -np ${NUM_MPI}-1 ${MPIRUN_OPTIONS_1} ./a.out : -np 1 ${MPIRUN_OPTIONS_2} ${ADVISOR_SURVEY} ./a.out | |||
wait | |||
mpirun -np ${NUM_MPI}-1 ${MPIRUN_OPTIONS_1} ./a.out : -np 1 ${MPIRUN_OPTIONS_2} ${ADVISOR_TRIPCOUNTS} ./a.out | |||
=== Running Advisor on MPI+OpenMP parallel application === | |||
The following example employs 32 MPI ranks distributed uniformly over both the sockets with 2 OpenMP threads per MPI tasks on a Hawk node. | |||
module load mpt | |||
export MPI_SHEPHERD=1 | |||
export MPI_DSM_CPULIST=0-127/2:allhosts | |||
export OMP_NUM_THREADS=2 | |||
export OMP_PROC_BIND=close | |||
export MPI_OPENMP_INTEROP=1 | |||
One can then run Advisor, same as described in the above section. | |||
mpirun -np $num_of_mpi_tasks-1 ./a.out : -np 1 advixe-cl -collect survey -project-dir results_advisor ./a.out | |||
mpirun -np $num_of_mpi_tasks-1 ./a.out : -np 1 advixe-cl -collect tripcounts -flop -project-dir results_advisor ./a.out | |||
=== Additional analysis - memory access pattern and dependencies === | |||
While visualizing the results, Advisor might suggest performing additional analysis like memory access pattern and dependencies. One may collect the same, for example, as follows, | |||
mpirun -np $num_of_mpi_tasks-1 ./a.out : -np 1 advixe-cl -collect map -project-dir results_advisor ./a.out | |||
mpirun -np $num_of_mpi_tasks-1 ./a.out : -np 1 advixe-cl -collect dependencies -project-dir results_advisor ./a.out | |||
Note that above analysis is possible only after collecting survey and tripcounts. | |||
== See also == | == See also == |
Latest revision as of 09:42, 20 January 2022
Introduction
Intel® Advisor XE is a low-weight threading assistant for C, C++, C# and Fortran. It guides developers through threading design, automating analyses required for fast and correct implementation. It also helps developers to add parallelism to their existing C/C++ or Fortran programs.
Apart from thread parallelism, Advisor also supports analyzing MPI-parallel applications. The overall efficiency of an MPI-parallel loop/function can be measured by manually adding individual bandwidths and performances. For example, one runs an application on "n" MPI ranks, attaching Advisor to each rank. Let us assume that for a specific loop/function the bandwidths corresponding to rank-1 to rank-n turn out to be (X1, X2,...,Xn) GB/sec and the respective performances are (Y1, Y2,...,Yn) GF/sec. Then, the total bandwidth would be (X1+X2+...+Xn) GB/sec and total performance (Y1+Y2+...+Yn) GF/sec. Note that, if there is a significant deviation among the values of Xi or Yi (i=1,2,...n), then there is a load imbalance among the ranks and one may add "--mpi-trace" flag to the survey command and repeat the analysis. However, if the load balance among MPI-ranks is good, running Advisor on a single rank would be enough. |
|
In brief, one may use the Intel Advisor XE to:
- find the most time-consuming serial code regions in your program.
- analyse Roofline plot, which highlights hot functions/loops and suggests necessary optimizations. With the help of Roofline plot, one can confirm whether an application is memory-bound or compute-bound.
- analyse memory access pattern to know if the memory is being accessed in unit stride or non-unit stride.
- explore loop-carried dependencies hindering efficient vectorization.
- find out if there are data-type conversions hindering vectorization in the code.
- estimate the load imbalance and parallel efficiency in an MPI-parallel application.
- insert Intel Advisor XE annotations to identify these as possible parallel code regions.
- predict the approximate parallel performance characteristics of the proposed parallel code regions.
- check for data sharing problems that could prevent the application from working correctly when parallelized.
Slides for a general introduction about the Advisor can be found here.
For the slides on the memory access pattern, one may click here.
Slides on vectorization and dependency can be found here.
For the last three points starting from "insert Intel Advisor XE annotations..." in the introduction section, one may refer to the tutorial here.
Why Intel Advisor?
Before checking the parallel efficiency of an application, it is necessary to understand how the application behaves at the core and node level. For example,
- whether it is memory bound or compute bound
- how good is the vectorization
- how is the memory access pattern
- whether there are dependencies hindering vectorization
- where different loops/functions lie on the Roofline plot and if there is a room for improvement, etc.
Intel Advisor not only provides answer to all the above-mentioned queries, but also suggests solutions, for example, what kind of optimizations one needs to implement in order to improve the performance of an application.
How to use Intel Advisor?
First, compile your application with the flag "-g" followed by other optimization flags, for example on Hawk "-O2 (or -O3) -march=core-avx2". Then, set up an environment for the Advisor by loading the corresponding module.
For example, on Hawk
module load advisor
On Vulcan
module load performance/advisor
If you have installed Intel oneAPI on your laptop then,
source /opt/intel/oneapi/setvars.sh
Running Advisor on OpenMP parallel application
Select the number of OpenMP threads as,
export OMP_NUM_THREADS=num_of_threads
and bind them as,
export OMP_PROC_BIND=spread
Afterwards, collect survey, tripcounts and flops as follows,
advixe-cl -collect survey -project-dir results_advisor ./a.out
advixe-cl -collect tripcounts -flop -project-dir results_advisor ./a.out
Here, survey is an internal tool which locates non-vectorized and poorly vectorized loops/functions and estimates performance gain with efficient vectorization. Trip counts introduces counters to measure time spent in a particular loop/function, and the flag “-flop” enables the flop counter. It is necessary to collect survey, tripcounts and flops in order to see the Roofline plot.
Results can be visualized using Advisor GUI,
advixe-gui results_advisor/e000/e000.advixeexp
Visualizing results on Hawk could be slow if the GUI contents are being transferred via thin DSL lines. Therefore, one shall try using VNC for the purpose of visualization. Alternatively, one may pack up all the results in a read-only file as follows
advixe-cl --snapshot --project-dir=results_advisor --cache-sources path_to_source_code --cache-binaries path_to_binary
Above command will create a file snapshot000.advixeexpz which requires very less memory as compared to the original results_advisor directory and thus can be easily copied/stored to the local machine. The file can be viewed in GUI as,
advixe-gui snapshot000.advixeexpz
Running Advisor on MPI parallel application
In order to get useful information from the Advisor analysis, it is necessary to pin MPI ranks according to the machine hardware. This is possible only when command line arguments are exported as variables. For example, if one would like to use OpenMPI and wants to distribute 8 MPI ranks uniformly over the Hawk node. This can be achieved with the following steps:
Export number of MPI ranks
export NUM_MPI=8
Then export pinning in a variable
export MPIRUN_OPTIONS="--bind-to cpu-list:ordered --cpu-list 0,16,32,48,64,80,96,112 -report-bindings"
Export survey and tripcounts command as follows
export ADVISOR_SURVEY="advixe-cl -collect survey -project-dir results_advisor"
export ADVISOR_TRIPCOUNTS="advixe-cl -collect tripcounts -flop -project-dir results_advisor"
Now collect survey and tripcounts as
mpirun -np ${NUM_MPI} ${MPIRUN_OPTIONS} ${ADVISOR_SURVEY} ./a.out
wait
mpirun -np ${NUM_MPI} ${MPIRUN_OPTIONS} ${ADVISOR_TRIPCOUNTS} ./a.out
Note that in the batch-scipt, as shown above, it is recommended to put a "wait" command between two "mpirun" commands. Above commands will create Advisor reports for all the ranks. In case, in the above example, one would like to run Advisor only on the single rank, then do the following,
export MPIRUN_OPTIONS_1="--bind-to cpu-list:ordered --cpu-list 0,16,32,48,64,80,96"
export MPIRUN_OPTIONS_2="--bind-to cpu-list:ordered --cpu-list 112"
Now collect survey and tripcounts as follows,
mpirun -np ${NUM_MPI}-1 ${MPIRUN_OPTIONS_1} ./a.out : -np 1 ${MPIRUN_OPTIONS_2} ${ADVISOR_SURVEY} ./a.out
wait
mpirun -np ${NUM_MPI}-1 ${MPIRUN_OPTIONS_1} ./a.out : -np 1 ${MPIRUN_OPTIONS_2} ${ADVISOR_TRIPCOUNTS} ./a.out
Running Advisor on MPI+OpenMP parallel application
The following example employs 32 MPI ranks distributed uniformly over both the sockets with 2 OpenMP threads per MPI tasks on a Hawk node.
module load mpt export MPI_SHEPHERD=1 export MPI_DSM_CPULIST=0-127/2:allhosts export OMP_NUM_THREADS=2 export OMP_PROC_BIND=close export MPI_OPENMP_INTEROP=1
One can then run Advisor, same as described in the above section.
mpirun -np $num_of_mpi_tasks-1 ./a.out : -np 1 advixe-cl -collect survey -project-dir results_advisor ./a.out
mpirun -np $num_of_mpi_tasks-1 ./a.out : -np 1 advixe-cl -collect tripcounts -flop -project-dir results_advisor ./a.out
Additional analysis - memory access pattern and dependencies
While visualizing the results, Advisor might suggest performing additional analysis like memory access pattern and dependencies. One may collect the same, for example, as follows,
mpirun -np $num_of_mpi_tasks-1 ./a.out : -np 1 advixe-cl -collect map -project-dir results_advisor ./a.out
mpirun -np $num_of_mpi_tasks-1 ./a.out : -np 1 advixe-cl -collect dependencies -project-dir results_advisor ./a.out
Note that above analysis is possible only after collecting survey and tripcounts.