7.3.3. Compiler Profiling

Modern compilers, such as the GNU C++ compiler can optimize based on statistics from earlier runs of the compiled program. The program is compiled with options to gather statistics, run to create the statistics, then recompiled using the data from those statistics.

The latest versions of the GNU C++ compiler can use this for:

Reorganize branches to favor the most commonly taken branch (option -fbranch-probabilities).
Optimize expressions based on knowledge of how they are used (option -fvpt).
Unroll loops where this would be favorable in most cases (option -funroll-loops).
Peel loops (i.e completely unroll and remove them), where they would always be done a fixed number of times (option -fpeel-loops).
Perform tail duplication where the resulting enlarged superblock would improve other transformations (option -ftracer).

Some care is needed in using branch-profiling. It can interact badly with other systems (for example ccache). Although it has been part of the GNU C++ Compiler for some years, it must still be regarded as somewhat experimental in nature.

Profiling is enabled with the example Makefile by using the verilator-fast target. Statistics are gathered by compiling the model with -ftest-coverage and -fprofile-generate options and then running it. The options to be used in the subsequent optimizing recompile are passed as a macro, PROF_OPTS, for example:

make verilate-fast COMMAND_FILE=cf-optimized-8.scr NUM_RUNS=1000 \
     OPT="-O3" PROF_OPTS="-fbranch-probabilities"

Table 7.4 shows the impact of the different profiling options on the example design when compiled with the -Os option, the fastest option without profiling. The options are applied incrementally, in the order -fbranch-probabilities, -fvpt, -funroll-loops, -fpeel-loops and -ftracer.

Run Description	Build Time	Run Time	Performance
No profile optimization	26.23 s	12.24 s	96.41 kHz
Add `-fbranch-probabilities`	72.44 s	11.94 s	98.79 kHz
Add `-fvpt`	73.88 s	11.93 s	98.93 kHz
Add `-funroll-loops`	72.63 s	12.00 s	98.30 kHz
Add `-fpeel-loops`	72.65 s	12.02 s	98.17 kHz
Add `-ftracer`	72.65 s	11.99 s	98.42 kHz

Table 7.4. Comparison of model performance using -Os and profiling.

Model build times are all substantially bigger because of the need to do a statistics gathering build and run. The results improve slightly for the first two optimizations (-fbranch-probabilities and -fvpt), but then fall off. This is not surprising. The benefit of -Os is compactness of code size. However -funroll-loops, -fpeel-loops and -ftracer all tend to increase code size—reducing the caching benefit with using -Os.

The added effort of profile directed compilation cannot be justified when using -Os.

The same exercise is repeated, but this time to see the effect on a compile using option -O3. The results are in Table 7.5.

Run Description	Build Time	Run Time	Performance
No profile optimization	35.35 s	12.39 s	95.25 kHz
Add `-fbranch-probabilities`	83.51 s	9.36 s	126.10 kHz
Add `-fvpt`	83.28 s	9.34 s	126.39 kHz
Add `-funroll-loops`	83.78 s	9.34 s	126.39 kHz
Add `-fpeel-loops`	84.61 s	9.27 s	127.32 kHz
Add `-ftracer`	85.87 s	9.13 s	129.28 kHz

Table 7.5. Comparison of model performance using -O3 and profiling.

The results are dramatic. The -fbranch-probabilities optimization gives the majority of the benefit, but cumulatively the other four options further increase performance. The results are significantly better than using -Os.

The guideline advice is to use -O3 rather than -Os if you have the opportunity to profile your design.