Modern compilers, such as the GNU C++ compiler can optimize based on statistics from earlier runs of the compiled program. The program is compiled with options to gather statistics, run to create the statistics, then recompiled using the data from those statistics.
The latest versions of the GNU C++ compiler can use this for:
Reorganize branches to favor the most commonly taken branch
(option -fbranch-probabilities
).
Optimize expressions based on knowledge of how they are used
(option -fvpt
).
Unroll loops where this would be favorable in most cases
(option -funroll-loops
).
Peel loops (i.e completely unroll and remove them), where they
would always be done a fixed number of times (option
-fpeel-loops
).
Perform tail duplication where the resulting enlarged superblock
would improve other transformations (option
-ftracer
).
Some care is needed in using branch-profiling. It can interact badly with other systems (for example ccache). Although it has been part of the GNU C++ Compiler for some years, it must still be regarded as somewhat experimental in nature.
Profiling is enabled with the example Makefile
by using the verilator-fast
target. Statistics
are gathered by compiling the model with
-ftest-coverage
and
-fprofile-generate
options and then running it. The
options to be used in the subsequent optimizing recompile are passed
as a macro, PROF_OPTS
, for example:
make verilate-fast COMMAND_FILE=cf-optimized-8.scr NUM_RUNS=1000 \ OPT="-O3" PROF_OPTS="-fbranch-probabilities"
Table 7.4 shows the impact of the different
profiling options on the example design when compiled with
the -Os
option, the fastest option without
profiling. The options are applied incrementally, in the order
-fbranch-probabilities
, -fvpt
,
-funroll-loops
, -fpeel-loops
and
-ftracer
.
Run Description |
Build Time |
Run Time |
Performance |
---|---|---|---|
No profile optimization |
26.23 s |
12.24 s |
96.41 kHz |
Add |
72.44 s |
11.94 s |
98.79 kHz |
Add |
73.88 s |
11.93 s |
98.93 kHz |
Add |
72.63 s |
12.00 s |
98.30 kHz |
Add |
72.65 s |
12.02 s |
98.17 kHz |
Add |
72.65 s |
11.99 s |
98.42 kHz |
Table 7.4.
Comparison of model performance using
-Os
and profiling.
Model build times are all substantially bigger because of the need
to do a statistics gathering build and run. The results improve
slightly for the first two optimizations
(-fbranch-probabilities
and
-fvpt
), but then fall off. This is not
surprising. The benefit of -Os
is compactness of
code size. However -funroll-loops
,
-fpeel-loops
and -ftracer
all
tend to increase code size—reducing the caching benefit with
using -Os
.
The added effort of profile directed compilation cannot be justified
when using -Os
.
The same exercise is repeated, but this time to see the effect on a
compile using option -O3
. The results are in Table 7.5.
Run Description |
Build Time |
Run Time |
Performance |
---|---|---|---|
No profile optimization |
35.35 s |
12.39 s |
95.25 kHz |
Add |
83.51 s |
9.36 s |
126.10 kHz |
Add |
83.28 s |
9.34 s |
126.39 kHz |
Add |
83.78 s |
9.34 s |
126.39 kHz |
Add |
84.61 s |
9.27 s |
127.32 kHz |
Add |
85.87 s |
9.13 s |
129.28 kHz |
Table 7.5.
Comparison of model performance using
-O3
and profiling.
The results are dramatic. The
-fbranch-probabilities
optimization gives the
majority of the benefit, but cumulatively the other four options
further increase performance. The results are significantly better
than using -Os
.
The guideline advice is to use -O3
rather than
-Os
if you have the opportunity to profile your
design.