The final stage is to look at the finished model for any modules which are dominating the compute time. These are candidates for replacement with equivalent modules optimized for cycle accurate modeling.
Common causes of performance bottlenecks are:
Built-in Self Test (BIST) code. Such code can be pervasive and bit-oriented, making it hard to model efficiently in a word-oriented environment like C++. BIST is not usually relevant to cycle accurate modeling. Substituting an equivalent model without BIST code can make a substantial performance improvement.
Behavioral memory models. Many memory models supplied by third parties are designed for behavioral accuracy during hardware verification. They will offer detailed and accurate intra-cycle performance modeling. Ports may well be buffered at the individual bit level.
Because memories are often so central to a design this can be a serious performance bottleneck. The solution is to replace them by a simple Verilog model which is concerned only with cycle accuracy and omits any buffering.
Associative (content-addressable) memories. These are efficient to implement in hardware, but a nightmare in software. In this case substitution in C/C++ using a hash-table is usually the best approach.
Bit-oriented code. Hardware handles bits as efficiently as words, but the same is not true of word-oriented C/C++. Such code can occur in many scenarios, but a common one is legacy designs for operations such as multiplication. Early synthesis tools did not make a good job of such operations, and so designs would be written out explicitly to make the functionality explicit.
Such designs can be huge, but are easily replaced by a single line of Verilog using the high level operation.
Verilator provides the -profile-cfuncs
flag, which
adds additional information to the compiled code, identifying the
module to which it belongs. Compiling the model using the GNU C++
compiler's -g
and -pg
flags will
instrument the compiled code for profiling. A subsequent run will
generate a gmon.out
file, which can be analyzed
using the standard gprof command.
Verilator provides a utility, verilator_profcfunc, for post-processing the results of the gprof. This breaks out the processing time by Verilog module name, rather than the underlying C++ function.
When profiling, no optimization should be used. Although the GNU C++ compiler allows optimized profiling, it can be a source of confusion, when parts of the code are optimized away. Unoptimized models are just as effective in highlighting any performance bottlenecks. With the example design, the following sequence of commands is appropriate:
make verilate COMMAND_FILE=cf-optimized-8.scr \ VFLAGS="-profile-cfuncs" NUM_RUNS=1000 OPT="-g -pg" gprof Vorpsoc_fpga_top > gprof.out verilator_profcfunc gprof.out vprof.out
The first part of the output file, vprof.out
identifies where the execution time went:
Overall summary by type: % time type 4.62 C++ 17.45 Common code under Vorpsoc_fpga_top 72.74 Verilog Blocks under Vorpsoc_fpga_top 5.19 Unaccounted for/rounding error
The C++ code is code outside the Verilator model. In the example
used here, that is the SystemC test bench. The common code under
Vorpsoc_fpga_top
is the common infrastructure
code. The Verilog blocks are the C++ code of directly derived from the
Verilog. Finally, there is time that was spent outside profiled
code. In this example, that will be largely due to the SystemC
kernel, but since gprof is based on statistical
sampling it also includes a small amount of time which cannot be
accounted for.
There is nothing significant in this example A warning sign to watch for is if the either the C++ or unaccounted figure is very high. That could be a problem with a SystemC test bench—perhaps with very wide ports.
The next section is a summary of the same information, grouping the common code and Verilog blocks:
Overall summary by design: % time design 4.62 C++ 90.19 Vorpsoc_fpga_top 5.19 Unaccounted for/rounding error
In both these cases, instantiation of multiple models would make for more entries.
The third section is the most important. It shows how the execution time was broken down by originating Verilog module:
Overall summary by module: % time module 4.62 C++ 17.45 Vorpsoc_fpga_top common code 0.11 dbg_crc8_d1 0.00 dbg_register 0.17 dbg_registers 0.76 dbg_sync_clk1_clk2 ...
This is provided in alphabetical order, but it is useful to cut out this section and sort it (using the command sort -n -r):
17.45 Vorpsoc_fpga_top common code 7.69 eth_wishbone_4 5.17 or1200_du 5.05 uart_regs_2 4.62 C++ 3.77 tc_top 3.41 eth_registers 3.38 eth_crc 3.07 dbg_top_3
The common code can be ignored—that is beyond control. Look for any small modules that are using a lot of processing.
Note | |
---|---|
The names used are that of the originating file, not the module
name, with any hyphen ("-") mapped to underscore ("_"). Thus the
first example here is the module |
There are no real bit CPU hogs in this example. The largest user,
eth_wishbone-4.v
uses over 7% of the execution
time, but it is a large block (more than 2,500 lines of Verilog), so
this is not unreasonable. The other modules at the top of the list are
also all big blocks of code.
It is worth observing that in the current model, the Ethernet is tied off and unused. If there is no intention to develop the model to use the Ethernet the instantiation could be removed altogether, perhaps improving performance by 20% or so. The same observation applies to a lesser extent with the other peripherals, currently unused.