7.4. Profiling the Completed Model

The final stage is to look at the finished model for any modules which are dominating the compute time. These are candidates for replacement with equivalent modules optimized for cycle accurate modeling.

Common causes of performance bottlenecks are:

Built-in Self Test (BIST) code. Such code can be pervasive and bit-oriented, making it hard to model efficiently in a word-oriented environment like C++. BIST is not usually relevant to cycle accurate modeling. Substituting an equivalent model without BIST code can make a substantial performance improvement.
Behavioral memory models. Many memory models supplied by third parties are designed for behavioral accuracy during hardware verification. They will offer detailed and accurate intra-cycle performance modeling. Ports may well be buffered at the individual bit level.
Because memories are often so central to a design this can be a serious performance bottleneck. The solution is to replace them by a simple Verilog model which is concerned only with cycle accuracy and omits any buffering.
Associative (content-addressable) memories. These are efficient to implement in hardware, but a nightmare in software. In this case substitution in C/C++ using a hash-table is usually the best approach.
Bit-oriented code. Hardware handles bits as efficiently as words, but the same is not true of word-oriented C/C++. Such code can occur in many scenarios, but a common one is legacy designs for operations such as multiplication. Early synthesis tools did not make a good job of such operations, and so designs would be written out explicitly to make the functionality explicit.
Such designs can be huge, but are easily replaced by a single line of Verilog using the high level operation.

Verilator provides the -profile-cfuncs flag, which adds additional information to the compiled code, identifying the module to which it belongs. Compiling the model using the GNU C++ compiler's -g and -pg flags will instrument the compiled code for profiling. A subsequent run will generate a gmon.out file, which can be analyzed using the standard gprof command.

Verilator provides a utility, verilator_profcfunc, for post-processing the results of the gprof. This breaks out the processing time by Verilog module name, rather than the underlying C++ function.

When profiling, no optimization should be used. Although the GNU C++ compiler allows optimized profiling, it can be a source of confusion, when parts of the code are optimized away. Unoptimized models are just as effective in highlighting any performance bottlenecks. With the example design, the following sequence of commands is appropriate:

make verilate COMMAND_FILE=cf-optimized-8.scr \
     VFLAGS="-profile-cfuncs" NUM_RUNS=1000 OPT="-g -pg"
gprof Vorpsoc_fpga_top > gprof.out
verilator_profcfunc gprof.out vprof.out

The first part of the output file, vprof.out identifies where the execution time went:

Overall summary by type:
  % time  type
    4.62  C++
   17.45  Common code under Vorpsoc_fpga_top
   72.74  Verilog Blocks under Vorpsoc_fpga_top
    5.19  Unaccounted for/rounding error

The C++ code is code outside the Verilator model. In the example used here, that is the SystemC test bench. The common code under Vorpsoc_fpga_top is the common infrastructure code. The Verilog blocks are the C++ code of directly derived from the Verilog. Finally, there is time that was spent outside profiled code. In this example, that will be largely due to the SystemC kernel, but since gprof is based on statistical sampling it also includes a small amount of time which cannot be accounted for.

There is nothing significant in this example A warning sign to watch for is if the either the C++ or unaccounted figure is very high. That could be a problem with a SystemC test bench—perhaps with very wide ports.

The next section is a summary of the same information, grouping the common code and Verilog blocks:

Overall summary by design:
  % time  design
    4.62  C++
   90.19  Vorpsoc_fpga_top
    5.19  Unaccounted for/rounding error

In both these cases, instantiation of multiple models would make for more entries.

The third section is the most important. It shows how the execution time was broken down by originating Verilog module:

Overall summary by module:
  % time  module
    4.62  C++
   17.45  Vorpsoc_fpga_top common code
    0.11  dbg_crc8_d1
    0.00  dbg_register
    0.17  dbg_registers
    0.76  dbg_sync_clk1_clk2
    ...

This is provided in alphabetical order, but it is useful to cut out this section and sort it (using the command sort -n -r):

   17.45  Vorpsoc_fpga_top common code
    7.69  eth_wishbone_4
    5.17  or1200_du
    5.05  uart_regs_2
    4.62  C++
    3.77  tc_top
    3.41  eth_registers
    3.38  eth_crc
    3.07  dbg_top_3

The common code can be ignored—that is beyond control. Look for any small modules that are using a lot of processing.

	Note
	The names used are that of the originating file, not the module name, with any hyphen ("-") mapped to underscore ("_"). Thus the first example here is the module `eth_wishbone`, but in the file `eth_wishbone-4.v`

There are no real bit CPU hogs in this example. The largest user, eth_wishbone-4.v uses over 7% of the execution time, but it is a large block (more than 2,500 lines of Verilog), so this is not unreasonable. The other modules at the top of the list are also all big blocks of code.

It is worth observing that in the current model, the Ethernet is tied off and unused. If there is no intention to develop the model to use the Ethernet the instantiation could be removed altogether, perhaps improving performance by 20% or so. The same observation applies to a lesser extent with the other peripherals, currently unused.