Brute Force Superoptimization
The concept of superoptimization is very simple. To find the smallest translation of your source code program, start with all sequences of one machine instruction and see if any behave identically to your original program. If not, then try all sequences of two instructions, then all sequences of three instructions and so on until you find a sequence which behaves identically to your original program. Because of the search sequence, this will be the smallest translation of your original program.
An example from the original superoptimization paper is the Motorola 68000 compilation of the following C function
int sign (int n)
{
if (n > 0)
return 1;
else if (n < 0)
return -1;
else
return 0;
}
A conventional compiler will generate 8 instructions requiring two comparisons and two branches.
cmp.l d0, 0
ble L1
move.l d1, 1
bra L3
L1:
bge L2
move.l d1, -1
bra L3
L2:
move.l d1, 0
L3:
However superoptimization yields the following.
add.l d0, d0
subx.l d1, d1
negx.l d0
addx.l d1, d1
Use of the sign extension flags means the result can be computed without comparisons or branches.
The big problem is that superoptimization suffers from combinatorial explosion. Even the simplest architectures have tens of instructions, and most have hundreds. Those instructions may well use any one of a large number of registers, or may include immediate constants or references to memory. Heuristics are used to prune this search space. Examples include:
- not using all the registers (they are unlikely to be needed in short sequences);
- recognizing commutativity in instructions: for example adding a to b is the same as adding b to a;
- not considering variants which are name changes (for example changing r1 to r2 throughout);
- not considering redundant instructions, such as adding zero or multiplying by one;
- only considering a restricted range of constant values; and
- not changing memory address references.
Care is needed when defining these simplifications to take account of the specific architecture issues, such as which flags are set. For example adding zero may be appropriate if it sets flags. Some of these heuristics mean that an optimal sequence could be missed, although in practice with good heuristics this is very unlikely.
Having defined the search space, the capacity of superoptimization then depends on how fast each sequence can be checked. The important thing is to eliminate the vast majority of useless sequences very quickly. Typically fast simulation with a small set of input values is used for this. Once a candidate sequence is found (of which there will be very few), significant resources can be deployed for verifying it really is a good sequence, either by exhaustive simulation or by formal proof of equivalence.
The limit to brute force superoptimization is the length of instruction sequence generated, not the size of the original program. Even with these heuristics, modern computers struggle to generate sequences more than 8 instructions.
One way round this is to try to guide the search space. In the technique of stochastic superoptimization, a simple heuristic is used to bias the enumeration of the sequences being considered. There is a considerably higher probability of missing optimal sequences, but much longer sequences can be explored. An example from Stanford University used the heuristic of the number of correct bits of the answer computed so far to guide the enumeration of sequences and applied this to the Montgomery modular multiplication kernel from OpenSSL. Compiled for x86-64 by GCC -O3 yields 27 instructions, including one conditional branch. The superoptimized sequence is just 11 instructions with no branch and runs 60% faster.
There is added complexity when optimizing for speed or energy efficiency. Instruction sequences must be enumerated in ascending order of cost. For many modern architectures, pipelining means there is not an exact model of performance of individual instructions, while energy modeling of individual instructions is still very inaccurate. We can enumerate using the best models to hand, but even with this we usually need to generate multiple sequences, not just the shortest one, and measure their speed or energy efficiency. This is still a relatively small set of sequences to test, so not an onerous task.
There are still some constructs which remain difficult for superoptimizers of all types.
- Superoptimizing floating point is difficult, particularly where strict IEEE 754 compliance is required.
- Code which accesses memory is difficult or which uses constant values is a challenge because of the potentially very large search space required.
- Generating sequences with loops is almost impossible because of the challenge of proving termination under all circumstances.
Embecosm continues to investigate these areas, to find practical solutions.
SMT Solver Based Superoptimization
Symmetric Modulo Theory (solvers) are a class of constraint solvers which can be used for superoptimization, and which may be more efficient than both brute force and stochastic superoptimization. They are most simply used to optimize binary-to-binary.
A formal description of the instruction set is provided to the solver. The code of interest is compiled to binary using a standard compiler. The SMT solver is then used to prove or disprove statements of the form “There is no sequence of n or less instructions equivalent to this sequence of compiled instructions”, where n is shorter than the length of the compiled code. If the statement is proved, then we know any superoptimal sequence must be longer. If it is disproved, then the solver provides an example as evidence, which will be a shorter code sequence. We can repeat this, adjusting n, to find the shortest sequence.
The advantage of constraint based approaches is that they have the potential to be able to explore loops, floating point and memory accesses effectively. Something that is beyond current brute force superoptimizers.
Applicability
With support from InnovateUK, Embecosm carried out an analysis of the state of superoptimization research and its commercial applicability. While general superoptimization of all code is infeasible, we identified two use cases where superoptimization is practical in a commercial context.
- Superoptimization of heavily used code blocks. This could be the critical central loop of a component, or more generically heavily used library routines, such as those in the libgcc or CompilerRT emulation libraries or in the standard C library.
- Target specific peephole optimization. Superoptimization has long been used to identify generic peephole sequences. However it is feasible to apply this to specific architectures, particularly since the brute force algorithms are amenable to searching for many peephole sequences simultaneously.
Embecosm has the experience and skill set to apply superoptimization to your tool chain and code base.
GSO 2.0
The most widely used superoptimizer currently available is the GNU superoptimizer (GSO) developed by Torbjörn Granlund, which was used to identify the generic peephole transformations used in GCC. However this work is now over 25 years old. It is still a blisteringly fast tool for enumerating the shortest translations of fixed code and we use it regularly for initial explorations of superoptimizaton. However it is single threaded (despite superoptimization being an embarrassingly parallel problem), and lacks the flexibility to be used for other types of superoptimization, including stochastic superoptimization and superoptimization for speed or energy efficiency.
As part of the TSERO project (see case study below), Embecosm implemented a generic framework for superoptimization, GSO 2.0. This allows a wide range of enumeration and evaluation strategies to be employed and is designed for parallel execution across very large numbers of nodes. GSO 2.0 is still under development, and allows us to offer a much range of superoptimization services.