Superoptimization

Modern compilers “optimize”, but the resulting code is not actually optimal in the sense of being the best possible code translation. Superoptimization aims to achieve the theoretically best translation of any source program.

The theoretical basis of brute force superoptimization has long been established and demonstrated, but it is immensely computationally demanding. Development of heuristic algorithms and modern high performance computing have now made this approach practical for real programs. Recent progress with constraint based solvers offer the possibility of applying superoptimization more widely. Embecosm are the first company to offer superoptimization for commercial applications. This is a practical technology that can deliver a step change in performance and code size.

Generate the theoretically best code sequence.
Can optimize for code size, speed or energy efficiency.
Available for all compilers, including GCC and LLVM.

Technical Details

Brute Force Superoptimization

The concept of superoptimization is very simple. To find the smallest translation of your source code program, start with all sequences of one machine instruction and see if any behave identically to your original program. If not, then try all sequences of two instructions, then all sequences of three instructions and so on until you find a sequence which behaves identically to your original program. Because of the search sequence, this will be the smallest translation of your original program.

An example from the original superoptimization paper is the Motorola 68000 compilation of the following C function

int sign (int n)
{
  if (n > 0)
    return 1;
  else if (n < 0)
    return -1;
  else
    return 0;
}

A conventional compiler will generate 8 instructions requiring two comparisons and two branches.

  cmp.l  d0, 0
  ble    L1
  move.l d1, 1
  bra    L3
L1:
  bge    L2
  move.l d1, -1
  bra    L3
L2:
  move.l d1, 0
L3:

However superoptimization yields the following.

  add.l  d0, d0
  subx.l d1, d1
  negx.l d0
  addx.l d1, d1

Use of the sign extension flags means the result can be computed without comparisons or branches.

The big problem is that superoptimization suffers from combinatorial explosion. Even the simplest architectures have tens of instructions, and most have hundreds. Those instructions may well use any one of a large number of registers, or may include immediate constants or references to memory. Heuristics are used to prune this search space. Examples include:

not using all the registers (they are unlikely to be needed in short sequences);
recognizing commutativity in instructions: for example adding a to b is the same as adding b to a;
not considering variants which are name changes (for example changing r1 to r2 throughout);
not considering redundant instructions, such as adding zero or multiplying by one;
only considering a restricted range of constant values; and
not changing memory address references.

Care is needed when defining these simplifications to take account of the specific architecture issues, such as which flags are set. For example adding zero may be appropriate if it sets flags. Some of these heuristics mean that an optimal sequence could be missed, although in practice with good heuristics this is very unlikely.

Having defined the search space, the capacity of superoptimization then depends on how fast each sequence can be checked. The important thing is to eliminate the vast majority of useless sequences very quickly. Typically fast simulation with a small set of input values is used for this. Once a candidate sequence is found (of which there will be very few), significant resources can be deployed for verifying it really is a good sequence, either by exhaustive simulation or by formal proof of equivalence.

The limit to brute force superoptimization is the length of instruction sequence generated, not the size of the original program. Even with these heuristics, modern computers struggle to generate sequences more than 8 instructions.

One way round this is to try to guide the search space. In the technique of stochastic superoptimization, a simple heuristic is used to bias the enumeration of the sequences being considered. There is a considerably higher probability of missing optimal sequences, but much longer sequences can be explored. An example from Stanford University used the heuristic of the number of correct bits of the answer computed so far to guide the enumeration of sequences and applied this to the Montgomery modular multiplication kernel from OpenSSL. Compiled for x86-64 by GCC -O3 yields 27 instructions, including one conditional branch. The superoptimized sequence is just 11 instructions with no branch and runs 60% faster.

There is added complexity when optimizing for speed or energy efficiency. Instruction sequences must be enumerated in ascending order of cost. For many modern architectures, pipelining means there is not an exact model of performance of individual instructions, while energy modeling of individual instructions is still very inaccurate. We can enumerate using the best models to hand, but even with this we usually need to generate multiple sequences, not just the shortest one, and measure their speed or energy efficiency. This is still a relatively small set of sequences to test, so not an onerous task.

There are still some constructs which remain difficult for superoptimizers of all types.

Superoptimizing floating point is difficult, particularly where strict IEEE 754 compliance is required.
Code which accesses memory is difficult or which uses constant values is a challenge because of the potentially very large search space required.
Generating sequences with loops is almost impossible because of the challenge of proving termination under all circumstances.

Embecosm continues to investigate these areas, to find practical solutions.

SMT Solver Based Superoptimization

Symmetric Modulo Theory (solvers) are a class of constraint solvers which can be used for superoptimization, and which may be more efficient than both brute force and stochastic superoptimization. They are most simply used to optimize binary-to-binary.

A formal description of the instruction set is provided to the solver. The code of interest is compiled to binary using a standard compiler. The SMT solver is then used to prove or disprove statements of the form “There is no sequence of n or less instructions equivalent to this sequence of compiled instructions”, where n is shorter than the length of the compiled code. If the statement is proved, then we know any superoptimal sequence must be longer. If it is disproved, then the solver provides an example as evidence, which will be a shorter code sequence. We can repeat this, adjusting n, to find the shortest sequence.

The advantage of constraint based approaches is that they have the potential to be able to explore loops, floating point and memory accesses effectively. Something that is beyond current brute force superoptimizers.

Applicability

With support from InnovateUK, Embecosm carried out an analysis of the state of superoptimization research and its commercial applicability. While general superoptimization of all code is infeasible, we identified two use cases where superoptimization is practical in a commercial context.

Superoptimization of heavily used code blocks. This could be the critical central loop of a component, or more generically heavily used library routines, such as those in the libgcc or CompilerRT emulation libraries or in the standard C library.
Target specific peephole optimization. Superoptimization has long been used to identify generic peephole sequences. However it is feasible to apply this to specific architectures, particularly since the brute force algorithms are amenable to searching for many peephole sequences simultaneously.

Embecosm has the experience and skill set to apply superoptimization to your tool chain and code base.

GSO 2.0

The most widely used superoptimizer currently available is the GNU superoptimizer (GSO) developed by Torbjörn Granlund, which was used to identify the generic peephole transformations used in GCC. However this work is now over 25 years old. It is still a blisteringly fast tool for enumerating the shortest translations of fixed code and we use it regularly for initial explorations of superoptimizaton. However it is single threaded (despite superoptimization being an embarrassingly parallel problem), and lacks the flexibility to be used for other types of superoptimization, including stochastic superoptimization and superoptimization for speed or energy efficiency.

As part of the TSERO project (see case study below), Embecosm implemented a generic framework for superoptimization, GSO 2.0. This allows a wide range of enumeration and evaluation strategies to be employed and is designed for parallel execution across very large numbers of nodes. GSO 2.0 is still under development, and allows us to offer a much range of superoptimization services.

Case Study

Total Software Energy Reduction and Optimization

Total Software Energy Reduction and Optimization (TSERO) was an InnovateUK supported research program between Embecosm, Allinea (now part of ARM), Concertim and STFC Daresbury Hartree Center which ran from 2015 to 2017 exploring a wide range of techniques for optimizing energy efficiency of compiled code for high performance computing (HPC) and data centers. One of the techniques we explored was superoptimization efficiency.

As part of this research program, we realized that we needed more than a conventional monolithic single use superoptimizer. GSO 2.0 is a framework for constructing a wide range of superoptimization strategies, covering both brute force superoptimization and constraint based superoptimization. It provides:

flexibility in the way an instruction set is pruned;
a range of approaches to enumeration, including stochastic enumeration;
various methods of validating candidate sequences, including formal equivalence; and
a choice of optimization criteria, including code size, execution speed and energy efficiency of superoptimized code.

Most importantly the entire implementation is designed for execution on highly parallel system with large numbers of nodes.

Initial testing of the framework used the Atmel AVR architecture, for comparison with the earlier MAGEEC project outperformed the best GCC results in 6 out of 10 cases , offering improvements of between 47% and 64%.

GSO 2.0 is a major project, which will continue to be developed over many years, but can already deliver impressive results.

Optimization for Energy Efficiency

Embecosm offers the first compilers and compiler optimizations that can optimize for energy. The technology combines Embecosm's MAGEEC machine learning framework for GCC and LLVM with optimizations specifically aimed at improving energy efficiency.

Machine Learning Optimization

Embecosm's MAGEEC is the first commercially robust implementation of a machine learning system and is available for both GCC and LLVM compilers.

Superoptimization
Superoptimization produces the best possible code translation when optimizing for code size, speed or energy efficiency.

Go to section