Machine Learning Optimization

Modern compilers like GCC and LLVM have huge numbers of optimizations—over 200 in the case of GCC, each with their own set of tuning parameters. Not all optimizations work with all code, and choosing the wrong ones can reduce performance of code. To choose the right combination of optimizations and tuning parameters for any particular program is a near impossible task. Instead we rely on standard combinations, which have proved to be generally good across many programs and architectures: -O2 or -O3 to generate fast code and -Os or -Oz to generate small code.

The problem is that for any particular program and architecture these flags will not generate the very best code. It is possible to try exploring individual flags and parameters, using iterative compilation techniques such as combined elimination. This can double the performance of compiled code, but takes a great deal of time and is impractical for all but the most valuable code.

Supported by Innovate UK, and working with the University of Bristol and the Science & Technology Facilities Council Hartree Center, Embecosm has developed MAGEEC, a commercially robust machine learning infrastructure for compiler tool chains. Trained on representative bodies of code, this approach can predict specific optimization settings which will outperform standard -O3 or -Oz, without the huge effort required for iterative compilation. Embecosm’s MAGEEC system provides:

automatic selection of the best optimization flags for your code;
optimization for code speed, size or energy-efficiency; and
implementations for LLVM and GCC, extensible to other compilers.

MAGEEC ensures you get the best performance every time out of your existing compiler tool chain.

Technical Details

Preparation

MAGEEC interfaces to the compiler through its command line, allowing individual optimizations and associated parameters to be set. The system is first trained on a large body of code, representing the broad spread of software with which the compiler is to be used. It is important that the programs provide the full range of language features, and that the spread is balanced. So we need programs which between them have a good mix of

small and large functions;
small, large and nested loops;
no loops;
integer arithmetic of all sizes;
floating point arithmetic of all sizes; and
scalars, small structs/vectors and large structs/vectors.

Individual programs will have only one or two of these features, providing plenty of differentiation from which optimizations can learn. A good training set should have at least 50 programs. For deeply embedded systems, the BEEBS test suite, developed by Embecosm and the University of Bristol provides an appropriate mix of programs.

Preparing the Training Set of Data

Combined elimination is used to find the best flag combination for each program in the test suite. For this technique, the program is first compiled with all optimizations enabled and its performance measured. The criterion for “performance” can be execution speed, code size, or energy consumed when the program runs. It is important that the figure is accurate and reproducible, so many runs may be needed to eliminate statistical variation, particularly on machines with caches.

The program is then recompiled, disabling one optimization in turn and the performance again measured. It is likely that some optimizations will be counter-productive with the program under test, and so removing those optimizations will increase performance. The optimization whose removal most improves performance is then removed from the set of optimizations to be applied.

The process is then repeated to remove another optimization from the set. The process of elimination is repeated until no optimization can increase performance by its removal. At this point we have the best set of optimizations to use with this particular program.

Almost all optimizations are turn on/turn off flags. However for optimizations which are parameterized, the process of elimination must explore changing the parameter value. This same approach can also be used to explore the values of tuning parameters associated with individual optimizations.

Once this approach has been completed, we have our training data: a set of programs, each with their own set of optimal optimizations.

Training

We now need to learn how the characteristics of the training programs predicts which optimizations will be best. For each program we use a set of abstract features, for example number of variables in a function, number of dominators of each basic block and so forth. We can use the compiler to generate this information, since it is part of the standard front-end analysis carried out when compiling.

Given the relatively small set of data, training is not usually too complex. We do not need deep training techniques such as neural networks. We have worked with a number of techniques, with one of the best being to use a C4.5 or C5 classifier to create a decision tree based on the features of the source program.

Using the Trained Compiler

When given a new program to compile, MAGEEC, the compiler determines the features of the source program and then applies those features to the decision tree from training to choose the best optimization set. The relatively small size of the training set means there may not be a good match, so one option is to use a nearest neighbour approach to find the most similar program when training and use its optimization settings. This is fast, with no significant impact on compile time performance.

The nearest neighbour approach can be extended to suggest several nearest neighbours, and the program can be compiled and its performance measured with each of these settings to determine which is best. This is not as simple as straight computation, but much faster and simpler than combined elimination. The data from this exercise can also be used to extend the trained database.

Implementing MAGEEC for a New Architecture

There are always optimization passes which are specific to an architecture, and which can only be used when training on that architecture. Different optimizations work differently on different architectures. So the training set and training database must be recreated for each new architecture.

Future Research & Development by Embecosm

MAGEEC is still an active research program. There are a number of areas for future development.

We wish to tune optimization flags for individual source files and eventually for individual source functions. This will give us a much larger training set for the machine learning phase, improving the quality of generated code.
The choice of abstract features in the source code is critical to good machine learning. The current set were created for the earlier MILEPOST EU research project, but lack sufficient rigour in their choice. Principal component analysis (PCA) will allow us to identify a better set, but it requires much larger training sets. This should be possible once we are able to train on individual source files and functions.
We have to date used simple machine learning techniques. Preliminary work with deeper techniques, such as genetic algorithms suggest these can yield much better results, which we wish to explore further.

Case Study

Thanks to Embecosm’s experience, unique skills and novel techniques we were able to complete ambitious research that required searching an enormous computational space.
Dr Simon Hollis University of Bristol

The Machine Guided Energy Efficient Compiler (MAGEEEC) project was an InnovateUK supported research program between University of Bristol and Embecosm, with the aim of making machine learning feasible in commercial compilers, specifically for generating energy efficient code. Running from mid-2012 to the end of 2013, the objective was to achieve a 20% reduction in typical code energy usage in deeply embedded systems.

Under this project we implemented the first version of the MAGEEC framework, and integrated it with an energy measurement board (the MAGEEC Wand) designed by Dr Simon Hollis, then at Bristol University. This allowed us to sample energy usage on small computers up to 2 million times per second to an accuracy of 1%, essential if we were to get accurate training data. This first version of MAGEEC used the plugin interface to control the GCC pass manager directly. We were able to demonstrate machine learning that could reduce energy consumption in deeply embedded computers, using Atmel AVR processors as the evaluation system.

Total Software Energy Reduction and Optimization (TSERO) was an InnovateUK supported follow-on project between Embecosm, Allinea (now part of ARM), Concertim and STFC Daresbury Hartree Center, which aimed to apply the same techniques to high performance computing systems and data centers. Running between 2015 and 2017, this project added superoptimization as a technique for energy optimization. University of Bristol continued to be involved as technical advisors to the project.

Fitting custom energy loggers to valuable HPC systems was not an option, so the data for machine learning was taken from the Allinea MAP tool, which collates energy information from all nodes in an HPC system. This required the MAGEEC system to use a much larger statistical sample to get useful data.

The MAGEEC system was rewritten for this project, based on the previous experience. Interfacing through the GCC plugin interface was too compiler specific, and proved very easy to select invalid optimization pass combinations which would crash to the compiler. We switched to controlling the compiler through the command line. This also made it easier to work with LLVM which has no standard plugin interface.

By late 2017 we were able to demonstrate that MAGEEC could improve execution speed compared to standard -O3 for some common HPC kernels. We did not see improvement in all cases, but where we did, the improvement averaged 8%. This is not as much as the target 20%, but still represents an important gain, particularly if we can achieve the same for energy efficiency. Google and Amazon between them spend US$1billion each year on energy for their data centers, and an 8% reduction is a significant saving.

Superoptimization

Embecosm is the first company to offer superoptimization for commercial applications. This is a practical technology that can deliver a step change in performance and code size for your key algorithms and libraries.

Optimization for Energy Efficiency

Embecosm offers the first compilers and compiler optimizations that can optimize for energy. The technology combines Embecosm's MAGEEC machine learning framework for GCC and LLVM with optimizations specifically aimed at improving energy efficiency.

Machine Learning Optimization
Embecosm’s machine learning automatically selects the best optimization flags for your code.

Go to section