Running AI models on a microcontroller

A few of weeks ago, I was at FOSDEM and then the AI Plumbers conference in Brussels with my colleagues Pietra Ferreira, Shane Slattery and William Jones. While many of the talks focused on LLMs, the Embecosm team attracted a great deal of interest for our work bringing up PyTorch/ExecuTorch on a bare metal microcontroller.

AI’s big problem to day is power consumption. Training AI models of any complexity usually needs a lot of memory and compute. Running inference for some trained AI models, such as LLMs, can need a lot of memory and compute, due to the size of the models.

But when running inference, many AI models do not need so much memory or compute power to be useful. Examples include audio analysis and image recognition. Often this is not the full processing, but pre-processing as input to other software. The advantage is that these AI models have such low power needs, they can easily be run as edge AI devices, even when battery powered.

For the past 8 months we’ve been working with Mosaic SoC to bring up ExecuTorch on a small bare metal microcontroller. ExecuTorch, first released in 2023, is a derivative project of PyTorch, targeting the embedded market, and has widely been used under operating systems such as Android. The processor we are working on has a small number of low power RISC-V 32-bit cores, a custom AI accelerator and megabytes, not gigabytes of memory. And it does not run any operating system, just a small kernel to provide primitives to control the different cores and access to memory.

Let’s take a look at the ExecuTorch architecture.

The silicon architecture will vary from device to device, but all AI assumes partitioning into a host core (or cores) and an accelerator (typically with many cores). For AI specific devices there are likely to be specialist accelerators that can support common operations such as matrix multiplication.

For microcontrollers, there will not be an MMU, but here is very likely to be some form of closely-coupled fast memory, very often with DMA units to facilitate transfers between different areas of memory.

Ahead-of-time processing

We have an ahead-of-time phase, which does the following.

It lowers the PyTorch model (which can use any of the thousands of ATen operators) to use just the ExecTorch Edge subset of operators (around 150).
It performs graph manipulation, for example to fuse operators that commonly occur together, such as Conv2d followed by ReLU.
It determines which operators will need to be delegated to custom code for processing. This may be to take advantage of the custom accelerator, to be distributed over all the cores, or to use a smaller datatype (known as Quantization). Any operator that is not delegated will be executed (as a single thread) using a standard C++ implementation on the host CPU.
It determines where in memory all the tensors, which form the operands to the operators will be placed. Small processors invariably have a small amount of fast memory, supported by DMA, so tensors can be tiled, allowing the computation to be done in smaller chunks, but always in fast memory.

The resulting model is saved as a .pte file, the model using just Edge operators, with the graph transformed, delegations identified, and memory locations specified. This is the input to the ExecuTorch runtime.

Implementing the ExecuTorch runtime

While some work is needed on the ahead-of-time code, to provide support specific to the target platform, the bulk of the work is in the runtime. We have two tasks: i) modify the runtime to work without an operating system; and ii) optimize the runtime, with delegated code to take advantage of our processor.

Removing OS dependencies

The first of these is easy. ExecuTorch runtime has minimal interaction with an operating system, so there were only three areas we needed to work on

Writing debug and error messages to standard output when we have no I/O. The initial approach is easy enough, we can just disable output. Later, we can add to the kernel the ability to write to a UART, or use semihosting to redirect output to an external host under debug.
Loading the model and transferring data to and from the runtime, when we have no filesystem. The model is not a problem – we can just convert it to a C/C++ binary array using xxd and the #include it in the runtime (we could have used #embed to include the binary file directly, but this is still quite new, and only available in the most recent C/C++ compilers). For an embedded system, transferring the data is very chip specific, with data coming in from sensors, and data going out likely to actuators, or communicated to another system. The kernel is extended to support these devices. The key activities for the ExecuTorch runtime are then to convert raw input data into the tensors expected as input by the model and to convert output tensors into whatever is needed for device actuation or communication to another system.
Memory management. ExecuTorch runtime does not use malloc, it provides its own memory management API. For an embedded system, we can just allocate the memory statically, using the linker to place it in the correct location in memory.

Runtime optimization

The major task is then optimization of the model. This is achieved by delegating operations to custom code, tuned to our target processor. We have three key general purpose optimizations: tiling, multi-threading and quantization we apply first

Tiling

The primary reason for tiling is to allow us to carry out operations in the small amount of fast memory attached to the processor.

This optimization is closely tied to the memory model and the amount of fast memory. Typically we will run double-buffered algorithms, where, while DMA is loading the next buffer, the processor is working on the current buffer. The following diagram illustrates the process.

Multi-threading

When we have a multi-core processor, then operations can be shared across all the cores. Many of the operations used in AI are highly parallel in nature, making this an effective operation. We lack and operating system, so won’t have a full standard threading API, but the kernel will typically provide some key functionality to allow threads to synchronize.

We are now breaking up the tiles into smaller “sub-tiles”, one for each core. We still use the double buffering approach described above, but it is usually more efficient to drive DMA to transfer data for all cores at once, rather than having each core drive this.

Quantization

When training models the weights and biases associated with operations are typically computed as 32-bit or 64-bit floating point values. However once these values have been determined, it is often sufficient to use much less accuracy—for example as 8-bit integers. This reduces the size of the tensors, used, and moving tensors around memory is usually the limit on performance of an AI model. This process is known as quantization.

Quantization is a complex topic. It is not always possible—some models are just too sensitive to the values. It also comes with a computational cost. While some quantization may be possible ahead-of-time, much of it has to be done at runtime. Typically there is a quantization step at the start of evaluating a model and then sometimes dequantization at the end. But overall there is a large benefit in reducing tensor sizes.

Other runtime optimizations

The optimizations above are generic—they apply to any AI model for any platform. However there is a whole class of other optimizations, which are target specific. Where we have customer accelerators, such as GEMM units, we provide delegated implementations of operations that can take advantage of these accelerators. In some cases we provide fused operators, for example fusing ReLU with a preceding tensor transformation.

We also use traditional optimization techniques, such as profiling to identify hot spots in code. Many of the operations used have an inner loop which dominates compute time. While compilers will do a good job in general, in specific cases these hot loops can be optimized. This can be by hand, but it is also where techniques such as superoptimization are beneficial.

Getting PyTorch models running on your microcontroller

If you would like to hear more detail, FOSDEM talk by Pietra Ferreira, Shane Slattery and William Jones is available here.

Our work to date has shown it is perfectly feasible to run PyTorch models on bare metal microcontrollers using ExecuTorch. It is not something that works out of the box, but with some careful engineering, such systems are eminently practical. The work with Mosaic SoC is ongoing, we have an ExecuTorch implementation, which is now starting to generate some very efficient code for key operators, using the techniques described above and custom acceleration.

If you would like to know more about how we can help you with your AI implementation, please contact Embecosm at info@embecosm.com.