The transcendental function implementation in Newlib’s libm, like a number of other such software implementations, assumes that the target already has a fast hardware implementation of basic floating point arithmetic in the same or higher precision as that of the implemented function. Moreover, the evaluation typically goes to some length to raise the correct floating point exceptions by executing the computation in a particular manner, and/or by detecting exception conditions and evaluating floating point arithmetic crafted to raise the appropriate exceptions.
Using such an implementation for a target that uses a software implementation of the basic floating point arithmetic means an unnecessary use of processor cycles — time and energy. Therefore, I intend to implement some transcendental functions with an integer-based algorithm.
Targeting such an optimization for a processor that you would rarely want to use for floating point processing would be a waste of effort, therefore I will focus on a class of targets that have, or could have, a reasonable performance software floating point implementation. Namely, I will assume that the following operations are reasonably fast:
- 32*32->64 bit widening integer multiplication
- 32 bit dynamic shift
- count leading sign/zero bits
Moreover, I will use GNU C to express these operations.
For the start of this project, I will focus on sinf and cosf (IEEE single precision sin / cos), without any signal handling or error raising apart from returning inf / NaN where appropriate. Raising signals is uncommon in performance-oriented software floating point implementations, and would also require some deeper binding into the target-specific implementation to avoid disproportionate size/speed cost.
Raising EDOM errors in addition to generating NaN would be relatively straightforward, but would cause threading dependencies. For error conditions where inf / zero is generated, additional code and calculations in the critical path might be needed to find all the exception conditions. And finally, the inexact condition is also rather ardourous and generally pointless to calculate for transcendentals.
Although exact round-to-nearest/even is possible — and even relatively cheaper than it would be to do with algorithms based on basic floating point arithmetic — it would still cost a considerable amount of code size and some extra computations, for very little gain. The aim is to use a few guard bits to be close to the theoretical 0.5 ulp worst case error, but not to unduly throw much extra code/table size or computing resources at diminishing returns.
To further define the algorithmic search space, I will assume that 32*32->64 multiply is performed as separate operations for the 32 bit low- and high-parts, and that these operations can be performed as for smaller bit sizes. There are interesting targets that can do the widening multiply in a single instruction and/or can do a smaller width more quickly, and for these, differently tailored algorithms (no pun intended) may be more appropriate. But you have to start somewhere.
A common technique to evaluate an approximation for a continuous function and suitably differentiated function, is to evaluate a polynomial. Sine and cosine are suitably well behaved and could be evaluated with their Taylor series in their entire domain. However, that would require an excessively long evaluation to get to the desired precision. In general, the smaller the domain, the higher the precision obtainable with a given number of terms. For sine/cosine, we can use the special properties of these functions to partition the domain into smaller intervals that are equivalent to the 0..pi/4 interval of sine or cosine. We can further partition this interval by using a lookup table of polynomials. Taylor polynomials developed from the middle of each interval give a reasonable starting point; however, by optimizing the polynomials for the lowest maximum error, we can gain extra precision for a fixed degree of the polynomial — about as many bits as the degree of the polynomial.
The error we want to bound is measured in ulp. For a small interval, chances are that the the result exponent will be the same over the entire interval, so it is sufficient to consider the absolute error. Also, if the interval spans just two adjacent result exponent values, using the absolute error is not too far off the optimal result. Furthermore, to minimize the maximum absolute error, Chebychev polynomials are convenient to calculate the lookup table quickly and thus explore the design space. Things are different in the intervals where the graph touches zero; we have to minimize the relative error there.
Design space exploration matters, because we want as far as possible to avoid shifting around values to make them fit in the integer calculations, but to also have constants of similar size that can be stored and indexed efficiently.
On a higher level, there is also the matter of table size vs. computation time vs. precision. Different architectures – which also might have different typical application spaces – will have different priorities. For example, a highly parallel architecture with closely coupled memory will likely prefer to use higher grade polynomials in order to be able to fit the table into the available memory.
Because of the smallness of the intervals, the power of each term is the dominating factor in the precision required for each term and decays roughly linearly with the power. E.g. with x in the interval 0..1/64, a polynomial a*x^3+b*x^2+x*x+d, and a,b,c,d in the same ballpark, we need about 18 bit less precision for the x^3 term than for the x^0 term. This means that for higher order terms constants can be truncated, and calculations can be done in lower precision and can ride rough-shot over lower order bits without materially affecting the result. For instance, for two terms that need 16 bit precision, a 32 bit constant value might be loaded, the lower part of which is used in a 16*16->32 bit widening multiply, and the entirety — with albeit with only the 16 higher bits holding a valid value — is used in a high part multiply. Or if no 16*16 multiply is readily available, and/or uneven precision distribution is wanted, the low bits are extracted with a mask and then used in a non-widening multiply, while the high bits again are used — together with the then numerically insignificant low bits — in a widening multiplication.
Something else to consider is that systematic errors introduced in high order terms can be taken into account when computing the lower order terms, just the same way we take completely eliding higher order terms into account when computing our optimized polynomials, so the effective number of guard bits for the polynomial can be equal or higher than the ones used for the highest order terms.
The high-level interval reduction of sinf / cosf to 0..pi/4 mentioned above requires some sort of remainder computation. The existing floating point-centric implementation in Newlib literally does a remainder operation; i.e. multiply by 4/pi, lop off the integral part, then multiply by pi/4. This is rather complex to do right — i.e. avoid massive loss of precision when the result is relatively small – and thus this operation ends up using not just time for all the but also a lot of code size. Now, a more efficient way to compute the function in this case is to omit the multiply of the truncated value with pi/4, and use appropriated adjusted polynomials. This does come at a cost: a different polynomial lookup table is needed than what can be used directly with input values in the 0..pi/4 interval. You could thus have two lookup tables and consider the speed/size tradeoff for each separately. I used a different approach here: I take the interval-reduction code path, multiplying by 4/pi, for most values up to pi/4 (a small interval close to the origin is treated differently). Going with just a single lookup table allows us to minimize the polynom order for a given table size, thus benefiting all computations and also keeping the worst case computation time down.
Initial evaluation of the sinf implementation shows that for some values we get errors in excess of 1 ulp. This is much larger than was expected from our error estimates obtained during table generation; we use the errors at the edge of the interval there. That would be perfect if the polynomial was optimal for absolute error to start with, but because we only approximate using a chebychev polynomial, the error value thus obtained is more a rough guide than a proper numerical analysis.
So, first we should properly optimize the polynomial. Remez’ algorithm probably needs some adjustments to cope with error specified in ulp. Also, if we succeed in minimizing maximum relative error in ulp, for any interval with more than one result exponent value the polynomial will be lopsided, so to find the maximum error we actually have to find the maxima of the error functions and test there as well as at the interval bounds and at each result exponent change site.
The code can be found in the joern-sf_sin_int branch of the development tree for newlib for the Synopsys DesignWare ARC processor family.