MicroFloatingPoints

The MicroFloatingPoints package allows to manipulate small IEEE 754-compliant floating-point numbers, aka. minifloats, which are smaller or equal to the Float32 format mandated by the standard.

The library may serve to exemplify the behavior of IEEE 754 floating-point numbers in a systematic way through the use of very small formats.

At its core, the package defines a new type Floatmu parameterized by two integers:

• szE, the number of bits used to represent the exponent;
• szf, the number of bits used to represent the fractional part (excluding the so-called hidden bit).

As the figure below shows, the total length of an object of the type Floatmu{szE,szf} is $1+\text{szE}+\text{szf}$ bits[1].

Floatmu{szE,szf} objects are stored in 32 bit unsigned integers, which puts a limit on the maximum value of szE and szf. All computations are performed internally with double precision Float64 floats. To ensure that no double rounding will occur, viz. that the computation performed in double precision, once rounded to a Floatmu{szE,szf}, will give a result identical to the one we would have obtained had we performed it entirely with the precision of the Floatmu{szE,szf} type, we limit the size of a Floatmu{szE,szf} to that of a Float32 [Rump2016].

The limits on the integers szE and szf are therefore:

$$$\left\{\begin{array}{l} 2\leqslant\text{szE}\leqslant8\\ 2\leqslant\text{szf}\leqslant23 \end{array}\right.$$$

Under these constraints, one can manipulate and compute with very small floats (e.g. 2 bits for the exponent and 2 bits for the fractional part) that comply with the IEEE 754 standard. It is also possible to emulate more established formats such as:

• Float16, the IEEE 754 half-precision format: Floatmu{5,10}
• Float32, the IEEE 754 single precision format: Floatmu{8,23}
• bfloat16, the Brain Floating Point by Google: Floatmu{8,7}
• TensorFloat-32, the format by NVIDIA: Floatmu{8,10}
• AMD's fp24: Floatmu{7,16}
• Pixar's PXR24: Floatmu{8,15}
• and many more…
• 1The size of the object representing a Floatmu may be much larger however, as it corresponds currently to two 32 bits unsigned integers per Floatmu.
• Rump2016IEEE754 Precision-$k$ base-$\beta$ Arithmetic Inherited by Precision-$m$ Base-$\beta$ Arithmetic for $k<m$. Siegfried M. Rump, ACM Transactions on Mathematical Software, Vol. 43, N° 3. December 2016.