A guided tour

The MicroFloatingPoints package is organized into four modules:

  • MicroFloatingPoints: the main module containing the definition of the parameterized type Floatmu and the associated methods;
  • MicroFloatingPoints.MFPUtils: a module providing miscellaneous utility functions for the Floatmu type;
  • MicroFloatingPoints.MFPPlot: a module offering various graphical ways to display Floatmu floating-point numbers;
  • MicroFloatingPoints.MFPRandom: the module overloading Random.rand to produce Floatmu random values.

After having correctly installed the package (see Installation), we start our tour by loading the MicroFloatingPoints module:

julia> using MicroFloatingPoints

We can now define a new floating-point type MuFP with 2 bits for the exponent (the first parameter) and 2 bits for the fractional part (the second parameter):

julia> MuFP = Floatmu{2,2}Floatmu{2, 2}

Such a type is very limited, and a call to floatmax will give us the largest finite float representable:

julia> floatmax(MuFP)3.5

Conversely, we can obtain the smallest positive float in the MuFP format with the μ method:

julia> μ(MuFP)0.25

Note that this value is a subnormal number, which is different and smaller than the smallest normal float, obtained by calling floatmin:

julia> floatmin(MuFP)1.0

Graphics with MicroFloatingPoints.MFPPlot

To better assess what we can do with such a small type, let us display all finite representable values on the real line. The Plot module has just the right method:

julia> using MicroFloatingPoints.MFPPlot
julia> real_line(-floatmax(MuFP),floatmax(MuFP));
Floatmu{2,2} representable finite values

Since the difference between any pair of MuFP is always greater or equal to μ(MuFP), it becomes apparent why the introduction of subnormal numbers (in purple in the picture above) ensures the property:

\[\forall (a,b)\in\text{MuFP}\colon |b-a| = 0 \iff a=b\]

Exhaustive search for rounded additions

The type MuFP is so small that we can easily perform exhaustive searches with it. For example, we can display graphically whether the sum of any two finite MuFP floats needs to be rounded or not, using the inexact() and reset_inexact() methods to, respectively, test whether the preceding computation needed rounding and to reset the global inexact flag:

plt.figure()
plt.title("Exhaustive search for rounded sums in Floatmu{2,2}")
TotalIterator = FloatmuIterator(-floatmax(MuFP),floatmax(MuFP))
N = length(TotalIterator)
Z = zeros(Int,N,N)
let i = 1
    for v1 in TotalIterator
        j = 1
        for v2 in TotalIterator
            reset_inexact()
            v1+v2
            Z[i,j] = Int(inexact())
            j += 1
        end
        i += 1
    end
end
V = collect(TotalIterator)
imshow(Z,origin="lower", cmap="summer")
plt.yticks(0:(length(V)-1),[string(V[i]) for i in 1:length(V)])
plt.xticks(0:(length(V)-1),[string(V[i]) for i in 1:length(V)],rotation=90);

Note the use of a FloatmuIterator to enumerate all floating-point numbers in a range.

We obtain the following matrix, where a green cell means that the sum of the values in row and column needs no rounding, while a yellow cell means that the result needs rounding to be represented by a Floatmu{2,2}.

Exhautive search for sums of Floatmu{2,2} needing rounding

Random floats with MicroFloatingPoints.MFPRandom

Let us now draw some BFloat16 floats uniformly at random in $[0,1)$. We will use the MicroFloatingPoints.MFPRandom module to overload the rand method for the type Floatmu.

using DataStructures
using PyPlot
using MicroFloatingPoints
using MicroFloatingPoints.MFPRandom

BFloat16 = Floatmu{8,7}

ndraws=1000000
plt.figure()
plt.title("Drawing $ndraws values at random in BFloat16[0,1)")
T = [rand(BFloat16) for i in 1:ndraws]
Tc = counter(T)

We can now display the number of times each float was drawn:

for x in Tc
    (k,v) = x
    plot([k,k],[0,v],marker=".",color="blue",alpha=0.5)
end
(low,high) = extrema(collect(values(Tc)))
plt.ylim(ymin=0.99*low,ymax=1.01*high)
Drawing values at random in BFloat16

Arithmetic with various precisions

The BFloat16 and Float16 formats both represent floating-point numbers with 16 bits. The BFloat16 trades precision for a larger range. Let us compare the results obtained when summing the values of a vector with both types:

BFloat16 = Floatmu{8,7}
MuFloat16 = Floatmu{5,10}
T64 = [rand() for i in 1:1000]
bfT16 = [BFloat16(x) for x in T64]
FT16 = [MuFloat16(x) for x in T64]
println(sum(T64))
println(sum(bfT16))
println(sum(FT16))
502.4140523517177
256.0
503.0

For small values in $[0,1)$, the effect of a smaller significand appears drastic. On the other hand, the small range of the type Float16 makes it useless for computation with medium to large numbers:

T64 = [rand(Uniform(min(floatmin(BFloat16),floatmin(MuFloat16)),
            max(floatmax(BFloat16),floatmax(MuFloat16))/100)) for i in 1:100]
bfT16 = [BFloat16(x) for x in T64]
FT16 = [MuFloat16(x) for x in T64]
println(sum(T64))
println(sum(bfT16))
println(sum(FT16))
1.791178754416783e38
1.767873234393938e38
Inf