Sign
+
0
Exponent
20
15
Mantissa
0
0
Decimal Input
Value Stored
Bin Representation
Hex Representation
NB: e5m2 / e4m3 saturation modes only apply when providing a decimal value.
Lovingly based of https://www.h-schmidt.net/FloatConverter and flop.evanau.dev

Source code
Minifloat (C++ with a C interface): minifloat
Wasm Binding (C): bootstrap.c
Wasm Binding (js): minifloat_interface.js


Minifloat Formats

16bit formats

bfloat16 (1.8.7)

sign
exponent
(8 bit)
mantissa
(7 bit)
████████
███████

15

7

0

Shortened IEEE 754 single-precision floating-point format. (Has the same exponent as a f32, but less mantissa bits).

Can be converted into a f32 by simply shifting.

float f = asuint((uint32_t)bfloat16_bits << 16);

f16 (half, 1.5.10)

sign
exponent
(5 bit)
mantissa
(10 bit)
█████
██████████

15

10

0

Probably the third most widely used format globally, after f32 and f64.

Commonly used for storing pre-tonemapped and HDR textures.

On GPUs you would expect to see dedicated f32tof16 / f16tof32 instructions.
Additionally many GPUs can perform arithmetic directly on f16 values (usually packing two in a 32bit register).

8bit formats

fp8 (1.5.2)

sign
exponent
(5 bit)
mantissa
(2 bit)
█████
██

7

2

0

Shortened IEEE 754 half-precision binary floating-point format. (Has the same exponent as a f16, but less mantissa bits).

Can be converted into a f16 by simply shifting.

half h = ashalf((uint16_t)fp8_bits << 8);

e5m2 (OCP-E5M2, 1.5.2)

The same as fp8, but with OFP8 saturation applied.

e4m3 (OCP-E4M3, 1.4.3)

sign
exponent
(4 bit)
mantissa
(3 bit)
████
███

7

3

0

Has no dedicated infinity, with NaN being 0xff / 0x7f.

Like e5m2 has saturating modes:

4bit formats

e2m1

s000 s001 s010 s011 s100 s101 s110 s111
s=0 0 0.5 1 1.5 2 3 4 6
s=1 −0 −0.5 −1 −1.5 −2 −3 −4 −6

binary4p2sf

The same as e2m1, but -0 is replaced with NaN.

s000 s001 s010 s011 s100 s101 s110 s111
s=0 0 0.5 1 1.5 2 3 4 6
s=1 NaN −0.5 −1 −1.5 −2 −3 −4 −6

binary4p2se

The same as binary4p2sf, but 6 is replaced with Inf.

s000 s001 s010 s011 s100 s101 s110 s111
s=0 0 0.5 1 1.5 2 3 4
s=1 NaN −0.5 −1 −1.5 −2 −3 −4 -∞

Other

R11G11B10 (0.5.6 / 0.5.5)

exponent
(5 bit)
mantissa
(6 bit)
█████
██████

6

0
exponent
(5 bit)
mantissa
(5 bit)
█████
█████

5

0

Often used in place of RGBA16f for render targets.

Shortened unsigned IEEE 754 half-precision binary floating-point format. (Has the same exponent as a f16, but less mantissa bits).

Can be converted into a f16 by simply shifting.

half h10 = ashalf((uint16_t)u10_bits << 5);
half h11 = ashalf((uint16_t)u11_bits << 4);