Minifloat Format Converter

16bit formats
- bfloat16 (1.8.7)
- f16 (half, 1.5.10)
8bit formats
4bit formats
Other
- R11G11B10 (0.5.6 / 0.5.5)

16bit formats

bfloat16 (1.8.7)

sign

exponent
(8 bit)

mantissa
(7 bit)

█

████████

███████

●
15

●
7

●
0

Shortened IEEE 754 single-precision floating-point format. (Has the same exponent as a f32, but less mantissa bits).

Can be converted into a f32 by simply shifting.

float f = asuint((uint32_t)bfloat16_bits << 16);

f16 (half, 1.5.10)

sign

exponent
(5 bit)

mantissa
(10 bit)

█

█████

██████████

●
15

●
10

●
0

Probably the third most widely used format globally, after f32 and f64.

Commonly used for storing pre-tonemapped and HDR textures.

On GPUs you would expect to see dedicated f32tof16 / f16tof32 instructions.
Additionally many GPUs can perform arithmetic directly on f16 values (usually packing two in a 32bit register).

8bit formats

fp8 (1.5.2)

sign

exponent
(5 bit)

mantissa
(2 bit)

█

█████

██

●
7

●
2

●
0

Shortened IEEE 754 half-precision binary floating-point format. (Has the same exponent as a f16, but less mantissa bits).

Can be converted into a f16 by simply shifting.

half h = ashalf((uint16_t)fp8_bits << 8);

e5m2 (OCP-E5M2, 1.5.2)

The same as fp8, but with OFP8 saturation applied.

Saturate: Clamp to the max magnitude.
NonSaturate: Greater than max magnitude becomes infinity.

e4m3 (OCP-E4M3, 1.4.3)

sign

exponent
(4 bit)

mantissa
(3 bit)

█

████

███

●
7

●
3

●
0

Has no dedicated infinity, with NaN being 0xff / 0x7f.

Like e5m2 has saturating modes:

Saturate: Clamp to the max magnitude.
NonSaturate: Greater than max magnitude becomes NaN.

4bit formats

e2m1

	s000	s001	s010	s011	s100	s101	s110	s111
s=0	0	0.5	1	1.5	2	3	4	6
s=1	−0	−0.5	−1	−1.5	−2	−3	−4	−6

binary4p2sf

The same as e2m1, but -0 is replaced with NaN.

	s000	s001	s010	s011	s100	s101	s110	s111
s=0	0	0.5	1	1.5	2	3	4	6
s=1	NaN	−0.5	−1	−1.5	−2	−3	−4	−6

binary4p2se

The same as binary4p2sf, but 6 is replaced with Inf.

	s000	s001	s010	s011	s100	s101	s110	s111
s=0	0	0.5	1	1.5	2	3	4	∞
s=1	NaN	−0.5	−1	−1.5	−2	−3	−4	-∞

Other

R11G11B10 (0.5.6 / 0.5.5)

exponent
(5 bit)

mantissa
(6 bit)

█████

██████

●
6

●
0

exponent
(5 bit)

mantissa
(5 bit)

█████

●
5

●
0

Often used in place of RGBA16f for render targets.

Shortened unsigned IEEE 754 half-precision binary floating-point format. (Has the same exponent as a f16, but less mantissa bits).

Can be converted into a f16 by simply shifting.

half h10 = ashalf((uint16_t)u10_bits << 5);
half h11 = ashalf((uint16_t)u11_bits << 4);