Minifloat Format Converter

Introduction

Minifloats are a bunch of different floating point formats, like your standard IEEE-754 float32, but smaller!

Each of them makes the delicate trade off between bandwidth, computing cost and precision.
No one format can claim to dominate the others in every situation.

They can roughly be categorised into two use cases: Rendering (VFX and Games) and Machine Learning. However with things like D3D12 LinAlg, it probably won't be long until games start using neural networks (with these formats) alongside traditional rendering techniques.

The (s.e.m) notation you might see in this e.g (1.8.7) means (sign.exponent.mantissa) in terms of bits.
So (1.8.7) means (1 sign bit, 8 exponent bits, 7 mantissa bits) and really acts as a shorthand way of identifying the shape of a floating point format.

16bit formats

bfloat16 (1.8.7)

sign

exponent
(8 bit)

mantissa
(7 bit)

█

████████

███████

●
15

●
7

●
0

	Hex	Value
Smallest value (Denormal)	0x0001	9.183549615799121e-41
Largest value (Denormal)	0x007f	1.1663108012064884e-38
Smallest value (Normal)	0x0080	1.1754943508222875e-38
Largest value (Normal)	0x7f7f	3.3895313892515355e+38
Smallest value > 1	0x3f81	1.0078125
Largest value < 1	0x3f7f	0.99609375
Closest value to π	0x4049	3.140625 (Δ ≈ 9.677x10⁻⁴)
Largest sequential integer	0x4380	256

Shortened IEEE 754 single-precision floating-point format. (Has the same exponent as a f32, but less mantissa bits).

A de facto standard for LLM training, where it can be used for gradients, activations and weights.

Can be converted into a f32 by simply shifting.

float f = asfloat((uint32_t)bfloat16_bits << 16);

f16 (half, 1.5.10)

sign

exponent
(5 bit)

mantissa
(10 bit)

█

█████

██████████

●
15

●
10

●
0

	Hex	Value
Smallest value (Denormal)	0x0001	5.960464477539063e-8
Largest value (Denormal)	0x03ff	0.00006097555160522461
Smallest value (Normal)	0x0400	0.00006103515625
Largest value (Normal)	0x7bff	65504
Smallest value > 1	0x3c01	1.0009765625
Largest value < 1	0x3bff	0.99951171875
Closest value to π	0x4248	3.140625 (Δ ≈ 9.677x10⁻⁴)
Largest sequential integer	0x6800	2048

Probably the third most widely used format globally, after f32 and f64.

In games, it's commonly used for storing the pre-tonemapped render target alongside HDR textures.

In VFX, you'd output render passes for compositing in this format.

On GPUs you would expect to see dedicated f32tof16 / f16tof32 instructions.
Additionally many GPUs can perform arithmetic directly on f16 values (usually packing two in a 32bit register).

8bit formats

fp8 (1.5.2)

sign

exponent
(5 bit)

mantissa
(2 bit)

█

█████

██

●
7

●
2

●
0

	Hex	Value
Smallest value (Denormal)	0x01	0.0000152587890625
Largest value (Denormal)	0x03	0.0000457763671875
Smallest value (Normal)	0x04	0.00006103515625
Largest value (Normal)	0x7b	57344
Smallest value > 1	0x3d	1.25
Largest value < 1	0x3b	0.875
Closest value to π	0x42	3 (Δ ≈ 1.416x10⁻¹)
Largest sequential integer	0x48	8

Shortened IEEE 754 half-precision binary floating-point format. (Has the same exponent as a f16, but less mantissa bits).

Can be converted into a f16 by simply shifting.

half h = ashalf((uint16_t)fp8_bits << 8);

e5m2 (OCP-E5M2, 1.5.2)

sign

exponent
(5 bit)

mantissa
(2 bit)

█

█████

██

●
7

●
2

●
0

	Hex	Value
Smallest value (Denormal)	0x01	0.0000152587890625
Largest value (Denormal)	0x03	0.0000457763671875
Smallest value (Normal)	0x04	0.00006103515625
Largest value (Normal)	0x7b	57344
Smallest value > 1	0x3d	1.25
Largest value < 1	0x3b	0.875
Closest value to π	0x42	3 (Δ ≈ 1.416x10⁻¹)
Largest sequential integer	0x48	8

Part of the OFP8 specification, targeting machine learning.

Has larger range than e4m3, making it useful for gradients.

Stored the same as fp8, but with OFP8 saturation applied when converting from another format.

Saturate: Clamp to the max magnitude.
NonSaturate: Greater than max magnitude becomes infinity.

e5m2fnuz (1.5.2)

sign

exponent
(5 bit)

mantissa
(2 bit)

█

█████

██

●
7

●
2

●
0

	Hex	Value
Smallest value (Denormal)	0x01	0.00000762939453125
Largest value (Denormal)	0x03	0.00002288818359375
Smallest value (Normal)	0x04	0.000030517578125
Largest value (Normal)	0x7f	57344
Smallest value > 1	0x41	1.25
Largest value < 1	0x3f	0.875
Closest value to π	0x46	3 (Δ ≈ 1.416x10⁻¹)
Largest sequential integer	0x4c	8

FNUZ (Float NaN Unsigned Zero) variant of e5m2.

Supported primarily today by AMD hardware (MI300).

Uses a bias of 16 (1 more than e5m2) for the exponent.

Has no dedicated infinity, with NaN being 0x80 (-0). Removing +/- infinity and 6 NaN bit-patterns in favour of a single NaN, it has 8 more unique values.

As there is no infinity to overflow to, values beyond the maximum magnitude always saturate.

e4m3 (OCP-E4M3, 1.4.3)

sign

exponent
(4 bit)

mantissa
(3 bit)

█

████

███

●
7

●
3

●
0

	Hex	Value
Smallest value (Denormal)	0x01	0.001953125
Largest value (Denormal)	0x07	0.013671875
Smallest value (Normal)	0x08	0.015625
Largest value (Normal)	0x7e	448
Smallest value > 1	0x39	1.125
Largest value < 1	0x37	0.9375
Closest value to π	0x45	3.25 (Δ ≈ 1.084x10⁻¹)
Largest sequential integer	0x58	16

Part of the OFP8 specification, targeting machine learning.

Has higher precision than e5m2, making it useful for activations and weights.

Has no dedicated infinity, with NaN being 0xff / 0x7f.

Like e5m2 has saturating modes:

Saturate: Clamp to the max magnitude.
NonSaturate: Greater than max magnitude becomes NaN.

e4m3fnuz (1.4.3)

sign

exponent
(4 bit)

mantissa
(3 bit)

█

████

███

●
7

●
3

●
0

	Hex	Value
Smallest value (Denormal)	0x01	0.0009765625
Largest value (Denormal)	0x07	0.0068359375
Smallest value (Normal)	0x08	0.0078125
Largest value (Normal)	0x7f	240
Smallest value > 1	0x41	1.125
Largest value < 1	0x3f	0.9375
Closest value to π	0x4d	3.25 (Δ ≈ 1.084x10⁻¹)
Largest sequential integer	0x60	16

FNUZ (Float NaN Unsigned Zero) variant of e4m3.

Supported primarily today by AMD hardware (MI300).

Uses a bias of 8 (1 more than e4m3) for the exponent.

Has no dedicated infinity, with NaN being 0x80 (-0). By reducing the 2 NaN values to 1 and removing -0, it has 2 more unique values.

As there is no infinity to overflow to, values beyond the maximum magnitude always saturate.

e8m0 (OCP-MX-E8M0, 0.8.0)

exponent
(8 bit)

████████

●
0

	Hex	Value
Smallest value	0x00	5.877471754111438e-39
Largest value	0xfe	1.7014118346046923e+38
Smallest value > 1	0x80	2
Largest value < 1	0x7e	0.5
Closest value to π	0x81	4 (Δ ≈ 8.584x10⁻¹)

Part of the OCP-MX specification, targeting machine learning and designed to be used as part of a block with a shared scaling factor.

Only has positive powers of 2 and scales from 2⁻¹²⁷ (0x00) to 2¹²⁷ (0xfe), with NaN being specifically at 0xff.

Has no dedicated Infinity or 0.

6bit formats

e3m2 (OCP-MX-E3M2, 1.3.2)

sign

exp
(3)

mantissa
(2)

█

███

██

●
5

●
2

●
0

	Hex	Value
Smallest value (Denormal)	0x01	0.0625
Largest value (Denormal)	0x03	0.1875
Smallest value (Normal)	0x04	0.25
Largest value (Normal)	0x1f	28
Smallest value > 1	0x0d	1.25
Largest value < 1	0x0b	0.875
Closest value to π	0x12	3 (Δ ≈ 1.416x10⁻¹)
Largest sequential integer	0x18	8

Part of the OCP-MX specification, targeting machine learning and designed to be used as part of a block with a shared scaling factor.

Much like how e5m2 relates to e4m3, e3m2 relates to e2m3 (has a larger range).

Has no dedicated Infinity or NaN.

e2m3 (OCP-MX-E2M3, 1.2.3)

sign

exp
(2)

mantissa
(3)

█

██

███

●
5

●
3

●
0

	Hex	Value
Smallest value (Denormal)	0x01	0.125
Largest value (Denormal)	0x07	0.875
Smallest value (Normal)	0x08	1
Largest value (Normal)	0x1f	7.5
Smallest value > 1	0x09	1.125
Largest value < 1	0x07	0.875
Closest value to π	0x15	3.25 (Δ ≈ 1.084x10⁻¹)
Largest sequential integer	0x1e	7

Part of the OCP-MX specification, targeting machine learning and designed to be used as part of a block with a shared scaling factor.

Has more precision than e3m2, but has less range.

Has no dedicated Infinity or NaN.

4bit formats

e2m1 (OCP-MX-E2M1)

	s000	s001	s010	s011	s100	s101	s110	s111
s=0	0	0.5	1	1.5	2	3	4	6
s=1	−0	−0.5	−1	−1.5	−2	−3	−4	−6

Part of the OCP-MX specification, targeting machine learning and designed to be used as part of a block with a shared scaling factor.

Despite its small size, the accuracy scores compared to e4m3 formats, when used as part of a microscaling block, are impressive.

Has no dedicated Infinity or NaN.

binary4p2sf

The same as e2m1, but -0 is replaced with NaN.

	s000	s001	s010	s011	s100	s101	s110	s111
s=0	0	0.5	1	1.5	2	3	4	6
s=1	NaN	−0.5	−1	−1.5	−2	−3	−4	−6

binary4p2se

The same as binary4p2sf, but ±6 is replaced with ±Inf.

	s000	s001	s010	s011	s100	s101	s110	s111
s=0	0	0.5	1	1.5	2	3	4	∞
s=1	NaN	−0.5	−1	−1.5	−2	−3	−4	-∞

Other

TensorFloat-32 (1.8.10)

sign

exponent
(8 bit)

mantissa
(10 bit)

█

████████

██████████

●
18

●
10

●
0

	Hex	Value
Smallest value (Denormal)	0x00001	1.1479437019748901e-41
Largest value (Denormal)	0x003ff	1.1743464071203126e-38
Smallest value (Normal)	0x00400	1.1754943508222875e-38
Largest value (Normal)	0x3fbff	3.4011621342146535e+38
Smallest value > 1	0x1fc01	1.0009765625
Largest value < 1	0x1fbff	0.99951171875
Closest value to π	0x20248	3.140625 (Δ ≈ 9.677x10⁻⁴)
Largest sequential integer	0x22800	2048

Shortened IEEE 754 single-precision floating-point format. (Has the same exponent as a f32, but less mantissa bits).

Widely used in NVIDIA Ampere/Hopper training, designed specifically for tensor cores.
Despite having 32 in the name, is actually 19 bits.

Can be converted into a f32 by simply shifting.

float f = asfloat(tf32_bits << 13);

E5B9G9R9 (0.5.9)

exponent
(5 bit)

mantissa
(9 bit)

█████

█████████

●
9

●
0

	Hex	Value
Smallest value (Denormal)	0x0001	1.1920928955078125e-7
Largest value (Denormal)	0x01ff	0.00006091594696044922
Smallest value (Normal)	0x0200	0.00006103515625
Largest value (Normal)	0x3dff	65472
Smallest value > 1	0x1e01	1.001953125
Largest value < 1	0x1dff	0.9990234375
Closest value to π	0x2124	3.140625 (Δ ≈ 9.677x10⁻⁴)
Largest sequential integer	0x3200	1024

Shortened unsigned IEEE 754 half-precision binary floating-point format. (Has the same exponent as a f16, but less mantissa bits).

Colour values are stored as mantissa bits with a shared exponent.

Often used as a HDR texture format (although largely replaced by BC6).

The reasoning behind this format is that perceptually the human eye is a lot better at picking out differences in luminance than it is hue changes.
The same theory applies to formats like YCbCr, where it's not uncommon to transmit Y (luma) at a higher resolution than the chromacities (CbCr).
This idea can also be seen when storing baked lighting as L1 spherical harmonics. Rather than storing each channel as 4 coefficients, it's pretty common to construct a weighted average and normalise against L0, with additional coefficients for R, G and B (12 floats becomes 6).

Can be converted into a f16 by simply shifting.

half h = ashalf((uint16_t)e5m9_bits << 1);

R11G11B10 (0.5.6 / 0.5.5)

exponent
(5 bit)

mantissa
(6 bit)

█████

██████

●
6

●
0

	Hex	Value
Smallest value (Denormal)	0x001	9.5367431640625e-7
Largest value (Denormal)	0x03f	0.00006008148193359375
Smallest value (Normal)	0x040	0.00006103515625
Largest value (Normal)	0x7bf	65024
Smallest value > 1	0x3c1	1.015625
Largest value < 1	0x3bf	0.9921875
Closest value to π	0x425	3.15625 (Δ ≈ 1.466x10⁻²)
Largest sequential integer	0x580	128

exponent
(5 bit)

mantissa
(5 bit)

█████

●
5

●
0

	Hex	Value
Smallest value (Denormal)	0x001	0.0000019073486328125
Largest value (Denormal)	0x01f	0.0000591278076171875
Smallest value (Normal)	0x020	0.00006103515625
Largest value (Normal)	0x3df	64512
Smallest value > 1	0x1e1	1.03125
Largest value < 1	0x1df	0.984375
Closest value to π	0x212	3.125 (Δ ≈ 1.659x10⁻²)
Largest sequential integer	0x2a0	64

Shortened unsigned IEEE 754 half-precision binary floating-point format. (Has the same exponent as a f16, but less mantissa bits).

Often used in place of RGBA16f for render targets.

Can be converted into a f16 by simply shifting.

half h10 = ashalf((uint16_t)u10_bits << 5);
half h11 = ashalf((uint16_t)u11_bits << 4);

Minifloat Format Converter

Introduction

16bit formats

bfloat16 (1.8.7)

f16 (half, 1.5.10)

8bit formats

fp8 (1.5.2)

e5m2 (OCP-E5M2, 1.5.2)

e5m2fnuz (1.5.2)

e4m3 (OCP-E4M3, 1.4.3)

e4m3fnuz (1.4.3)

e8m0 (OCP-MX-E8M0, 0.8.0)

6bit formats

e3m2 (OCP-MX-E3M2, 1.3.2)

e2m3 (OCP-MX-E2M3, 1.2.3)

4bit formats

e2m1 (OCP-MX-E2M1)

binary4p2sf

binary4p2se

Other

TensorFloat-32 (1.8.10)

E5B9G9R9 (0.5.9)

R11G11B10 (0.5.6 / 0.5.5)

Links