Minifloat Format Converter
Introduction
Minifloats are a bunch of different floating point formats, like your standard IEEE-754 float32, but smaller!
Each of them makes the delicate trade off between bandwidth, computing cost and precision.
No one format can claim to dominate the others in every situation.
They can roughly be categorised into two use cases: Rendering (VFX and Games) and Machine Learning. However with things like D3D12 LinAlg, it probably won't be long until games start using neural networks (with these formats) alongside traditional rendering techniques.
The (s.e.m) notation you might see in this e.g (1.8.7) means (sign.exponent.mantissa) in terms of bits.
So (1.8.7) means (1 sign bit, 8 exponent bits, 7 mantissa bits) and really acts as a shorthand way of identifying the shape of a floating point format.
16bit formats
bfloat16 (1.8.7)
(8 bit)
(7 bit)
15
7
0
| Hex | Value | |
|---|---|---|
| Smallest value (Denormal) | 0x0001 | 9.183549615799121e-41 |
| Largest value (Denormal) | 0x007f | 1.1663108012064884e-38 |
| Smallest value (Normal) | 0x0080 | 1.1754943508222875e-38 |
| Largest value (Normal) | 0x7f7f | 3.3895313892515355e+38 |
| Smallest value > 1 | 0x3f81 | 1.0078125 |
| Largest value < 1 | 0x3f7f | 0.99609375 |
| Closest value to π | 0x4049 | 3.140625 (Δ ≈ 9.677x10⁻⁴) |
| Largest sequential integer | 0x4380 | 256 |
Shortened IEEE 754 single-precision floating-point format. (Has the same exponent as a f32, but less mantissa bits).
A de facto standard for LLM training, where it can be used for gradients, activations and weights.
Can be converted into a f32 by simply shifting.
float f = asfloat((uint32_t)bfloat16_bits << 16);
f16 (half, 1.5.10)
(5 bit)
(10 bit)
15
10
0
| Hex | Value | |
|---|---|---|
| Smallest value (Denormal) | 0x0001 | 5.960464477539063e-8 |
| Largest value (Denormal) | 0x03ff | 0.00006097555160522461 |
| Smallest value (Normal) | 0x0400 | 0.00006103515625 |
| Largest value (Normal) | 0x7bff | 65504 |
| Smallest value > 1 | 0x3c01 | 1.0009765625 |
| Largest value < 1 | 0x3bff | 0.99951171875 |
| Closest value to π | 0x4248 | 3.140625 (Δ ≈ 9.677x10⁻⁴) |
| Largest sequential integer | 0x6800 | 2048 |
Probably the third most widely used format globally, after f32 and f64.
In games, it's commonly used for storing the pre-tonemapped render target alongside HDR textures.
In VFX, you'd output render passes for compositing in this format.
On GPUs you would expect to see dedicated f32tof16 / f16tof32 instructions.
Additionally many GPUs can perform arithmetic directly on f16 values (usually packing two in a 32bit register).
8bit formats
fp8 (1.5.2)
(5 bit)
(2 bit)
7
2
0
| Hex | Value | |
|---|---|---|
| Smallest value (Denormal) | 0x01 | 0.0000152587890625 |
| Largest value (Denormal) | 0x03 | 0.0000457763671875 |
| Smallest value (Normal) | 0x04 | 0.00006103515625 |
| Largest value (Normal) | 0x7b | 57344 |
| Smallest value > 1 | 0x3d | 1.25 |
| Largest value < 1 | 0x3b | 0.875 |
| Closest value to π | 0x42 | 3 (Δ ≈ 1.416x10⁻¹) |
| Largest sequential integer | 0x48 | 8 |
Shortened IEEE 754 half-precision binary floating-point format. (Has the same exponent as a f16, but less mantissa bits).
Can be converted into a f16 by simply shifting.
half h = ashalf((uint16_t)fp8_bits << 8);
e5m2 (OCP-E5M2, 1.5.2)
(5 bit)
(2 bit)
7
2
0
| Hex | Value | |
|---|---|---|
| Smallest value (Denormal) | 0x01 | 0.0000152587890625 |
| Largest value (Denormal) | 0x03 | 0.0000457763671875 |
| Smallest value (Normal) | 0x04 | 0.00006103515625 |
| Largest value (Normal) | 0x7b | 57344 |
| Smallest value > 1 | 0x3d | 1.25 |
| Largest value < 1 | 0x3b | 0.875 |
| Closest value to π | 0x42 | 3 (Δ ≈ 1.416x10⁻¹) |
| Largest sequential integer | 0x48 | 8 |
Part of the OFP8 specification, targeting machine learning.
Has larger range than e4m3, making it useful for gradients.
Stored the same as fp8, but with OFP8 saturation applied when converting from another format.
- Saturate: Clamp to the max magnitude.
- NonSaturate: Greater than max magnitude becomes infinity.
e5m2fnuz (1.5.2)
(5 bit)
(2 bit)
7
2
0
| Hex | Value | |
|---|---|---|
| Smallest value (Denormal) | 0x01 | 0.00000762939453125 |
| Largest value (Denormal) | 0x03 | 0.00002288818359375 |
| Smallest value (Normal) | 0x04 | 0.000030517578125 |
| Largest value (Normal) | 0x7f | 57344 |
| Smallest value > 1 | 0x41 | 1.25 |
| Largest value < 1 | 0x3f | 0.875 |
| Closest value to π | 0x46 | 3 (Δ ≈ 1.416x10⁻¹) |
| Largest sequential integer | 0x4c | 8 |
FNUZ (Float NaN Unsigned Zero) variant of e5m2.
Supported primarily today by AMD hardware (MI300).
Uses a bias of 16 (1 more than e5m2) for the exponent.
Has no dedicated infinity, with NaN being 0x80 (-0). Removing +/- infinity and 6 NaN bit-patterns in favour of a single NaN, it has 8 more unique values.
As there is no infinity to overflow to, values beyond the maximum magnitude always saturate.
e4m3 (OCP-E4M3, 1.4.3)
(4 bit)
(3 bit)
7
3
0
| Hex | Value | |
|---|---|---|
| Smallest value (Denormal) | 0x01 | 0.001953125 |
| Largest value (Denormal) | 0x07 | 0.013671875 |
| Smallest value (Normal) | 0x08 | 0.015625 |
| Largest value (Normal) | 0x7e | 448 |
| Smallest value > 1 | 0x39 | 1.125 |
| Largest value < 1 | 0x37 | 0.9375 |
| Closest value to π | 0x45 | 3.25 (Δ ≈ 1.084x10⁻¹) |
| Largest sequential integer | 0x58 | 16 |
Part of the OFP8 specification, targeting machine learning.
Has higher precision than e5m2, making it useful for activations and weights.
Has no dedicated infinity, with NaN being 0xff / 0x7f.
Like e5m2 has saturating modes:
- Saturate: Clamp to the max magnitude.
- NonSaturate: Greater than max magnitude becomes NaN.
e4m3fnuz (1.4.3)
(4 bit)
(3 bit)
7
3
0
| Hex | Value | |
|---|---|---|
| Smallest value (Denormal) | 0x01 | 0.0009765625 |
| Largest value (Denormal) | 0x07 | 0.0068359375 |
| Smallest value (Normal) | 0x08 | 0.0078125 |
| Largest value (Normal) | 0x7f | 240 |
| Smallest value > 1 | 0x41 | 1.125 |
| Largest value < 1 | 0x3f | 0.9375 |
| Closest value to π | 0x4d | 3.25 (Δ ≈ 1.084x10⁻¹) |
| Largest sequential integer | 0x60 | 16 |
FNUZ (Float NaN Unsigned Zero) variant of e4m3.
Supported primarily today by AMD hardware (MI300).
Uses a bias of 8 (1 more than e4m3) for the exponent.
Has no dedicated infinity, with NaN being 0x80 (-0). By reducing the 2 NaN values to 1 and removing -0, it has 2 more unique values.
As there is no infinity to overflow to, values beyond the maximum magnitude always saturate.
e8m0 (OCP-MX-E8M0, 0.8.0)
(8 bit)
0
| Hex | Value | |
|---|---|---|
| Smallest value | 0x00 | 5.877471754111438e-39 |
| Largest value | 0xfe | 1.7014118346046923e+38 |
| Smallest value > 1 | 0x80 | 2 |
| Largest value < 1 | 0x7e | 0.5 |
| Closest value to π | 0x81 | 4 (Δ ≈ 8.584x10⁻¹) |
Part of the OCP-MX specification, targeting machine learning and designed to be used as part of a block with a shared scaling factor.
Only has positive powers of 2 and scales from 2⁻¹²⁷ (0x00) to 2¹²⁷ (0xfe), with NaN being specifically at 0xff.
Has no dedicated Infinity or 0.
6bit formats
e3m2 (OCP-MX-E3M2, 1.3.2)
(3)
(2)
5
2
0
| Hex | Value | |
|---|---|---|
| Smallest value (Denormal) | 0x01 | 0.0625 |
| Largest value (Denormal) | 0x03 | 0.1875 |
| Smallest value (Normal) | 0x04 | 0.25 |
| Largest value (Normal) | 0x1f | 28 |
| Smallest value > 1 | 0x0d | 1.25 |
| Largest value < 1 | 0x0b | 0.875 |
| Closest value to π | 0x12 | 3 (Δ ≈ 1.416x10⁻¹) |
| Largest sequential integer | 0x18 | 8 |
Part of the OCP-MX specification, targeting machine learning and designed to be used as part of a block with a shared scaling factor.
Much like how e5m2 relates to e4m3, e3m2 relates to e2m3 (has a larger range).
Has no dedicated Infinity or NaN.
e2m3 (OCP-MX-E2M3, 1.2.3)
(2)
(3)
5
3
0
| Hex | Value | |
|---|---|---|
| Smallest value (Denormal) | 0x01 | 0.125 |
| Largest value (Denormal) | 0x07 | 0.875 |
| Smallest value (Normal) | 0x08 | 1 |
| Largest value (Normal) | 0x1f | 7.5 |
| Smallest value > 1 | 0x09 | 1.125 |
| Largest value < 1 | 0x07 | 0.875 |
| Closest value to π | 0x15 | 3.25 (Δ ≈ 1.084x10⁻¹) |
| Largest sequential integer | 0x1e | 7 |
Part of the OCP-MX specification, targeting machine learning and designed to be used as part of a block with a shared scaling factor.
Has more precision than e3m2, but has less range.
Has no dedicated Infinity or NaN.
4bit formats
e2m1 (OCP-MX-E2M1)
| s000 | s001 | s010 | s011 | s100 | s101 | s110 | s111 | |
|---|---|---|---|---|---|---|---|---|
| s=0 | 0 | 0.5 | 1 | 1.5 | 2 | 3 | 4 | 6 |
| s=1 | −0 | −0.5 | −1 | −1.5 | −2 | −3 | −4 | −6 |
Part of the OCP-MX specification, targeting machine learning and designed to be used as part of a block with a shared scaling factor.
Despite its small size, the accuracy scores compared to e4m3 formats, when used as part of a microscaling block, are impressive.
Has no dedicated Infinity or NaN.
binary4p2sf
The same as e2m1, but -0 is replaced with NaN.
| s000 | s001 | s010 | s011 | s100 | s101 | s110 | s111 | |
|---|---|---|---|---|---|---|---|---|
| s=0 | 0 | 0.5 | 1 | 1.5 | 2 | 3 | 4 | 6 |
| s=1 | NaN | −0.5 | −1 | −1.5 | −2 | −3 | −4 | −6 |
binary4p2se
The same as binary4p2sf, but ±6 is replaced with ±Inf.
| s000 | s001 | s010 | s011 | s100 | s101 | s110 | s111 | |
|---|---|---|---|---|---|---|---|---|
| s=0 | 0 | 0.5 | 1 | 1.5 | 2 | 3 | 4 | ∞ |
| s=1 | NaN | −0.5 | −1 | −1.5 | −2 | −3 | −4 | -∞ |
Other
TensorFloat-32 (1.8.10)
(8 bit)
(10 bit)
18
10
0
| Hex | Value | |
|---|---|---|
| Smallest value (Denormal) | 0x00001 | 1.1479437019748901e-41 |
| Largest value (Denormal) | 0x003ff | 1.1743464071203126e-38 |
| Smallest value (Normal) | 0x00400 | 1.1754943508222875e-38 |
| Largest value (Normal) | 0x3fbff | 3.4011621342146535e+38 |
| Smallest value > 1 | 0x1fc01 | 1.0009765625 |
| Largest value < 1 | 0x1fbff | 0.99951171875 |
| Closest value to π | 0x20248 | 3.140625 (Δ ≈ 9.677x10⁻⁴) |
| Largest sequential integer | 0x22800 | 2048 |
Shortened IEEE 754 single-precision floating-point format. (Has the same exponent as a f32, but less mantissa bits).
Widely used in NVIDIA Ampere/Hopper training, designed specifically for tensor cores.
Despite having 32 in the name, is actually 19 bits.
Can be converted into a f32 by simply shifting.
float f = asfloat(tf32_bits << 13);
E5B9G9R9 (0.5.9)
(5 bit)
(9 bit)
9
0
| Hex | Value | |
|---|---|---|
| Smallest value (Denormal) | 0x0001 | 1.1920928955078125e-7 |
| Largest value (Denormal) | 0x01ff | 0.00006091594696044922 |
| Smallest value (Normal) | 0x0200 | 0.00006103515625 |
| Largest value (Normal) | 0x3dff | 65472 |
| Smallest value > 1 | 0x1e01 | 1.001953125 |
| Largest value < 1 | 0x1dff | 0.9990234375 |
| Closest value to π | 0x2124 | 3.140625 (Δ ≈ 9.677x10⁻⁴) |
| Largest sequential integer | 0x3200 | 1024 |
Shortened unsigned IEEE 754 half-precision binary floating-point format. (Has the same exponent as a f16, but less mantissa bits).
Colour values are stored as mantissa bits with a shared exponent.
Often used as a HDR texture format (although largely replaced by BC6).
The reasoning behind this format is that perceptually the human eye is a lot better at picking out differences in luminance than it is hue changes.
The same theory applies to formats like YCbCr, where it's not uncommon to transmit Y (luma) at a higher resolution than the chromacities (CbCr).
This idea can also be seen when storing baked lighting as L1 spherical harmonics. Rather than storing each channel as 4 coefficients, it's pretty common to construct a weighted average and normalise against L0, with additional coefficients for R, G and B (12 floats becomes 6).
Can be converted into a f16 by simply shifting.
half h = ashalf((uint16_t)e5m9_bits << 1);
R11G11B10 (0.5.6 / 0.5.5)
(5 bit)
(6 bit)
6
0
| Hex | Value | |
|---|---|---|
| Smallest value (Denormal) | 0x001 | 9.5367431640625e-7 |
| Largest value (Denormal) | 0x03f | 0.00006008148193359375 |
| Smallest value (Normal) | 0x040 | 0.00006103515625 |
| Largest value (Normal) | 0x7bf | 65024 |
| Smallest value > 1 | 0x3c1 | 1.015625 |
| Largest value < 1 | 0x3bf | 0.9921875 |
| Closest value to π | 0x425 | 3.15625 (Δ ≈ 1.466x10⁻²) |
| Largest sequential integer | 0x580 | 128 |
(5 bit)
(5 bit)
5
0
| Hex | Value | |
|---|---|---|
| Smallest value (Denormal) | 0x001 | 0.0000019073486328125 |
| Largest value (Denormal) | 0x01f | 0.0000591278076171875 |
| Smallest value (Normal) | 0x020 | 0.00006103515625 |
| Largest value (Normal) | 0x3df | 64512 |
| Smallest value > 1 | 0x1e1 | 1.03125 |
| Largest value < 1 | 0x1df | 0.984375 |
| Closest value to π | 0x212 | 3.125 (Δ ≈ 1.659x10⁻²) |
| Largest sequential integer | 0x2a0 | 64 |
Shortened unsigned IEEE 754 half-precision binary floating-point format. (Has the same exponent as a f16, but less mantissa bits).
Often used in place of RGBA16f for render targets.
Can be converted into a f16 by simply shifting.
half h10 = ashalf((uint16_t)u10_bits << 5);
half h11 = ashalf((uint16_t)u11_bits << 4);
Links
- To FP8 and Back Again: Quantifying the Effects of Reducing Precision on LLM Training Stability
- NVIDIA: Floating-Point 8: An Introduction to Efficient, Lower-Precision AI Training
- NVIDIA: Introducing NVFP4 for Efficient and Accurate Low-Precision Inference
- OCP 8-bit Floating Point Specification (OFP8) - Revision 1.0
- OCP Microscaling Formats (MX) Specification - Version 1.0