Minifloat Formats
16bit formats
bfloat16 (1.8.7)
(8 bit)
(7 bit)
15
7
0
Shortened IEEE 754 single-precision floating-point format. (Has the same exponent as a f32, but less mantissa bits).
Can be converted into a f32 by simply shifting.
float f = asuint((uint32_t)bfloat16_bits << 16);
f16 (half, 1.5.10)
(5 bit)
(10 bit)
15
10
0
Probably the third most widely used format globally, after f32 and f64.
Commonly used for storing pre-tonemapped and HDR textures.
On GPUs you would expect to see dedicated f32tof16 / f16tof32 instructions.
Additionally many GPUs can perform arithmetic directly on f16 values (usually packing two in a 32bit register).
8bit formats
fp8 (1.5.2)
(5 bit)
(2 bit)
7
2
0
Shortened IEEE 754 half-precision binary floating-point format. (Has the same exponent as a f16, but less mantissa bits).
Can be converted into a f16 by simply shifting.
half h = ashalf((uint16_t)fp8_bits << 8);
e5m2 (OCP-E5M2, 1.5.2)
The same as fp8, but with OFP8 saturation applied.
- Saturate: Clamp to the max magnitude.
- NonSaturate: Greater than max magnitude becomes infinity.
e4m3 (OCP-E4M3, 1.4.3)
(4 bit)
(3 bit)
7
3
0
Has no dedicated infinity, with NaN being 0xff / 0x7f.
Like e5m2 has saturating modes:
- Saturate: Clamp to the max magnitude.
- NonSaturate: Greater than max magnitude becomes NaN.
4bit formats
e2m1
| s000 | s001 | s010 | s011 | s100 | s101 | s110 | s111 | |
|---|---|---|---|---|---|---|---|---|
| s=0 | 0 | 0.5 | 1 | 1.5 | 2 | 3 | 4 | 6 |
| s=1 | −0 | −0.5 | −1 | −1.5 | −2 | −3 | −4 | −6 |
binary4p2sf
The same as e2m1, but -0 is replaced with NaN.
| s000 | s001 | s010 | s011 | s100 | s101 | s110 | s111 | |
|---|---|---|---|---|---|---|---|---|
| s=0 | 0 | 0.5 | 1 | 1.5 | 2 | 3 | 4 | 6 |
| s=1 | NaN | −0.5 | −1 | −1.5 | −2 | −3 | −4 | −6 |
binary4p2se
The same as binary4p2sf, but 6 is replaced with Inf.
| s000 | s001 | s010 | s011 | s100 | s101 | s110 | s111 | |
|---|---|---|---|---|---|---|---|---|
| s=0 | 0 | 0.5 | 1 | 1.5 | 2 | 3 | 4 | ∞ |
| s=1 | NaN | −0.5 | −1 | −1.5 | −2 | −3 | −4 | -∞ |
Other
R11G11B10 (0.5.6 / 0.5.5)
(5 bit)
(6 bit)
6
0
(5 bit)
(5 bit)
5
0
Often used in place of RGBA16f for render targets.
Shortened unsigned IEEE 754 half-precision binary floating-point format. (Has the same exponent as a f16, but less mantissa bits).
Can be converted into a f16 by simply shifting.
half h10 = ashalf((uint16_t)u10_bits << 5);
half h11 = ashalf((uint16_t)u11_bits << 4);