19-bit
16-bit
14-bit / 11-bit / 10-bit
8-bit
6-bit
4-bit
Sign
+
0
Exponent
20
15
Mantissa
0
0
Decimal Input
Value Stored
Bin Representation
Hex Representation
NB: e5m2 / e4m3 saturation modes only apply when providing a decimal value.
All input conversions use rounding-to-nearest-even.
Lovingly based on https://www.h-schmidt.net/FloatConverter and flop.evanau.dev

Source code
Minifloat (C++ with a C interface): minifloat
Wasm Binding (C): bootstrap.c
Wasm Binding (js): minifloat_interface.js


Minifloat Format Converter

Introduction

Minifloats are a bunch of different floating point formats, like your standard IEEE-754 float32, but smaller!

Each of them makes the delicate trade off between bandwidth, computing cost and precision.
No one format can claim to dominate the others in every situation.

They can roughly be categorised into two use cases: Rendering (VFX and Games) and Machine Learning. However with things like D3D12 LinAlg, it probably won't be long until games start using neural networks (with these formats) alongside traditional rendering techniques.

The (s.e.m) notation you might see in this e.g (1.8.7) means (sign.exponent.mantissa) in terms of bits.
So (1.8.7) means (1 sign bit, 8 exponent bits, 7 mantissa bits) and really acts as a shorthand way of identifying the shape of a floating point format.

16bit formats

bfloat16 (1.8.7)

sign
exponent
(8 bit)
mantissa
(7 bit)
████████
███████

15

7

0

Hex Value
Smallest value (Denormal) 0x0001 9.183549615799121e-41
Largest value (Denormal) 0x007f 1.1663108012064884e-38
Smallest value (Normal) 0x0080 1.1754943508222875e-38
Largest value (Normal) 0x7f7f 3.3895313892515355e+38
Smallest value > 1 0x3f81 1.0078125
Largest value < 1 0x3f7f 0.99609375
Closest value to π 0x4049 3.140625 (Δ ≈ 9.677x10⁻⁴)
Largest sequential integer 0x4380 256

Shortened IEEE 754 single-precision floating-point format. (Has the same exponent as a f32, but less mantissa bits).

A de facto standard for LLM training, where it can be used for gradients, activations and weights.

Can be converted into a f32 by simply shifting.

float f = asfloat((uint32_t)bfloat16_bits << 16);

f16 (half, 1.5.10)

sign
exponent
(5 bit)
mantissa
(10 bit)
█████
██████████

15

10

0

Hex Value
Smallest value (Denormal) 0x0001 5.960464477539063e-8
Largest value (Denormal) 0x03ff 0.00006097555160522461
Smallest value (Normal) 0x0400 0.00006103515625
Largest value (Normal) 0x7bff 65504
Smallest value > 1 0x3c01 1.0009765625
Largest value < 1 0x3bff 0.99951171875
Closest value to π 0x4248 3.140625 (Δ ≈ 9.677x10⁻⁴)
Largest sequential integer 0x6800 2048

Probably the third most widely used format globally, after f32 and f64.

In games, it's commonly used for storing the pre-tonemapped render target alongside HDR textures.

In VFX, you'd output render passes for compositing in this format.

On GPUs you would expect to see dedicated f32tof16 / f16tof32 instructions.
Additionally many GPUs can perform arithmetic directly on f16 values (usually packing two in a 32bit register).

8bit formats

fp8 (1.5.2)

sign
exponent
(5 bit)
mantissa
(2 bit)
█████
██

7

2

0

Hex Value
Smallest value (Denormal) 0x01 0.0000152587890625
Largest value (Denormal) 0x03 0.0000457763671875
Smallest value (Normal) 0x04 0.00006103515625
Largest value (Normal) 0x7b 57344
Smallest value > 1 0x3d 1.25
Largest value < 1 0x3b 0.875
Closest value to π 0x42 3 (Δ ≈ 1.416x10⁻¹)
Largest sequential integer 0x48 8

Shortened IEEE 754 half-precision binary floating-point format. (Has the same exponent as a f16, but less mantissa bits).

Can be converted into a f16 by simply shifting.

half h = ashalf((uint16_t)fp8_bits << 8);

e5m2 (OCP-E5M2, 1.5.2)

sign
exponent
(5 bit)
mantissa
(2 bit)
█████
██

7

2

0

Hex Value
Smallest value (Denormal) 0x01 0.0000152587890625
Largest value (Denormal) 0x03 0.0000457763671875
Smallest value (Normal) 0x04 0.00006103515625
Largest value (Normal) 0x7b 57344
Smallest value > 1 0x3d 1.25
Largest value < 1 0x3b 0.875
Closest value to π 0x42 3 (Δ ≈ 1.416x10⁻¹)
Largest sequential integer 0x48 8

Part of the OFP8 specification, targeting machine learning.

Has larger range than e4m3, making it useful for gradients.

Stored the same as fp8, but with OFP8 saturation applied when converting from another format.

e5m2fnuz (1.5.2)

sign
exponent
(5 bit)
mantissa
(2 bit)
█████
██

7

2

0

Hex Value
Smallest value (Denormal) 0x01 0.00000762939453125
Largest value (Denormal) 0x03 0.00002288818359375
Smallest value (Normal) 0x04 0.000030517578125
Largest value (Normal) 0x7f 57344
Smallest value > 1 0x41 1.25
Largest value < 1 0x3f 0.875
Closest value to π 0x46 3 (Δ ≈ 1.416x10⁻¹)
Largest sequential integer 0x4c 8

FNUZ (Float NaN Unsigned Zero) variant of e5m2.

Supported primarily today by AMD hardware (MI300).

Uses a bias of 16 (1 more than e5m2) for the exponent.

Has no dedicated infinity, with NaN being 0x80 (-0). Removing +/- infinity and 6 NaN bit-patterns in favour of a single NaN, it has 8 more unique values.

As there is no infinity to overflow to, values beyond the maximum magnitude always saturate.

e4m3 (OCP-E4M3, 1.4.3)

sign
exponent
(4 bit)
mantissa
(3 bit)
████
███

7

3

0

Hex Value
Smallest value (Denormal) 0x01 0.001953125
Largest value (Denormal) 0x07 0.013671875
Smallest value (Normal) 0x08 0.015625
Largest value (Normal) 0x7e 448
Smallest value > 1 0x39 1.125
Largest value < 1 0x37 0.9375
Closest value to π 0x45 3.25 (Δ ≈ 1.084x10⁻¹)
Largest sequential integer 0x58 16

Part of the OFP8 specification, targeting machine learning.

Has higher precision than e5m2, making it useful for activations and weights.

Has no dedicated infinity, with NaN being 0xff / 0x7f.

Like e5m2 has saturating modes:

e4m3fnuz (1.4.3)

sign
exponent
(4 bit)
mantissa
(3 bit)
████
███

7

3

0

Hex Value
Smallest value (Denormal) 0x01 0.0009765625
Largest value (Denormal) 0x07 0.0068359375
Smallest value (Normal) 0x08 0.0078125
Largest value (Normal) 0x7f 240
Smallest value > 1 0x41 1.125
Largest value < 1 0x3f 0.9375
Closest value to π 0x4d 3.25 (Δ ≈ 1.084x10⁻¹)
Largest sequential integer 0x60 16

FNUZ (Float NaN Unsigned Zero) variant of e4m3.

Supported primarily today by AMD hardware (MI300).

Uses a bias of 8 (1 more than e4m3) for the exponent.

Has no dedicated infinity, with NaN being 0x80 (-0). By reducing the 2 NaN values to 1 and removing -0, it has 2 more unique values.

As there is no infinity to overflow to, values beyond the maximum magnitude always saturate.

e8m0 (OCP-MX-E8M0, 0.8.0)

exponent
(8 bit)
████████

0

Hex Value
Smallest value 0x00 5.877471754111438e-39
Largest value 0xfe 1.7014118346046923e+38
Smallest value > 1 0x80 2
Largest value < 1 0x7e 0.5
Closest value to π 0x81 4 (Δ ≈ 8.584x10⁻¹)

Part of the OCP-MX specification, targeting machine learning and designed to be used as part of a block with a shared scaling factor.

Only has positive powers of 2 and scales from 2⁻¹²⁷ (0x00) to 2¹²⁷ (0xfe), with NaN being specifically at 0xff.

Has no dedicated Infinity or 0.

6bit formats

e3m2 (OCP-MX-E3M2, 1.3.2)

sign
exp
(3)
mantissa
(2)
███
██

5

2

0

Hex Value
Smallest value (Denormal) 0x01 0.0625
Largest value (Denormal) 0x03 0.1875
Smallest value (Normal) 0x04 0.25
Largest value (Normal) 0x1f 28
Smallest value > 1 0x0d 1.25
Largest value < 1 0x0b 0.875
Closest value to π 0x12 3 (Δ ≈ 1.416x10⁻¹)
Largest sequential integer 0x18 8

Part of the OCP-MX specification, targeting machine learning and designed to be used as part of a block with a shared scaling factor.

Much like how e5m2 relates to e4m3, e3m2 relates to e2m3 (has a larger range).

Has no dedicated Infinity or NaN.

e2m3 (OCP-MX-E2M3, 1.2.3)

sign
exp
(2)
mantissa
(3)
██
███

5

3

0

Hex Value
Smallest value (Denormal) 0x01 0.125
Largest value (Denormal) 0x07 0.875
Smallest value (Normal) 0x08 1
Largest value (Normal) 0x1f 7.5
Smallest value > 1 0x09 1.125
Largest value < 1 0x07 0.875
Closest value to π 0x15 3.25 (Δ ≈ 1.084x10⁻¹)
Largest sequential integer 0x1e 7

Part of the OCP-MX specification, targeting machine learning and designed to be used as part of a block with a shared scaling factor.

Has more precision than e3m2, but has less range.

Has no dedicated Infinity or NaN.

4bit formats

e2m1 (OCP-MX-E2M1)

s000 s001 s010 s011 s100 s101 s110 s111
s=0 0 0.5 1 1.5 2 3 4 6
s=1 −0 −0.5 −1 −1.5 −2 −3 −4 −6

Part of the OCP-MX specification, targeting machine learning and designed to be used as part of a block with a shared scaling factor.

Despite its small size, the accuracy scores compared to e4m3 formats, when used as part of a microscaling block, are impressive.

Has no dedicated Infinity or NaN.

binary4p2sf

The same as e2m1, but -0 is replaced with NaN.

s000 s001 s010 s011 s100 s101 s110 s111
s=0 0 0.5 1 1.5 2 3 4 6
s=1 NaN −0.5 −1 −1.5 −2 −3 −4 −6

binary4p2se

The same as binary4p2sf, but ±6 is replaced with ±Inf.

s000 s001 s010 s011 s100 s101 s110 s111
s=0 0 0.5 1 1.5 2 3 4
s=1 NaN −0.5 −1 −1.5 −2 −3 −4 -∞

Other

TensorFloat-32 (1.8.10)

sign
exponent
(8 bit)
mantissa
(10 bit)
████████
██████████

18

10

0

Hex Value
Smallest value (Denormal) 0x00001 1.1479437019748901e-41
Largest value (Denormal) 0x003ff 1.1743464071203126e-38
Smallest value (Normal) 0x00400 1.1754943508222875e-38
Largest value (Normal) 0x3fbff 3.4011621342146535e+38
Smallest value > 1 0x1fc01 1.0009765625
Largest value < 1 0x1fbff 0.99951171875
Closest value to π 0x20248 3.140625 (Δ ≈ 9.677x10⁻⁴)
Largest sequential integer 0x22800 2048

Shortened IEEE 754 single-precision floating-point format. (Has the same exponent as a f32, but less mantissa bits).

Widely used in NVIDIA Ampere/Hopper training, designed specifically for tensor cores.
Despite having 32 in the name, is actually 19 bits.

Can be converted into a f32 by simply shifting.

float f = asfloat(tf32_bits << 13);

E5B9G9R9 (0.5.9)

exponent
(5 bit)
mantissa
(9 bit)
█████
█████████

9

0

Hex Value
Smallest value (Denormal) 0x0001 1.1920928955078125e-7
Largest value (Denormal) 0x01ff 0.00006091594696044922
Smallest value (Normal) 0x0200 0.00006103515625
Largest value (Normal) 0x3dff 65472
Smallest value > 1 0x1e01 1.001953125
Largest value < 1 0x1dff 0.9990234375
Closest value to π 0x2124 3.140625 (Δ ≈ 9.677x10⁻⁴)
Largest sequential integer 0x3200 1024

Shortened unsigned IEEE 754 half-precision binary floating-point format. (Has the same exponent as a f16, but less mantissa bits).

Colour values are stored as mantissa bits with a shared exponent.

Often used as a HDR texture format (although largely replaced by BC6).

The reasoning behind this format is that perceptually the human eye is a lot better at picking out differences in luminance than it is hue changes.
The same theory applies to formats like YCbCr, where it's not uncommon to transmit Y (luma) at a higher resolution than the chromacities (CbCr).
This idea can also be seen when storing baked lighting as L1 spherical harmonics. Rather than storing each channel as 4 coefficients, it's pretty common to construct a weighted average and normalise against L0, with additional coefficients for R, G and B (12 floats becomes 6).

Can be converted into a f16 by simply shifting.

half h = ashalf((uint16_t)e5m9_bits << 1);

R11G11B10 (0.5.6 / 0.5.5)

exponent
(5 bit)
mantissa
(6 bit)
█████
██████

6

0

Hex Value
Smallest value (Denormal) 0x001 9.5367431640625e-7
Largest value (Denormal) 0x03f 0.00006008148193359375
Smallest value (Normal) 0x040 0.00006103515625
Largest value (Normal) 0x7bf 65024
Smallest value > 1 0x3c1 1.015625
Largest value < 1 0x3bf 0.9921875
Closest value to π 0x425 3.15625 (Δ ≈ 1.466x10⁻²)
Largest sequential integer 0x580 128

exponent
(5 bit)
mantissa
(5 bit)
█████
█████

5

0

Hex Value
Smallest value (Denormal) 0x001 0.0000019073486328125
Largest value (Denormal) 0x01f 0.0000591278076171875
Smallest value (Normal) 0x020 0.00006103515625
Largest value (Normal) 0x3df 64512
Smallest value > 1 0x1e1 1.03125
Largest value < 1 0x1df 0.984375
Closest value to π 0x212 3.125 (Δ ≈ 1.659x10⁻²)
Largest sequential integer 0x2a0 64

Shortened unsigned IEEE 754 half-precision binary floating-point format. (Has the same exponent as a f16, but less mantissa bits).

Often used in place of RGBA16f for render targets.

Can be converted into a f16 by simply shifting.

half h10 = ashalf((uint16_t)u10_bits << 5);
half h11 = ashalf((uint16_t)u11_bits << 4);