Preserving Small Values

Background

It's not massively uncommon to want a higher level of precision for smaller values. A good example of this is luminance, our eyes can detect subtle changes in low light much better than in a bright room. There is a reason modern computers use IEE-754 floats rather than fixed, which does this (very cleverly) by design. Allowing us to describe both very small numbers and very big numbers with a relative amount of accuracy.

In the world of GPU rendering, memory constraints are a very real thing, while it would be nice to be able to use float32 (even float16 would be nice) for every texture or buffer, this isn't practical due to bandwidth and actually available VRAM.

Usually you'll end up attempting to store values between 0 and 1 in a 8bit UNORM texture (and if it's not a runtime texture, you'll have block compression applied on top). This is always stored linearly, which is often fine, but what about when it's not?

You could try using log2 and exp2:

float4 encode(float4 x)
{
    return log2(x + 1);     // 1 x full rate + 1 x half rate
}

float4 decode(float4 x)
{
    return exp2(x) - 1;     // 1 x full rate + 1 x quater rate
}

A better and cheaper solution is to sqrt and square the value upon read:

float4 encode(float4 x)
{
    return sqrt(x);         // 1 x quater rate
}

float4 decode(float4 x)
{
    return x * x;           // 1 x full rate
}

If you want to push precision further for smaller values, you may attempt to use pow, which starts to get a bit expensive:

float4 encode(float4 x)
{
    return pow(x, 1.0 / 2.5);   // 1 x full rate + 1 x half rate + 1 x quater rate
}

float4 decode(float4 x)
{
    return pow(x, 2.5);         // 1 x full rate + 1 x half rate + 1 x quater rate
}

So what other less travelled alternatives are there?

Approximating Roots

If using pow is a bit on the expensive side, what if we approximated it? We don't necessarily care if the approximation is fully accurate, as long as it gets us in the right ball park and we can reverse it.

Most (graphics) people at one point or another have come across the legendary fast inverse square root.

float q_rsqrt(float number)
{
  long i;
  float x2, y;
  const float threehalfs = 1.5F;

  x2 = number * 0.5F;
  y  = number;
  i  = * ( long * ) &y;                       // evil floating point bit level hacking
  i  = 0x5f3759df - ( i >> 1 );               // what the fuck?
  y  = * ( float * ) &i;
  y  = y * ( threehalfs - ( x2 * y * y ) );   // 1st iteration
  // y  = y * ( threehalfs - ( x2 * y * y ) );   // 2nd iteration, this can be removed

  return y;
}

With the normal response being somewhere between horror, confusion and admiration. As it turns out Hackers Delight has a section (17.4) explaining how this works.

While we aren't concerned with solving negative exponents, we can use the same methods to solve positive ones.

(MUSLs cbrt implementation uses the same technique, but never got any of the appreciation it deserves).

We aren't going to be doing any Newton or Halley iterations, they aren't really relevant to us and if we want to use fractional powers, make everything messy.


We first derive the exponent via uint(127 - 127 / power) << 23, leaving us to solve the 23bit mantissa.

This can be done by probing the maximum ulp error for different mantissa values and keeping whichever one has the least error.

It might sound like this requires an enormous amount of computing, but it doesn't.

When determining the max ulp error, you don't need to consider every possible 32bit float, in fact just evaluating every mantissa bit is enough (so 0.5 -> 1.0 or 1.0 -> 2.0).

Additionally you don't need to solve the max ulp error for every possible value, you can do something similar to a binary search. For reasons I cannot explain, every power has a value it clearly converges towards (so much so, it may be possible to analytically derive it).

The actual way I do this is through a series of passes.

Explanation of passes
Each pass will pick up where the last pass left off,
solving (nsearchbits-1) each time, except for the last
pass, which will solve nsearchbits.
A bit confusing, but hopefully the example below, will
make this a bit clearer:


If the first pass searches 8 bits, it'll test out
    00000000000000000000000 => 11111111000000000000000

Lets say the value with the minimum error was:
    00111000000000000000000

We will now limit our search +/- 00000001000000000000000
    00110111000000000000000 => 00111001000000000000000

Which leaves us with 2^16 values, and 16 bits to solve,
abstractly telling us we have solved 7 bits.

---

If the second pass solves 6 bits, it will search
    00110111000000000000000 + [0000000000000000 => 1111110000000000]

Lets say the value with the minimum error was:
    00110111101010000000000

We will now limit our search +/- 0000010000000000
    00110111101000000000000 => 00110111101100000000000

Which leaves us with 2^11 values, and 11 bits to solve,
telling us we have solved a further 6 bits.

---

This carries on until we've reached some point where
we we're comparing the least significant bits, at which
point our magic number as been solved!

Here's a visualisation of the different passes used to solve the cuberoot, solving 4 bits at a time, it's a set of normalised offsets with the max ulp error appearing on y (lower is better).

f-pass-1
Pass 1
0x2a000000 - 0x2a780000
f-pass-2
Pass 2
0x2a480000 - 0x2a570000
f-pass-3
Pass 3
0x2a500000 - 0x2a51e000
f-pass-4
Pass 4
0x2a508000 - 0x2a50bc00
f-pass-5
Pass 5
0x2a50a000 - 0x2a50a780
f-pass-6
Pass 6
0x2a50a200 - 0x2a50a2f0
f-pass-7
Pass 7
0x2a50a290 - 0x2a50a2ae
f-pass-8
Pass 8
0x2a50a29a - 0x2a50a29d

Here is an example of us having solved a cuberoot:

float encode(float x)
{
    // 4 x full rate
    float y = float(asuint(x));
    y = y * 0.3333333333333333f;
    uint z = uint(y);
    return asfloat(z + 0x2a50a29du);
}
float decode(float x)
{
    // 4 x full rate
    uint z = asuint(x) - 0x2a50a29du;
    float y = float(z);
    y = y * 3.0f;
    return asfloat(uint(y));
}

We can actually reduce the amount of instructions to 3, if we're willing to evaluate the magic number as a float (for a very very very small amount of precision error):

float encode(float x)
{
    // 3 x full rate
    float y = float(asuint(x));
    y = y * 0.3333333333333333f
          + 709927552.0f;
    return asfloat(uint(y));
}
float decode(float x)
{
    // 3 x full rate
    float y = float(asuint(x));
    y = y * 3.0f
          - 2129782656.0f;
    return asfloat(uint(y));
}

The precision error can be seen as the passes begin to break down at later stages:

t-pass-5
Pass 5
0x2a50a000 - 0x2a50a780
t-pass-6
Pass 6
0x2a50a200 - 0x2a50a2f0
t-pass-7
Pass 7
0x2a50a270 - 0x2a50a28e
t-pass-8
Pass 8
0x2a50a27e - 0x2a50a281

Live demo

Root

Magic as float
(1 less instruction)


0x00000000


Here are all the relevant source code files:

Making Tiny Floats

Floats already do a pretty good job of preserving small values, so can we use this to our advantage?

As it turns out, yes! If we logically split an 8 bit number to have a few bits dedicated to the exponent and the rest the mantissa, we should be able to arrive at the same result.

For example, if we stored a 3 bit exponent and 5 bit mantissa, we would be able to represent: 2^[-9, -1] * (1 + [0, 31/32])

Now actually manually extracting bits, handling rounding and dealing with representing 1.0 doesn't seem particularly simple, but we can use the power of floats to achieve this for us!

Let's take our example of wanting a 3 bit exponent, if we just multiply our [0, 1] number by the biggest float32 value with 3 signed exponent bits. (This happens to be 1.50463267937e-36, when treated as an integer 67108863 [0x3ffffff]).

Then the resulting number will scale in integer format from 0 -> 67108863 with all the floating point rules applied. From here, simply cast back to float and divide by 67108863, giving us a normalised number.

(A float within a float if you will).

Here are some examples of the output:

    0    => 0
    0.05 => 0.4499999947845935
    0.1  => 0.5749999966472387
    0.15 => 0.650000000745058
    0.2  => 0.6999999985098838
    0.25 => 0.7499999962747096
    0.3  => 0.7750000026077033
    0.35 => 0.7999999940395355
    0.4  => 0.8250000003725291
    0.45 => 0.8500000067055227
    0.5  => 0.8749999981373549
    0.55 => 0.8875000013038516
    0.6  => 0.9000000044703484
    0.65 => 0.9125000076368452
    0.7  => 0.9249999959021806
    0.75 => 0.9374999990686774
    0.8  => 0.9500000022351742
    0.85 => 0.962500005401671
    0.9  => 0.9750000085681678
    0.95 => 0.9874999968335032
    1    => 1

Both encoding and decoding can be done with 3 full rate instructions, and if we want to handle [-1, 1] this can be done at the cost of two extra instructions.

What's sort of neat, is really we can introduce the concept of partial exponent bits. If we take 2.5 as the number of exponent bits, simply lerp between 1 << (23 + 2) - 1 and 1 << (23 + 3) - 1.


Live demo

Exponent Bits
Signed
Quantized bits

Quantized Preview

Shader source


Here are all the relevant source code files:

Finally Actually Packing Bits

Once you've chosen whatever method seems most appropriate, you may wish to actually pack your data inside of a single uint.

For example, maybe you want to pack RGB into a uint32, with the following layout:

struct PackedRgb
{
    uint R:11;
    uint G:12;
    uint B:9;
};

The most common way you'd see this be done is something like:

uint pack(float3 rgb)
{
    uint R = uint(rgb.x * 2047.0 + 0.5); // (1 << 11) - 1
    uint G = uint(rgb.y * 4095.0 + 0.5); // (1 << 12) - 1
    uint B = uint(rgb.z * 511.0 + 0.5);  // (1 << 9) - 1
    return R | (G << 11) | (B << 23);
}

float3 unpack(uint packed)
{
    float R = float(packed         & 2047u) * 1.0/2047.0;
    float G = float((packed >> 11) & 4095u) * 1.0/4095.0;
    float B = float((packed >> 23) & 511u)  * 1.0/511.0;
    return float3(R, G, B);
}

Each component required 2 full rate instructions to be converted into their integer form:

    // 1 x mad
    // 1 x cvt f32 to u32
    uint R = uint(rgb.x * 2047.0 + 0.5);

Optimization For Encoding / Decoding using Denormals


⚠ Denormal support (atleast with WebGL) is spotty at best ⚠

Massive thanks to the lovely people at shadertoy for doing a little survey: Does Your GPU Support Denormals?


Encoding

Assuming the target architecture handles denormals correctly, we can reduce the number of instructions from 2 to 1:

    // 1 x mul
    uint R = asuint(rgb.x * asfloat(2047));

This works, by once again expoiting the fact the mantissa works linearly and would work up to 23bits.

uint pack(float3 rgb)
{
    uint R = asuint(rgb.x * asfloat(2047u)); // (1 << 11) - 1
    uint G = asuint(rgb.y * asfloat(4095u)); // (1 << 12) - 1
    uint B = asuint(rgb.z * asfloat(511u));  // (1 << 9) - 1
    return R | (G << 11) | (B << 23);
}

Theoretical Optimization For Decoding

Unfortunately, we can't simply do the inverse of the above, since by definition of it being a denormal, it has no representable inverse.

We can however shift our number such that the most significant bit lies on the least significant exponent bit.

e.g:

8bits = 0b11111111

s eeeeeee mmmmmmmmmmmmmmmmmmmmmmm
0 0000000 00000000000000011111111

value << (24 - 8)

s eeeeeee mmmmmmmmmmmmmmmmmmmmmmm
0 0000001 11111110000000000000000

This can be inverted and since this is the first not denormalised range, the value is still totally linear.

The reason this is theoretically an optimization is because if you were to unpack bits from a single uint, you'd already be doing a shift and mask, you're now just changing that shift to be at a new address.

At the very least, it is never more instructions than the baseline, with the main optimization coming from removing a cast to float:

float unpackNorm(uint src, uint shift, uint numbits)
{
    // These should be evaluated statically at compile-time
    uint maxValue = (1u << numbits) - 1u;
    uint floatShift = 24u - numbits;
    uint mask = maxValue << floatShift;
    float norm = 1.0 / asfloat(mask);

    // This branch should be folded and be a constant
    // single shift.
    if(floatShift < shift)
    {
        src >>= (shift - floatShift);
    }
    else
    {
        src <<= (floatShift - shift);
    }

    src &= mask;
    return asfloat(src) * norm;
}


float3 unpack(uint packed)
{
    float R = unpackNorm(packed, 0u, 11u);
    float G = unpackNorm(packed, 11u, 12u);
    float B = unpackNorm(packed, 23u, 9u);
    return float3(R, G, B);
}