Simplifying Drawing Numbers On The GPU

Background
Encoding For A Simpler Pipeline

Background

A little while ago, I wanted a way to easily display a bunch of debugging numbers while I was rendering. (FPS, instances culled, all that sort of good stuff).

You can read it back to the CPU and just print it to the console, but that gets quite messy and becomes rather burdensome to sift through.

You could also probably use whatever font rendering shaders you have at hand, but then that becomes quite tedious. When all you really wanted to do was quickly know the value of something, suddenly you need to worry about how much buffer to pre-allocate and the maximum number triangles/compute you may end up needing for all the digits.

Neither of these options are particularly great, when what you wanted to do, was simply draw numbers on the GPU.

Encoding For A Simpler Pipeline

Ideally, a solution would satisfy:

Not be a performance hog.
Be simple to drop in.
Use fixed buffer and dispatch sizes.
Represent a reasonable gamut of numbers.

To this end, it would be great if we could encode something usable in single uint32 and dispatch a single quad (be it two triangles or a compute range).

We know we'll need to store the digits 0 through 9, so we would have to use atleast 4 bits per digit, leaving us with a 8 character budget. 8 characters seems like more than enough for most use cases, and if we're talking only integers [-9999999, 99999999], we might call it a day.

It would be nice to be able to represent floats and since we have 16 possible values per digit but we're only using 10, that leaves us with 6 spare, that we can put to good use!

I settled on the following charset:

Nibble

Character

space

And went about writing encoders in a few different languages:

C++ (original version, which is a bit more involved)
HLSL (also includes sampling logic)
GLSL (also includes sampling logic)
C (primarily targeting WASM for JS)

And here are a few examples:

Input

Encoded

Decoded

0xfffff0b0

0.0

1.5

0xfffff5b1

1.5

-123

0xffff321d

-123

132412210

0x8ca423b1

1.324e+8

0.012456

0xf654210b

.012456

-0.00001421

0x5da24b1d

-1.42e-5

59.84512

0x21548b95

59.84512

9.99999999

0xffffff01

∞

0x9999ca9c

+9e+9999

-∞

0x9999ca9d

-9e+9999

NaN

0xfffffebe

#.#

And here's a little live demo using the C/WASM version + WebGL2:

Input

Encoded

Decoded

0xff5432b1

1.2345

WebGL2

Shader used:

draw_number.frag

Additionally, this solves the problem of dispatching, since it's a fixed uint32 we can effectively sample it using UVs. Here is an example using HLSL:

//// From number_encoding.hlsli

// GL = Y starts at the bottom
// DX = Y starts at the top
#ifndef Y_STARTS_AT_BOTTOM
#define Y_STARTS_AT_BOTTOM 0
#endif


// .###. ..#.. .###. ##### #...# ##### .#### ##### .###. .###.
// #..## .##.. #...# ....# #...# #.... #.... ....# #...# #...#
// #.#.# ..#.. ...#. ..##. #...# ####. ####. ...#. .###. #...#
// ##..# ..#.. ..#.. ....# .#### ....# #...# ..#.. #...# .####
// #...# ..#.. .#... #...# ....# ....# #...# ..#.. #...# ....#
// .###. .###. ##### .###. ....# ####. .###. ..#.. .###. .###.
//
// ..... ..... ..... ..... ..... .....
// .###. ..... ..... ..... .#.#. .....
// #...# ..... ..#.. ..... ##### .....
// ##### ..... .###. .###. .#.#. .....
// #.... .##.. ..#.. ..... ##### .....
// .###. .##.. ..... ..... .#.#. .....

const static uint numberPixels[16] = {
#if !Y_STARTS_AT_BOTTOM
    0x1d19d72eu, 0x1c4210c4u, 0x3e22222eu, 0x1d18321fu,
    0x210f4631u, 0x1f083c3fu, 0x1d18bc3eu, 0x0842221fu,
    0x1d18ba2eu, 0x1d0f462eu, 0x1c1fc5c0u, 0x0c600000u,
    0x00471000u, 0x00070000u, 0x15f57d40u, 0x00000000u
#else
    0x1d9ace2eu, 0x0862108eu, 0x1d14105fu, 0x3f06422eu,
    0x2318fa10u, 0x3e17c20fu, 0x3c17c62eu, 0x3f041084u,
    0x1d17462eu, 0x1d18fa0eu, 0x00e8fc2eu, 0x000000c6u,
    0x00023880u, 0x00003800u, 0x00afabeau, 0x00000000u
#endif
};


uint sampleEncodedDigit(uint encodedDigit, float2 uv)
{
    if(uv.x < 0. || uv.y < 0. || uv.x >= 1. || uv.y >= 1.) return 0u;
    uint2 coord = uint2(uv * float2(5., 6.));
    return (numberPixels[encodedDigit] >> (coord.y * 5u + coord.x)) & 1u;
}


// 8 character variant
uint sampleEncodedNumber(uint encodedNumber, float2 uv)
{
    // Extract the digit ID by scaling the uv.x value by 8 and clipping
    // the relevant 4 bits.
    uv.x *= 8.0;
    uint encodedDigit = (encodedNumber >> (uint(uv.x) * 4u)) & 0xf;
    
    // Put the U in between then [0, 1.2] range, the extra 0.2 is add a
    // logical 1px padding.
    // (6/5, where 5 is the number of pixels on the x axis)
    uv.x = frac(uv.x) * 1.2;

    return sampleEncodedDigit(encodedDigit, uv);
}


//// Actual shader


struct VSToPS
{
    float2 uv : ATTR0;
    uint   encoded : ATTR1;
};

float4 bgCol;
float4 fgCol;


float4 drawNumberPS(VSToPS input) : SV_TARGET
{
    uint signedValue = sampleEncodedNumber(input.encoded, input.uv);
    return lerp(bgCol, fgCol, float(signedValue));
}

And using AMDs assembly as a loose reference:

shader main
  asic(GFX10)
  type(PS)
  sgpr_count(14)
  vgpr_count(8)
  wave_size(64)

  s_inst_prefetch  0x0003
  s_mov_b32     m0, s12
  v_interp_p1_f32  v2, v0, attr0.x
  v_interp_p1_f32  v0, v0, attr0.y
  v_interp_p2_f32  v2, v1, attr0.x
  v_interp_p2_f32  v0, v1, attr0.y
  v_mul_f32     v1, lit(0x41000000), v2
  v_cmp_lt_f32  s[0:1], v0, 0
  v_fract_f32   v2, v1
  v_cmp_le_f32  vcc, lit(0x3f555555), v2
  s_or_b64      s[0:1], s[0:1], vcc
  v_cmp_le_f32  vcc, 1.0, v0
  s_or_b64      vcc, s[0:1], vcc
  s_mov_b64     s[0:1], exec
  s_andn2_b64   exec, s[0:1], vcc
  v_cvt_u32_f32  v1, v1
  s_cbranch_execz  label_0098
  v_lshlrev_b32  v1, 2, v1
  v_interp_mov_f32  v3, p0, attr1.x
  v_lshrrev_b32  v1, v1, v3
  v_and_b32     v1, 15, v1
  tbuffer_load_format_x  v1, v1, s[8:11], 0 idxen format:[BUF_FMT_32_FLOAT]
  v_mul_f32     v0, lit(0x40c00000), v0
  v_mul_f32     v2, lit(0x40c00000), v2
  v_cvt_u32_f32  v3, v0
  v_cvt_u32_f32  v2, v2
  v_lshl_add_u32  v0, v3, 2, v3
  v_add_nc_u32  v0, v2, v0
  s_waitcnt     vmcnt(0)
  v_lshrrev_b32  v0, v0, v1
  v_and_b32     v0, 1, v0
label_0098:
  s_andn2_b64   exec, s[0:1], exec
  v_mov_b32     v0, 0
  s_mov_b64     exec, s[0:1]
  s_buffer_load_dwordx8  s[0:7], s[4:7], null
  v_cvt_f32_u32  v0, v0
  s_waitcnt     lgkmcnt(0)
  v_subrev_f32  v1, s0, s4
  v_subrev_f32  v2, s1, s5
  v_subrev_f32  v3, s2, s6
  v_subrev_f32  v4, s3, s7
  v_mad_f32     v1, v0, v1, s0
  v_mad_f32     v2, v0, v2, s1
  v_mad_f32     v3, v0, v3, s2
  v_mad_f32     v0, v0, v4, s3
  v_cvt_pkrtz_f16_f32  v1, v1, v2
  v_cvt_pkrtz_f16_f32  v0, v3, v0
  s_nop         0x0000
  s_nop         0x0000
  exp           mrt0, v1, v1, v0, v0 done compr vm

Which looks pretty reasonable.

All in all, this seems like a pretty good approach.
The full Javascript/WebGL2 stuff can be viewed here: gpunumbers_webgl2.js.

Also for funsies here is a shadertoy port: https://www.shadertoy.com/view/dtjXWK