Simplifying Drawing Numbers On The GPU
Background
A little while ago, I wanted a way to easily display a bunch of debugging numbers while I was rendering. (FPS, instances culled, all that sort of good stuff).
You can read it back to the CPU and just print it to the console, but that gets quite messy and becomes rather burdensome to sift through.
You could also probably use whatever font rendering shaders you have at hand, but then that becomes quite tedious. When all you really wanted to do was quickly know the value of something, suddenly you need to worry about how much buffer to pre-allocate and the maximum number triangles/compute you may end up needing for all the digits.
Neither of these options are particularly great, when what you wanted to do, was simply draw numbers on the GPU.
Encoding For A Simpler Pipeline
Ideally, a solution would satisfy:
- Not be a performance hog.
- Be simple to drop in.
- Use fixed buffer and dispatch sizes.
- Represent a reasonable gamut of numbers.
To this end, it would be great if we could encode something usable in single uint32
and dispatch a single quad (be it two triangles or a compute range).
We know we'll need to store the digits 0 through 9, so we would have to use atleast 4 bits per digit, leaving us with a 8 character budget. 8 characters seems like more than enough for most use cases, and if we're talking only integers [-9999999, 99999999], we might call it a day.
It would be nice to be able to represent floats and since we have 16 possible values per digit but we're only using 10, that leaves us with 6 spare, that we can put to good use!
I settled on the following charset:
And went about writing encoders in a few different languages:
- C++ (original version, which is a bit more involved)
- HLSL (also includes sampling logic)
- GLSL (also includes sampling logic)
- C (primarily targeting WASM for JS)
And here are a few examples:
And here's a little live demo using the C/WASM version + WebGL2:
FG
BG
WebGL2
Shader used:
Additionally, this solves the problem of dispatching, since it's a fixed uint32
we can effectively sample it using UVs. Here is an example using HLSL:
//// From number_encoding.hlsli
// GL = Y starts at the bottom
// DX = Y starts at the top
#ifndef Y_STARTS_AT_BOTTOM
#define Y_STARTS_AT_BOTTOM 0
#endif
// .###. ..#.. .###. ##### #...# ##### .#### ##### .###. .###.
// #..## .##.. #...# ....# #...# #.... #.... ....# #...# #...#
// #.#.# ..#.. ...#. ..##. #...# ####. ####. ...#. .###. #...#
// ##..# ..#.. ..#.. ....# .#### ....# #...# ..#.. #...# .####
// #...# ..#.. .#... #...# ....# ....# #...# ..#.. #...# ....#
// .###. .###. ##### .###. ....# ####. .###. ..#.. .###. .###.
//
// ..... ..... ..... ..... ..... .....
// .###. ..... ..... ..... .#.#. .....
// #...# ..... ..#.. ..... ##### .....
// ##### ..... .###. .###. .#.#. .....
// #.... .##.. ..#.. ..... ##### .....
// .###. .##.. ..... ..... .#.#. .....
const static uint numberPixels[16] = {
#if !Y_STARTS_AT_BOTTOM
0x1d19d72eu, 0x1c4210c4u, 0x3e22222eu, 0x1d18321fu,
0x210f4631u, 0x1f083c3fu, 0x1d18bc3eu, 0x0842221fu,
0x1d18ba2eu, 0x1d0f462eu, 0x1c1fc5c0u, 0x0c600000u,
0x00471000u, 0x00070000u, 0x15f57d40u, 0x00000000u
#else
0x1d9ace2eu, 0x0862108eu, 0x1d14105fu, 0x3f06422eu,
0x2318fa10u, 0x3e17c20fu, 0x3c17c62eu, 0x3f041084u,
0x1d17462eu, 0x1d18fa0eu, 0x00e8fc2eu, 0x000000c6u,
0x00023880u, 0x00003800u, 0x00afabeau, 0x00000000u
#endif
};
uint sampleEncodedDigit(uint encodedDigit, float2 uv)
{
if(uv.x < 0. || uv.y < 0. || uv.x >= 1. || uv.y >= 1.) return 0u;
uint2 coord = uint2(uv * float2(5., 6.));
return (numberPixels[encodedDigit] >> (coord.y * 5u + coord.x)) & 1u;
}
// 8 character variant
uint sampleEncodedNumber(uint encodedNumber, float2 uv)
{
// Extract the digit ID by scaling the uv.x value by 8 and clipping
// the relevant 4 bits.
uv.x *= 8.0;
uint encodedDigit = (encodedNumber >> (uint(uv.x) * 4u)) & 0xf;
// Put the U in between then [0, 1.2] range, the extra 0.2 is add a
// logical 1px padding.
// (6/5, where 5 is the number of pixels on the x axis)
uv.x = frac(uv.x) * 1.2;
return sampleEncodedDigit(encodedDigit, uv);
}
//// Actual shader
struct VSToPS
{
float2 uv : ATTR0;
uint encoded : ATTR1;
};
float4 bgCol;
float4 fgCol;
float4 drawNumberPS(VSToPS input) : SV_TARGET
{
uint signedValue = sampleEncodedNumber(input.encoded, input.uv);
return lerp(bgCol, fgCol, float(signedValue));
}
And using AMDs assembly as a loose reference:
shader main
asic(GFX10)
type(PS)
sgpr_count(14)
vgpr_count(8)
wave_size(64)
s_inst_prefetch 0x0003
s_mov_b32 m0, s12
v_interp_p1_f32 v2, v0, attr0.x
v_interp_p1_f32 v0, v0, attr0.y
v_interp_p2_f32 v2, v1, attr0.x
v_interp_p2_f32 v0, v1, attr0.y
v_mul_f32 v1, lit(0x41000000), v2
v_cmp_lt_f32 s[0:1], v0, 0
v_fract_f32 v2, v1
v_cmp_le_f32 vcc, lit(0x3f555555), v2
s_or_b64 s[0:1], s[0:1], vcc
v_cmp_le_f32 vcc, 1.0, v0
s_or_b64 vcc, s[0:1], vcc
s_mov_b64 s[0:1], exec
s_andn2_b64 exec, s[0:1], vcc
v_cvt_u32_f32 v1, v1
s_cbranch_execz label_0098
v_lshlrev_b32 v1, 2, v1
v_interp_mov_f32 v3, p0, attr1.x
v_lshrrev_b32 v1, v1, v3
v_and_b32 v1, 15, v1
tbuffer_load_format_x v1, v1, s[8:11], 0 idxen format:[BUF_FMT_32_FLOAT]
v_mul_f32 v0, lit(0x40c00000), v0
v_mul_f32 v2, lit(0x40c00000), v2
v_cvt_u32_f32 v3, v0
v_cvt_u32_f32 v2, v2
v_lshl_add_u32 v0, v3, 2, v3
v_add_nc_u32 v0, v2, v0
s_waitcnt vmcnt(0)
v_lshrrev_b32 v0, v0, v1
v_and_b32 v0, 1, v0
label_0098:
s_andn2_b64 exec, s[0:1], exec
v_mov_b32 v0, 0
s_mov_b64 exec, s[0:1]
s_buffer_load_dwordx8 s[0:7], s[4:7], null
v_cvt_f32_u32 v0, v0
s_waitcnt lgkmcnt(0)
v_subrev_f32 v1, s0, s4
v_subrev_f32 v2, s1, s5
v_subrev_f32 v3, s2, s6
v_subrev_f32 v4, s3, s7
v_mad_f32 v1, v0, v1, s0
v_mad_f32 v2, v0, v2, s1
v_mad_f32 v3, v0, v3, s2
v_mad_f32 v0, v0, v4, s3
v_cvt_pkrtz_f16_f32 v1, v1, v2
v_cvt_pkrtz_f16_f32 v0, v3, v0
s_nop 0x0000
s_nop 0x0000
exp mrt0, v1, v1, v0, v0 done compr vm
Which looks pretty reasonable.
All in all, this seems like a pretty good approach.
The full Javascript/WebGL2 stuff can be viewed here: gpunumbers_webgl2.js.
Also for funsies here is a shadertoy port: https://www.shadertoy.com/view/dtjXWK