2D Circular Shadow Mapping

Background
Faster atan2
Generating A Shadowmap
Planemaps
- Binary sampling
- PCF
Filtering Lights To Evaluate
Demo

Background

Lighting a scene efficiently is pretty difficult in 3D and in 2D that is still the case. The de-facto standard for shadows has remained shadow mapping (and derivative techniques) for quite a number of years and that doesn't look like it will change any time soon.

For 3D, you'd typically render the scene from the perspective of a light (or if it's a directional light, a cascade of orthographic projections that cover what the player can see), storing only the depth and sampling it when it comes time to apply lighting.

We can do a similar thing for pointlights in 2D, by mapping the surface of a circle to a 1D texture.

// u is the 1d texture coordinate (0 -> 1)
vec2 circlePosition = vec2(cos(u * TWOPI), sin(u * TWOPI));
float u = atan2(circlePosition.y, circlePosition.x) / TWOPI;

line <=> circle

Storing information in this way has the added bonus, that we can store the shadow information of multiple lights in a single 2D texture, where each light is it's own row. So if we have 100 lights and we choose to have a resolution of 512, we allocate a single 512x100 texture, this simplifies binding and opens the door for evaluating all pointlights in a single pass.

If you have truly static lights, baking them all down either onto the background image or as a seperate overlay is always going to be faster at runtime.

Faster atan2

You may know that atan2 is expensive, especially if it's something we plan on doing potentially multiple times per pixel. As such, we are going to want to use approximations.

This is a pretty decent one I've found, courtesy of njuffa on math.stackexchange.com.

fastAtan2

// https://math.stackexchange.com/a/1105038
float fastAtan2(float y, float x)
{
    float a = min(abs(x), abs(y)) / max(abs(x), abs(y));
    float s = a * a;
    float r = ((-0.0464964749 * s + 0.15931422) * s - 0.327622764) * s * a + a;
    if(abs(y) > abs(x)) { r = HALFPI - r; }
    if(x < 0) { r = PI - r; }
    r = multiplySign(r, y); // if(y < 0) { r = -r; }
    return r;
}

But we can do a bit better since we also need to do an extra division by 2pi

fastAtan2_div2pi

// Similar to fastAtan2, except the result is pre divided by 2pi, for the
// purpose of sampling 1d circular maps (circular shadows etc).
// https://www.geogebra.org/calculator/dms6kp8w
//
// max error ~ 0.00024531353567275316
float fastAtan2_div2pi(float y, float x)
{
    float a = min(abs(x), abs(y)) / max(abs(x), abs(y));
    float s = a * a;
    float r = ((0.013506162438972577 * s + -0.04684240210645093) * s + -0.8414151531876038) * a + a;
    if(abs(y) > abs(x)) { r = 0.25 - r; }
    if(x < 0.0) { r = 0.5 - r; }
    r = multiplySign(r, y); // if(y < 0) { r = -r; }
    return r;
}

And if we wanted to be extra accurate (we don't):

fastAtan2_div2pi_accurate

// One extra mad over fastAtan2_div2pi, although that is typically accurate enough.
//
// max error ~ 3.6957397769599165e-05
float fastAtan2_div2pi_accurate(float y, float x)
{
    float a = min(abs(x), abs(y)) / max(abs(x), abs(y));
    float s = a * a;
    float r = (((-0.0066609612639593 * s + 0.023972538405749075) * s + -0.05140823187987065) * s + -0.8409411267732256) * a + a;
    if(abs(y) > abs(x)) { r = 0.25 - r; }
    if(x < 0.0) { r = 0.5 - r; }
    r = multiplySign(r, y); // if(y < 0) { r = -r; }
    return r;
}

I wrote the following python script using scipy.optimize.curve_fit to derive the coefficients (not particularly well written, but this is very much something you run once and never touch again, but probably nice to have a reference):

atan01.py

from numpy import *
from scipy.optimize import curve_fit

ACCURATE = False

if ACCURATE:
    def atan01_func(X, *betas):
        a = X # y/x
        s = a * a
        return (((betas[0] * s + betas[1]) * s + betas[2]) * s + betas[3]) * a + a

else:
    def atan01_func(X, *betas):
        a = X # y/x
        s = a * a
        return ((betas[0] * s + betas[1]) * s + betas[2]) * a + a


if __name__ == "__main__":
    B = linspace(0.0, 1.0, 20000)
    P = stack((cos(B * 2 * pi), sin(B * 2 * pi)))
    Px = P[0]
    Py = P[1]

    # Lazily isolate one quadrant of a cicle
    keep = (Px > 0) & (Py > 0) & (Px > Py)
    Px = Px[keep]
    Py = Py[keep]
    
    # Fitting ratio of segment
    fit_x = Py / Px
    fit_y = arctan2(Py, Px) / (2 * pi)

    best = float("inf")
    best_coefs = None

    for _ in range(10):
        if ACCURATE:
            initial_seed = random.random(4) * 20 - 10
        else:
            initial_seed = random.random(3) * 20 - 10
        try:
            coefs = curve_fit(atan01_func, fit_x, fit_y, initial_seed)[0]
        except RuntimeError:
            # Ignore when convergence couldn't be resolved
            continue
        yhat = atan01_func(fit_x, *coefs)
        diff = abs(fit_y - yhat).max()
        if diff < best:
            best = diff
            best_coefs = coefs

    print("ACC {0}".format(best))
    print("\nCoefs:")
    for f in best_coefs:
        print(f)

Before and after AMD assembly for reference:

v_max_f32     v1, abs(v2), abs(v0)
v_rcp_f32     v1, v1
v_min_f32     v3, abs(v2), abs(v0)
v_mul_f32     v1, v1, v3
v_mul_f32     v3, v1, v1
v_mul_f32     v4, 0x3caaae5f, v3
v_add_f32     v4, 0xbdae5a36, v4
v_madak_f32   v4, v3, v4, 0x3e3876e2
v_madak_f32   v4, v3, v4, 0xbea91d04
v_madak_f32   v3, v3, v4, 0x3f7ff738
v_mul_f32     v4, v1, v3
v_madak_f32   v4, -2.0, v4, 0x3fc90fdb
v_cmp_gt_f32  vcc, abs(v0), abs(v2)
v_cndmask_b32  v4, 0, v4, vcc
v_min_f32     v5, v2, v0
v_max_f32     v0, v2, v0
v_cmp_gt_f32  vcc, -v2, v2
v_mov_b32     v2, 0xc0490fdb
v_cmp_gt_f32  s[0:1], -v5, v5
v_cmp_ge_f32  s[2:3], v0, -v0
v_mac_f32     v4, v1, v3
v_cndmask_b32  v0, 0, v2, vcc
s_and_b64     vcc, s[0:1], s[2:3]
v_mov_b32     v1, 0x80000000
v_add_f32     v0, v4, v0
v_cndmask_b32  v1, 0, v1, vcc
v_xor_b32     v0, v0, v1
v_mul_f32     v0, 0.15915494, v0

v_min_f32     v1, abs(v2), abs(v0)
v_max_f32     v3, abs(v2), abs(v0)
v_rcp_f32     v3, v3
v_mul_f32     v1, v1, v3
v_mul_f32     v3, v1, v1
v_mul_f32     v4, 0x3c5d48f3, v3
v_add_f32     v4, 0xbd3fddd2, v4
v_madak_f32   v3, v4, v3, 0xbf5766fc
v_mac_f32     v1, v3, v1
v_sub_f32     v3, 0x3e800000, v1
v_cmp_gt_f32  vcc, abs(v2), abs(v0)
v_cndmask_b32  v1, v1, v3, vcc
v_sub_f32     v3, 0.5, v1
v_cmp_gt_f32  vcc, 0, v0
v_cndmask_b32  v0, v1, v3, vcc
v_and_b32     v1, 0x80000000, v2
v_xor_b32     v0, v0, v1

Generating A Shadowmap

If you've got a BVH of the scene (Raytracing In 2D), you can very quickly generate your maps.

// Assuming you're just doing a fullscreen pass with the vertex shader
// that outputs the screen uv coordinate and have a uniform buffer of pointlight
// coordinates.

layout(location=0) in vec2 uv;
layout(location=0) out float outDepth; // Or write to gl_FragDepth, both would work.

void main()
{

    int lightId = int(gl_FragCoord.y);
    vec2 ro = getLightPosition(lightId);
    vec2 rd = vec2(cos(uv.x * TWOPI), sin(uv.x * TWOPI));

    LineBvhV1Result hit = traceLineBvhV1(ro, rd, 1.0, false);

    outDepth = sqrt(hit.hitDistSq);

    // Needs to be normalised between 0->1 to work
    // gl_FragDepth = 1.0 - 1.0 / (1.0 + length(I));
}

Now if you aren't interested in generating a BVH, you can also use lines directly, but it is a bit more involved. The basic idea is to draw each line over each pointlights row, then manually overriding gl_FragDepth, allowing the GPUs depth hardware to store the closest hit.

For this example, I'm using glDrawArraysInstanced, with the vertex id encoding the light and instance id encoding the line. (We also need to draw each line twice, since it's liable to wrap around).

vertex shader

layout(location = 0) uniform float invTextureHeight;

flat layout(location = 0) out vec4 lineData;
layout(location = 1) out float direction;


void main()
{
    // Lines need to be duplicated so we can effectively wrap around
    // when projected lines aren't simply in the 0->1 range and are in the
    // -1->0 range.
    const float polarOffset = float(gl_InstanceID & 1);
    const uint pointLightID = startingPointLightID + gl_InstanceID >> 1;

    const uint lineId = gl_VertexID >> 1;
    const uint lineSide = gl_VertexID & 1;

    const vec2 light = getLightPosition(pointLightID);
    vec4 line = getLine(lineId) - light.xyxy; // Make sure the line is always relative to the light

    // x = A, y = B
    vec2 polarPositions = vec2(fastAtan2_div2pi(line.y, line.x),
                               fastAtan2_div2pi(line.w, line.z));


    // Make sure the points are the minimum circular distance
    #if 0 // reference
        if(polarPositions.y < polarPositions.x) {
            polarPositions.x -= sign(polarPositions.x) * float(abs(polarPositions.y - polarPositions.x) > 0.5);
        }
        else {
            polarPositions.y -= sign(polarPositions.y) * float(abs(polarPositions.x - polarPositions.y) > 0.5);
        }
    #else // branchless
        polarPositions -= (
            vec2(lessThan(polarPositions.yx, polarPositions.xy))
            * sign(polarPositions)
            * vec2(greaterThan(abs(polarPositions - polarPositions.yx), vec2(0.5)))
        );
    #endif

    // In order to make everything airtight (so no holes where lines meet)
    // the lines need to have a consistent winding order, to get around
    // this, we can instead opt to make sure our line coords are sorted.
    if(polarPositions.y < polarPositions.x)
    {
        float tmp = polarPositions.x;
        polarPositions.x = polarPositions.y;
        polarPositions.y = tmp;

        vec2 tmpl = line.xy;
        line.xy = line.zw;
        line.zw = tmpl;
    }

    // Apply wrap-around offset
    polarPositions += polarOffset;

    // Swap between A and B points
    const float X = ((lineSide == 0) ? polarPositions.x : polarPositions.y);

    lineData = ((lineSide == 0) ? line.xyzw : line.zwxy);
    direction = TWOPI * (X - 0.5);

    gl_Position = vec4(2 * X - 1,
                       2 * ((0.5 + float(pointLightID)) * invTextureHeight) - 1,
                       0.0,
                       1.0);
}

fragment shader

flat layout(location = 0) in vec4 lineData;
layout(location = 1) in float direction;

void main()
{
    // Calculate the distance from the origin when facing in a direction
    // dictated by the polar coordinates to the target line.
    // https://www.geogebra.org/m/pabfs2c9
    vec2 directionVec = vec2(cos(direction), sin(direction));

    vec2 lineDiff = lineData.zw - lineData.xy;
    float u = (
        (directionVec.x * lineData.y - directionVec.y * lineData.x)
        / (lineDiff.x * directionVec.y - lineDiff.y * directionVec.x)
    );

    vec2 I = lineData.xy + lineDiff * u;
    gl_FragDepth = 1.0 - 1.0 / (1.0 + length(I));
}

Overall, using the BVH version is probably a better option.

Now we can start attempting to shadow things. Normally for shadow maps you'd use something like PCF when sampling, which we can get the hardware to do directly when using a depth texture (it's not hard to do manually if you're rendering to some other format).

texture parameters

glTextureParameteri(tex, GL_TEXTURE_WRAP_S, GL_REPEAT);
glTextureParameteri(tex, GL_TEXTURE_WRAP_T, GL_CLAMP_TO_EDGE);

if(useHardwarePCF)
{
    glTextureParameteri(tex, GL_TEXTURE_MIN_FILTER, GL_LINEAR);
    glTextureParameteri(tex, GL_TEXTURE_MAG_FILTER, GL_LINEAR);
    glTextureParameteri(tex, GL_TEXTURE_COMPARE_FUNC, GL_LEQUAL);
    glTextureParameteri(tex, GL_TEXTURE_COMPARE_MODE, GL_COMPARE_REF_TO_TEXTURE);
}

// manual PCF
else
{
    glTextureParameteri(tex, GL_TEXTURE_MIN_FILTER, GL_NEAREST);
    glTextureParameteri(tex, GL_TEXTURE_MAG_FILTER, GL_NEAREST);
}

For the sake of keeping this simple, I'll assume we're using a depth texture and here are the parameters to enable PCF.

Here is one way, you might go about applying the lights in a single fullscreen pass:

lighting pass

layout(location = 0) uniform uint pointLightCount;
layout(binding = 0) uniform sampler2DShadow pointLightShadowMap;

layout(location = 0) in vec2 uv;
layout(location = 0) out vec4 outRgba;

void main()
{

    const vec2 invTextureSize = 1.0 / vec2(textureSize(pointLightShadowMap, 0));
    const float textureWidth = vec2(textureSize(pointLightShadowMap, 0)).x;

    vec3 totalAccum = vec3(0.0);

    for(uint pointLightID=0; pointLightID < pointLightCount; ++pointLightID)
    {
        
        const vec2 pointLightPos = getLightPosition(pointLightID);
        const vec2 localUv = (2.0 * uv - 1.0) - pointLightPos.xy;   // assuming lights are in NDC space

        const float X = fastAtan2_div2pi(localUv.y, localUv.x);
        const float Y = (0.5 + float(pointLightID)) * invTextureSize.y;
        const float Z = 1.0 - 1.0 / (1.0 + length(localUv));
        float pcf = texture(pointLightShadowMap, vec3(X, Y, Z));
        totalAccum += evaluateLight(pointLightID, localUv) * pcf;
    }

    outRgba = vec4(totalAccum, 1.0);
}

Single light source

This may all look fine at first, but all is not how it seems...
If we move our light source and zoom in a bit, the true horror of acne reveals itself.

Reference

Zoomed in #1

Zoomed in #2

This does make sense, since each sample more-or-less maps to a radius of visibility and any direction in between doesn't really correspond to either in a linear way.

You can of course just up the resolution, but this isn't a very satisfying solution.

You might attempt to create lines between depth samples and then do line-line intersection tests, or even just render a line directly and this does work to some extent, but looks pretty bad when you get to the lines end points.

A nicer solution I've found is to use Hesse Normal Form (basically planes).

Planemaps

The idea is to store the equation of the line rather than its distance. It's not massively different pipeline wise to make the switch, although a bit more involved.

Using R10G10B10A2 is a natural fit, with N being stored in RG and dist being stored in B (not currently using A, but it could easily be used to store a more precise dist).

We could also opt for RG16F, storing a polar angle in R and the dist in G.

As a reminder, no GPU actually supports the "regular" three-component texture formats (RGB8, RGB16, RGB16F, RGB32F), they are simply padded and the hidden A is basically dead space.

layout(location=0) in vec2 uv;
layout(location=0) out vec4 outPlane;

void main()
{
    int lightId = int(gl_FragCoord.y);
    vec2 ro = getLightPosition(lightId);
    vec2 rd = vec2(cos(uv.x * TWOPI), sin(uv.x * TWOPI));
    LineBvhV1Result hit = traceLineBvhV1(ro, rd, 1.0, false);

    vec2 N = rd;
    float w = 1.0;

    if(hit.hitLineId != 0xffffffffu)
    {
        N = normalize(vec2(hit.line.w, -hit.line.z));
        w = dot(N, hit.line.xy - ro);
        
        // Ensure a consistent clockwise orientation plus
        // keep the distance coef positive.
        N.x = multiplySign(N.x, w);
        N.y = multiplySign(N.y, w);
        w = multiplySign(w, w);
    }

    // R10G10B10A2 encoding, A2 is currently unused...
    outPlane = vec4(N * 0.5 + 0.5, w, 1.0);
}

When it comes to sampling, there are two flavours that I've found work well.

Binary sampling

This is by far the simplest method and works rather well, it can feel a bit jittery though and break down a tad when there is a sharp change in orientation.

float getBinaryPlaneVisibility(vec2 uv, vec3 sampledNormalAndDistance, float bias /* = 0.5 / lightPlaneMapSize.x */)
{
    return 1.0 - smoothstep(sampledNormalAndDistance.z - bias,
                            sampledNormalAndDistance.z + bias,
                            dot(uv, sampledNormalAndDistance.xy));
}

PCF

Can look better when the linemap resolution is very low, but overall Binary sampling should probably be preferred.

// Two taps and two binary plane tests, texture width is assumed to be a power of 2,
// more expensive, but generally just looks better.
float getPCFPlaneVisibility(sampler2D lightPlaneMap,
                            vec2 localUv, // uv local to the lights center
                            vec2 planeMapUV,
                            ivec2 lightPlaneMapSize,
                            float bias /* = 0.5 / lightPlaneMapSize.x */)
{
    float sampleXBase = (planeMapUV.x + 1.0) * lightPlaneMapSize.x - 0.5;
    int sampleY = int(planeMapUV.y * lightPlaneMapSize.y);
    
    int sampleX0 = int(sampleXBase);
    int sampleX1 = sampleX0 + 1;
    sampleX0 &= (lightPlaneMapSize.x - 1);
    sampleX1 &= (lightPlaneMapSize.x - 1);

    vec3 planeAndDistance0 = texelFetch(lightPlaneMap, ivec2(sampleX0, sampleY), 0).xyz;
    vec3 planeAndDistance1 = texelFetch(lightPlaneMap, ivec2(sampleX1, sampleY), 0).xyz;
    planeAndDistance0.xy = planeAndDistance0.xy * 2.0 - 1.0;
    planeAndDistance1.xy = planeAndDistance1.xy * 2.0 - 1.0;

    float visbility0 = getBinaryPlaneVisibility(localUv, planeAndDistance0, bias);
    float visbility1 = getBinaryPlaneVisibility(localUv, planeAndDistance1, bias);

    float lerpWeight = fract(sampleXBase);
          lerpWeight = smoothstep(0, 1, lerpWeight);
    return mix(visbility0, visbility1, lerpWeight);
}

Filtering Lights To Evaluate

Up until this point, we've been evaluating every light per pixel, this is not going to scale well beyond a handful of lights.

The first obvious optimisation would be to only evaluate pixels the player sees (Blocking 2D Player Visibility), although this might not always be possible.

We can generate bounds for each light, using the planemaps we've already generated and draw a series of quads which add the colour to the scene.

(If we weren't targeting WebGL and had compute shaders / UAV buffers, using a linked light list would be a sensible option).

The simplest option is to generate BBOXs:

gen_bbox.frag

vec2 projectRay(vec3 planeAndDistance, float theta)
{
    vec2 rd = vec2(cos(theta), sin(theta));
    float denom = dot(planeAndDistance.xy, rd);

    // Prevent divisions by zero and excessive expansion at grazing angles
    float onedeg = 0.01745240643728351;
    if(abs(denom) <= onedeg)
    {
        return vec2(0);
    }

    float distToPlane = planeAndDistance.z / denom;
    return rd * distToPlane;
}


void main()
{
    int lineIndex = int(gl_FragCoord.x);
    vec2 ro = getLightPosition(lightingData, lineIndex);
    vec2 bboxMax = vec2(-65504.0);
    vec2 bboxMin = vec2(65504.0);

    // Assumed linemap resolution to be a power of 2
    int lightPlaneMapWidth = textureSize(lightPlaneMap, 0).x;
    float invTextureSizeTwoPi = rcpForPowersOf2(float(lightPlaneMapWidth)) * TWOPI;

    for(int x=0; x < lightPlaneMapWidth; ++x)
    {

        vec3 planeAndDistance = texelFetch(lightPlaneMap, ivec2(x, lineIndex), 0).xyz;
             planeAndDistance.xy = planeAndDistance.xy * 2 - 1;

        // Project against the left and right rotational values
        // to prevent underprojecting and not fully containing
        // the light.

        const float bias = 1.0;

        float thetaLeft = (float(x) - bias) * invTextureSizeTwoPi;
        float thetaRight = (float(x) + bias) * invTextureSizeTwoPi;

        vec2 projectionLeft = projectRay(planeAndDistance, thetaLeft);
        vec2 projectionRight = projectRay(planeAndDistance, thetaRight);

        bboxMin = min(min(projectionLeft, projectionRight), bboxMin);
        bboxMax = max(max(projectionLeft, projectionRight), bboxMax);
    }

    outBBox = vec4(bboxMin + ro, bboxMax + ro);
}

This works reasonably well, but we can prevent overdrawing by fitting a better quad:

gen_obbox.frag

vec2 projectRay(vec3 planeAndDistance, float theta)
{
    vec2 rd = vec2(cos(theta), sin(theta));
    float denom = dot(planeAndDistance.xy, rd);

    // Prevent divisions by zero and excessive expansion at grazing angles
    float onedeg = 0.01745240643728351;
    if(abs(denom) <= onedeg)
    {
        return vec2(0);
    }

    float distToPlane = planeAndDistance.z / denom;
    return rd * distToPlane;
}


void main()
{
    int lineIndex = int(gl_FragCoord.x);
    vec2 ro = getLightPosition(lightingData, lineIndex);

    // Assumed linemap resolution to be a power of 2
    int lightPlaneMapWidth = textureSize(lightPlaneMap, 0).x;
    float invTextureSizeTwoPi = rcpForPowersOf2(float(lightPlaneMapWidth)) * TWOPI;

    const float bias = 1.0;

    // Find initial furthest vector
    vec2 furthest = vec2(0);
    float furthestDistSq = -65504.0;

    for(int x=0; x < lightPlaneMapWidth; ++x)
    {
        vec3 planeAndDistance = texelFetch(lightPlaneMap, ivec2(x, lineIndex), 0).xyz;
             planeAndDistance.xy = planeAndDistance.xy * 2 - 1;

        float thetaLeft = (float(x) - bias) * invTextureSizeTwoPi;
        vec2 projectionLeft = projectRay(planeAndDistance, thetaLeft);
        float leftDistSq = dot(projectionLeft, projectionLeft);
        float thetaRight = (float(x) + bias) * invTextureSizeTwoPi;
        vec2 projectionRight = projectRay(planeAndDistance, thetaRight);
        float rightDistSq = dot(projectionRight, projectionRight);

        if(leftDistSq > furthestDistSq)
        {
            furthest = projectionLeft;
            furthestDistSq = leftDistSq;
        }

        if(rightDistSq > furthestDistSq)
        {
            furthest = projectionRight;
            furthestDistSq = rightDistSq;
        }
    }

    // Next find the furthest point from our first point
    vec2 furthestTangent = vec2(0);
    float furthestTangentDistSq = -65504.0;

    for(int x=0; x < lightPlaneMapWidth; ++x)
    {
        vec3 planeAndDistance = texelFetch(lightPlaneMap, ivec2(x, lineIndex), 0).xyz;
             planeAndDistance.xy = planeAndDistance.xy * 2 - 1;

        float thetaLeft = (float(x) - bias) * invTextureSizeTwoPi;
        vec2 projectionLeft = projectRay(planeAndDistance, thetaLeft);
        float leftDistSq = dot(projectionLeft - furthest, projectionLeft - furthest);
        float thetaRight = (float(x) + bias) * invTextureSizeTwoPi;
        vec2 projectionRight = projectRay(planeAndDistance, thetaRight);
        float rightDistSq = dot(projectionRight - furthest, projectionRight - furthest);

        if(leftDistSq > furthestTangentDistSq)
        {
            furthestTangent = projectionLeft;
            furthestTangentDistSq = leftDistSq;
        }

        if(rightDistSq > furthestTangentDistSq)
        {
            furthestTangent = projectionRight;
            furthestDistSq = rightDistSq;
        }
    }

    vec2 L0 = furthest;
    vec2 L1 = furthestTangent;

    // All remaining points should lie either left or right of our computed line
    float CDist = 0.0;
    float DDist = 0.0;

    vec2 Ld = normalize(L1 - L0);
    vec2 Ln = vec2(Ld.y, -Ld.x);
    float Lw = dot(Ln, L0);

    for(int x=0; x < lightPlaneMapWidth; ++x)
    {
        vec3 planeAndDistance = texelFetch(lightPlaneMap, ivec2(x, lineIndex), 0).xyz;
             planeAndDistance.xy = planeAndDistance.xy * 2 - 1;

        float thetaLeft = (float(x) - bias) * invTextureSizeTwoPi;
        vec2 projectionLeft = projectRay(planeAndDistance, thetaLeft);
        float leftDist = dot(projectionLeft, Ln) - Lw;
        
        float thetaRight = (float(x) + bias) * invTextureSizeTwoPi;
        vec2 projectionRight = projectRay(planeAndDistance, thetaRight);
        float rightDist = dot(projectionRight, Ln) - Lw;

        if(leftDist < 0)
        {
            if(abs(leftDist) > abs(CDist))
            {
                CDist = leftDist;
            }
        }
        else
        {
            if(leftDist > DDist)
            {
                DDist = leftDist;
            }
        }

        if(rightDist < 0)
        {
            if(abs(rightDist) > abs(CDist))
            {
                CDist = rightDist;
            }
        }
        else
        {
            if(rightDist > DDist)
            {
                DDist = rightDist;
            }
        }
    }

    vec2 A = furthest + Ln * CDist;
    vec2 B = furthest + Ln * DDist;
    vec2 C = furthestTangent + Ln * DDist;
    vec2 D = furthestTangent + Ln * CDist;

    outOOBox = uvec4(packHalf2x16(A + ro),
                     packHalf2x16(B + ro),
                     packHalf2x16(C + ro),
                     packHalf2x16(D + ro));
}

The above stores four points, but in order to make it easier to fetch, packs them as float16.

Despite the bbox code iterating every pixel in a row (and the obbox iterating them three times), this is actually remarkably not the most obscenely expensive operation to run, of course ideally you'd run the whole operation maybe once and update individual lights as data changes.

If we have stationary lights which change intensity / colour, but not be moved, offline generating polygons and not using shadow mapping at all would result in the fastest solution (although that's not exactly simple to do and very beyond the scope of what we're attempting to do here).

Regardless, one of the main bottlenecks to expect here is the amount of lights that overlap the same pixel, as this increases the amount of implicit dependencies the GPU will need to resolve when blending (a complex scene may then actually be cheaper to evaluate).

Demo

Here are all the relevant source code files for the demo at the top:

JS used for demo: circular_mapping.js
Shader used for demo:
- Plane generation: gen_plane_map.frag
- BBox calculation: gen_bbox.frag
- OBBox calculation: gen_obbox.frag
- Light drawing: draw_lights.frag, draw_lights.vert, lights_common.glsli