Accumulating min/max bounds with transform feedback?

Hi.

I’m doing GPU skinning with transform feedback. Specifically, I’m deforming a mesh given as object space coordinates, resulting in a mesh also specified in object space coordinates.

I’d like to accumulate the min/max bounds for each vertex somehow. That is, given:


vec3 minimum = vec3(FLOAT_MAX, FLOAT_MAX, FLOAT_MAX);
vec3 maximum = vec3(FLOAT_MIN, FLOAT_MIN, FLOAT_MIN);

Then, for each vertex v:


minimum = vec3(
  min (v.x, minimum.x),
  min (v.y, minimum.y),
  min (v.z, minimum.z)
);

maximum = vec3(
  max (v.x, maximum.x),
  max (v.y, maximum.y),
  max (v.z, maximum.z)
);

I could obviously map the resulting buffer and do the accumulation myself on the CPU side, but it seems like there should be a way to do this. I’d like to keep things roughly 3.0 compatible, I’m not sure if that restricts things too much.

Er, the intention is to accumulate the upper and lower bounds of the positions of vertices, obviously. Realize I never actually said that.

One option would be to use buffer variables with the atomicMin() and atomicMax() functions.

Another option would be to use a compute shader (or the poor-man’s compute shader: a fragment shader with render-to-texture) for a divide-and-conquer algorithm.

I don’t know which would be fastest on current hardware.

Apologies for the delay in getting back to you!

Unfortunately, I can’t use this one as it’d limit me to >= 4.3.

Could you elaborate a little on this one? Specifically the fragment shader variant…

I’m not too bothered about extreme performance, but I’d obviously like to do significantly better than the CPU can manage.

Vertex transform is inherently one-vert-in, one-vert-out. There is no communication between vertices (but parallel execution potential is high.)

Min/Max requires comparing all N vertices, so you need to configure the pipeline to do that. A common GPGPU method is “reduction” (or divide-and-conquer). Consider this approach:

a) existing vertex shader reads attributes, transforms, outputs position [can be captured via transform feedback.]
b) modify vertex shader to also read neighbor attributes. Draw N/2 vertices with appropriate attribute strides. Shader transforms two vertices, outputs the min/max [can capture min/max via transform feedback.]
c) one pass has now reduced N vertices to N/2. Repeat log2 passes, to reduce N to 1.

Limitations here include the maximum number of attributes you can feed the vertex shader. And you need to decide how to capture both the real transformed positions, and the min/max, which on GL3 hardware with a single transform feedback stream, means log2 passes in addition to the regular transform.

If you go way back to even older hardware (GL1 with float textures, or, say, ES2 on an iPhone) another approach is to use the fragment stage to compare multiple vertices:

b) modify vertex shader to always output 0,0 position. Output transformed position as a generic attribute. Set up simple pass-through fragment shader.
c) set up a 1x1 floating point renderbuffer, with either MIN or MAX blending. Initially clear to zero.
d) draw N vertices as 1-px POINTS. Each point is blended onto the same framebuffer pixel, taking the min (or max.)
e) ReadPixels (or etc) to get the final result.
f) repeat a second pass with the other blend equation (on GL4 hardware, you could use ARB_draw_buffers_blend to set up both blend equations in a single MRT pass.)

Limitations here include the maximum number of components you can manipulate in the fragment stage (i.e. RGBA = 4 floats. More if your hardware supports ARB_draw_buffers.)

As mentioned, this is trivial with GL4.2+ hardware, using any of the atomic, image load/store, or compute shader solutions.
OpenCL is another option, and it is available on GL3 hardware.

[QUOTE=raga34;1259574]Could you elaborate a little on this one? Specifically the fragment shader variant…[/QUOTE].
Poor-man’s compute shader:

  1. Create two textures of the same size and format, large enough to hold all of your data.
  2. Upload your data into one texture, and bind it to a texture unit (with no filtering).
  3. Attach the other texture as the colour buffer of a framebuffer object, bind it, and set the viewport accordingly.
  4. Render a screen-sized quad.

Each invocation of the fragment shader will calculate one “pixel” to be stored in the output texture. It can use data from the input texture in the calculation.

Because fragment shader invocations are performed in parallel, you need to use a divide-and-conquer approach, where each step halves the amount of data. After each step, you’d swap the textures, so the output from the last step becomes the input to the next step. You need log[SUB]2/SUB steps to process N elements. You would halve the size of the viewport at each step (as you’re only calculating half as many values).

On each step, you’d calculate


out[i] = min(in[2*i], in[2*i+1]);

So if you start with e.g. 1024 pixels, you’d group these into 512 pairs, and the output would be the minimum of the 2 values for each pair, giving 512 values. On the next step, you’d group these into 256 pairs, and so on. On the tenth step, you’d have 2 inputs and one output, which would be the minimum of the 1024 initial values.

One minor complication is that the textures have to be 2D, and there’s a limit to the width. So if you had 65536 values, you might use 256x256 textures. So the calculation above would look something like:


uniform int level;
uniform sampler2D in_tex;

void main() {
    ivec2 d = ivec2(gl_FragCoord.xy);    // destination index
    ivec2 s0 = level & 1 ? ivec2(d.x/2+0,d.y) : ivec2(d.x,d.y/2+0);  // source index 0
    ivec2 s1 = level & 1 ? ivec2(d.x/2+1,d.y) : ivec2(d.x,d.y/2+1);  // source index 1
    vec4 v0 = texelFetch(in_tex, i0, 0);
    vec4 v1 = texelFetch(in_tex, i1, 0);
    gl_FragColor = min(v0, v1);

Essentially, whereas a sequential “reduce” (aka “fold”) operation accumulates values in series:


sum(seq) = ((seq[0] + seq[1]) + seq[2]) + seq[3]

a divide-and-conquer approach uses the associativity of the operator to structure the calculation as a binary tree:


sum(seq) = (seq[0] + seq[1]) + (seq[2] + seq[3])

This allows the two operands to be evaluated concurrently, which is more favourable to a GPU which relies upon massive parallelism for performance.

Very nice, thank you! Would never have come up with this by myself…