[QUOTE=raga34;1259574]Could you elaborate a little on this one? Specifically the fragment shader variant…[/QUOTE].
Poor-man’s compute shader:
- Create two textures of the same size and format, large enough to hold all of your data.
- Upload your data into one texture, and bind it to a texture unit (with no filtering).
- Attach the other texture as the colour buffer of a framebuffer object, bind it, and set the viewport accordingly.
- Render a screen-sized quad.
Each invocation of the fragment shader will calculate one “pixel” to be stored in the output texture. It can use data from the input texture in the calculation.
Because fragment shader invocations are performed in parallel, you need to use a divide-and-conquer approach, where each step halves the amount of data. After each step, you’d swap the textures, so the output from the last step becomes the input to the next step. You need log[SUB]2/SUB steps to process N elements. You would halve the size of the viewport at each step (as you’re only calculating half as many values).
On each step, you’d calculate
out[i] = min(in[2*i], in[2*i+1]);
So if you start with e.g. 1024 pixels, you’d group these into 512 pairs, and the output would be the minimum of the 2 values for each pair, giving 512 values. On the next step, you’d group these into 256 pairs, and so on. On the tenth step, you’d have 2 inputs and one output, which would be the minimum of the 1024 initial values.
One minor complication is that the textures have to be 2D, and there’s a limit to the width. So if you had 65536 values, you might use 256x256 textures. So the calculation above would look something like:
uniform int level;
uniform sampler2D in_tex;
void main() {
ivec2 d = ivec2(gl_FragCoord.xy); // destination index
ivec2 s0 = level & 1 ? ivec2(d.x/2+0,d.y) : ivec2(d.x,d.y/2+0); // source index 0
ivec2 s1 = level & 1 ? ivec2(d.x/2+1,d.y) : ivec2(d.x,d.y/2+1); // source index 1
vec4 v0 = texelFetch(in_tex, i0, 0);
vec4 v1 = texelFetch(in_tex, i1, 0);
gl_FragColor = min(v0, v1);
Essentially, whereas a sequential “reduce” (aka “fold”) operation accumulates values in series:
sum(seq) = ((seq[0] + seq[1]) + seq[2]) + seq[3]
a divide-and-conquer approach uses the associativity of the operator to structure the calculation as a binary tree:
sum(seq) = (seq[0] + seq[1]) + (seq[2] + seq[3])
This allows the two operands to be evaluated concurrently, which is more favourable to a GPU which relies upon massive parallelism for performance.