Using FBO for simulation (reading from many texels, writing to a current texel)
i would like to implement some simulation, leveraging the fact of high GPU fillrates and working on a prepared/precalculated texture.
In general being at texel X,Y (textureA) i would like for example:
- read some static R/G values from textureB (X,Y)
- fetch texel value from textureA (at R/G)
- do some calculation on this fetched value
- write the result to textureA (X,Y).
Then repeat it for each texel at texture A, going line by line (so using the results from previous 'lines' for calculation of current line). In other words - result of texel XY shall depend on the values of texels LM calculated microseconds earlier (as part of the same pass), where for example L<X.
Is such an approach possible at all? I've read on NV texture barrier, so i can imagine i could split the whole space into some smaller sections, and interlave them with the texture barrier calls. So the fetch values would be already present in the texture.
I'm just wondering, whether it makes any sense at all. What i want to achieve is the highest possible performance (#simulation cycles per second) on doing some small/trivial computation of a 'levelized' net of elements (texels). I would hope to get rates of hundreds/thousands of whole texture passes per second. Does it make any sense?
Any suggestions appreciated.
Assuming the texture is 1080px high, with NV tex barrier it's doable, with 1080 drawcalls and flushes; but is not the optimal solution.
You can do it with 1080/N drawcalls with transform-feedback, where N is how many vertically-adjacent pixels you handle per vertex. Ping-pong between two buffers that are TEX_WIDTH*N*texel_size big each. Your vertex data input and output sizes are N*texel_size; the higher size achievable, the better. Before the ping-pong loop, a separate transform-feedback shader is used to preload data from the texture into a buffer. Or, it's just best to keep everything in a VBO from the beginning; with every N vertical pixels packed together.
You can also use ARB_compute_shader. The most flexible, but not scalable configuration, is a workgroup of (1,1,1) size, and a dispatch of (TEX_WIDTH,1,1). In the shader loop TEX_HEIGHT times and load texels, calculate with previous values, imageStore(). If your TEX_HEIGHT is 1024, this will be the optimal solution for gpus with <= 1024 scalar/vector cores. Which is fortunately the case for all current high-end AMD gpus, and most nVidia ones.
You'll have to search for hardware-generated SAT (summed-area-tables), as they solve a similar issue when the function is simply a sum of texels above and to the left. Maybe it'll give you ideas.