Different types of MRTs + parallel shader progs

Two suggestions (similar theme),

First having the ability (if supported in hardware) to attach framebuffer targets of different types would be extremely useful. For example say you want to write both a RGBA INT8 render target and RGBA FP16 render target from one fragment shader. Obvious wins for G-Buffers.

Also extremely useful if you want to merge two independent “full frame GPGPU kernel” style fragment shaders which output to the same size texture but require different output types. Often this is the case where you perhaps have one shader which is texture bound, and another shader which is ALU bound, each requiring different MRT types (say like one is LUMINANCE FP32 and another is RGBA FP16). Combined the two shaders into one single shader and you then have better performance overall.

Second suggestion,

Also related to the issue of shaders often being either ALU, TEX, or ROP bound.

It would be good if GL in the future has a clear design path to allow for multiple shading paths (groups of vertex/(optional)geometry/fragment shaders) to run simultaneously on the GPU with clear sync points. This would allow for programmers to pair shaders which are limited in one performance aspect (again ALU/TEX/ROP) together to better use the full GPU resources.

Anyone care to chime in if this is something which is possible on the current hardware? Perhaps this already happens?

From the limited knowledge of what I have gathered about NVidia’s 8 series drivers, seems as if the fragment and vertex shaders of a drawing invocation definitely overlap. From the NV CUDA forum boards, currently CUDA programs are serialized and do not run simultaneously. So I wonder if different drawing invocations of different GL shaders are also serialized as well, or can they go parallel other than the small overlap when one ends and the other begins?

Any info on AMD/ATi HD or NVidia 8 series stuff would be great…

That isn’t a OpenGL API issue. It would be possible to implement more flexible rendertarget combinations without changing the extensions.

Actually this is purely an OpenGL API / extension issue.

Even at the time of the last update of the Framebuffer Object Extension in April 2006, the spec clearly states,

“(49) When this extension is used in conjunction with MRT (multiple render targets), it would naively be possible to create a framebuffer that had different color bit depths/formats for various color attachment points.”

So we know that the hardware supported this even almost 2 years ago. However,

“Should this be allowed? RESOLUTION: resolved, no, not in this extension. A soon to follow extension may add this feature.”

This “soon to follow extension” never materialized. Probably because any updates to GL via extensions are bound to be on hold until all GL3 issues are resolved.

BTW, isn’t the framebuffer extension being promoted from an extension and rolled into the core of GL3?

Personally I think this multi-format MRT issue (and of course the other big FBO issue of multi-sample buffers) should be a “must do” for GL3.

Actually this is purely an OpenGL API / extension issue.

What he meant is that one does not need to change the API. You don’t need a special kind of FBO or buffer object or texture or something.

Now, you’re right in that you do need an extension. But the extension itself would basically just say, “This works now.” Much like the NPOT extension just goes over the spec and says, “textures don’t have to be power-of-two in size”. It wouldn’t make any API changes, but it would need to exist.

GL 3.0, depending on how much gets changed from what we knew previously, will probably just let you do it. It has a failsafe mechanism that allows it to test for conditions like that. When you create an FBO object, you have to give it a list of format objects that will be used for the attachments. If the format objects can’t be used in the same FBO, it will simply fail to create itself. So GL 3.0 will probably just say that it guarantees that implementations will be fine if you use equivalent color format objects, but if you use different ones, the implementation can choose not to accept it.

It would be good if GL in the future has a clear design path to allow for multiple shading paths (groups of vertex/(optional)geometry/fragment shaders) to run simultaneously on the GPU with clear sync points.

I’m not sure that this “feature” makes sense.

As a rendering API, the system runs shaders when something gets rendered. So you’re going to be running the same shader 99% of the time, because each draw command will use a specific shader. I don’t know where you can expect to get much parallelism from.

Here is an example, say your G-Buffer generation is ROP bound, and your later G-Buffer shading is TEX bound, and after that your GPGPU physics engine is ALU bound.

If you serialize the shaders of these three pipeline steps,

1.) G-Buffer creation will only go as quick as you can output to the ROP, while lots of free ALU and TEX cycles there to be used for something else.

2.) G-Buffer shading will will only go as quick as you can fetch from the texture units, with free ROP and ALU resources.

3.) The GPGPU physics code will only go as quick as the ALUs, with free TEX and ROP resources.

Now if the shaders of each pipeline step could run together,

1.) You could be finishing the lighting on the previous G-Buffer, while say creating the next G-Buffer (for the sake of argument say you had enough mem for 2 G-Buffers), and doing the next GPGPU physics pass. In doing so get a better balance of ALU/TEX/ROP usage.

Now assuming that actual hardware supports this, you might see a good performance increase.

I’m thinking that this is something which is going to need some API hooks so that the drivers can figure what to overlap given input from the program.

Perhaps I’m all wet, and this is something which is easy to do driver side automatically?

I am not sure it would be that great for a graphics API.
For a GPGPU API maybe, and I guess it is probably already the case.

Anyway, it can already happen under the hood, ie. taken from official specs of an ATI Radeon HD 3800 :

Unified Superscalar Shader Architecture

* 320 stream processing units
      o Dynamic load balancing and resource allocation for vertex, geometry, and pixel shaders
      o Common instruction set and texture unit access supported for all types of shaders

And as you point out, your original proposal need to store temporary results, and that is not trivial.

If you take a drawing call given a combination of {vert/geo/pix} shaders, I think we know the drivers load balance between the 3 parts of the pipeline during this call. I don’t know enough about the geometry stage to comment on it, but at least between vertex and fragment stages, I’m guessing that the driver chooses a start time for the fragment stage once it can guarantee that the fragment stage will not stall on vertex data. So in this way the stages are overlapped.

Also it would only make sense to start running the vertex shader of the next drawing call once the pixel shader of the previous call starts to have open thread slots (speaking from a unified shader perspective). Perhaps the next vertex shader starts even earlier, but I would guess that perhaps no 2 different vertex shaders or no 2 different pixel shaders ever overlap execution.

What I am suggesting has to do with overlapping drawing call execution of different shaders of the same type.

Of course the hardware simply might not be able to do this right now (I’m guessing this is the case). But I’d bet that overlapped execution of multiple different programs is something which will be a given at least on Larrabee in 2010… and since GL3 is trying for “future proof”, this might be something to think about.

My dummy G-Buffer example is just a dummy example, but the concept could apply to many other situations which don’t need to double buffer.

since GL3 is trying for “future proof”, this might be something to think about.

GL 3 is still a rendering API. It doesn’t specify performance. All it cares about is render order. So long as the fragments of triangle 1 do not get written (or appear to get written) after triangle 2, GL 3 will not care how they get to the framebuffer.