Can gl_FragColor be read?

Normally one uses gl_FragColor to write the output of the shader, setting the pixel color. When using multipass rendering, it can be a useful optimization (to avoid ping ponging) to read the current value of the target pixel in the texture, perform the operations on that data, and then write the new value back to that location. Is this strictly legal in OpenGL? Also, whether legal or not, does this work in most modern nVidia and ATI hardware?

Thanks,

-Raystonn

No.

Thanks for saving me some time experimenting on this optimization!

-Raystonn

Though everybody wished that would be the case.
(if it was as fast as a regular texture fetch)

The variable gl_FragColor is readable, however, it doesn’t read from framebuffer, but reading it will just give the value you already assigned to it in the shader. In other words, you can do something like this:

gl_FragColor = texture2D(Base, texCoord);
gl_FragColor.a = dot(gl_FragColor.rgb, gl_FragColor.rgb);

I have done some experiments and have found that I can read from a texture and later write to it during a single pass fragment shader run. This works on two nVidia GPUs:the GeForce Go 6800 and the GeForce 8800 GTX.

I got this working with the following technique: Add the texture to an FBO. Enable the texture for reading the way you’d normally bind a texture. Enable the texture for writing via gl_FragData[1] using glDrawBuffers.

I have only tried this technique when using multiple textures. The target texture is not the first texture for either reading or writing. I have not tried reading after writing, as I can see the result of such an operation being undefined.

If you try this on other hardware, specifically non-nVidia hardware, please let me know of your success. Avoiding a ping-pong can save quite a bit of texture memory.

Thanks,

-Raystonn

Originally posted by Raystonn:

I have only tried this technique when using multiple textures. The target texture is not the first texture for either reading or writing. I have not tried reading after writing, as I can see the result of such an operation being undefined.

The FBO specification explicitly states (section 4.4.3) that value of fragment is undefined if it is possible that it might sample from texture level that is simultaneously bound to FBO. So even if it works on some hw with current drivers, it might break at any time.

My opinion is that this section of the spec should be changed. A read before a write should fetch the current value as if it was not possible to write. A read after a write should be left undefined, as today’s hardware does not synchronize reads and writes. You may get the written value, and you may get the old value, depending on how the hardware’s cache system works.

Would this change make compliance impossible on any existing hardware?

-Raystonn

Would this change make compliance impossible on any existing hardware?
Let’s readjust that question:

Would this change make compliance possible on any existing hardware?

Answer: No. Not G80, nor R600. And God knows what we’ve got now can’t handle it.

Actually, it works fine on the G80 and on the G40. (Please read above where I give my test results.) In fact, it looks like most hardware already handles this correctly. We would just need to update the spec to ensure it stays that way.

-Raystonn

Originally posted by Raystonn:
Actually, it works fine on the G80 and on the G40. (Please read above where I give my test results.) In fact, it looks like most hardware already handles this correctly.

It is only coincidence that the way you use it interacts with the hw implementation in way you are expecting. The hw does nothing to ensure that it will not break in some situations (e.g. for small textures or for overlapping meshes). If the specification would require that this must always behave correctly, the extension would be not supported.

I have no problem with this not working for cases where a pixel is drawn to more than once per pass, such as for overlapping meshes. The specification can clearly state that reading from a pixel will result in an undefined value after that pixel has been written at any point during the pass.

-Raystonn

OK, how about this:

The extension specification was written by people who know a lot more about how their hardware works than you do. They clearly considered this issue and decided to reject it.

End of discussion.

How about this:

Don’t assume every possible scenario has already been contemplated by the hardware engineers. Think outside the box.

If you want to end discussion, then don’t partake in discussion forums.

-Raystonn

Well, I guess Korval is right… But I still would like to know the reasons behind it. This hack works on all Nvidia hardware from Nv4x (as far as I know) under each driver that supports FBO. From words of Humus I also understand that it also works for Ati. Can someone who was warking at the spec or has connection to it (I know for sure that some of them read this forum) say something on this issue?

Don’t assume every possible scenario has already been contemplated by the hardware engineers.
They did consider it: see below. It was considered and rejected.

But I still would like to know the reasons behind it.
The FBO spec, issue #44.

Also, OpenGL is not the ATi/nVidia cross platform API. It needs to remain an implementable specification, so things that are vagaries of the specific way hardware goes about implementing something need to not be too specific. To define it would be to force very specific implementation details on implementers, much like specifically defining what a “float” is in glslang is too specific to allow for potentially creative implementations.

Well, the whole problem with the assertion that this is OK within a single pass, is that a ‘pass’ is not defined anywhere in OpenGL. Is it everytime you call begin? Establishig such a semantic would require that implementations flush their caches at that point, which could have a negative impact on performance. Also there have been lots of graphics cards in the past that would require a surface transfer between rendering and texturing. These would break under that scenario. An SLI config where each card rendered 1/2 of the FBO would be another example where this is not something that could be fully guaranteed without sacrificing performance.

-Evan

I quote from the FBO spec, issue #44:

As background, the reason this is an issue in the first place is that simultaneously reading from, and writing to, the same texture image is likely to be problematic on multiple vendors’ hardware without paying performance penalties for excessive synchronization and/or data copying.
As you can see here, the hardware engineers did not contemplate the scenario I am describing. They indicate that simultaneous reading and writing should not be allowed, and I fully agree. I am proposing allowing a specific subset of this functionality. Specifically, reading from a pixel prior to any write that could affect that pixel should return the value of that pixel prior to that rendering pass beginning. Any other mix of reading and writing of the same location should result in undefined values being read.

-Raystonn

Well, the whole problem with the assertion that this is OK within a single pass, is that a ‘pass’ is not defined anywhere in OpenGL. Is it everytime you call begin? Establishig such a semantic would require that implementations flush their caches at that point, which could have a negative impact on performance.
For multipass with early-z branching, a pass is implicitly defined as anything occurring between a glBeginQuery(GL_SAMPLES_PASSED, …) and the accompanying glEndQuery(GL_SAMPLES_PASSED). Such techniques typically require you to block in glGetQueryObjectuiv(…, GL_QUERY_RESULT, …) waiting to be told how many pixels were drawn so you can halt the passes when no further pixels need to be updated. If you are doing multipass without early-z branching for some reason, then nailing down the moment the write caches must be flushed out to memory becomes more problematic.

Also there have been lots of graphics cards in the past that would require a surface transfer between rendering and texturing. These would break under that scenario. An SLI config where each card rendered 1/2 of the FBO would be another example where this is not something that could be fully guaranteed without sacrificing performance
I don’t see how SLI would affect this. This is not scatter writing. As far as the surface transfer between rendering and texturing, we already have that problem with early-z branching multipass techniques. We are basically forced to wait for the pass to finish rendering before beginning the next, as we need to know when the number of samples written reaches 0.

-Raystonn

Actually, that brings up another idea. Why force the CPU to wait for the rendering pass to complete, check a single value, and issue another? The drivers or hardware should be capable of stringing together passes until 0 pixels are written.

-Raystonn