PDA

View Full Version : Can gl_FragColor be read?



Raystonn
11-18-2006, 02:25 PM
Normally one uses gl_FragColor to write the output of the shader, setting the pixel color. When using multipass rendering, it can be a useful optimization (to avoid ping ponging) to read the current value of the target pixel in the texture, perform the operations on that data, and then write the new value back to that location. Is this strictly legal in OpenGL? Also, whether legal or not, does this work in most modern nVidia and ATI hardware?

Thanks,

-Raystonn

Korval
11-18-2006, 02:27 PM
No.

Raystonn
11-18-2006, 02:30 PM
Thanks for saving me some time experimenting on this optimization!

-Raystonn

zeoverlord
11-18-2006, 03:33 PM
Though everybody wished that would be the case.
(if it was as fast as a regular texture fetch)

Humus
11-18-2006, 06:14 PM
The variable gl_FragColor is readable, however, it doesn't read from framebuffer, but reading it will just give the value you already assigned to it in the shader. In other words, you can do something like this:


gl_FragColor = texture2D(Base, texCoord);
gl_FragColor.a = dot(gl_FragColor.rgb, gl_FragColor.rgb);

Raystonn
12-05-2006, 11:39 AM
I have done some experiments and have found that I can read from a texture and later write to it during a single pass fragment shader run. This works on two nVidia GPUs:the GeForce Go 6800 and the GeForce 8800 GTX.

I got this working with the following technique: Add the texture to an FBO. Enable the texture for reading the way you'd normally bind a texture. Enable the texture for writing via gl_FragData[1] using glDrawBuffers.

I have only tried this technique when using multiple textures. The target texture is not the first texture for either reading or writing. I have not tried reading after writing, as I can see the result of such an operation being undefined.

If you try this on other hardware, specifically non-nVidia hardware, please let me know of your success. Avoiding a ping-pong can save quite a bit of texture memory.

Thanks,

-Raystonn

Komat
12-05-2006, 12:18 PM
Originally posted by Raystonn:

I have only tried this technique when using multiple textures. The target texture is not the first texture for either reading or writing. I have not tried reading after writing, as I can see the result of such an operation being undefined.
The FBO specification explicitly states (section 4.4.3) that value of fragment is undefined if it is possible that it might sample from texture level that is simultaneously bound to FBO. So even if it works on some hw with current drivers, it might break at any time.

Raystonn
12-05-2006, 12:26 PM
My opinion is that this section of the spec should be changed. A read before a write should fetch the current value as if it was not possible to write. A read after a write should be left undefined, as today's hardware does not synchronize reads and writes. You may get the written value, and you may get the old value, depending on how the hardware's cache system works.

Would this change make compliance impossible on any existing hardware?

-Raystonn

Korval
12-05-2006, 12:52 PM
Would this change make compliance impossible on any existing hardware?Let's readjust that question:

Would this change make compliance possible on any existing hardware?

Answer: No. Not G80, nor R600. And God knows what we've got now can't handle it.

Raystonn
12-05-2006, 01:15 PM
Actually, it works fine on the G80 and on the G40. (Please read above where I give my test results.) In fact, it looks like most hardware already handles this correctly. We would just need to update the spec to ensure it stays that way.

-Raystonn

Komat
12-05-2006, 01:21 PM
Originally posted by Raystonn:
Actually, it works fine on the G80 and on the G40. (Please read above where I give my test results.) In fact, it looks like most hardware already handles this correctly.
It is only coincidence that the way you use it interacts with the hw implementation in way you are expecting. The hw does nothing to ensure that it will not break in some situations (e.g. for small textures or for overlapping meshes). If the specification would require that this must always behave correctly, the extension would be not supported.

Raystonn
12-05-2006, 01:30 PM
I have no problem with this not working for cases where a pixel is drawn to more than once per pass, such as for overlapping meshes. The specification can clearly state that reading from a pixel will result in an undefined value after that pixel has been written at any point during the pass.

-Raystonn

Korval
12-05-2006, 01:49 PM
OK, how about this:

The extension specification was written by people who know a lot more about how their hardware works than you do. They clearly considered this issue and decided to reject it.

End of discussion.

Raystonn
12-05-2006, 02:01 PM
How about this:

Don't assume every possible scenario has already been contemplated by the hardware engineers. Think outside the box.

If you want to end discussion, then don't partake in discussion forums.

-Raystonn

Zengar
12-05-2006, 02:02 PM
Well, I guess Korval is right... But I still would like to know the reasons behind it. This hack works on all Nvidia hardware from Nv4x (as far as I know) under each driver that supports FBO. From words of Humus I also understand that it also works for Ati. Can someone who was warking at the spec or has connection to it (I know for sure that some of them read this forum) say something on this issue?

Korval
12-05-2006, 02:19 PM
Don't assume every possible scenario has already been contemplated by the hardware engineers.They did consider it: see below. It was considered and rejected.


But I still would like to know the reasons behind it.The FBO spec, issue #44.

Also, OpenGL is not the ATi/nVidia cross platform API. It needs to remain an implementable specification, so things that are vagaries of the specific way hardware goes about implementing something need to not be too specific. To define it would be to force very specific implementation details on implementers, much like specifically defining what a "float" is in glslang is too specific to allow for potentially creative implementations.

ehart
12-05-2006, 02:27 PM
Well, the whole problem with the assertion that this is OK within a single pass, is that a 'pass' is not defined anywhere in OpenGL. Is it everytime you call begin? Establishig such a semantic would require that implementations flush their caches at that point, which could have a negative impact on performance. Also there have been lots of graphics cards in the past that would require a surface transfer between rendering and texturing. These would break under that scenario. An SLI config where each card rendered 1/2 of the FBO would be another example where this is not something that could be fully guaranteed without sacrificing performance.

-Evan

Raystonn
12-05-2006, 02:30 PM
I quote from the FBO spec, issue #44:

As background, the reason this is an issue in the first place is that simultaneously reading from, and writing to, the same texture image is likely to be problematic on multiple vendors' hardware without paying performance penalties for excessive synchronization and/or data copying.As you can see here, the hardware engineers did not contemplate the scenario I am describing. They indicate that simultaneous reading and writing should not be allowed, and I fully agree. I am proposing allowing a specific subset of this functionality. Specifically, reading from a pixel prior to any write that could affect that pixel should return the value of that pixel prior to that rendering pass beginning. Any other mix of reading and writing of the same location should result in undefined values being read.

-Raystonn

Raystonn
12-05-2006, 02:43 PM
Well, the whole problem with the assertion that this is OK within a single pass, is that a 'pass' is not defined anywhere in OpenGL. Is it everytime you call begin? Establishig such a semantic would require that implementations flush their caches at that point, which could have a negative impact on performance.For multipass with early-z branching, a pass is implicitly defined as anything occurring between a glBeginQuery(GL_SAMPLES_PASSED, ...) and the accompanying glEndQuery(GL_SAMPLES_PASSED). Such techniques typically require you to block in glGetQueryObjectuiv(..., GL_QUERY_RESULT, ...) waiting to be told how many pixels were drawn so you can halt the passes when no further pixels need to be updated. If you are doing multipass without early-z branching for some reason, then nailing down the moment the write caches must be flushed out to memory becomes more problematic.


Also there have been lots of graphics cards in the past that would require a surface transfer between rendering and texturing. These would break under that scenario. An SLI config where each card rendered 1/2 of the FBO would be another example where this is not something that could be fully guaranteed without sacrificing performanceI don't see how SLI would affect this. This is not scatter writing. As far as the surface transfer between rendering and texturing, we already have that problem with early-z branching multipass techniques. We are basically forced to wait for the pass to finish rendering before beginning the next, as we need to know when the number of samples written reaches 0.

-Raystonn

Raystonn
12-05-2006, 02:46 PM
Actually, that brings up another idea. Why force the CPU to wait for the rendering pass to complete, check a single value, and issue another? The drivers or hardware should be capable of stringing together passes until 0 pixels are written.

-Raystonn

Zengar
12-05-2006, 02:48 PM
Korval, I still can see that this actually works on current hardware without any synchronozation problems. From the FBO spec is is not clear, that they even suggested a positive answer (like, yes we could allow it). Why wouldn't they, if the hardware actually supports it? They could at least introduce it as an extra extension.

IMHO, it is difficult to controll such behaviour. It seems that reading from a texel that is going to be written to is ok, but what is if one would read a texel that was written to in the last shader clock (a texel from a different quad, for example). This is where the real issues come in, synchronisation-wise.

I think, they should introduce the gl_FramebufferColor (or how is is supposed to be called) at last, as a variable that holds the current framebuffer value.

Well, they will have their own reasons...

P.S. As I know, FBO's are not being accelerated by SLI, it only applies to rendering to window. This is why they made the GPU_affinity extension in the first place (which sence I don't really get).

ZbuffeR
12-05-2006, 02:52 PM
Raystonn, what you mention looks like the (future) "Sample Shader" in this document :
http://www.gamedev.net/columns/events/gdc2006/article.asp?id=233

Raystonn
12-05-2006, 02:54 PM
ZbuffeR, that sounds sweet. It will be a great relief to be able to completely decouple the CPU from the GPU on branch-heavy shader code.

-Raystonn

Komat
12-05-2006, 02:57 PM
Originally posted by Raystonn:
For multipass with early-z branching, a pass is implicitly defined as anything occurring between a glBeginQuery(GL_SAMPLES_PASSED, ...) and the accompanying glEndQuery(GL_SAMPLES_PASSED).
Actually that is only what you consider a pass. Other people use the queries for different things and do not wish to relate that with cache flushes.
The proper way to mark something like pass would be to add additional function to do so.

Raystonn
12-05-2006, 02:57 PM
ZbuffeR, I'm a bit skeptical on the details, though. If they continue to formally disallow any and all mixtures of reading and writing the same pixel, then we will need to swap the read and write textures each pass. Without explicit support for this being done for you, I don't see the sample shader being very useful.

-Raystonn

Raystonn
12-05-2006, 03:05 PM
Komat, that is certainly a possibility. If we use two buffers and swap them (the current official way to do this), then caches are flushed every time they are swapped. If we use a single texture and read the current value prior to writing the new one, then we need a way to tell the GL to flush the cache so it can be guaranteed to be read. A new function is one way to do this.

Either way, when you multipass ping-pong you need the cache flushed for the targets you wrote to during the last pass.

-Raystonn

Komat
12-05-2006, 03:06 PM
Originally posted by Zengar:

I think, they should introduce the gl_FramebufferColor (or how is is supposed to be called) at last, as a variable that holds the current framebuffer value.
That was considered in GLSL specification 1.10 in two issues (#7 and #23). Was rejected because of speed concerns.

Raystonn
12-05-2006, 03:07 PM
If such functions were to be introduced, they should target individual write targets. This would allow you to flush a single texture to memory without overly impacting performance by forcing a flush on every write target. You ideally want to flush only the textures that both a) were written to last pass, and b) will be read during the next pass.

-Raystonn

Zengar
12-05-2006, 03:37 PM
@Komat: yup, I know, but it is still somehow sad, seeing as it is supported by current cards (at least partially :-)

Korval
12-05-2006, 04:51 PM
For multipass with early-z branching, a pass is implicitly defined as anything occurring between a glBeginQuery(GL_SAMPLES_PASSED, ...) and the accompanying glEndQuery(GL_SAMPLES_PASSED).Wait, let me get this straight.

Now you want to bind FBO behavior to that of occlusion queries? How about they just put out an "EXT_write_Raystonns_code" extension while they're at it?

No, this is way to special-case for it to be a reasonable request.


I don't see how SLI would affect this.*cough*

"The extension specification was written by people who know a lot more about how their hardware works than you do."

The fact that you don't see it is completely irrelevant to whether or not it is there. They're saying that it is there, and that's what matters.


We are basically forced to wait for the pass to finish rendering before beginning the next, as we need to know when the number of samples written reaches 0.Hey, you choose the algorithm, not me and not the FBO authors. You should have made a better choice. And if no better choices make themselves available, then you bite down and accept what you've got.


Korval, I still can see that this actually works on current hardware without any synchronozation problems.Yes, and I'm sure that there's other unspecified behavior that "just works" too. Feel free to use it, but don't complain about the spec if in a later driver revision it ceases to work for you.


@Komat: yup, I know, but it is still somehow sad, seeing as it is supported by current cards (at least partially :-)Maybe. Kinda. In a limited number of tested cases. And with completely unspecified restrictions.

Hardly the foundation for a good extension.

Humus
12-08-2006, 09:27 PM
Originally posted by Zengar:
From words of Humus I also understand that it also works for Ati.Not sure if I ever said it did. I actually don't know if it does. It may work, or work in some cases, or may not work. I really don't know. I'm not familiar with all the hardware details involved.

Humus
12-08-2006, 09:36 PM
Originally posted by Zengar:
I think, they should introduce the gl_FramebufferColor (or how is is supposed to be called) at last, as a variable that holds the current framebuffer value.As much as I love the idea as a software guy there are hardware reasons why this would be a problem in practice. The hardware really likes having a straight top to bottom pipeline. Data enter at top, gets processed and is written out at the bottom, with no loopbacks. As soon as you loop data back you're introducing a helluva lot of coherency checks for the hardware to perform which would be a problem for parallelism plus makes communication paths necceary between units that otherwise don't have to communicate. If you're rendering to and texturing from the same surface within the same draw call you'd need communication between the backend and the texture units.