S3 DeltaChrome - bidirectional depth & color buffers ??

Finally, DeltaChrome’s pixel shaders feature bi-directional Z and color buffers. Standard DirectX 9 pixel shaders utilize unidirectional color and Z buffers. This allows the color and Z buffers to send or receive data from the pixel shaders.

(from FiringSquad )

This sounds like they want to implement ability to read-and-modify fragment’s underlying framebuffer contents in fragment program - just like it is defined in GL2 shading language (readable gl_FBColor, gl_FBDepth, …). This would make programmable blending, depth test and polygon offset (or even stencil test/op.).

What do you think about possible applications of the feature ?

[This message has been edited by MZ (edited 01-15-2003).]

It makes it a lot easier to create arbitrarily complex multipass shaders & break large shaders into simpler multipass shaders IMHO. I mentioned this in a shader laguage thread a long time ago. I think the name ‘bidirectional’ is unfortunate.

Yeah, this sounds like the thing for the future. It should solve the blending on float buffer problem, plus of course that it adds loads of flexibility.

That’s cool! Will the reading back be fast enough? The 3dlabs GL2 whitepaper said that reading back might be very slow with gl_FB* , I didn’t expect it to implemented so soon .

It sounds like it would make things run really slow, to me. Perhaps if you tiled your buffer, put it in insanely expensive, close SRAM, and got good locality, you could make it fast enough to be worth it.

jwatte, why? it is actually just another texture access… at the first access to the framebuffer, read it into a register (a tex instruction, if you could use the target buffer as tex), and in the end, just write the resulting color out.

it is a performance drop, but it should not be more a performance drop than accessing a texture.

anyone has a reason why it should?

GPUs are pipelines, so there are almost certainly more fragments between the one you’re shading and the frame buffer. If any of those other fragments will update the framebuffer location you want to read, you have a hazzard.

That’s the basic problem. It’s not as simple as just treating the framebuffer as a texture.

Thanks -
Cass

I thought a texture unit has some texture cache so this would make frame buffer reading slower than a texel fetch unless some caching is used for the frame buffer. Is this right?
At most how many fragments do you think will be in the pipeline below a fragment that’s being shaded? What I want to know is will it fine most of the times? 'Coz how often will two fragments belonging to the same position on the screen are placed one after the other in the fragment FIFO queue. So the worst case rarely occurs and that the average performance will be good enough.

Originally posted by cass:
[b]
GPUs are pipelines, so there are almost certainly more fragments between the one you’re shading and the frame buffer. If any of those other fragments will update the framebuffer location you want to read, you have a hazzard.

That’s the basic problem. It’s not as simple as just treating the framebuffer as a texture.

Thanks -
Cass[/b]

hm… actually a pixel has only access to its own older part at the same pos on the frame buffer… parallelism or not, i don’t think you draw yet several triangles at the same time at the same place, but process a bunch of pixels at different places at the same time… at least, thats how it makes sence (say 8 fragment pipelines => pixel[x,y] … pixel[x+7,y] processed at the same time)

pipelined or not, that thing works in parallel, there is NO problem then with the chance of reading and writing at the same time. it is a simple texture access… caching is an issue, yet, but on the other hand, you have streamlined readout, wich is quite neat to do fast, too, i think.

or don’t you work with scanlines?

please explain in more detail what is the problem…

I think he’s talking of the fragments that are in the same pipeline but closer to the frame buffer than the fragment that’s being processed.Say a fragment is undergoing blending (or some other thing that’s done after the pixel shading) then when your fragment that’s being shaded tries to access the frame buffer, the result will be wrong because some fragments still are on their way to the frame buffer.

Will the next fragment processors grow to include (and replace) all things (like blending fog) that are done after the fragment shader stage? Then there shouldn’t be the problem that cass mentions.

hm okay i see the idea… but well… i want to see some numbers:
how often, nvidia, does a typical game (say ut2003) feed 2times the same fragment, with different input values, at the same pixelcoords. how much fragments are behind eachother in a pipeline? is there really a big chance to have 2 fragments that have to be drawn at the same place too fast after eachother, so they can’t get processed in parallel…

seriously, i can’t think of that situation to happen often => in normal situations parallelism is guarantied to say 99.9% of the time… but i’m interested in numbers…

Davepermen,

If it’s like a texture access, it’s not guaranteed to just read the pixel you’re currently writing. You may want to read other pixels out of the frame buffer than the one you’re going to write. Thus, you have a generalized read-after-write hazard.

If you allow reading of the frame buffer pixel that your fragment contributes to, you still have the problem of multisampling.

If you only allow reading the corresponding framebuffer color value for exactly the fragment you’re processing (I don’t know if that’s possible in all antialiasing applications, but let’s pretend it is) then you still have inefficiency, because framebuffers typically live in DRAM-like memories, who do not like being turned around between reading and writing, and may add many cycles of latency for each time you change the direction.

Which leads me up to my original speculation that you would have to render in some tiled fashion, keeping the entire tile (be it horizontal, vertical, or square :slight_smile: in some close, fast SRAM that has better access characteristics. And, evenso, reads from other parts of the framebuffer would still be less efficient (thus, my comment about locality). Although, because you’re tiling, you know which parts can be cached, and which parts need hazard resolution, and you’d probably serialize switching between destination tiles to make that not a problem.

Make sense?

Originally posted by jwatte:
[b]Davepermen,

If it’s like a texture access, it’s not guaranteed to just read the pixel you’re currently writing. You may want to read other pixels out of the frame buffer than the one you’re going to write. Thus, you have a generalized read-after-write hazard.

If you allow reading of the frame buffer pixel that your fragment contributes to, you still have the problem of multisampling.

If you only allow reading the corresponding framebuffer color value for exactly the fragment you’re processing (I don’t know if that’s possible in all antialiasing applications, but let’s pretend it is) then you still have inefficiency, because framebuffers typically live in DRAM-like memories, who do not like being turned around between reading and writing, and may add many cycles of latency for each time you change the direction.

Which leads me up to my original speculation that you would have to render in some tiled fashion, keeping the entire tile (be it horizontal, vertical, or square :slight_smile: in some close, fast SRAM that has better access characteristics. And, evenso, reads from other parts of the framebuffer would still be less efficient (thus, my comment about locality). Although, because you’re tiling, you know which parts can be cached, and which parts need hazard resolution, and you’d probably serialize switching between destination tiles to make that not a problem.

Make sense?[/b]

never talked about general texture accessing. i only ment the fragment right under the new fragment, like in blending.

and yes, there is the read-write task to do. just like in blending. you get the idea? it does not have to do much more as in blending, so where’s the real point? yes, blending IS slower, but it does not “kill” performance. it does hurt performance, but the additional features blending gave us are worth that little speed drop. i don’t see why it should gave any more speeddrop for reading,writing and all the stuff. the only difference is you have to read more early => if you have several fragments at the same time at the same place, you get a stall. but somehow i just don’t think that has to happen very often, with a little bit of clever rendering. at least, i know the radeon does tile based render, in 8x8 or 16x16 chunks, or so. the first drivers had some bugs in the fragment programs, wich made such tiles visible. and i think, if you render such a tile at once, well, a line of the tile at once in all 8 pipelines, and all 8 pixels behind eachother due the pipelining, you should not get stalls. tilebased is the future anyways, we know that yet. hairdryer style cooling isn’t cool. it just shows lack of bether designed hw…

Now I feel OGL 2.0 (3dlabs proposal) capable hardware will be out much sooner than 5 years. I just hope API-chasing-the-hardware won’t hold true again in 2-3 years.

Someone up the thread was talking about accessing the frame buffer as a general texture. And, as far as I know, current blend hardware is very specialized to do the read-modify-write cycle very fast. They can’t even (currently) fit a floating point unit in that path, so I could only imagine that putting an entire fragment program in the way would make it that much worse.

If all you have accessible is the frame buffer fragment corresponding to the current fragment, then I agree things would become a little easier, although probably not as “easy” as the current blend hardware. But I’d have to ask an ASIC designer to get any kind of certainty about that :slight_smile:

Approach of something like “blend shader” - that is feeding current render target to pixel shader as a texture tends to work under D3D. Actually I haven’t heard of it not working, although I haven’t been (yet) in a need to use it myself. This is generally discouraged by IHV-s, but it works :-). As a side note - my friend (successfully) used the idea of using rendertarget simultaneously as a texture to avoid rendertarget switching in “ping-pong” blur effect - so even some less restrictive things than “one to one mapping” may work.
I think behavior of doing custom blending should be allowed - this is getting really important with floating pont/10:10:10:2 buffers, as they generally don’t allow any blending. If DeltaChrome chip explicitly allows for that (and if that chip is for real :slight_smile: ) then I hope it’ll become a standard feature exposed as a gl extension/d3d cap bit.

It works?? Perhaps it is side effect of using texture cache, so if you’d sample texel far enough from the current fragment’s location, the effect would disappear…
Well, if it works under certain conitions in DX, then there is possiblity it might work in GL as well - spec of WGL_render_texture says that:

(after calling wglBindTexImageARB…)
any draw operation that is done to the pbuffer prior to wglReleaseTexImageARB being called, produces indeterminant results.
The “indeterminant result” might happen to be the one you desire

I’ve just tested it with WGL_ARB_render_texture, and it really works. Exactly as I said above, if you sample texel distant enough from the current fragment you get trashes on screen. But simple 1-on-1 read-modify-write is ok. On my GF3 at least.
Also sampling within (about) 1-texel neighbourhood seems to work, but it’s probably far more unsafe.

Relying on behavior that is explicitly specified to be undefined is as unsafe as you can get. It’s exactly like reading data from a pointer that has a random value, or using stack memory without initializing it. You cannot even say that you will get the same behavior the next time you run your program as you got the last time you ran it, even if you got the behavior you wanted 100 times in a row. You cannot say that your computer won’t lock up and reboot when you do it with a red pixel right after a blue pixel, even though it works with every other combination of colors. It may be that it works when the frame starts on everything but clock timers that are a multiple of 256. You cannot be confident that your program will continue to work on the next driver update, much less the next video card you try the program on.

You cannot even use the results of this test to say that it is possible… a lot of things become fast and easy when correctness and stability requirements are ignored.

I do realize what are consequences of relying on undefined results. However, in threads in the past both Matt from Nvidia and Evan from Ati recommended particular usage scheme of WGL RTT, which is explicitly specified as undefined. Both guys ensured it does work on their HW. It was about RTT into double-buffered pbuffer. In practice, this technique is very closely related to the technique described in this thread. I think unsafety of relying on undefined result in this particular case is not as big problem as you see it.