PDA

View Full Version : S3 DeltaChrome - bidirectional depth & color buffers ??



MZ
01-15-2003, 08:06 AM
Finally, DeltaChrome’s pixel shaders feature bi-directional Z and color buffers. Standard DirectX 9 pixel shaders utilize unidirectional color and Z buffers. This allows the color and Z buffers to send or receive data from the pixel shaders.
(from FiringSquad (http://firingsquad.gamers.com/hardware/s3_deltachrome/page4.asp) )

This sounds like they want to implement ability to read-and-modify fragment's underlying framebuffer contents in fragment program - just like it is defined in GL2 shading language (readable gl_FBColor, gl_FBDepth, ...). This would make programmable blending, depth test and polygon offset (or even stencil test/op.).

What do you think about possible applications of the feature ?


[This message has been edited by MZ (edited 01-15-2003).]

dorbie
01-15-2003, 11:56 AM
It makes it a lot easier to create arbitrarily complex multipass shaders & break large shaders into simpler multipass shaders IMHO. I mentioned this in a shader laguage thread a long time ago. I think the name 'bidirectional' is unfortunate.

Humus
01-15-2003, 03:49 PM
Yeah, this sounds like the thing for the future. It should solve the blending on float buffer problem, plus of course that it adds loads of flexibility.

tarantula
01-16-2003, 02:16 PM
That's cool! Will the reading back be fast enough? The 3dlabs GL2 whitepaper said that reading back might be very slow with gl_FB* , I didn't expect it to implemented so soon http://www.opengl.org/discussion_boards/ubb/biggrin.gif.

jwatte
01-16-2003, 06:23 PM
It sounds like it would make things run really slow, to me. Perhaps if you tiled your buffer, put it in insanely expensive, close SRAM, and got good locality, you could make it fast enough to be worth it.

davepermen
01-16-2003, 07:10 PM
jwatte, why? it is actually just another texture access.. at the first access to the framebuffer, read it into a register (a tex instruction, if you could use the target buffer as tex), and in the end, just write the resulting color out.

it _is_ a performance drop, but it should not be more a performance drop than accessing a texture.

anyone has a reason why it should?

cass
01-16-2003, 08:28 PM
GPUs are pipelines, so there are almost certainly more fragments between the one you're shading and the frame buffer. If any of those other fragments will update the framebuffer location you want to read, you have a hazzard.

That's the basic problem. It's not as simple as just treating the framebuffer as a texture.

Thanks -
Cass

tarantula
01-16-2003, 09:30 PM
I thought a texture unit has some texture cache so this would make frame buffer reading slower than a texel fetch unless some caching is used for the frame buffer. Is this right?
At most how many fragments do you think will be in the pipeline below a fragment that's being shaded? What I want to know is will it fine most of the times? 'Coz how often will two fragments belonging to the same position on the screen are placed one after the other in the fragment FIFO queue. So the worst case rarely occurs and that the average performance will be good enough.

davepermen
01-16-2003, 10:03 PM
Originally posted by cass:

GPUs are pipelines, so there are almost certainly more fragments between the one you're shading and the frame buffer. If any of those other fragments will update the framebuffer location you want to read, you have a hazzard.

That's the basic problem. It's not as simple as just treating the framebuffer as a texture.

Thanks -
Cass

hm.. actually a pixel has only access to its own older part at the same pos on the frame buffer.. parallelism or not, i don't think you draw yet several triangles at the same time at the same place, but process a bunch of pixels at different places at the same time.. at least, thats how it makes sence (say 8 fragment pipelines => pixel[x,y] .. pixel[x+7,y] processed at the same time)

pipelined or not, that thing works in parallel, there is _NO_ problem then with the chance of reading and writing at the same time. it is a simple texture access.. caching is an issue, yet, but on the other hand, you have streamlined readout, wich is quite neat to do fast, too, i think.

or don't you work with scanlines?

please explain in more detail what is the problem..

tarantula
01-16-2003, 11:23 PM
I think he's talking of the fragments that are in the same pipeline but closer to the frame buffer than the fragment that's being processed.Say a fragment is undergoing blending (or some other thing that's done after the pixel shading) then when your fragment that's being shaded tries to access the frame buffer, the result will be wrong because some fragments still are on their way to the frame buffer.

Will the next fragment processors grow to include (and replace) all things (like blending fog) that are done after the fragment shader stage? Then there shouldn't be the problem that cass mentions.

davepermen
01-17-2003, 02:18 AM
hm okay i see the idea.. but well.. i want to see some numbers:
how often, nvidia, does a typical game (say ut2003) feed 2times the same fragment, with different input values, at the same pixelcoords. how much fragments are behind eachother in a pipeline? is there really a big chance to have 2 fragments that have to be drawn at the same place too fast after eachother, so they can't get processed in parallel..

seriously, i can't think of that situation to happen often => in normal situations parallelism is guarantied to say 99.9% of the time.. but i'm interested in numbers..

jwatte
01-17-2003, 08:08 AM
Davepermen,

If it's like a texture access, it's not guaranteed to just read the pixel you're currently writing. You may want to read other pixels out of the frame buffer than the one you're going to write. Thus, you have a generalized read-after-write hazard.

If you allow reading of the frame buffer pixel that your fragment contributes to, you still have the problem of multisampling.

If you only allow reading the corresponding framebuffer color value for exactly the fragment you're processing (I don't know if that's possible in all antialiasing applications, but let's pretend it is) then you still have inefficiency, because framebuffers typically live in DRAM-like memories, who do not like being turned around between reading and writing, and may add many cycles of latency for each time you change the direction.

Which leads me up to my original speculation that you would have to render in some tiled fashion, keeping the entire tile (be it horizontal, vertical, or square :-) in some close, fast SRAM that has better access characteristics. And, evenso, reads from other parts of the framebuffer would still be less efficient (thus, my comment about locality). Although, because you're tiling, you know which parts can be cached, and which parts need hazard resolution, and you'd probably serialize switching between destination tiles to make that not a problem.

Make sense?

davepermen
01-17-2003, 08:23 AM
Originally posted by jwatte:
Davepermen,

If it's like a texture access, it's not guaranteed to just read the pixel you're currently writing. You may want to read other pixels out of the frame buffer than the one you're going to write. Thus, you have a generalized read-after-write hazard.

If you allow reading of the frame buffer pixel that your fragment contributes to, you still have the problem of multisampling.

If you only allow reading the corresponding framebuffer color value for exactly the fragment you're processing (I don't know if that's possible in all antialiasing applications, but let's pretend it is) then you still have inefficiency, because framebuffers typically live in DRAM-like memories, who do not like being turned around between reading and writing, and may add many cycles of latency for each time you change the direction.

Which leads me up to my original speculation that you would have to render in some tiled fashion, keeping the entire tile (be it horizontal, vertical, or square :-) in some close, fast SRAM that has better access characteristics. And, evenso, reads from other parts of the framebuffer would still be less efficient (thus, my comment about locality). Although, because you're tiling, you know which parts can be cached, and which parts need hazard resolution, and you'd probably serialize switching between destination tiles to make that not a problem.

Make sense?

never talked about general texture accessing. i only ment the fragment right under the new fragment, like in blending.

and yes, there is the read-write task to do. just like in blending. you get the idea? it does not have to do much more as in blending, so where's the real point? yes, blending _IS_ slower, but it does not "kill" performance. it does hurt performance, but the additional features blending gave us are worth that little speed drop. i don't see why it should gave any more speeddrop for reading,writing and all the stuff. the only difference is you have to read more early => if you have several fragments at the same time at the same place, you get a stall. but somehow i just don't think that has to happen very often, with a little bit of clever rendering. at least, i know the radeon does tile based render, in 8x8 or 16x16 chunks, or so. the first drivers had some bugs in the fragment programs, wich made such tiles visible. and i think, if you render such a tile at once, well, a line of the tile at once in all 8 pipelines, and all 8 pixels behind eachother due the pipelining, you should not get stalls. tilebased is the future anyways, we know that yet. hairdryer style cooling isn't cool. it just shows lack of bether designed hw..

tarantula
01-18-2003, 01:46 AM
Now I feel OGL 2.0 (3dlabs proposal) capable hardware will be out much sooner than 5 years. I just hope API-chasing-the-hardware won't hold true again in 2-3 years.

jwatte
01-18-2003, 03:27 PM
Someone up the thread was talking about accessing the frame buffer as a general texture. And, as far as I know, current blend hardware is very specialized to do the read-modify-write cycle very fast. They can't even (currently) fit a floating point unit in that path, so I could only imagine that putting an entire fragment program in the way would make it that much worse.

If all you have accessible is the frame buffer fragment corresponding to the current fragment, then I agree things would become a little easier, although probably not as "easy" as the current blend hardware. But I'd have to ask an ASIC designer to get any kind of certainty about that :-)

frost_add
01-19-2003, 01:08 AM
Approach of something like "blend shader" - that is feeding current render target to pixel shader as a texture tends to work under D3D. Actually I haven't heard of it _not_ working, although I haven't been (yet) in a need to use it myself. This is generally discouraged by IHV-s, but it works :-). As a side note - my friend (successfully) used the idea of using rendertarget simultaneously as a texture to avoid rendertarget switching in "ping-pong" blur effect - so even some less restrictive things than "one to one mapping" _may_ work.
I think behavior of doing custom blending should be allowed - this is getting really important with floating pont/10:10:10:2 buffers, as they generally don't allow _any_ blending. If DeltaChrome chip explicitly allows for that (and if that chip is for real :-) ) then I hope it'll become a standard feature exposed as a gl extension/d3d cap bit.

MZ
01-19-2003, 05:48 AM
http://www.opengl.org/discussion_boards/ubb/eek.gif http://www.opengl.org/discussion_boards/ubb/eek.gif http://www.opengl.org/discussion_boards/ubb/eek.gif It works?? Perhaps it is side effect of using texture cache, so if you'd sample texel far enough from the current fragment's location, the effect would disappear...
Well, if it works under certain conitions in DX, then there is possiblity it might work in GL as well - spec of WGL_render_texture says that:
(after calling wglBindTexImageARB...)
any draw operation that is done to the pbuffer prior to wglReleaseTexImageARB being called, produces indeterminant results.The "indeterminant result" might happen to be the one you desire

MZ
01-23-2003, 06:39 PM
I've just tested it with WGL_ARB_render_texture, and it really works. Exactly as I said above, if you sample texel distant enough from the current fragment you get trashes on screen. But simple 1-on-1 read-modify-write is ok. On my GF3 at least.
Also sampling within (about) 1-texel neighbourhood seems to work, but it's probably far more unsafe.

Coriolis
01-23-2003, 07:06 PM
Relying on behavior that is explicitly specified to be undefined is as unsafe as you can get. It's exactly like reading data from a pointer that has a random value, or using stack memory without initializing it. You cannot even say that you will get the same behavior the next time you run your program as you got the last time you ran it, even if you got the behavior you wanted 100 times in a row. You cannot say that your computer won't lock up and reboot when you do it with a red pixel right after a blue pixel, even though it works with every other combination of colors. It may be that it works when the frame starts on everything but clock timers that are a multiple of 256. You cannot be confident that your program will continue to work on the next driver update, much less the next video card you try the program on.

You cannot even use the results of this test to say that it is possible... a lot of things become fast and easy when correctness and stability requirements are ignored.

MZ
01-23-2003, 07:27 PM
I do realize what are consequences of relying on undefined results. However, in threads in the past both Matt from Nvidia and Evan from Ati _recommended_ particular usage scheme of WGL RTT, which is explicitly specified as undefined. Both guys ensured it does work on their HW. It was about RTT into double-buffered pbuffer. In practice, this technique is very closely related to the technique described in this thread. I think unsafety of relying on undefined result in this particular case is not as big problem as you see it.

Coriolis
01-23-2003, 07:43 PM
My products need to run on a wide variety of hardware and drivers, both current and future. Consequently, I always run away as fast as I can from any behavior that is explicitly undefined -- even if nVidia and ATI both assure me that the behavior works with their current cards.

As you may have noticed, confidence in the correctness of my code is very, very important to me http://www.opengl.org/discussion_boards/ubb/biggrin.gif

cass
01-23-2003, 11:22 PM
Originally posted by MZ:
I do realize what are consequences of relying on undefined results. However, in threads in the past both Matt from Nvidia and Evan from Ati _recommended_ particular usage scheme of WGL RTT, which is explicitly specified as undefined. Both guys ensured it does work on their HW. It was about RTT into double-buffered pbuffer. In practice, this technique is very closely related to the technique described in this thread. I think unsafety of relying on undefined result in this particular case is not as big problem as you see it.


MZ,

You're getting into deeper waters with this kind of thing though. The way our hardware works (both NV and ATI apparently) we can promise defined results for what the spec leaves undefined in the case of double-buffering.

Texturing from the active render target is another thing entirely. Still, if you can live with "at your own risk", you may be able to do some nifty stuff. (Not that I'm advocating it. http://www.opengl.org/discussion_boards/ubb/smile.gif )

Thanks -
Cass


[This message has been edited by cass (edited 01-24-2003).]

Humus
01-26-2003, 12:13 AM
Originally posted by jwatte:
Someone up the thread was talking about accessing the frame buffer as a general texture. And, as far as I know, current blend hardware is very specialized to do the read-modify-write cycle very fast. They can't even (currently) fit a floating point unit in that path, so I could only imagine that putting an entire fragment program in the way would make it that much worse.

If all you have accessible is the frame buffer fragment corresponding to the current fragment, then I agree things would become a little easier, although probably not as "easy" as the current blend hardware. But I'd have to ask an ASIC designer to get any kind of certainty about that :-)

The reason there's no blending on floating point render targets is just that blending assumes certain operations are cheap, such as 1 - x, which is true for integers but not for floats. For the very same reason you no longer have pre-scale/pre-bias/post-scale etc in a ARB_fragment_program as you had in ATI_fragment_shader and NV_register_combiners. So to implement blending with normal glBlendFunc() calls would be expensive in hardware. But if you can access the backbuffer in the shader you can use the same hardware for blending. The only real difference is at which time we access the framebuffer, and if we only access it the last thing we do in the shader there shouldn't be a whole lot of difference compared to a normal blend.