Originally posted by Korval:
Well, it has to be synchronous. As you point out, in order to use the querry, you have to wait for it to complete.
Granted, but the first thing you told me was “async” was the only way to go!
So NV_Occlusion really seems to force you to take a few steps back in performance (no state sorting, synchronous behavior, etc…) in order to gain unknown performance.
Okay, here’s my final pitch organized a little more coherently:
NV_OcclusionQuery and FB_OcclusionQuery (for lack for a better term) can be made almost identical in the following ways:
-
The HW can fill in a host-side datastructure asynchronously, which the user can poll for completion.
-
Both give visible pixel counts per test, where the “test zone” is defined by the user either issuing occlusion tokens or assigning unique color tags to objects in advance.
-
The number of bytes transmitted over the bus is virtually identical, dependent entirely on the number of visibility tests requested. (however, there is likely some additional overhead per transmission).
The techniques differ in that:
1a. NV_Occlusion requires objects to be drawn in depth-sorted order for accurate results.
1b. FB_Occlusion gives frame-accurate results with out any depth-sorting requirement.
2a. NV_Occlusion creates N seperate test result packets.
2b. FB_Occlusion uses a single array of N elements (which can be zero coded during transmission if it helps).
3a. NV_Occlusion doesn’t require any FB reads; however, since it uses a “visible pixel counter,” is also must wait for objects within a single test zone to finish rasterizing. It can theoretically return test results while later objects are being drawn, but there is no guarantee of return times.
3b. FB_Occlusion requires initial rasterization to finish plus a full-framebuffer iteration to get results. Since this data is vital to the next stage, the app can more reasonably “wait” for the data to arrive via the bus.
4a. NV_Occlusion is theoretically free, except for the tokens and API (and, of course, the large amount of CPU overhead in a tightly coupled render-test-render loop)
4b. FB_Occlusion blocks further rasterization to a FB while it is summed (at, say, 1 clock per pixel and say 400Mhz, a 1k x 1k FB would take 1M clocks, or 1/400th of a second). However the results are much better, they’re returned all at once, and the API overhead is fixed and low.
5a. NV_Occlusion doesn’t require any special rendering state, though you probably do want to render a simpler stand-in for tests (ideally, not textured, not shaded, not blended, no write, etc…) and then later render the real object, if it is visible. However, the number of large state changes goes up by 2*N, where N is the number of vis tests.
5b. FB_Occlusion requires a first pass using color or other tag mechanism and some special state (no AA, no blending, no texture, no lighting for starters). You’d also render a simpler stand-in for objects before rendering the full object in a second pass. However, if pass-1 renders all stand-ins at once using very simple state and NO internal state changes, it is arguably faster to complete the vis tests (more than making up for that 1/400th second FB sum).
How’s that?
One way to test the idea, I imagine, is as in a fragment shader, where you render a whole-frame poly with this special shader. The shader would need to increment some target memory table[pixel-color]++ and then the host could do a slower read-back of that target memory.
However, I’m stuck on step 2, incrementing a target table based on a read of the FB dest color. Any pointers are appreciated.
Avi
[This message has been edited by Cyranose (edited 07-23-2003).]