Histo-visibility

Cyranose · July 20, 2003, 10:45am

glHistogram is in the spec, but it doesn’t seem well supported on the consumer-level cards. I’m not sure why that is, but here’s why I think I need it:

To do whole-frame occlusion testing, in the past, I’ve rendered objects using colors as tags (where color bits = object ID). Then, I read back the FB and run the following algorithm:

for each pixel[i] in FB, table[ pixel[i] ]++;

Assuming the table was initialized to zeros, it gives a count of how many pixels each object is occupying, which is zero for invisible objects. This is the basic packed histogram. Nothing special.

The hardware for the above code can’t possibly be that complex. And assuming the number of unique colors used is low, it would save a huge amount of bandiwth on the readback if the hardware did the summation. Better yet, if it can be zero-coded on the fly.

Why render the whole frame first, just to figure out what’s visible? You can use coarser LODs (esp. for alpha objects), no texture, no shaders, no blending or AA (a requirement for that pass, naturally) and it’s much faster. Plus, if you can re-use the z values flor the next pass, you can save some additional z writes.

Why not the Occlusion extensions? Too async, too coarse, and it still requires depth sorting for best results, unfortunately (results are true at time of object render, not at glFinish). Per pixel visibility gives a better class of information, IMO.

So is it that histograms are expensive to implement or is that no one’s been asking for it, apart from the image processing folks?

Avi

Korval · July 21, 2003, 12:57am

Why not the Occlusion extensions? Too async, too coarse, and it still requires depth sorting for best results, unfortunately (results are true at time of object render, not at glFinish). Per pixel visibility gives a better class of information, IMO.

At the expense of performance. Of course, since the entire point of doing visibility checks is to improve performance, that makes your method less useful overall.

As for occlusion culling being “too async”, you can never be too asynchronous, especially when performance on an async device (like a GPU) is involved. Occlusion culling is, for all intents and purposes, the right way to handle visibility tests. The asynchronisity allows the user to keep doing other things (rather than waiting for the occlusion querry to complete).

Indeed, the only real drawback to the technique is that it “requires” front-to-back rendering. Of course, it only requires it to the extent that you want accurate occlusion culling.

Besides, the hardware would have to pass back to you a table of a rather large size. That means that you’re talking about reading back a significant amount of data; therefore, it will be slow. The occlusion querry method requires reading back a 24-bit integer value. That’s something that can be lived with, by comparison.

al_bob · July 21, 2003, 12:15pm

glHistogram is in the spec, but it doesn’t seem well supported on the consumer-level cards.

glHistogram is part of the ARB_imaging option to the GL 1.2 specs. AFAIK, no (consummer) video card manufacturer supports this, appart from NVidia. Even then, it’s all done via the software rasterizer.

You’re much better off using occlusion queries.

Cyranose · July 22, 2003, 7:27am

Korval,

First note: I’ve been doing various forms of occlusion culling for 10 years (including using the NV extension more recently), so this is not an uninformed opinion. I do know what I need, even if I may be the only one who thinks about the problem this way.

Second, your point about reading only 24 bits back doesn’t make a whole lot of sense. That’s 24(or 32?) bits per occlusion query. In the case of using tags, you set the number of queries by picking the number of unique colors you apply (i.e., one unique color per equivalent NV_Occlusion test).

In either case, the amount of data is nearly identical, except in the FB-color-tag approach, it’s aggregated into one contigious chunk, which as far as I know is more efficient to read back since there’s always some overhead with AGP read back (more than 24 bits worth).

So if I want to test 1,000 objects per frame, that’s 24,000 bits packed for the FB approach and 24 * 1,000 bits scattered for the NV extension. The FB approach wins on throughput and it gives you true results without requiring depth sorting. The z-buffer, after all, is there to do the depth sorting in HW.

On synchronicity vs. asynchronicity, I’m revising my comments to note that my use of the word “synchronous” probably implies something to most of you I didn’t intend. A better word is “sychronized,” where I know I can get frame-accurate results in a timely fashion.

Granted, I don’t generally want to be blocking on GL calls. It would be great to have this vis-table filled in and sent back to the host asynchronously. But my objection to the so-called “asynchronous” model is that it is unbounded and generally difficult to even roughly “synchronize” the app stages with the drawing, even if they’re independent. Basically, I need to know I can get vis results this frame by a certain time, even if I have the APP go off and so some other work while it waits. The amount of time is not critical (as long as it’s counted in low milliseconds), but having it be predictable or at least fine-grained pollable is important to synchronizing. But that’s another discussion.

Here’s the scenario that currently happens with the NV_occlusion approach: you render an object speculatively, it later comes back as “invisible,” you wait X frames and try again. But with that wait, you may wait too long and risk having the object pop in. If the wait is too short, you haven’t saved much since you’re always testing everything. Of course, there are more spohisticated approaches using invisible bounding boxes, etc… but it gets very complicated very quickly.

Aside: one thing that would vastly improve the NV_occlusion approach, IMO, is the notion of a dependent draw. If I could render the solid bounding box of an object with Z and color writes off and automatically (HW-side) tie that visibility result to the subsequent render of the actual object that frame, then we’re talking true asynchronicity. The API wouldn’t even be hard to imagine, piggybacking on the NV_occlusion or FB_occlusion approaches.

But short of something like that, the NV_Occlusion test isn’t the final or correct answer, IMO.

Avi

[This message has been edited by Cyranose (edited 07-22-2003).]

Korval · July 22, 2003, 10:20am

Here’s the scenario that currently happens with the NV_occlusion approach

That’s certainly one approach. However, it is not the approach. And, considering that it is flawed (as you point out), it is not a good approach.

A better approach is this.

You sort your objects front-to-back. You render the front-most object. Then, you render an occlusion querry for the 3rd object (which, btw, is not the geometry for the 3rd object), and draw the 2nd. If the querry comes back as visible, you send a querry for the 4th and render the 3rd. Otherwise, you send a querry for the 5th and render the 4th. And so on.

Of course, it doesn’t guarentee that you aren’t rendering something that’s not being seen. However, it does prevent the rendering of some geometry that is not seen, so it does save time. And, it doesn’t have the popping problem that the one you described gets.

There are even better algorithms that can take advantage of the notion that an object that was visible last frame is more likely to be visible this frame than one that wasn’t visible last frame. The point of pairing each querry with a mesh rendering is so that the GPU never stops doing something, even if it’s drawing something behind something else. The mesh rendering could be from the list of objects more likely to be visible, and querries are sent for objects less likely to be visible.

Because rendering an occlusion querry is probably not more expensive than not (it probably just resets the pixel-drawn count register before rendering), you could render each separate object as a querry, so that you can get a pixel count. An object with a low pixel count (relative to its distance from the screen) can be considered less visible, and is therefore more likely to be not visible next frame.

The algorithm doesn’t have to be perfect. As long as it mitigates some of the work and provides a performance boost, it’s fine.

one thing that would vastly improve the NV_occlusion approach, IMO, is the notion of a dependent draw.

In order to make that really useful, the “draw” operation has to be relatively flexible. That is, you may be talking about a “draw” that involves numerous state-changes (a multi-shader mesh) and glDraw* calls. You may even need to bind different VBO’s and so forth.

The only way to do that is to bind it to the Display List API.

But, in any case, why do you need it to be automatic? Can’t you just test the querry yourself, and if it succeeded, then you draw the appropriate geometry?

Also, I’m not sure that this is a good idea, from a driver implementation standpoint. In order to make it work like normal GL calls, you would have to make it so that the “draw” operation happens immediately after the occlusion querry operation. Since most hardware operates out of a command list FIFO, and most drawing commands just drop command tokens into this FIFO for the hardware to read, the hardware itself would have to know that, if the occlusion querry result, when passed through some arbiturary comparison function, is true, then it should immediately perform some other kind of operation, rather than going on to what is next in the FIFO. Of course, because it’s pipelined (and very heavily), the next tokens in the FIFO have already been read and partically processed. So, it has to kill the entire GPU pipeline and start again. Basically, the GPU just mis-predicted a branch instruction. This induces a stall in the pipeline, assuming the hardware can stop itself at all.

[This message has been edited by Korval (edited 07-22-2003).]

Cyranose · July 22, 2003, 10:49am

Korval, the approach you describe sounds fairly synchronous. How do you guarantee the vis result comes back before you iterate to the next object? Plus, it loses out on state sorting and requires the CPU to be paying constant attention to the draw result instead of blasting VBO and state at full speed.

I’d love to hear how much of a gain it really is, perhaps only for very complex geometry. (not that the NV_occlusion approach I mentioned was any better, of course).

On the dependent rendering, I was imagining something more along the lines of a “begin-if” token that evaluated a previously requested visibility result and skipped along to an “end-if” token if the visibility failed. Granted, it’s not ideal from a FIFO point of view, but my point is that it is truly asynchronous, as opposed to what we have now. But you’re totally right about the branch mis-prediction cost. I’d rather not push the commands and bind state if I can ever avoid it.

Which brings me back to the original question of having whole-frame visibility info (ala FB packed histo-visibility) before I go rendering with various expensive states turned on. I can render in optimal state order, knowing which objects to skip for that frame only. If delegated shading is a good idea, then this is even better, IMO, since it also avoids unnecessary geometric detail.

I believe it’s worth the small amount of time it would take to do the packed FB read back, if I can avoid doing fine-grained culling, depth sorting, and tightly-coupled draw, test, draw visibility loops like you describe.

In the past, I’ve simulated the results by using the technique for pre-visibility and then replaying the scene using static visibility lists (e.g., per frame, or per position). The only thing I couldn’t do was do the FB readback fast enough for interactive use. With histogram packing, it would then be possible.

Avi

Korval · July 23, 2003, 8:48am

Korval, the approach you describe sounds fairly synchronous.

Well, it has to be synchronous. As you point out, in order to use the querry, you have to wait for it to complete. If you separate your rendering into a different thread, then you can transparently use that extra time. Or, if you don’t want the complexity, you can spend that time preparing for the rendering of the next object (running your animation system, building shader state-parameters, etc).

Cyranose · July 23, 2003, 9:23am

Originally posted by Korval:
Well, it has to be synchronous. As you point out, in order to use the querry, you have to wait for it to complete.

Granted, but the first thing you told me was “async” was the only way to go!

So NV_Occlusion really seems to force you to take a few steps back in performance (no state sorting, synchronous behavior, etc…) in order to gain unknown performance.

Okay, here’s my final pitch organized a little more coherently:

NV_OcclusionQuery and FB_OcclusionQuery (for lack for a better term) can be made almost identical in the following ways:

The HW can fill in a host-side datastructure asynchronously, which the user can poll for completion.
Both give visible pixel counts per test, where the “test zone” is defined by the user either issuing occlusion tokens or assigning unique color tags to objects in advance.
The number of bytes transmitted over the bus is virtually identical, dependent entirely on the number of visibility tests requested. (however, there is likely some additional overhead per transmission).

The techniques differ in that:

1a. NV_Occlusion requires objects to be drawn in depth-sorted order for accurate results.

1b. FB_Occlusion gives frame-accurate results with out any depth-sorting requirement.

2a. NV_Occlusion creates N seperate test result packets.

2b. FB_Occlusion uses a single array of N elements (which can be zero coded during transmission if it helps).

3a. NV_Occlusion doesn’t require any FB reads; however, since it uses a “visible pixel counter,” is also must wait for objects within a single test zone to finish rasterizing. It can theoretically return test results while later objects are being drawn, but there is no guarantee of return times.

3b. FB_Occlusion requires initial rasterization to finish plus a full-framebuffer iteration to get results. Since this data is vital to the next stage, the app can more reasonably “wait” for the data to arrive via the bus.

4a. NV_Occlusion is theoretically free, except for the tokens and API (and, of course, the large amount of CPU overhead in a tightly coupled render-test-render loop)

4b. FB_Occlusion blocks further rasterization to a FB while it is summed (at, say, 1 clock per pixel and say 400Mhz, a 1k x 1k FB would take 1M clocks, or 1/400th of a second). However the results are much better, they’re returned all at once, and the API overhead is fixed and low.

5a. NV_Occlusion doesn’t require any special rendering state, though you probably do want to render a simpler stand-in for tests (ideally, not textured, not shaded, not blended, no write, etc…) and then later render the real object, if it is visible. However, the number of large state changes goes up by 2*N, where N is the number of vis tests.

5b. FB_Occlusion requires a first pass using color or other tag mechanism and some special state (no AA, no blending, no texture, no lighting for starters). You’d also render a simpler stand-in for objects before rendering the full object in a second pass. However, if pass-1 renders all stand-ins at once using very simple state and NO internal state changes, it is arguably faster to complete the vis tests (more than making up for that 1/400th second FB sum).

How’s that?

One way to test the idea, I imagine, is as in a fragment shader, where you render a whole-frame poly with this special shader. The shader would need to increment some target memory table[pixel-color]++ and then the host could do a slower read-back of that target memory.

However, I’m stuck on step 2, incrementing a target table based on a read of the FB dest color. Any pointers are appreciated.

Avi

[This message has been edited by Cyranose (edited 07-23-2003).]

JustHanging · July 30, 2003, 3:52am

Cyranose, you forget one thing. With the NV approach you can use just bounding boxes for visibility testing, but with the FB approach you have to get at least the silhouettes right for the algorithm to work. Depending on your geometry (trees…) this may make a huge difference.

However, I do agree that the FB approach would, if supported, be a good way to test visibility, especially when you use complex shaders.

-Ilkka

Cyranose · July 30, 2003, 9:13am

Agreed. Testing a quad (or 6 triangles/1 fan for a 3D box) without requiring writes to the FB is useful for testing lots of small objects at the lowest possible LOD. I suppose you could create somewhat similar functionality by using the stencil instead of the colors for such objects – turn off Z and color writes too – but it still wouldn’t be a nice as getting an async result for each object like NV_occlusion does in this case.

It’s worth mentioning that BBoxes are not as useful if the objects are large (screen-wise), since you still pay most of the initial fill cost. In fact, for large objects, you may want to use silhouettes or geometry instead of BBoxes to save fill and test correctly for Z intersections.

The z-intersection problem is always a risk (both false positive and false negative) with stand-ins unless you depth-sort more finely and special case z-overlaps.

By the way, for anyone who was interested or thought the shader emulation idea was totally crazy, I came up with an approach that might at least work, though it’s still much more expensive than a simple full-frame readback… The idea is to use either a texture shader or fragment shader to post-filter the colorized FB, such that for each iteration, as one reference color is selected: any output texel of that color is set to white while every other texel is set to black. Using auto-mipmap generation (may require a 2nd pass and a pbuffer), the 1x1 texel version will represent the percent coverage (red: 0-255 = 0-100%) for that one reference color.

It’s not an exact pixel count, but the rough percentage coverage is good enough for me. The big problem, of course, is it would need N passes, where N=number of colors/vis tests. Probably 2*N if done in a fragment shader to accumulate the results. But I’d hope reading back the 1x1 mip levels would work well enough to avoid writing the coverage counts to a packed FB for readback.

And yes, you could probably do three colors (maybe four) at a time, one in each channel of the filtered images.

Anyway, it’s not practical as far as I can tell. But it’s interesting to think about, IMO.

Avi

[This message has been edited by Cyranose (edited 07-30-2003).]

[This message has been edited by Cyranose (edited 08-04-2003).]