OT: Separate render & cull threads

Just a quick question.
Performer uses two separate ‘threads’ for rendering and culling, from which it gains a great deal of parallalism. However, I’m a little puzzled as to how to implement this myself in my own renderer (currently it’s a synchronous cull->render loop).
A render pass relies on up-to-date cull information to avoid not rendering something that should be visible…whereas if the cull is separate, this information may not be up to date as the culler might not have visited a particular object with a particular frustum before the renderer reaches it.
The only solution I can think of is to make the culler’s frustum slightly bigger than the actual frustum to minimise artefacts, but this seems like a hack and will result in more geometry being sent down the pipe than is neccessary - and won’t guarantee the false negative won’t happen.

Maybe a good choice would be to implement 2 threads where rendering one would render what’s been found to be visible this frame and in case there’s nothing more to render just wait for culling thread to move on with it’s job…

Pipelining! Don’t let both threads go at your data simultaneously, but arrange them in a pipelined fashion by giving each thread an input queue where you can store pending “jobs” that the thread needs to perform.

For your particular example, the cull thread goes through your data and gives every visibile object to the draw thread. The draw thread will process it when it’s ready to do so.

That said, you should really ask yourself if this is worth it. If frustum culling is that much of a CPU hog that you would consider parallellizing it, maybe you’re not doing it right? Hierarchical frustum culling is not very expensive if you’re a little clever about how you do it.

– Tom

Right, pipelining - although I’m not sure if performer uses job queues…maybe it does. I suppose it must do.
Cheers, Tom.
Oh, the culling example is used because that’s how performer divides its work - nothing to do with efficiency of culling algorithms, just the fact that its a job that can be run in parallel with the render thread, I suppose - which it is so long as the sequencing is handled correctly.

Originally posted by knackered:
Oh, the culling example is used because that’s how performer divides its work - nothing to do with efficiency of culling algorithms, just the fact that its a job that can be run in parallel with the render thread, I suppose - which it is so long as the sequencing is handled correctly.

Performer also has a number of MP models, including overlapped and fully pipelined CULL/DRAW. All are full MP-safe and have extensive buffering/queueing under the hood.

However, the key thing to look at is whether you have more than one CPU, whether you’re waiting on the GPU, and whether you want to run the cull and draw at the same frequency.

If you have one CPU, you’re most likely not saving much by going parallel and may in fact be losing on the context switches and the overhead of MP queueing.

However, if you have big vertex buffers, complex shaders, and so on, and your app spends most of its time waiting on one buffer to finish rendering before you bind the next, parallelism might help recover those CPU cycles. Of course, manual interleaving has worked for years too and the Windows scheduler does indeed suck, so caveat emptor.

[Last time I looked, Performer on the Windows only ran in single-process mode. However, Intrinsic’s Alchemy (same people, different company) did a better job of handling MP, IMO.]

If you want to try and lock your draw func to say 60HZ and let the cull run slower, then that might also be a benefit, but it’s very tricky and yes, it requires bloating the frustum and updating the final modelview in the draw func. If cull and draw run on the same data and at the same framerate, no frustum bloating should be needed. The Aladdin VR Ride did that, but we had many CPUs.

[edit: modify the above to state that if you don’t mind the 1 frame latency you get with fully parallel CULL/DRAW, then no bloating is needed. The Aladdin ride, with an HMD, did need to bloat both because of different cull/draw rates and latency]

And if you have 2 or more CPUs, then the benefits of parallelism should be pretty clear.

Avi

[This message has been edited by Cyranose (edited 10-20-2003).]

Originally posted by Tom Nuydens:
[b]Pipelining! Don’t let both threads go at your data simultaneously, but arrange them in a pipelined fashion by giving each thread an input queue where you can store pending “jobs” that the thread needs to perform.

For your particular example, the cull thread goes through your data and gives every visibile object to the draw thread. The draw thread will process it when it’s ready to do so.

That said, you should really ask yourself if this is worth it. If frustum culling is that much of a CPU hog that you would consider parallellizing it, maybe you’re not doing it right? Hierarchical frustum culling is not very expensive if you’re a little clever about how you do it.

– Tom[/b]

Queues are usually the right way to go for multi-threading coordination, but not always. In this case, there are two choices: have the cull queue up its results and then draw starts consuming the queue, which generally means draw will be a frame later than cull. Or they can be overlapped (another Performer mode), where the draw queue consumes the cull queue immediately with less latency, but other problems.

The main disadvantage of overlapped, I recall, is that immediate consumption makes state sorting hard. Cull output order is not the same as ideal draw order in most cases.

On the benefits of asynchronicity, one other one that should be mentioned is render time. In the serial case, you may take 2 ms to cull and 8 to draw (assume 16ms total frame and some other CPU work, AI, sound, etc…), but then drawing only gets 1/2 of the frame time. In fully parallel mode, assuming the CPU is often waiting on the GPU to finish the next batch, the other processes can be interleaved and you may get the full 16ms wall time for drawing, which can make a difference. In the most common case, a sync’d swapbuffer call will spend some time just waiting, which is time well spent on other (parellel) tasks, assumimg the driver is smart enough not to do a tight-loop CPU spin-lock on vsync…

[Edit: Bearing this in mind, one trick to speedup the non-parallel case is to reverse things a bit: draw frame N, cull frame N+1, and then wait on swapbuffers.]

Avi

[This message has been edited by Cyranose (edited 10-20-2003).]

There’s an article by John Rohlf and James Helman called ‘IRIS Performer: A High Performance Multiprocessing Toolkit for Real-Time 3D Graphics’ which describes exactly how Performer is designed, and how it handles several threads.
I think most of your questions are answered in that article (it’s available from ACM)

Anders