GPGPU, SwapBuffers and multithreaded rendering

kend · May 13, 2009, 2:01am

Hi!

I have 2 threads, the first (off-screen rendering thread, OSRT) renders some data to a texture and the other (main rendering thread, MRT) renders something using this texture to the default framebuffer. The contexts are shared using wglShareLists (OSRT’s rc is created using the same dc as MRT’s rc, I hope it’s not a problem). OSRT never alters the default frame buffer, it renders to a texture with an FBO.

I want MRT to run in realtime framerates. It’s not problem if sometimes OSRT updates the texture much slowly. To prevent rendering and using the same texture at the same time, and to prevent any stalling of MRT, I use 3 textures, it’s similiar to triple buffering. OSRT renders to t3, when finishes, then swaps t2 and t3. MRT uses t1 for it’s rendering, when finishes, swaps t1 and t2. In my case swap of t1 and t2 simply means that the t1 and t2 GL names are swapped. The swaps are protected with a critical section.

I just can’t make the whole thing work (fast MRT and possibly long calculations in OSRT). Now MRT is only fast when the calculation in OSRT is relatively simple.

I have some questions:

Is glFinish applied only to the commands of the calling thread’s context?
Is SwapBuffers do an explicit finish or someting like that? It seems that it does, however I found that on my machine glFinish makes a busy-waiting (burning 100% of a single core), while SwapBuffers use only 1-2% of the processor time.
How can I effectively determine if rendering to/rendering from a texture is finished? Now for OSRT I use glFinish and do the swap of textures after that. For MRT I do the swap of textures after SwapBuffers. I want to support NV and ATI cards too, so I don’t want to use fences.

In my case, it seems that the SwapBuffers waits for OSRT’s pending GL calls too, becouse it hangs a lot of time. I made some experiments, so set the OSRT’s priority to lowest, while SRT’s priority to time critical (vsynch on, so it’s not dangerous), but not helped anything.

Thanks for any help
kend

skynet · May 13, 2009, 4:55am

Have you tried the new 185’s driver yet? They claim to have better multithreading support. I can confirm that SwapBuffers could cause long stalls. As far as I know, nvidia is working on improving their multithreaded rendering support in future versions of the driver.

ZbuffeR · May 13, 2009, 5:28am

I would not hold my breath on the availability of good parallel rendering, especially for both ATI and NV.

no. the spec says “does not return until the effects of all previously called GL commands are complete”. Maybe a clever implementation might do something better…
it can, but should not. it typically does a flush however. As you have vsync on, the driver will be idle waiting for retrace as soon as the command buffer is full. glFinish is more precise however, and is needed if you want to keep super steady framerate (at the price of high cpu load). Even with low gpu load, ‘passive vsync’ will sometime stutter a bit, like once every few seconds.
does it works having OSRT do the swap without glFinish ?

It may not be easy, but cutting the rendering of OSRT in multiple small batches will allow better parallelism. Like doing your own “GPU task switching”. Tedious, but the only way I know.
like this :

OSRT1/10
MRT
OSRT2/10
MRT
OSRT3/10
MRT
…
OSRT10/10

new texture available

MRT

If the OSRT GPU load is hard to predict, of course it will be hard to find a balanced way to split it…

kend · May 13, 2009, 7:50am

I made some measurements. Now, I made this for only a single-core machine with Windows XP with Gf 7600 GT and latest driver. For test I rendered 1 or 20 simple full-screen quad with OSRT. For all frames I measured average times (with QueryperformanceCounter).

case 1: OSRT without glFinish 1 quad, vsynch off
OSRT’s frame : 1.06 ms
MRT’s frame : 2.55 ms (SwapBuffers 2.43 ms)

case 2: OSRT without glFinish 1 quad, vsynch on
OSRT’s frame : 0.99 ms
MRT’s frame : 13.8 ms (SwapBuffers 13.42 ms)

case 3: OSRT without glFinish 20 quads, vsynch off
OSRT’s frame : 0.72 ms
SRT’s frame : 21.93 ms (SwapBuffers 21.25 ms)

case 4: OSRT without glFinish 20 quads, vsynch on
OSRT’s frame : 1.08 ms
SRT’s frame : 43.17 ms (SwapBuffers 41.44 ms)

case 5: OSRT with glFinish 20 quads, vsynch off
OSRT’s frame : 25.92 ms
SRT’s frame : 9 ms (SwapBuffers 8.67 ms)

case 6: OSRT with glFinish 20 quads, vsynch on
OSRT’s frame : 24.44 ms
SRT’s frame : 22.81 ms (SwapBuffers 21.98 ms)

case 7: OSRT with glFlush 20 quads, vsynch off
OSRT’s frame : 1.03 ms
SRT’s frame : 22.41 ms (SwapBuffers 21.49 ms)

case 8: OSRT with glFlush 20 quads, vsynch on
OSRT’s frame : 1.13 ms
SRT’s frame : 43.69 ms (SwapBuffers 41.56 ms)

According to the measurements I think it not just calls flush and waits for a retrace, becouse flush doesn’t hangs.

I don’t really know. Then may rendering to and reading from a texture is happening at the same time. Maybe the simple swap of the gl names was not a good idea. Maybe I have to make a copy from one texture to the other. With PBO it could be fast.

But how this “GPU task switching” could be achieved? I can’t tell the GL that these commands has to be made by this thread, and those commands are by that thread. Can we conclude that all GL commands are gathered in one list, and even on SwapBuffers or glFinish is called, then it waits until all commands are complete? If this is the case, then even fances could not help. So, is the “GPU task switching” could be work with only one thread? And what happens if there are other OpenGL applications running? Is glFinish and SwapBuffers waits for GL commands posted by other processes too?

I never thought that this problem could be so complex. Now I don’t see any good solution.

Brolingstanz · May 13, 2009, 5:40pm

Multithreading doesn’t really seem to be part of the core GL spec language. It’s sort of implicit in the async part of the buffer upload stuff and in the platform specifics of render contexts in general, but by and large the meat of it is left to the implementation (perhaps as it should be). Probably better to leave the details of MT to the black box that it is, that or expose it in a way that’s amenable to expansion (whatever that means).

DX11’s approach - to build batches for the main render thread from a collection of threaded command buffers - seems like a good solution, but if and how that plays out in the GL way of doing things going forward remains to be seen. IMHO this would fit nicely into the DL framework, with a modification or two here and there.

However it unfolds, if you’re really going to expose the details of MT to the developer, you’d probably want to go “all the way” or run the risk of pleasing half the folks half the time and alienating the rest (not like that’s happened before). Long term though I’d guess the black box approach wins. Short term, vendor-specific extension a plenty to explore the possibilities. That seems to be the cycle of things: new technology, mad scramble to make sense of it, tidy wrappers emerge, complacency sets in; new technology, … I’d say we’re knee deep in the mad scramble phase.

tamlin · May 14, 2009, 7:33am

I think it’s as simple as; while GPU’s today have many, many ALU’s, they have but one command input - the FIFO coming from user-mode. Whether the commands come from a separate thread, or a separate process, is likely irrelevant - one “batch” in the FIFO can’t be processed until the previous batch has finished.

If this is the case (as seems likely), I’d recommend thekend to split the heavy work in OSRT into smaller batches to not tie up the whole GPU for such amounts of time as to break the MRT’s real-time requirements. While the total throughput will suffer a little, at least it could prevent the MRT breakage.

++luck;

kend · May 14, 2009, 9:20am

Ok, now I start thinking of how could I redesing the system.
Thank you all the help.

Have you got any idea what happens when multiple OpenGL and Direct3D applications run at the same time? How they reach and share the GPU? Then how this GPU FIFO command buffer looks like? How this whole thing could be handled by the OS and the drivers? It’s very interesting.

tamlin · May 14, 2009, 11:24am

Have you got any idea what happens when multiple OpenGL and Direct3D applications run at the same time?

Yes. It’s however well outside the scope of (even the advanced forum for) OpenGL.
Not that many of us here likely wouldn’t like this kind of pornography :-), it’s just waaaay off-topic.

Dark_Photon · May 14, 2009, 5:47pm

Quite right! You’re go for porn!

it’s just waaaay off-topic.

Naaah. Right on topic. And what is the Advanced forum for if not guts-level goodies way over the head of Beginners.

dukey · May 17, 2009, 4:28pm

try tripple buffering
gotta enable it in the driver tho

wizard · May 18, 2009, 2:15pm

One comment here. I would use one thread for passing commands to GL. This is because the overhead of sharing objects and synchronizing is likely to surpass the overhead of a simple command buffer implementation for queuing the commands onto the single GL enabled thread. You can still do most of the CPU work on the other thread but just let a single thread do the GL work.

Ferdi_Smit · May 25, 2009, 6:43am

I’ve done a lot of work on multi-threaded opengl recently; so some observations from my side in this wonderful world of race conditions and driver segfaults

Use the 185 series nvidia driver (I use the one that comes with CUDA 2.2 right now; the ogl 3.1 180 driver crashes). Some things simply don’t work on the 180 series and result in segmentation faults and/or lockups.
It appears that rendering in small batches has much less effect on the quality of “time slicing” the GPU now. I see no difference between rendering either a single 20M polygon glDrawElements call, or cutting it in chunks. We used to, but not anymore with the new driver.
In response to question 3, I test for GPU completion using an asynchronous query. At the moment a GPU timer query, but an occlusion query should work too. So first wait for the GPU token to appear, and only then release you thread locks to signal buffer completion.
Realize that OpenGL commands from different contexts/threads can and will be processed out of order. Thread level locking can not be used alone for synchronization.
Make absolutely sure your buffers remain on the GPU. I classify most buffers as ARRAY_BUFFER with STATIC_DRAW to do that. Also look at the flags for glMapBufferRange that you can’t explicitly set with only glMapBuffer.

qzm · May 27, 2009, 4:30pm

Hello Ferdi, I am very interested in your comments, as we fight many of the same battles - we use NVidia Fences for management however of course that is unavailable on ATI, which causes big problems on that side of the fence.

Can I ask for a touch more detail on the solution you use, and does it seem to work across ATI cards?

We have also noticed a great improvement from <178 to 178 then to 185, 180 was a horrible no-go zone for us with many problems, and we are assured 190 will fix even more issues.

One of the killer 180 issues was performance in multi-screen modes, which was just broken, 185.85 came to the rescue just in time.

We are currently working through some interesting performance deltas between pre-compiled display lists and VBOs - as our models change very rarely display lists have been good for us, but I am starting to think their days are numbered.

I also deeply wish there was a way to mark textures as GPU only, as on todays 1GB+cards it is starting to be a problem keeping main memory shadows of all resources, especially when we require our own additional buffering also…

Multi thread, ASync rendering in OpenGL has got to be the future, but many dont seem to realise it… are they not looking at where core counts are headed?

Ferdi_Smit · June 27, 2009, 5:34am

I’m sorry for the late reply, qzm; I haven’t read these forums for some time.

I tried to use NV_fence, but it appears not to be supported on my system (Linux-x86_64 185.18.08, NVidia GTX 260?)

Timer query (EXT_timer_query) appears to work well, although I don’t know if it should or is portable. Since it’s in EXT it should work on ATI, I guess. I don’t actually have any ATI hardware available anywhere to try…

Basically I do this:


// one-time init
glGenQueries(1, &timer);

// start filling a buffer in thread 0
// Acquire CPU thread lock on buffer
glBeginQuery(GL_TIME_ELAPSED_EXT, timer);
// ... fill buffer

// sync GPU
glEndQuery(GL_TIME_ELAPSED_EXT);
while (!available) {
   glGetQueryObjectiv(timer, GL_QUERY_RESULT_AVAILABLE, &available);
}
volatile GLuint64EXT sync = glGetQueryObjectui64vEXT(timer, GL_QUERY_RESULT, &timeElapsed);

// Release CPU thread lock on buffer

I’m trying to write OGL 3.0+ code only nowadays, so display lists are out. I also noticed that buffers and textures don’t always stay on the GPU as you would expect, even if you never read them explicitly on the CPU.

I agree about the multi-threaded async rendering. We are using this for an image-warping-based system that produces 60Hz display updates, which is periodically updated by a slow (e.g. 6Hz) rendering process. Doing work with different priorities in parallel is very interesting, imo. One thing that comes to mind is doing very slow ambient occlusion calculations in the background and updating that data periodically.

Dark_Photon · June 28, 2009, 11:33am

Really?! Here on Linux x86_64 185.18.14 on a GeForce 8, it is supported:

> glxinfo | grep NV_fence
GL_NV_explicit_multisample, GL_NV_fence, GL_NV_float_buffer,