Accessing Win32 window on two threads = massive slowdown

I currently have my game engine structured with one main thread that creates a Win32 window and accesses events, and another thread that initializes and executes all OpenGL code.

I have found that the rendering thread is suffering from massive pauses on the following commands:
[ul]
[li]wglMakeCurrent[/li][li]glViewport[/li][li]wglSwapBuffers[/li][/ul]

I’m seeing inexplicable delays on the order of like 20 milliseconds for these commands. For some reason, having a lot of activity in the main thread seems to cause the big delays, so I did not notice it until I started pushing a lot of game logic in the main thread.

I’m 90% certain at this point I need all window creation and event code to be placed in the rendering thread, but I was wondering if this was a known thing and if anyone had additional information on what I am experiencing? Thank you in advance.

It looks like this thread is describing similar behavior:
https://www.gamedev.net/forums/topic/539058-slow-multithreaded-rendering-in-windows/

Tell us more about what you’re doing.

[ul]
[li]What GPU and driver? [/li][li]There’s only one window and one context involved here, right? [/li][li]Which thread are you creating the GL context on? [/li][li]If not the main thread: Are you unbinding the context from the main thread before letting the render thread bind it? [/li][li]Are you checking the return value of wglMakeCurrent? [/li][li]Are you running with VSync disabled? [/li][li]Have you double-checked the driver control panel to make sure you’re not forcing VSync on? [/li][li]Are you calling glFinish after wglSwapBuffers? [/li][li]How are you timing? [/li][li]Which Windows is this? (7? 8.1? 10?) [/li][li]Is this with a FullScreen Exclusive Window? [/li][/ul]
In addition to trying different permutations from the above (including moving the GL context creation to the render thread, if it’s not already there), you might do a VerySleepy run and see if it gives you any additional clues about what’s bottlenecking you.

Also, this post sounds like your test setup, and might relate to your problem.

It’s a 1280x720 window, not fullscreen. Pixel fill rate is not an issue. There is one hidden window for context sharing and one visible window. wGLMakeCurrent only gets called when the context switches, which is never, so it is not being called as the app loops.

There is no calls to glFlush or Finish.

Windows 10.

My timing code uses win32::timeGetTime().

VSync is disabled.

Framerate is only about 160 when it should be 1000+ for this scene.

Nvidia GEForce 1080 GTX.

The main thread creates the window and polls events. Second thread creates the OpenGL context and handles all OpenGL calls.

I spent all day moving all windows API calls into the second thread but I still get the same result!

Actual rendering takes 1-2 milliseconds, but the call to SwapBuffers() is taking 4-9 milliseconds.

Okay, I think the problem lies elsewhere in my use of texture buffers. I commented out some code there and the framerate shot up to 1000, so I think SwapBuffers was just slow because something in the command buffer queue from the previous frame was slow.

Good to hear that you’ve got a line on your problem.

I’m curious what you find out with your texture buffer slowdown. It might be due to internal synchronization caused by the way that the app updates the contents of its buffer object, but that’s a guess.

My timing code uses win32::timeGetTime().

Should be fine if you don’t need resolution < 1ms. If you do, I would use QueryPerformanceCounter(). It’s easy to use.

[QUOTE=Dark Photon;1292913]Good to hear that you’ve got a line on your problem.

I’m curious what you find out with your texture buffer slowdown. It might be due to internal synchronization caused by the way that the app updates the contents of its buffer object, but that’s a guess.

Should be fine if you don’t need resolution < 1ms. If you do, I would use QueryPerformanceCounter(). It’s easy to use.[/QUOTE]
I have never run this renderer in single-threaded mode. I have only built old single-threaded systems.

First off, I was erroneously calling glBufferData when glSubBuffer data would have sufficed. Removing that error sped things up where I am now getting around 600 FPS, but that is still slow. Our old single-threaded deferred renderer runs at 1000 FPS with the same scene, and the new multithreaded forward renderer should be as fast or faster.

I suspect that texture buffers are slower than uniform buffers, but more testing is needed.

I can confirm I tried the window code both ways, and my original design (Window creation and event handling on main thread, OpenGL initialization and handling on second thread) is slightly faster, probably due to being a little older and more optimized, but certainly there is no disadvantage.

[QUOTE=glnoob;1292914]First off, I was erroneously calling glBufferData when glSubBuffer data would have sufficed.
Removing that error sped things up where I am now getting around 600 FPS, but that is still slow.
Our old single-threaded deferred renderer runs at 1000 FPS with the same scene, and the new multithreaded forward renderer should be as fast or faster.[/QUOTE]

Well, you do what you have to do to make it fast. But whether that is an error depends on your buffer usage and the driver. Are you calling glBufferData with a non-NULL pointer repeatedly and/or with a different size? If so, yes, you should probably avoid that.

On NVidia drivers, if you’re re-specifying the same amount of data each update, orphaning the buffer (see this page for details) can be very fast and avoid internal driver synchronization potentially required (due to other references to that same buffer object in the pipeline that are still “in-flight”).

First, I’d recommend you benchmark in milliseconds/frame rather than frames/sec … for many reasons, not the least of which is it actually makes sense to talk about how much your texture buffer object (TBO) update method actually costs you so you can optimize it. Please do read this for details.

Next, I would disable your per-frame TBO updates and time frames with out them. How many msec/frame? Then add those TBO updates back in and re-time your frames. What’s the difference? With this crucial data, you know exactly how much time is tied up in the update specifically, and whether it’s the “big fish”. If it is, you can focus on optimizing its time consumption specifically.

As to methods to optimize your buffer object updates, see this page: Buffer Object Streaming. That said, TBOs are weird beasts that make it a little bit hard to apply some of these techniques. It’s easier when you’re just binding buffer objects directly to the shader. Especially with NVidia bindless extensions where you can pipe the GPU address for the buffer object(s) directly into the shader, bypassing all the binding mess.

I suspect that texture buffers are slower than uniform buffers, but more testing is needed.

From what I’ve read your guess is probably correct. Expect “ordinary uniforms” to be very fast, with uniform buffer objects being next. On NVidia, these can be cached in the fast shared mem local to the GPU multiprocessors. Next-up is probably ordinary textures, with access sped-up through texture tiling and texture caches. And tailing those are things are things which (due to their maximum sizes) virtually have to live in slower global GPU memory such as TBOs and SSBOs. IIRC, the driver does use part of the GPU shared memory as a global memory access cache, which’ll should help with accessing these latter types somewhat, if there’s repetition in or locality of access to global memory.