PDA

View Full Version : Bad performance (possibly FBO related) on NVIDIA



cadaver
02-27-2012, 01:24 AM
Hi,
I'm developing a 3D engine which is both OpenGL & Direct3D capable.

I'm seeing quite bad OpenGL performance (possible pipeline stalls) when running Linux & the NVIDIA driver (up to 50ms/frame.) When the same machine is booted into Windows, performance is as expected (below 10ms/frame)

I'm seeing this both on a laptop with a Geforce GT540M and a desktop machine with GTX580.

On Mac OS X, the same OpenGL rendering code also works without performance issues on NVIDIA hardware. Also, Linux + AMD hardware seems to work fine.

The performance issue seems to be proportional to the number of times I change the surfaces bound to the FBO (I use a single FBO object.). Therefore forward rendering without shadows works fine, but anything like adding shadows or postprocessing, or doing deferred rendering starts to bog down the performance.

Anyone else seen something like this?

(btw. the engine code is public at Google Code: http://urho3d.googlecode.com)

Ludde
02-27-2012, 04:57 AM
Have you tested with multiple FBO:s, one for each surface?

cadaver
02-27-2012, 05:22 AM
Not yet, I plan to.

Actually I've narrowed things down a bit .. it is not the amount of surface changes after all, but rather the amount of drawcalls that go to the FBO instead of the backbuffer.

For example a forward-rendered, complex scene without bloom post-effect has no problems, as it goes directly to the backbuffer. But the same scene with bloom on must be rendered to the FBO first so that it can be operated on, and for a complex scene that causes a > 20ms performance hit.

Dark Photon
02-27-2012, 06:24 AM
I'm seeing quite bad OpenGL performance ... when running Linux & the NVIDIA driver (up to 50ms/frame.) ... with GTX580 ... it is ... the amount of drawcalls that go to the FBO instead of the backbuffer. ... Anyone else seen something like this?
Only in one specific scenario. And I use a number of FBOs to render frames, with NVidia, on Linux (for many years), on GTX580, GTX480s, GTX285s (and others) just like you are.

The only time I've seen anything like this is when you're hitting up against (or flat blowing past) GPU memory capacity. When you do, that means the driver can/will start tossing textures and such off the board to try to make room so it can keep everything it needs for rendering batches on there, and that can result in massive frame time hits as it tries frantically to play musical chairs with CPU and GPU memory to render your frame. This includes your shadow textures, which may be swapped off the board to make room for other things when you're not rendering to them.

So check how much memory you're using. Use NVX_gpu_memory_info (http://developer.download.nvidia.com/opengl/specs/GL_NVX_gpu_memory_info.txt). It is trivial and well worth your while. In my experience, you should never see the "evicted" number > 0 (on Linux). If you do, you're blowing past GPU memory. Shut down/restart X via logout/login or Ctrl-Alt-Bkspc (or just reboot) to reset the count to 0.

Also, if you've got one of those GPU memory and performance wasting desktop compositors enabled, disable it (for KDE, use kcontrol GUI to disable effects/composting, or just Shift-Alt-F12).

As far as controlling which get kicked off first, glPrioritizeTextures is generally mentioned as a no-op. And while NVidia hasn't updated their GPU programming guide in a good while (3 years), we might have some clue as to how to influence texture/render target GPU residency priority through advice there (see below). But best advice, just never fill up GPU memory and then you don't have to worry about this.


In order to minimize the chance of your application thrashing video memory, the best way to allocate shaders and render targets is:

1. Allocate render targets first
a. Sort the order of allocation by pitch (width * bpp).
b. Sort the different pitch groups based on frequency of use, The surfaces that are rendered to most frequently should be allocated first.
2. Create vertex and pixel shaders
3. Load remaining textures"

cadaver
02-27-2012, 04:08 PM
The problem indeed seems to be using a single FBO. As a test, I switched to using another FBO for shadow map rendering (switching between shadowmaps and the main view is the most frequent rendertarget change for me), and most of the "unexpected" performance hit went away. The rendering as a whole is still some constant factor slower than on Windows & OpenGL, but it's much more consistent now.

Now just to implement the multiple-FBO mechanism properly and transparently to the caller :)

Thanks to all who replied!

Dark Photon
02-27-2012, 06:47 PM
The problem indeed seems to be using a single FBO. As a test, I switched to using another FBO for shadow map rendering (switching between shadowmaps and the main view is the most frequent rendertarget change for me), and most of the "unexpected" performance hit went away. The rendering as a whole is still some constant factor slower than on Windows & OpenGL, but it's much more consistent now.

Now just to implement the multiple-FBO mechanism properly and transparently to the caller :)

Thanks to all who replied!

Over the course of a frame, in rebinding different render targets to the FBO, do you ever change the resolution and/or pixel format of the FBO?

I don't know if it still is, but used to be that this was a slow path in the NVidia driver (circa GeForce 7 days). And yeah, the solution was to avoid doing that -- use multiple FBOs.

cadaver
02-28-2012, 01:14 AM
Yes, the shadow maps (or possible post-processing buffers) are different size and format.

Would it possibly be that the Linux driver is still using older code?

Dark Photon
02-28-2012, 05:47 AM
Yes, the shadow maps (or possible post-processing buffers) are different size and format.

Would it possibly be that the Linux driver is still using older code?
The implication of your statement is that there's a newer improved version. However, IIRC from the NVidia post, it's not that this a path that was written inefficiently, but just that it's a slow path. It said that reconfiguring the resolution or internal format of an FBO was expensive, and to avoid doing that a lot.

cadaver
02-28-2012, 10:23 AM
I can confirm further improvement (on Linux) by implementing a map of FBO's, where the resolution and format form the search key.

However, what is curious that on Windows performance was always fine with the same hardware, and it did not improve over the initial code, which was just binding all surfaces to the same FBO.

Dark Photon
02-28-2012, 06:13 PM
However, what is curious that on Windows performance was always fine with the same hardware, and it did not improve over the initial code, which was just binding all surfaces to the same FBO.
That is interesting. Wonder if the Windows driver is doing the FBO virtualization thing under the covers that we're both doing in the app.

cadaver
02-29-2012, 12:54 AM
Perhaps the Windows driver has to have a similar mechanism anyway to support Direct3D's SetRenderTarget / SetDepthStencil API efficiently, and it just reuses it for OpenGL.

That would mean the FBO in fact exposes functionality closer to the hardware, while the Direct3D rendertarget setup is a further abstraction.