NVIDIA perf issue with compressed textures

I’m running an application on two different systems, one is a laptop with a NVIDIA GF 9800M (Win7 32-bit, Core 2 Duo, 4GB RAM), the other is a desktop with a NVIDIA GF 470 (Win7 64-bit, i7, 6GB RAM). You’d think the app would run just as good, or even faster, on the more powerful desktop system. Interestingly the application runs significantly slower on the desktop: about 6 FPS on the laptop, and less than 0.5 FPS on the desktop.

(The problem was first reported to me from a client using a Quadro card. I’m waiting to get system specs from them, and will follow-up here when I have that info.)

The app is OpenSceneGraph-based, so it opens an OpenGL 2.1 context. The dataset consists of 6 million vertices, all texture mapped, either triangles or tri strips. There is over a GB of texture data in DXT1 format and it’s a mix of RGB textures and RGBA textures. The textures are in .dds files.

When the app comes up in a default view, the framerate is OK on both systems. But as I move the eyepoint closer to the model, the framerate suddenly drops significantly on the desktop. Note that OpenSceneGraph does some small feature culling, so it’s likely that, as the eye moves closer, parts of the dataset begin to render that previously didn’t render, probably requiring more texture data to be resident in GPU RAM. At least, that was my theory. But the fact that the problem is sompletely absent on the 9800M seems to contradict this.

The problem goes away if I disable texture mapping entirely. The performance is also acceptable if I render to a smaller window, but this doesn’t necessarily imply a fill-limitation, as OSG’s small feature culling renders less data in this case.

My laptop is running 260.99 drivers. On my desktop, both 260.89 and 275.33 exhibit the issue. The current release 280.26 seems to make the issue worse: the performance is poor right from the first frame, while the application is still in the initial view position.

Does anyone have any suggestions for how I can move forward to work around this issue? I’m considering converting all the textures to non-compressed to see if that changes the behavior, or limiting their mipmap base level size.

But, ultimately, I’d really like to know why my older, slower GF 9800M performs better with this dataset than my newer more expensive GF470…

Thanks for any help.

Client-provided info on two other systems that experience the same performace cliff:

  • OS: Windows XP x64 / CPU: dual xeon quad core 2.67 / Host RAM: 6GB / GPU: Quadro FX 5800 / driver: 275.65
  • OS: Mac OS 10.6 64-bit / CPU: i7 2.66GHz / Host RAM: 8GB 1067 MHz DDR3 / GPU: GF GT 330M / drive: latest available from Apple at date of posting

Well, since you’ve established it’s texture related, first thing that comes to mind is texture thrashing. That is, on the desktop/GF470, is it possible that your working set of GPU textures is larger than will fit within the remaining GPU-card memory (after deducting space for your system FB, FBOs, VBOs, PBOs, renderbuffers, display lists, etc.)?

First, if you’re running a compositing window manager (Aero with MSWin, etc.), turn it off. It just eats GPU memory and performance.

Use the NVX_gpu_memory_info extension to get a feel for your GPU memory usage. Generally speaking, if evicted > 0, you’re overrunning GPU memory and need to reduce your working set (i.e. the driver couldn’t keep everything it wants to on the GPU, and some is being kicked off to CPU memory). Shut down your display manager and restart (or just reboot) to reset the evicted count to 0.

If evicted is 0, then you can look at current available vs. dedicated mem to see how much of your GPU memory is free.

As a cross-check, you can also easily add up in your app how much texture memory you’re uploading to the GPU. DXT1 = 0.5 bytes/texel. DXT3/DXT5 are 1.0 bytes/texel. When computing size for base map, round texel counts up to next multiple of 4 (as DXT is 4x4 block-based), then multiply by bytes/texel. To factor in MIPmaps too, take size for base map * 4 / 3.

If this is what’s going on, wild guess at why the laptop is faster is its GPU is possibly using unified memory (that is, CPU mem = GPU mem), so there is no swapping/thrashing to keep textures in GPU mem, and it’s got 4GB of mem. I don’t know anything about laptop GPUs though…

Just a small correction. The eviction count is newer 0 on Window (at least I have never encountered that). What you should take care of is the constant value of the eviction count. If the count changes you have a problem with insufficient amount of memory.

Thanks for the correction, Aleksandar. That’s interesting. Here on Linux with with all desktop compositing disabled its 0 after boot-up.

Sorry for the typo-error! It was written early in the morning after just a few hours of sleep. :slight_smile:

By the way, I have to correct my previous post. I haven’t seen eviction count equals to zero “on newer versions of Windows” (and till now). I’ve just tried it on XP 32-bit and it is 0. Since I’m not using XP, I have forgotten to test it too. On Win7 64-bit, I got initial eviction count = 3 and evicted = 19584 kB (res.1920x1080). On Vista 32-bit, count = 2 and evicted = 8320 kB (res.1280x800). If the resolution is changed on startup, Vista reports count = 6 and evicted = 20608 kB. So, it depends somehow on the screen resolution, and probably frame buffer allocation. But who knows…

Nevertheless, the point stays the same: If eviction count stays constant everything is OK. :wink:

P.S. I’ve just remembered (and it is clearly stated) that on XP/Linux eviction information is a process specific, while on Vista/Win7 it is system wide (determined by OS).

Good call on the compositing window system. I always run with Aero disabled on my laptop (with GF9800M, where I don’t see the issue), but Aero is enabled on my desktop (with GF470, where I do see the issue). And disabling Aero on the desktop system makes the issue go away. Sweet.

This still doesn’t explain why my client sees this on a WinXP system with a GPU (Quadro 5800) that features a beefy 4GB RAM, but I’m going back to him to double-check.

Ultimately, it sounds like my “fix” is going to involve an offline tool to quarter the textures (or at least limit the base level size for the really big ones), then writes the dataset back out to disk for production use.

(Sorry I posted this to the drivers group; thanks for the assistance even though my initial diagnosis was wrong.)

I am not an expert on this at all, but is the “WinXP system with a GPU (Quadro 5800) that features a beefy 4GB RAM” using WinXP 32 bit?

I believe that if that is the case you may be running out of address space with such a large data set. (and Win7 handles this memory paging better?)

Just a random google search:
http://www.tested.com/forums/pc-and-mac/5/does-4gb-ram-limit-in-32-bit-include-video-ram/7024/

This topic was automatically closed 183 days after the last reply. New replies are no longer allowed.