Resource leak in 2xx nvidia drivers

Hi,

we are experiencing a strange out of memory condition with nvidia drivers on Vista 64. It happened very rarely with the 1xx driver series but drastically increased with the 2xx drivers (Several times a day).

The problem:
After working with our OpenGL app for some time suddenly all kinds of OpenGL calls fail. Most of the time binding an FBO fails with GL_FRAMEBUFFER_UNSUPPORTED but also other calls fail with GL_OUT_OF_MEMORY. I managed to get GLexpert running and it turned out that the FBO error is also caused by an out of memory condition. The driver seems to delay the allocation of render buffers and textures to the point they are first bound to an FBO in some cases and the GLexpert messages indicate that there is not enough memory for this leading to GL_FRAMEBUFFER_UNSUPPORTED. What also happens is that the driver falls back to software emulation from time to time. These problems occur sooner and sooner until even starting the application doesn’t work anymore and it eventually results in crashes in nvogl.dll. It helps a little bit to kill dwm.exe. This seems to free up at least some memory but at some point only rebooting helps. We have reports from Windows XP users that even worse things happen there resulting in bluescreens and corruption of other windows but it is not 100% sure this is related.

“Repro case”:
It seems like it happens more often in certain scenarios when the rendering loop is paused for a long time. For instance our editor app displays various preview windows and when cooking a project it launches a player app to check if the cooked projects actually runs. During this time the editor render loop is paused. The same thing happens in one of our tests. A lot of resources are created and released but almost no rendering occurs. This is complete guesswork though and might be completely coincidental.

What I tried:
I tried to somehow deal with this for month now. Even if our app does something wrong this is definately a driver issue, isn’t it? Even if we leak some resources the driver should do cleanup when the application exits right? I checked for leaking OpenGL resources in our app using gDebugger but it didn’t find anything. The memory of the app itself is not increasing that much, too.

Do you have any suggestions what I can do to to find the source of the problem? I guess our application must do something different or otherwise a lot more people would have complained about this. Is there any way to get more information from the driver to know what’s actually leaking? Even if I was able to make a repro case it seems like it is not possible to submit a bug to nvidia anymore. Some years before I was a registered developer but now my account seems to be deleted and all my attempts to register again seem to be ignored.

I hope this is not the wrong forum but I don’t know where else to post. This really becomes big problem for us now and I’m getting desperate. Thanks for your time.

Are you calling any glGen* functions per-frame? Yes, the driver will clean up “leaked” resources on shutdown, but if you are allocating new resources each frame you may run out of memory well before the driver gets the chance to do this.

Thank you for your reply. No. I don’t usually use glGen* calls per frame. At some places textures are generated on the fly when they are accessed first but this is not happening very often and the error also happens when this lazy texture generation is disabled and everything is precached. I also do not use very many FBOs. Actually there is only one FBO for rendering to textures. It is reused all the time because we have very many textures we render to (hundreds).

May be I didn’t explain it clearly. The driver leaks memory over several starts of the application. At some point when I restart our app I cannot even create a single render target texture without getting an FBO error caused by an out of memory condition in the driver and only a reboot fixes this. One of the very first things that happens in our renderer is the creation of a 32x32 RGBA dummy texture. We render a checkerboard pattern to it and use it as dummy texture whenever loading a texture fails. Even this doesn’t work from time to time.

I tried to use this extension:
http://developer.download.nvidia.com/opengl/specs/GL_NVX_gpu_memory_info.txt

When the error occurs none of the numbers indicate any memory shortage.

GL_GPU_MEMORY_INFO_DEDICATED_VIDMEM_NVX=786432
GL_GPU_MEMORY_INFO_TOTAL_AVAILABLE_MEMORY_NVX=786432
GL_GPU_MEMORY_INFO_CURRENT_AVAILABLE_VIDMEM_NVX=634308
GL_GPU_MEMORY_INFO_EVICTION_COUNT_NVX=26
GL_GPU_MEMORY_INFO_EVICTED_MEMORY_NVX=38060

This is what GLlExpert prints when the error occurs:
“OGLE: Category: 0x00004000, MessageID: 0x008E0000
Basic framebuffer object information: The COLOR_ATTACHMENT0 attachment is unsupported, because it is not allocated (out of memory).”

Sometimes also this when calling glGenFrameBuffers:
OGLE: Category: 0x00000002, MessageID: 0x00810008
Software rendering has been enabled because the current framebuffer related state is not supported with the current hardware configuration: The framebuffer is not a hardware accelerated resource.

While I’m no NVidia driver guru, I use NVX_gpu_memory_info, and in my experience, the EVICT numbers are always 0, unless you’ve bang your head up against the limits of GPU memory at some point. I’d restart your system, verify the EVICT numbers are 0, start your app, and watch the numbers.

Interesting. Even after rebooting these numbers are not zero for me. Never. After testing a lot yesterday it seems like these numbers grow after every start of our editor. I don’t even have to load any resources. It is enough to start the editor and close it. It also happens with one of our test executables which in one test creates a second window with a second render context (the editor does not do this though). In contrast to that I can run our player app all day long without changing eviction count at all.

I’m now running a second little glut app that does no more than continuously printing out these numbers to see when exactly these counts increase. In the editor a window handle from .NET Winforms is passed to native code, a context is created and then these counts suddenly increase while something is going on inside .NET. But I guess it’s not easily possible to debug it this way because the driver will probably do things asynchronously in another thread so the time I observe the change might be after the command that is responsible for that. May be it’s easier to track this in the other app where everything is under my control more or less.

Edit:
Are you running Vista/Win7 or WinXP. I read that these numbers are global for all applications on Vista/Win7 and local to the application on WinXP. I’m running Vista64 so may be it is normal that there is always something evicted in the driver globally. When you are running WinXP this might be the reason the eviction counter is always zero for you as it is local to the GL context.

I’ve discovered that resizing a hidden gl window causes GL_GPU_MEMORY_INFO_EVICTION_COUNT_NVX and GL_GPU_MEMORY_INFO_EVICTED_MEMORY_NVX to increase. This happens when starting the editor because it creates the render window as a child of the main window while the main window is still hidden. We can work around the issue I think. I wrote a simple test case using glut that shows the issue. I’m still checking if this is really the problem though.

Here is the test case source in case someone is interested. It continuously resizes a hidden window and prints out evicted count, evicted memory and the currently available video when the evicted count changes.

My spec:
Operating System: Windows Vista ™ Ultimate, 64-bit (Service Pack 2)
GPU processor: GeForce 8800 GTX
Driver version: 260.99


#pragma comment(lib, "glut32.lib")
#include "glut.h"
#include <stdio.h>

#define GL_GPU_MEMORY_INFO_DEDICATED_VIDMEM_NVX 0x9047
#define GL_GPU_MEMORY_INFO_TOTAL_AVAILABLE_MEMORY_NVX 0x9048
#define GL_GPU_MEMORY_INFO_CURRENT_AVAILABLE_VIDMEM_NVX 0x9049
#define GL_GPU_MEMORY_INFO_EVICTION_COUNT_NVX 0x904A
#define GL_GPU_MEMORY_INFO_EVICTED_MEMORY_NVX 0x904B

void PrintVideoMemory()
{	
	static GLint evicted=0;
	GLint vidmem=0, mem_available=0, vidmem_available=0, evicted_count=0, evicted_size=0; 
	glGetIntegerv(GL_GPU_MEMORY_INFO_DEDICATED_VIDMEM_NVX, &vidmem);
	glGetIntegerv(GL_GPU_MEMORY_INFO_TOTAL_AVAILABLE_MEMORY_NVX, &mem_available);
	glGetIntegerv(GL_GPU_MEMORY_INFO_CURRENT_AVAILABLE_VIDMEM_NVX, &vidmem_available);
	glGetIntegerv(GL_GPU_MEMORY_INFO_EVICTION_COUNT_NVX, &evicted_count);
	glGetIntegerv(GL_GPU_MEMORY_INFO_EVICTED_MEMORY_NVX, &evicted_size);

	if (evicted!=evicted_count)
	{
		printf("evicted_count=%d, evicted_size=%dkb, vidmem_available=%dkb
",
			evicted_count, evicted_size, vidmem_available);

		evicted=evicted_count;
	}
}

void renderScene(void) 
{
	static bool flip=true;
	flip=!flip;

	glClear(GL_COLOR_BUFFER_BIT);
	glBegin(GL_TRIANGLES);	

	glVertex3f(-0.5,-0.5,0.0);
	glVertex3f(0.5,0.0,0.0);
	glVertex3f(0.0,0.5,0.0);
	glEnd();
	glutSwapBuffers();
	PrintVideoMemory();

	if (flip)
		glutReshapeWindow(300,200);	
	else
		glutReshapeWindow(300,300);	
}



int main(int argc, char* argv[])
{
	glutInit(&argc, argv);
	glutInitDisplayMode(GLUT_RGBA | GLUT_DOUBLE );
	glutInitWindowPosition(100,100);
	glutInitWindowSize(100,100);

	glutCreateWindow("resoource leak provoker");
	glutHideWindow();	//comment this to make the leak disappear
	
	glutDisplayFunc(&renderScene);
	glutIdleFunc(&renderScene);
	glutMainLoop();

	return 0;
}

Hmmm - maybe check that your program actually is exiting properly and fully? Check in Task Manager that there are no instances of it still running, check your code to ensure that your context is being destroyed, and so on.

Ah! That’s probably it. Vista/Win7 has that perf-wasting Aero compositor which uses the GPU for rendering/compositing the desktop. Whereas XP doesn’t. Very well could be that it’s the OS that’s overflowing your GPU at some point with normal desktop rendering ops, giving you an eviction count > 0.

I’m running Linux, and with the perf-eating GPU compositor turned off (so this should be more like the XP case). So my app is the only thing GPU intensive that runs, and my eviction counts are 0 unless I get up close to the limit of GPU memory.

You could try after disabling Aero and rebooting to see if your results differ:

http://www.howtogeek.com/howto/windows-vista/disable-aero-on-windows-vista/

Disabling Aero seems to help. The test case does not cause the evicted count to increase anymore. But I still have to check if this also solves our original problem with various OpenGL calls failing with out of memory errors.

By the way, this is what the extension spec says about the evicted count in Vista, WinXP and Linux:
“Implementing the eviction information is OS dependent.
For XP and Linux the eviction information is specific to the current process/state since eviction is determined in the individual client. For Vista it is system wide since eviction is determined by the OS.”

Ok. With Aero disabled and after rebooting the evicted count stays at zero all the time but still our app stll gets these strange random out of memory errors. So the evicted count doesn’t seem to have much to do with our problem. The only information NVX_gpu_memory_info can give me is that it is obviously not the video memory that runs out as there are over 400 Megs free when the error occurs. Sigh

Google GL_OUT_OF_MEMORY for some ideas. Here are a few of the matches:

  • Not calling FBO setup rtns in the right order (link)
  • Overheating GPU due to dead fan (link)
  • Driver memory leak (link) - make sure you’re running the latest
  • Out of CPU memory (link)

Because resources such as textures that are allocated on the GPU are backed by CPU memory, I’d verify you’re not running out of CPU memory.

Also, corrupting the heap can cause the appearance of out-of-memory errors, so I wouldn’t hesitate to run a memory checker such as valgrind to make sure you don’t have some rogue writes going on in your code.

Another thought would be that you might be out of some shared bus resource such as AGP memory or something, but that was a problem years ago.

You’re running Vista 64, so it’s unlikely you’re hitting the process virtual memory address limits that plagued 32-bit apps on 32-bit OSs, but I don’t know what Windows might be doing. I’m assuming your app is 64-bit.

The driver seems to delay the allocation of render buffers and textures to the point they are first bound to an FBO in some cases

Yeah, one way you can force it’s hand is to render with each texture right after you create it. That’ll force the out-of-memory condition to occur much closer to the allocation that breaks the camels back (assuming this isn’t a memory corruption or driver leak issue).

I experience this with a 512mb 9500GT running under W7-32. After about six hours of stop-start, my deferred renderer fails to start up claiming framebuffer incomplete. Playing hardware accelerated video exacerbates the problem. When this happens, texture/VBO uploads randomly fail silently. I either have to crash the driver or reboot the machine to fix it.

Been around since about 2xx, I agree.

32-bit, eh. I’d run “top” (or whatever the windows process monitor equivalent is) where you can monitor both total process virtual memory and physical memory.

Back in 32-bit-land, it’s quite easy to overrun total available process “virtual” memory well before you run out of “physical” memory. Check your OS docs for what that limit is. While total VM is 4GB, your process may only get 2GB. Limits vary based on OS and OS config. For Windows, this pops up on google:

However, if your process runs out of mem, I’d expect it to be killed outright by the OS, not give GL the opportunity to return OUT_OF_MEMORY errors. So I’d suspect other things, though this is an easy thing to check for.

I do not think this is the problem. If the only help is restarting PC then it must be something else. It really indicates that the problem is in memory leak or memory fragmentation. Probably VRAM memory.

Not in Windows. The allocation functions just fail and return NULL or any other error code indicating out-of-mem error. The process is never killed because of this condition.

Thank you again for your suggestions:

  • Not calling FBO setup rtns in the right order (link)

I think the setup code for the FBO is not the problem. It is running for some time now and basically works. But I’ll try the tool mentioned in the link as it can monitor GPU memory.

  • Overheating GPU due to dead fan (link)

We have this on multiple PCs. It also happens on PCs of our customers. I don’t think they all have hardware problems.

  • Driver memory leak (link) - make sure you’re running the latest

Yes. Latest driver 260.99. It seems to work much better with 197.45 but even there the problem shows up from time to time.

  • Out of CPU memory (link)

Because resources such as textures that are allocated on the GPU are backed by CPU memory, I’d verify you’re not running out of CPU memory.

This is definately not the case. When this happens there is plenty of memory left and the process is hardly using some. It sometimes even happens when the very first 32x32 render target is created. I checked with task manager which also displays the max memory usage of a process. I also checked if the parameters

Also, corrupting the heap can cause the appearance of out-of-memory errors, so I wouldn’t hesitate to run a memory checker such as valgrind to make sure you don’t have some rogue writes going on in your code.

Well, that might of course be the case. But how can a process be able to create a resource leak in the driver this way that persists over multiple restarts of the process? But as I don’t have any other things left to try I’ll check if valgrind turns up something.

Another thought would be that you might be out of some shared bus resource such as AGP memory or something, but that was a problem years ago.

Yeah. It’s not an AGP card. But I was also thinking about something like this. I remember that we had a problem with the kernel under FreeBSD running out of shared memory when using too many FBOs (we used to create one FBO per rtt in the past). But even when this is the case it would still mean the driver leaks this limited resource.

You’re running Vista 64, so it’s unlikely you’re hitting the process virtual memory address limits that plagued 32-bit apps on 32-bit OSs, but I don’t know what Windows might be doing. I’m assuming your app is 64-bit.

Nope. Our app is 32bit. But we are far from allocating so much memory. The app hardly allocates more than 512 MB and as I said the problem sometimes occurs right after starting the app when the memory consumption is even lower.

[quote]The driver seems to delay the allocation of render buffers and textures to the point they are first bound to an FBO in some cases

Yeah, one way you can force it’s hand is to render with each texture right after you create it. That’ll force the out-of-memory condition to occur much closer to the allocation that breaks the camels back (assuming this isn’t a memory corruption or driver leak issue). [/QUOTE]
Actually in the case of FBOs the driver does the allocation at the time glCheckFramebufferStatus is called which always happens when a new rtt is created. In this case there is not even a GL_OUT_OF_MEMORY happening directly. Instead GL_FRAMEBUFFER_UNSUPPORTED is returned. Only when installing GlExpert and enabling driver instrumentation the driver prints some extra information about the real problem “The COLOR_ATTACHMENT0 attachment is unsupported, because it is not allocated (out of memory).”

I googled like crazy and found others with similar problems:
http://archive.gamedev.net/community/forums/topic.asp?topic_id=589249 (I posted to this as mhenschel)
http://svn.opentk.net/node/1869?page=1

Thanks for posting this. Feels so much better to not be the only one.

On a side note: if I ignore the framebuffer incomplete, it works but in software fallback mode.

Yes. The same here. I did some more testing and made all my “normal” textures 4x4 and all render targets 1x1 texels wide. The problem still persists.

The last days I tried almost everything to make this reproducible and finally I discovered that the error occurs every time when windows has no more memory listed as being free or unused. After a fresh reboot most of the memory is free. When working with the system for some time a large part of it is used as file cache so the free memory decreases. Normally when an application requests memory and there is not enough free memory left windows removes some memory from the file cache to fullfill the request. It seems like this is not working for the nvidia driver for some reason. When the listed free memory drops to zero the FBO error occurs. But I can open other applications without problems.

Any thoughts on that?

Edit:
It seems to be more complicated than that. 3 times in a row the error occured exactly when free memory reached 0 and 3 Gigs were in windows file cache. But then again the error started to appear completely random again. May be I should just try to kidnap a nvidia driver programmer and force him to debug the issue on my pc. As an alternative I could found a website called driver-source-leaks.com so that I can at least get a clue what this “out of memory” error message actually means.