PDA

View Full Version : Resource leak in 2xx nvidia drivers



muhkuh
01-04-2011, 11:46 AM
Hi,

we are experiencing a strange out of memory condition with nvidia drivers on Vista 64. It happened very rarely with the 1xx driver series but drastically increased with the 2xx drivers (Several times a day).

The problem:
After working with our OpenGL app for some time suddenly all kinds of OpenGL calls fail. Most of the time binding an FBO fails with GL_FRAMEBUFFER_UNSUPPORTED but also other calls fail with GL_OUT_OF_MEMORY. I managed to get GLexpert running and it turned out that the FBO error is also caused by an out of memory condition. The driver seems to delay the allocation of render buffers and textures to the point they are first bound to an FBO in some cases and the GLexpert messages indicate that there is not enough memory for this leading to GL_FRAMEBUFFER_UNSUPPORTED. What also happens is that the driver falls back to software emulation from time to time. These problems occur sooner and sooner until even starting the application doesn't work anymore and it eventually results in crashes in nvogl.dll. It helps a little bit to kill dwm.exe. This seems to free up at least some memory but at some point only rebooting helps. We have reports from Windows XP users that even worse things happen there resulting in bluescreens and corruption of other windows but it is not 100% sure this is related.


"Repro case":
It seems like it happens more often in certain scenarios when the rendering loop is paused for a long time. For instance our editor app displays various preview windows and when cooking a project it launches a player app to check if the cooked projects actually runs. During this time the editor render loop is paused. The same thing happens in one of our tests. A lot of resources are created and released but almost no rendering occurs. This is complete guesswork though and might be completely coincidental.


What I tried:
I tried to somehow deal with this for month now. Even if our app does something wrong this is definately a driver issue, isn't it? Even if we leak some resources the driver should do cleanup when the application exits right? I checked for leaking OpenGL resources in our app using gDebugger but it didn't find anything. The memory of the app itself is not increasing that much, too.


Do you have any suggestions what I can do to to find the source of the problem? I guess our application must do something different or otherwise a lot more people would have complained about this. Is there any way to get more information from the driver to know what's actually leaking? Even if I was able to make a repro case it seems like it is not possible to submit a bug to nvidia anymore. Some years before I was a registered developer but now my account seems to be deleted and all my attempts to register again seem to be ignored.

I hope this is not the wrong forum but I don't know where else to post. This really becomes big problem for us now and I'm getting desperate. Thanks for your time.

mhagain
01-05-2011, 02:13 AM
Are you calling any glGen* functions per-frame? Yes, the driver will clean up "leaked" resources on shutdown, but if you are allocating new resources each frame you may run out of memory well before the driver gets the chance to do this.

muhkuh
01-05-2011, 03:00 AM
Thank you for your reply. No. I don't usually use glGen* calls per frame. At some places textures are generated on the fly when they are accessed first but this is not happening very often and the error also happens when this lazy texture generation is disabled and everything is precached. I also do not use very many FBOs. Actually there is only one FBO for rendering to textures. It is reused all the time because we have very many textures we render to (hundreds).

May be I didn't explain it clearly. The driver leaks memory over several starts of the application. At some point when I restart our app I cannot even create a single render target texture without getting an FBO error caused by an out of memory condition in the driver and only a reboot fixes this. One of the very first things that happens in our renderer is the creation of a 32x32 RGBA dummy texture. We render a checkerboard pattern to it and use it as dummy texture whenever loading a texture fails. Even this doesn't work from time to time.

muhkuh
01-05-2011, 05:14 AM
I tried to use this extension:
http://developer.download.nvidia.com/opengl/specs/GL_NVX_gpu_memory_info.txt

When the error occurs none of the numbers indicate any memory shortage.

GL_GPU_MEMORY_INFO_DEDICATED_VIDMEM_NVX=786432
GL_GPU_MEMORY_INFO_TOTAL_AVAILABLE_MEMORY_NVX=7864 32
GL_GPU_MEMORY_INFO_CURRENT_AVAILABLE_VIDMEM_NVX=63 4308
GL_GPU_MEMORY_INFO_EVICTION_COUNT_NVX=26
GL_GPU_MEMORY_INFO_EVICTED_MEMORY_NVX=38060

This is what GLlExpert prints when the error occurs:
"OGLE: Category: 0x00004000, MessageID: 0x008E0000
Basic framebuffer object information: The COLOR_ATTACHMENT0 attachment is unsupported, because it is not allocated (out of memory)."

Sometimes also this when calling glGenFrameBuffers:
OGLE: Category: 0x00000002, MessageID: 0x00810008
Software rendering has been enabled because the current framebuffer related state is not supported with the current hardware configuration: The framebuffer is not a hardware accelerated resource.

Dark Photon
01-05-2011, 08:00 PM
I tried to use this extension:
http://developer.download.nvidia.com/opengl/specs/GL_NVX_gpu_memory_info.txt

When the error occurs none of the numbers indicate any memory shortage.

GL_GPU_MEMORY_INFO_DEDICATED_VIDMEM_NVX=786432
GL_GPU_MEMORY_INFO_TOTAL_AVAILABLE_MEMORY_NVX=7864 32
GL_GPU_MEMORY_INFO_CURRENT_AVAILABLE_VIDMEM_NVX=63 4308
GL_GPU_MEMORY_INFO_EVICTION_COUNT_NVX=26
GL_GPU_MEMORY_INFO_EVICTED_MEMORY_NVX=38060

While I'm no NVidia driver guru, I use NVX_gpu_memory_info, and in my experience, the EVICT numbers are always 0, unless you've bang your head up against the limits of GPU memory at some point. I'd restart your system, verify the EVICT numbers are 0, start your app, and watch the numbers.

muhkuh
01-06-2011, 03:15 AM
Interesting. Even after rebooting these numbers are not zero for me. Never. After testing a lot yesterday it seems like these numbers grow after every start of our editor. I don't even have to load any resources. It is enough to start the editor and close it. It also happens with one of our test executables which in one test creates a second window with a second render context (the editor does not do this though). In contrast to that I can run our player app all day long without changing eviction count at all.

I'm now running a second little glut app that does no more than continuously printing out these numbers to see when exactly these counts increase. In the editor a window handle from .NET Winforms is passed to native code, a context is created and then these counts suddenly increase while something is going on inside .NET. But I guess it's not easily possible to debug it this way because the driver will probably do things asynchronously in another thread so the time I observe the change might be after the command that is responsible for that. May be it's easier to track this in the other app where everything is under my control more or less.

Edit:
Are you running Vista/Win7 or WinXP. I read that these numbers are global for all applications on Vista/Win7 and local to the application on WinXP. I'm running Vista64 so may be it is normal that there is always something evicted in the driver globally. When you are running WinXP this might be the reason the eviction counter is always zero for you as it is local to the GL context.

muhkuh
01-06-2011, 08:47 AM
I've discovered that resizing a hidden gl window causes GL_GPU_MEMORY_INFO_EVICTION_COUNT_NVX and GL_GPU_MEMORY_INFO_EVICTED_MEMORY_NVX to increase. This happens when starting the editor because it creates the render window as a child of the main window while the main window is still hidden. We can work around the issue I think. I wrote a simple test case using glut that shows the issue. I'm still checking if this is really the problem though.

Here is the test case source in case someone is interested. It continuously resizes a hidden window and prints out evicted count, evicted memory and the currently available video when the evicted count changes.


My spec:
Operating System: Windows Vista (TM) Ultimate, 64-bit (Service Pack 2)
GPU processor: GeForce 8800 GTX
Driver version: 260.99



#pragma comment(lib, "glut32.lib")
#include "glut.h"
#include <stdio.h>

#define GL_GPU_MEMORY_INFO_DEDICATED_VIDMEM_NVX 0x9047
#define GL_GPU_MEMORY_INFO_TOTAL_AVAILABLE_MEMORY_NVX 0x9048
#define GL_GPU_MEMORY_INFO_CURRENT_AVAILABLE_VIDMEM_NVX 0x9049
#define GL_GPU_MEMORY_INFO_EVICTION_COUNT_NVX 0x904A
#define GL_GPU_MEMORY_INFO_EVICTED_MEMORY_NVX 0x904B

void PrintVideoMemory()
{
static GLint evicted=0;
GLint vidmem=0, mem_available=0, vidmem_available=0, evicted_count=0, evicted_size=0;
glGetIntegerv(GL_GPU_MEMORY_INFO_DEDICATED_VIDMEM_ NVX, &amp;vidmem);
glGetIntegerv(GL_GPU_MEMORY_INFO_TOTAL_AVAILABLE_M EMORY_NVX, &amp;mem_available);
glGetIntegerv(GL_GPU_MEMORY_INFO_CURRENT_AVAILABLE _VIDMEM_NVX, &amp;vidmem_available);
glGetIntegerv(GL_GPU_MEMORY_INFO_EVICTION_COUNT_NV X, &amp;evicted_count);
glGetIntegerv(GL_GPU_MEMORY_INFO_EVICTED_MEMORY_NV X, &amp;evicted_size);

if (evicted!=evicted_count)
{
printf("evicted_count=%d, evicted_size=%dkb, vidmem_available=%dkb\n",
evicted_count, evicted_size, vidmem_available);

evicted=evicted_count;
}
}

void renderScene(void)
{
static bool flip=true;
flip=!flip;

glClear(GL_COLOR_BUFFER_BIT);
glBegin(GL_TRIANGLES);

glVertex3f(-0.5,-0.5,0.0);
glVertex3f(0.5,0.0,0.0);
glVertex3f(0.0,0.5,0.0);
glEnd();
glutSwapBuffers();
PrintVideoMemory();

if (flip)
glutReshapeWindow(300,200);
else
glutReshapeWindow(300,300);
}



int main(int argc, char* argv[])
{
glutInit(&amp;argc, argv);
glutInitDisplayMode(GLUT_RGBA | GLUT_DOUBLE );
glutInitWindowPosition(100,100);
glutInitWindowSize(100,100);

glutCreateWindow("resoource leak provoker");
glutHideWindow(); //comment this to make the leak disappear

glutDisplayFunc(&amp;renderScene);
glutIdleFunc(&amp;renderScene);
glutMainLoop();

return 0;
}

mhagain
01-06-2011, 11:16 AM
Hmmm - maybe check that your program actually is exiting properly and fully? Check in Task Manager that there are no instances of it still running, check your code to ensure that your context is being destroyed, and so on.

Dark Photon
01-06-2011, 02:58 PM
Edit:
Are you running Vista/Win7 or WinXP. I read that these numbers are global for all applications on Vista/Win7 and local to the application on WinXP. I'm running Vista64 so may be it is normal that there is always something evicted in the driver globally. When you are running WinXP this might be the reason the eviction counter is always zero for you as it is local to the GL context.
Ah! That's probably it. Vista/Win7 has that perf-wasting Aero compositor which uses the GPU for rendering/compositing the desktop. Whereas XP doesn't. Very well could be that it's the OS that's overflowing your GPU at some point with normal desktop rendering ops, giving you an eviction count > 0.

I'm running Linux, and with the perf-eating GPU compositor turned off (so this should be more like the XP case). So my app is the only thing GPU intensive that runs, and my eviction counts are 0 unless I get up close to the limit of GPU memory.

You could try after disabling Aero and rebooting to see if your results differ:

http://www.howtogeek.com/howto/windows-vista/disable-aero-on-windows-vista/

muhkuh
01-07-2011, 02:59 AM
Disabling Aero seems to help. The test case does not cause the evicted count to increase anymore. But I still have to check if this also solves our original problem with various OpenGL calls failing with out of memory errors.

By the way, this is what the extension spec says about the evicted count in Vista, WinXP and Linux:
"Implementing the eviction information is OS dependent.
For XP and Linux the eviction information is specific to the current process/state since eviction is determined in the individual client. For Vista it is system wide since eviction is determined by the OS."

muhkuh
01-07-2011, 04:27 AM
Ok. With Aero disabled and after rebooting the evicted count stays at zero all the time but still our app stll gets these strange random out of memory errors. So the evicted count doesn't seem to have much to do with our problem. The only information NVX_gpu_memory_info can give me is that it is obviously not the video memory that runs out as there are over 400 Megs free when the error occurs. *Sigh*

Dark Photon
01-07-2011, 07:03 PM
Google GL_OUT_OF_MEMORY for some ideas. Here are a few of the matches:

* Not calling FBO setup rtns in the right order (link (http://www.opengl.org/discussion_boards/ubbthreads.php?ubb=showflat&amp;Number=260232))
* Overheating GPU due to dead fan (link (http://www.nvnews.net/vbulletin/showthread.php?t=49406))
* Driver memory leak (link (http://www.gamedev.net/community/forums/topic.asp?topic_id=588249)) - make sure you're running the latest
* Out of CPU memory (link (http://mailman.coin3d.org/pipermail/coin-discuss/2002-June/010255.html))
* ...

Because resources such as textures that are allocated on the GPU are backed by CPU memory, I'd verify you're not running out of CPU memory.

Also, corrupting the heap can cause the appearance of out-of-memory errors, so I wouldn't hesitate to run a memory checker such as valgrind to make sure you don't have some rogue writes going on in your code.

Another thought would be that you might be out of some shared bus resource such as AGP memory or something, but that was a problem years ago.

You're running Vista 64, so it's unlikely you're hitting the process virtual memory address limits that plagued 32-bit apps on 32-bit OSs, but I don't know what Windows might be doing. I'm assuming your app is 64-bit.


The driver seems to delay the allocation of render buffers and textures to the point they are first bound to an FBO in some cases
Yeah, one way you can force it's hand is to render with each texture right after you create it. That'll force the out-of-memory condition to occur much closer to the allocation that breaks the camels back (assuming this isn't a memory corruption or driver leak issue).

NeXEkho
01-08-2011, 05:11 PM
I experience this with a 512mb 9500GT running under W7-32. After about six hours of stop-start, my deferred renderer fails to start up claiming framebuffer incomplete. Playing hardware accelerated video exacerbates the problem. When this happens, texture/VBO uploads randomly fail silently. I either have to crash the driver or reboot the machine to fix it.

Been around since about 2xx, I agree.

Dark Photon
01-09-2011, 11:32 AM
I experience this with a 512mb 9500GT running under W7-32.
32-bit, eh. I'd run "top" (or whatever the windows process monitor equivalent is) where you can monitor both total process virtual memory and physical memory.

Back in 32-bit-land, it's quite easy to overrun total available process "virtual" memory well before you run out of "physical" memory. Check your OS docs for what that limit is. While total VM is 4GB, your process may only get 2GB. Limits vary based on OS and OS config. For Windows, this pops up on google:

http://msdn.microsoft.com/en-us/library/bb613473%28v=vs.85%29.aspx

However, if your process runs out of mem, I'd expect it to be killed outright by the OS, not give GL the opportunity to return OUT_OF_MEMORY errors. So I'd suspect other things, though this is an easy thing to check for.

mfort
01-09-2011, 12:55 PM
32-bit, eh. I'd run "top" (or whatever the windows process monitor equivalent is) where you can monitor both total process virtual memory and physical memory.

I do not think this is the problem. If the only help is restarting PC then it must be something else. It really indicates that the problem is in memory leak or memory fragmentation. Probably VRAM memory.



However, if your process runs out of mem, I'd expect it to be killed outright by the OS, not give GL the opportunity to return OUT_OF_MEMORY errors.

Not in Windows. The allocation functions just fail and return NULL or any other error code indicating out-of-mem error. The process is never killed because of this condition.

muhkuh
01-09-2011, 05:36 PM
Thank you again for your suggestions:



* Not calling FBO setup rtns in the right order (link (http://www.opengl.org/discussion_boards/ubbthreads.php?ubb=showflat&amp;Number=260232))

I think the setup code for the FBO is not the problem. It is running for some time now and basically works. But I'll try the tool mentioned in the link as it can monitor GPU memory.



* Overheating GPU due to dead fan (link (http://www.nvnews.net/vbulletin/showthread.php?t=49406))

We have this on multiple PCs. It also happens on PCs of our customers. I don't think they all have hardware problems.



* Driver memory leak (link (http://www.gamedev.net/community/forums/topic.asp?topic_id=588249)) - make sure you're running the latest

Yes. Latest driver 260.99. It seems to work much better with 197.45 but even there the problem shows up from time to time.




* Out of CPU memory (link (http://mailman.coin3d.org/pipermail/coin-discuss/2002-June/010255.html))


Because resources such as textures that are allocated on the GPU are backed by CPU memory, I'd verify you're not running out of CPU memory.

This is definately not the case. When this happens there is plenty of memory left and the process is hardly using some. It sometimes even happens when the very first 32x32 render target is created. I checked with task manager which also displays the max memory usage of a process. I also checked if the parameters



Also, corrupting the heap can cause the appearance of out-of-memory errors, so I wouldn't hesitate to run a memory checker such as valgrind to make sure you don't have some rogue writes going on in your code.


Well, that might of course be the case. But how can a process be able to create a resource leak in the driver this way that persists over multiple restarts of the process? But as I don't have any other things left to try I'll check if valgrind turns up something.



Another thought would be that you might be out of some shared bus resource such as AGP memory or something, but that was a problem years ago.

Yeah. It's not an AGP card. But I was also thinking about something like this. I remember that we had a problem with the kernel under FreeBSD running out of shared memory when using too many FBOs (we used to create one FBO per rtt in the past). But even when this is the case it would still mean the driver leaks this limited resource.



You're running Vista 64, so it's unlikely you're hitting the process virtual memory address limits that plagued 32-bit apps on 32-bit OSs, but I don't know what Windows might be doing. I'm assuming your app is 64-bit.

Nope. Our app is 32bit. But we are far from allocating so much memory. The app hardly allocates more than 512 MB and as I said the problem sometimes occurs right after starting the app when the memory consumption is even lower.





The driver seems to delay the allocation of render buffers and textures to the point they are first bound to an FBO in some cases
Yeah, one way you can force it's hand is to render with each texture right after you create it. That'll force the out-of-memory condition to occur much closer to the allocation that breaks the camels back (assuming this isn't a memory corruption or driver leak issue).
Actually in the case of FBOs the driver does the allocation at the time glCheckFramebufferStatus is called which always happens when a new rtt is created. In this case there is not even a GL_OUT_OF_MEMORY happening directly. Instead GL_FRAMEBUFFER_UNSUPPORTED is returned. Only when installing GlExpert and enabling driver instrumentation the driver prints some extra information about the real problem "The COLOR_ATTACHMENT0 attachment is unsupported, because it is not allocated (out of memory)."

I googled like crazy and found others with similar problems:
http://archive.gamedev.net/community/forums/topic.asp?topic_id=589249 (I posted to this as mhenschel)
http://svn.opentk.net/node/1869?page=1

muhkuh
01-09-2011, 05:45 PM
I experience this with a 512mb 9500GT running under W7-32. After about six hours of stop-start, my deferred renderer fails to start up claiming framebuffer incomplete. Playing hardware accelerated video exacerbates the problem. When this happens, texture/VBO uploads randomly fail silently. I either have to crash the driver or reboot the machine to fix it.

Been around since about 2xx, I agree.
Thanks for posting this. Feels so much better to not be the only one.

NeXEkho
01-10-2011, 05:15 AM
On a side note: if I ignore the framebuffer incomplete, it works but in software fallback mode.

muhkuh
01-10-2011, 05:17 AM
On a side note: if I ignore the framebuffer incomplete, it works but in software fallback mode.

Yes. The same here. I did some more testing and made all my "normal" textures 4x4 and all render targets 1x1 texels wide. The problem still persists.

muhkuh
01-13-2011, 10:06 AM
The last days I tried almost everything to make this reproducible and finally I discovered that the error occurs every time when windows has no more memory listed as being free or unused. After a fresh reboot most of the memory is free. When working with the system for some time a large part of it is used as file cache so the free memory decreases. Normally when an application requests memory and there is not enough free memory left windows removes some memory from the file cache to fullfill the request. It seems like this is not working for the nvidia driver for some reason. When the listed free memory drops to zero the FBO error occurs. But I can open other applications without problems.

Any thoughts on that?

Edit:
It seems to be more complicated than that. 3 times in a row the error occured exactly when free memory reached 0 and 3 Gigs were in windows file cache. But then again the error started to appear completely random again. May be I should just try to kidnap a nvidia driver programmer and force him to debug the issue on my pc. As an alternative I could found a website called driver-source-leaks.com so that I can at least get a clue what this "out of memory" error message actually means.

k.sinitsyn
02-24-2011, 07:33 AM
Hi muhkuh. I have same problem in visualization system that we developing. This continues the last half of the year. My colleagues and I have done many tests, but not so many we know.

I can state the following facts:
1. problem occurs in Vista/Win7 x86/x64
2. occurs in all WHQL drivers
3. occurs when almost all memory will be "cached" by OS
4. even test program with little memory consumption has same behavior

It's strange that when you run the test program, the first few copies are run without problems, and take the same amount of system memory (for example 22 MB), beginning with some start program takes more memory (32 MB), and this means that the next copy will take up more system memory (37-40 MB) and will work very slowly.

All this looks like NVIDIA driver and new Windows memory manager that takes all free system memory for caching program modules not friendly with each other.

Have you some progress with this problem?

muhkuh
02-28-2011, 02:10 PM
I have sent a bug report to NVidia but didn't receive a reply yet.

NeXEkho
03-24-2011, 07:45 AM
I seem to be having no problems with this since moving to 7x64 and upgrading to 4Gb of DDR3.

prideout
04-01-2011, 08:00 AM
We had a very reproducible test case for this issue and we worked with NVIDIA engineers to resolve it. It is fixed in the 270.51 beta driver that was released this week.

Jenus
04-28-2011, 09:33 AM
I still got problems with the 270.61 WHQL driver on my Geforce GTX 590.
I have also tried the older 270.51 beta but it has the same problem.
Every OpenGL application stutters like crazy and one of my applications (Eyeon Fusion) crashes when I quit the program with this message: "Error in nvoglv64.DLL at 0x7eccf4".
Any idea what I can do to fix it?

horus711
05-02-2011, 08:22 PM
I still got problems with the 270.61 WHQL driver on my Geforce GTX 590.
I have also tried the older 270.51 beta but it has the same problem.
Every OpenGL application stutters like crazy and one of my applications (Eyeon Fusion) crashes when I quit the program with this message: "Error in nvoglv64.DLL at 0x7eccf4".
Any idea what I can do to fix it?

I had the same issue and just fixed it. solution (http://www.vfxpedia.com/index.php?title=Crash/Fusion/6.1.4.760/nvoglv64.dll/8.17.12.7061/0x7eccf4) Go here

muhkuh
10-10-2011, 02:18 AM
I still have the very same issues with newer drivers. Sporadic FBO errors for no obvious reason that can be cured only by a reboot.