PDA

View Full Version : VBO have 2x the memory footprint of DL



tranders
02-07-2008, 07:39 PM
So I started looking into using VBOs and discovered that they have twice the memory footprint of equivalent display lists. When I say "memory footprint", I'm talking of an explicit reduction in the total available virtual address space for the process. My test was very simple, but realistic none the less:

1. Observe virtual address usage ~40MB
2. glGenLists(32768)
3. fill each list with 1024 3D GL_POINTS
4. Observe virtual address usage ~450MB
5. glGenBuffers(32768, aVBO)
6. bind and fill each list with 1024 x 3 GLfloats
7. Observe virtual address usage ~1.2GB

The DL usage makes sense. The VBO usage is ~2x what it should be. Has anyone else seen this problem?

Quadro FX 3450 256MB current driver VISTA OS

Korval
02-07-2008, 08:38 PM
Did you actually delete the display lists?

zeoverlord
02-08-2008, 04:57 AM
Yes well if you have 32768 buffers i can understand why this is, however VBOs are not used this way, it could be that VBO's are more top heavy when it comes to low polycounts, though it doesn't matter as you often put several objects in a single VBO.
Secondly i seem to remember that the drivers keeps a backup copy in the ram of everything you put in vram(textures and VBOs), so it could be as korval said that you just didn't delete something.

k_szczech
02-08-2008, 05:44 AM
I think this is what happens:

1. myVertices = new float[3 * vertexcount]; //memory from address A to B is being allocated by application
2. send vertex data to driver //memory from address B to C is being allocated by driver
3. delete[] myVertices; //memory from address A to B is released

Outcome: upper bound of virtual address space equals C, but only second half of that address space is used now.

tranders
02-08-2008, 06:32 AM
Did you actually delete the display lists?
It doesn't matter. If I skip creating the DLs, the VBOs still take up twice as much memory.

tranders
02-08-2008, 06:39 AM
I think this is what happens:

1. myVertices = new float[3 * vertexcount]; //memory from address A to B is being allocated by application
2. send vertex data to driver //memory from address B to C is being allocated by driver
3. delete[] myVertices; //memory from address A to B is released

Outcome: upper bound of virtual address space equals C, but only second half of that address space is used now.

There is no allocation of vertices in my test application. I'm simply copying a single static buffer into each VBO. The increase in memory is directly a result of that copy.

knackered
02-08-2008, 06:45 AM
is the usage parameter set to static?

tranders
02-08-2008, 06:48 AM
Yes well if you have 32768 buffers i can understand why this is, however VBOs are not used this way, it could be that VBO's are more top heavy when it comes to low polycounts, though it doesn't matter as you often put several objects in a single VBO.
Secondly i seem to remember that the drivers keeps a backup copy in the ram of everything you put in vram(textures and VBOs), so it could be as korval said that you just didn't delete something.
My application has thousands of complex components and yes I am indeed looking at per-VBO overhead vs per-DL overhead because that was mentioned in another thread. Regardless, I would not expect a single VBO to have 12K of overhead (i.e., 1024x3xsizeof(GLfloat)).

While there is a user space backup of both VBO data and DL data, I'm seeing what would amount to two copies of the VBO data.

Hampel
02-08-2008, 07:45 AM
Did you try to swap steps (2,3) with (5,6) and then measure again?

knackered
02-08-2008, 09:37 AM
right, i've just tested this myself by switching my renderer to dlist mode. I can confirm that the VBO's are now taking double the memory of dlists. This didn't used to be the case - I used to get less memory usage with static VBO's.
ForceWare version: 162.65
I'm not happy. This is going to kill us if it's not fixed pronto.
(Edit) BTW, the scene consists of 287 batches, total of 50,000 triangles spread over a single 2mb static_draw VBO and a single 4mb static_draw IBO.

tranders
02-08-2008, 10:05 AM
I started looking at larger buffer sizes and decided to skip display lists and focus only on VBOs. After running through several iterations of various buffer sizes, I found that if you create VBOs that are less than 65536 bytes in size (e.g., 65535 bytes or smaller), your process will take a 2x hit in committed memory relatvie to the actual size of the data. If your buffers are greater than or equal to 65536 bytes, the committed memory matches expectations. For float vertices, this comes out to be 5461 vertices (i.e., if your buffers have 5462 vertices or more, you won't get penalized, anything less and you are). For example:

Given:


#define MAX_BUFFERS 4096

char p[65536] = {0};


This block of code:


UINT ab[MAX_BUFFERS];
glGenBuffers( MAX_BUFFERS, ab );
for (int i=0;i<MAX_BUFFERS;i++)
{
glBindBuffer( GL_ARRAY_BUFFER_ARB, ab[i] );
glBufferData( GL_ARRAY_BUFFER_ARB, 65535, p, GL_STATIC_DRAW_ARB );
}

results in twice as much committed memory as:


UINT bb[MAX_BUFFERS];
glGenBuffers( MAX_BUFFERS, bb );
for (int i=0;i<MAX_BUFFERS;i++)
{
glBindBuffer( GL_ARRAY_BUFFER_ARB, bb[i] );
glBufferData( GL_ARRAY_BUFFER_ARB, 65536, p, GL_STATIC_DRAW_ARB );
}

The only difference is the size of the buffer being created (65535 vs 65536). I measured the memory comsumption inline with the execution of the code and it is precise and repeatable. Also note that it's not that each VBO is 64K because creating the same number of VBOs with a smaller size results in approximately twice as much committed memory as the composite size of the buffers. FWIW, the per-VBO overhead appears to be approximately 700 bytes.

I haven't tested this on ATI systems yet.

tranders
02-08-2008, 10:10 AM
right, i've just tested this myself by switching my renderer to dlist mode. I can confirm that the VBO's are now taking double the memory of dlists. This didn't used to be the case - I used to get less memory usage with static VBO's.
Thanks for confirming this.

tranders
02-08-2008, 10:24 AM
ForceWare version: 162.65
I'm not happy. This is going to kill us if it's not fixed pronto.
(Edit) BTW, the scene consists of 287 batches, total of 50,000 triangles spread over a single 2mb static_draw VBO and a single 4mb static_draw IBO.
There is a definite jump when VBOs are less than 64K and I'm also seeing a rise in memory consumption when the VBOs are greater than 512K and less than 3MB. Some of this can be expected over time but not for a clean initialization. I haven't run any numbers to find the exact breakpoints or look for any subsequent anomalies but your 2MB buffer would probably be affected.

-NiCo-
02-08-2008, 10:35 AM
I found that if you create VBOs that are less than 65536 bytes in size (e.g., 65535 bytes or smaller), your process will take a 2x hit in committed memory relatvie to the actual size of the data.

This reminds me of the performance gap between 16bit addressable and 32bit addressable VBO's on Nvidia cards but I guess I'm just stating the obvious here :)

N.

-NiCo-
02-08-2008, 11:07 AM
Maybe this (http://developer.nvidia.com/attach/6427) helps...



glBufferDataARB()

This function is an abstraction layer between the memory and the application. But behind each buffer object is a complex memory management system. Basically, the function does the following:

-Checks whether the size or usage type of the data store has changed.

-If the size is zero, frees the memory attached to this buffer object.

-If the size and storage type didn’t change and if this buffer isn’t used by the GPU, we’ll use it. Everything is already set up for use.

-On the other hand, if the GPU is using it or is about to use it, or if the storage type changed, we’ll have to request another chunk of memory for this buffer to be ready.

-If the data pointer isn’t NULL, we’ll copy the data into this new memory area.

We can see that the memory we had before a second call to BufferDataARB isn’t necessarily the same exact memory we had afterward. However, it’s still the same from the application’s point of view (same buffer object). But on the driver’s side, we’re optimizing and allowing the application to not wait for the GPU.

Internally, we’ve allocated a large pool of memory that we suballocate from. When we call BufferDataARB, we reserve a chunk of it for the current buffer object. Then we fill it with data and draw with it, and we mark that memory as being used (similar to the glFence function) .

If we call BufferDataARB again before the GPU is done, we can simply assign the buffer object a new chunk of the large pool. This is possible because BufferDataARB says we’re going to re-specify all the data in the buffer (as opposed to BufferSubDataARB).


N.

knackered
02-08-2008, 11:15 AM
don't think that's the problem, as i allocate using bufferdata with a null pointer, then upload the static data using subdata.

tamlin
02-08-2008, 11:16 AM
Could this simlpy be attributed to the fact that NT has 64-KB granularity (not "page"-sized) for many things? Try f.ex. VirtualAlloc and you'll likely get a 64KB-aligned pointer back.

I suspect the same behaviour goes to HAL if you try to allocate physical memory, or map a section - you get 64KB alignment.

To verify this, it could be interesting to attempt to actually map all those created (relatively small) VBO's, at the same time, and check their base addresses. Perhaps it's so simple it's granularity overhead - much like a file of e.g. 1000 bytes on a 4KB-cluster filesystem allocates 4KB?

knackered
02-08-2008, 12:12 PM
then why doesn't it affect the dlist memory footprint?

tranders
02-08-2008, 12:32 PM
Could this simlpy be attributed to the fact that NT has 64-KB granularity (not "page"-sized) for many things? Try f.ex. VirtualAlloc and you'll likely get a 64KB-aligned pointer back.

I suspect the same behaviour goes to HAL if you try to allocate physical memory, or map a section - you get 64KB alignment.

To verify this, it could be interesting to attempt to actually map all those created (relatively small) VBO's, at the same time, and check their base addresses. Perhaps it's so simple it's granularity overhead - much like a file of e.g. 1000 bytes on a 4KB-cluster filesystem allocates 4KB?
The individual data buffers are allocated out of a pool of much larger buffers (in this case a pool of what appears to be 4MB buffers) so the allocation granularity only applies to these larger buffers. The results of glMapBuffer(GL_ARRAY_BUFFER_ARB,GL_READ_ONLY_ARB) return addresses aligned on 16 byte boundaries (which makes sense) and are not spaced out a twice the size of the buffer. There is padding between the mapped addresses but it tends to run in the neighborhood of 304 and 320 bytes. Regardless, you shouldn't see a DECREASE in committed memory when you go from 65535 to 65536 no matter what the alignment.

I think the memory allocator is seriously flawed for smaller size buffers.

tamlin
02-08-2008, 04:27 PM
"Regardless, you shouldn't see a DECREASE in committed memory when you go from 65535 to 65536 no matter what the alignment."

Actually, that may not be correct.

Imagine they use different allocators (quite likely) and use (the equivalent of) VirtualAlloc for 64KB+, but some other algo for less. Now imagine they assume the majority of allocations are 64KB+. (we all know what assumption is the mother of).

I'm not saying this is the reason, but it is a plausible theory.

tranders
02-08-2008, 06:33 PM
"Regardless, you shouldn't see a DECREASE in committed memory when you go from 65535 to 65536 no matter what the alignment."

Actually, that may not be correct.

Imagine they use different allocators (quite likely) and use (the equivalent of) VirtualAlloc for 64KB+, but some other algo for less. Now imagine they assume the majority of allocations are 64KB+. (we all know what assumption is the mother of).

I'm not saying this is the reason, but it is a plausible theory.
OK - from a technical point of view the overall footprint for a single byte difference could require slightly more memory but a 2x difference for all allocations below 64K is a clear indication of a serious flaw.

Considering that the virtual page table contains equivalent 4MB blocks of memory for both sizes, the large block allocator appears to be the same. The problem lies in the partitioning of these large blocks for individual buffer requests. It's pretty obvious there are different algorithms for the different sizes - which in itself is not a bad thing. However, the key to a good allocator is that the algorithms for the various sizes are "optimized" for that size (or size range). If the VBO allocator is only optimized for 64K+ buffer sizes, then the allocator needs to be fixed - period. Otherwise GL3 will have a hard time convincing developers to switch from a DL-based system to a VBO-based system.

Korval
02-08-2008, 07:10 PM
but a 2x difference for all allocations below 64K is a clear indication of a serious flaw.

Not necessarily a flaw. It could be an optimization of some sort.


If the VBO allocator is only optimized for 64K+ buffer sizes, then the allocator needs to be fixed - period.

Why? IHVs have always strongly suggested that small buffer objects are a bad idea. How is it their fault (and even moreso, how is it the OpenGL specification's fault) if you disregard their advice and get poor performance?

knackered
02-09-2008, 06:09 AM
advice?!
how would you feel if malloc committed double the memory you asked for?
the spec says the parameter is in bytes, what's the point in this parameter if it's nowhere near the truth in practice?
if IHV's are going to do this, then the spec needs to be changed to blocks rather than bytes.

tranders
02-09-2008, 07:21 AM
Not necessarily a flaw. It could be an optimization of some sort.

No optimization at all for smaller blocks would simply use an entire 64K buffer and waste the extra byte for a 64K-1b block resulting in the same footprint as a 64K block size. This is most definitely a flaw:

4096 x 65536 --> ~256MB increase in committed process memory
4096 x 65535 --> ~523MB increase in committed process memory


Why? IHVs have always strongly suggested that small buffer objects are a bad idea. How is it their fault (and even moreso, how is it the OpenGL specification's fault) if you disregard their advice and get poor performance?
For the record, this thread has nothing to do with performance it is specifically pointing out a problem with memory usage of small block VBOs.

I didn't say this was a problem with the specification - I said that GL3 will have a difficult time convincing DL-based applications to switch if this problem isn't fixed. It's much easier to refactor with a one-to-one relationship between DLs and VBOs and small-block VBOs enter into that process. The problem is with the IHV's implementation of the specification but GL3 will get a black eye by way of association.

Per-VBO overhead may be a necessary evil but that should be the only issue developers should have to worry about. IMO it would be a very bad thing to change a specification to hide an IHV's inability to implement a decent allocator. Also note that we are talking about (committed, non-mapped) process memory and not the memory that actually resides on the graphics card.

Jan
02-09-2008, 07:31 AM
As far as i know, when asking Windows for memory, that memory is not immediately allocated, ie. the memory is neither in physical RAM, nor cached on disk, as long as you don't "map/commit/whatever it's called" it.

I don't know, what you actually check, but it is possible, that the driver DOES allocate the virtual address space in 64K chunks, as an optimization. That does not necessarily mean, that that amount of memory is actually used/wasted in physical RAM, on disk, or even in the GPU.

Just an idea. I too like to know more about what grand things nVidia messed up there ;-)

Jan.

Korval
02-09-2008, 01:23 PM
the spec says the parameter is in bytes, what's the point in this parameter if it's nowhere near the truth in practice?

It's the IHV's right to allocate whatever memory they see fit in order to implement the specification. The spec doesn't and cannot guarantee that more memory will not be allocated.

That's not to say that this may not be a bug. But when you do something that an IHV says not to, and you get weird behavior, it's really your fault, not theirs.


this thread has nothing to do with performance

Memory performance (how much memory an application takes up) is a type of performance.


The problem is with the IHV's implementation of the specification but GL3 will get a black eye by way of association.

Alternatively, nobody will notice, because they will actually follow the recommendations of the IHV and not allocate tiny buffers. Or, at least, nobody important will notice.

Even more alternatively, GL 3 will allow (as promised) the ability on a per-object basis to say whether a VBO has a backing store in main memory. So allocating a VBO that has no backing store will take up negligible main memory.


IMO it would be a very bad thing to change a specification to hide an IHV's inability to implement a decent allocator.

You say that as though this is incorrect behavior according to the spec. It isn't. The spec allows them to do whatever they want as far as memory allocation goes.

tranders
02-09-2008, 07:58 PM
As far as i know, when asking Windows for memory, that memory is not immediately allocated, ie. the memory is neither in physical RAM, nor cached on disk, as long as you don't "map/commit/whatever it's called" it.

I don't know, what you actually check, but it is possible, that the driver DOES allocate the virtual address space in 64K chunks, as an optimization. That does not necessarily mean, that that amount of memory is actually used/wasted in physical RAM, on disk, or even in the GPU.
Jan.
A program can "reserve" virtual addresses and while in that state the memory does not occupy any physical resource -- the addresses are simply removed from the pool of available, free memory. However when you convert a reserved block of memory to a committed state, there must be some resource (e.g., RAM or pagefile) to backup the virtual address and contain the data. Committed memory is essentially "in use" memory. Mapped memory is distinct from committed memory and is typically used to support write-through or read-through accesses to memory that is not normally in System RAM or the pagefile (e.g., video RAM, memory-mapped files, etc.).

When creating the first VBO of either 65535 or 65536 bytes (or any size for that matter), the NVIDIA driver first commits (not reserves) a large 4MB block of memory from which the actual VBO is subsequently allocated. When the 4MB block is full, or a VBO allocation request cannot be satisfied, another 4MB block is committed, etc. etc. This is typical and most allocators have different algorithms to deal with small and large objects -- in NVIDIA's VBO implementation, there is a flaw in the small object algorithm.

tranders
02-09-2008, 08:36 PM
You say that as though this is incorrect behavior according to the spec. It isn't. The spec allows them to do whatever they want as far as memory allocation goes.
I responded that a suggestion to change the VBO spec to a block oriented spec instead of a byte oriented spec was a bad idea. I have never said this allocation bug violated any spec. What I have said is that this behavior is seriously flawed and needs to be fixed. Regardless of what an IHV might recommend, there is no way to justify an algorithm that doubles the memory footprint when a buffer request changes from 65536 bytes to 65535 bytes.

That's enough for this thread - the problem was confirmed by knackered - and I'll direct my complaint directly with the IHV.

tamlin
02-10-2008, 01:20 PM
4096 x 65536 --> ~256MB increase in committed process memory
4096 x 65535 --> ~523MB increase in committed process memory


I have a, rather plausible I'd say, theory for this. If true it indeed displays a flaw in the memory allocator used.

I think what's going on is that the allocator has two lists - large and small pages. Requests for 64KB or more it satisfies by returning large pages (meaning an allocation of 65537 bytes would also consume two pages - this however requires verification). Such pages needs no book keeping in the pages themselves, as it's simply a bitmap keeping track of them.

Requests for less memory is handled by a small block heap manager, that carves sub-blocks out of 64KB pages and interleaves book keeping data with requested memory. This works just fine so long as book keeping + memory < 64KB.

The flaw (bug) is then likely that it only checks how much memory is requested to decide if it's to be considered a small or large allocation; not how much requested+overhead becomes. It then requests requested+overhead from the lower level 64KB-page-carver, that satisfies the request by allocating and returning two 64KB pages.

The latter would explain why requesting n*65535 bytes consumes 2n*65536 bytes. As for the larger size (64KB or larger), if it actually uses 64KB granularity or simply defers to the operating systems pagesize could be left for further research.

Either way, it indeed seems to be an allocator creating the problem for the border condition where req < 64KB but req+overhead > 64KB.

knackered
02-10-2008, 04:58 PM
My main gripe is that they've changed the behaviour, with dramatic effect. When I originally implemented my VBO path I was getting much better memory usage than dlists on nvidia.
The 2mb VBO size is fastest in my tests, because it allows me to use 16bit indices storing multiple models in the same VBO without having to call glVertexPointer more than once per VBO (20% faster than any other configuration I came up with), but now it uses double the memory. So I followed nvidia's 'advice' and look where it's got me. I either accept the insane increase in memory footprint and keep the speed, or make my VBO's bigger and use 32bit indices which will add 20% to my render time.
For gods sake, I have customers using 32bit Windows loading models that wouldn't fit using dlists but do (did) fit with VBO's. Now they're going to have problems if they update their drivers.

tamlin
02-10-2008, 06:01 PM
knackered, I feel your pain. This is crap we shouldn't have to experience, and if my theory is correct it's the display of a beginners mistake in memory management. It's a mistake a toddler in the area would do.

I'm sorry to bring this up here, but this is an excellent case to make the point "isn't it about bloody time to open up the source? It's not like <competitor> would have any use of it, is it?" (especially in this case :) ).

Had this memory allocator change been peer reviewed, chances are this error would never had been introduced at all.

Korval
02-11-2008, 02:11 AM
The 2mb VBO size is fastest in my tests, because it allows me to use 16bit indices storing multiple models in the same VBO without having to call glVertexPointer more than once per VBO (20% faster than any other configuration I came up with), but now it uses double the memory.

Wait, what? The thread suggests that the double-memory problem happens when allocating less than 64K in a single vertex buffer.


I'm sorry to bring this up here, but this is an excellent case to make the point "isn't it about bloody time to open up the source? It's not like <competitor> would have any use of it, is it?" (especially in this case \:\) ).

LoL.

nVidia is as protective of their IP as it gets. Which makes since, considering that this is all nVidia is. Exposing the source-code to their drivers would essentially give everyone a really good idea as to how nVidia's hardware works.

Plus, I believe that doing so on Windows machines would not be possible, due to agreements made with Microsoft to write drivers in the first place. Oh, they might be able to open source the client-side code, but certainly, nothing that runs in Ring0 would be possible.


Had this memory allocator change been peer reviewed, chances are this error would never had been introduced at all.

Right. Open-source drivers don't have bugs crop up in the randomly.

knackered
02-11-2008, 03:25 AM
Wait, what? The thread suggests that the double-memory problem happens when allocating less than 64K in a single vertex buffer.


There is a definite jump when VBOs are less than 64K and I'm also seeing a rise in memory consumption when the VBOs are greater than 512K and less than 3MB.