Virtual Memory and VBO

I recently started using VBOs for an application. I have a brand new Quadro FX 5800 with 4GB of VRAM but can’t use it for lack of virtual memory.

here are some details:

multiple objects are loaded with approx 5 000 000 vertices (10 000 000 triangles). I use a separate VBO for vertex (3f) normal(3f) and mapcoords(2f). What I can see is that with each vbo, the VRAM occupancy goes up, but so does the virtual memory usage of the application. I always run out of VM (xp32->3.5GB) before getting anywhere in VRAM usage. I tried all flags (dynamic/static, read/draw/copy)to no avail. here is a snippet:

// create buffer object
pglGenBuffersARB(1, &m_vbo);
pglBindBufferARB(GL_ARRAY_BUFFER, m_vbo);

// initialize buffer object
aLocalVec.resize( iNbElem );
pglBufferDataARB(GL_ARRAY_BUFFER_ARB, iNbElem * sizeof(float), &aLocalVec.at(0), GL_DYNAMIC_DRAW );

// unbind buffer
glBindBufferARB(GL_ARRAY_BUFFER_ARB, 0);

Since I’m willing to wait a bit when modifying the models, I would prefer the stay in VRAM only and then get mapped to VM only on demand. It this even feasable?

thanks!

I don’t think it’s possible.

This is true for other OpenGL objects like textures. If you load a x MB texture into VRAM you’ll see an additional x MB RAM footprint. That’s right: if you need to load a texture that takes x amount of memory, you actually need 2*x of RAM to be able to send it to the card.

My rule of thumb now is: “if I buy a graphics card with x GB of VRAM, I also buy at least 2*x GB of RAM.”

multiple objects are loaded with approx 5 000 000 vertices (10 000 000 triangles). I use a separate VBO for vertex (3f) normal(3f) and mapcoords(2f).

Just so you know, that’s ~160MB of data. That’s a lot.

I would prefer the stay in VRAM only and then get mapped to VM only on demand. It this even feasable?

No.

It is your implementations prerogative to mirror information between VRAM and main memory. Usually this is done out of necessity; if the mirroring didn’t exist, then WinXP’s bad VRAM management could cause the VRAM to be cleared (when you switch to a new application). The implementation must therefore maintain a copy of this data in main memory.

Hmm I see, thanks for the replies.

I have been reading on the Vertex Array Range stuff in the meantime, especially the wglAllocateMemoryNV bit. With calls like this:

pVAR = (BYTE*)wglAllocateMemoryNV( iNbElem * sizeof(float), 0, 0, 1 );

I can see that the VM usage doesn’t budge (yeh!) but I dont’t see any VRAM usage (I am using RivaTuner for now). Guess I will have to put data in there and try and render it.

Are there any avenues using textures as vertex buffers? Or even allocating CUDA buffers?

The saddest part is that the PC has 6GB or RAM but stoopid XP32 won<t see it. I guess an upgrage to Windows 7 will be required sooner than I thought!

Are there any avenues using textures as vertex buffers? Or even allocating CUDA buffers?

Huh? How could you use a texture as a buffer object? It’d have to be a 1D texture for that to work.

More importantly, textures are not exempt from the rule about implementations keeping a system memory copy of data.

You are running out of address space, not virtual memory. Try enabling <a href=“Physical Address Extension - Wikipedia”>PAE</a>. Alternatively, change to a 64bit operating system.

In short, you’ll never be able to use 4GB VRAM on a 32-bit address space.

Edit: I just realized that WinXP limits address space to 32bits for “compatibility and licensing reasons”. PAE won’t help you then…

Ok, I spent some time away from the forum, here is what I did:

I now have a Windows7 machine (x64) with 6GB visible RAM. I still have the same Quadro FX 5800 with 4GB VRAM.

What i’m trying to do is load multiple objects, each one similar to the following:

Vertex mesh: 8983x499 = 4482517 vertices
number of polygons(tris): 8946072

VBO for vertex data:  4482517 x 3 floats = 53790204 bytes
VBO for normal data:  4482517 x 3 floats = 53790204 bytes
VBO for map coords:  4482517 x 2 floats = 35860136 bytes

total: 143440544 bytes (140MB)

I then load a few EBOs (GL_ELEMENT_ARRAY_BUFFER) for LOD stuff:

EBO[Original Resolution]: 8948062 * 1 int = 35792248 bytes
EBO[2x2 SubSampling]: 2228030 * 1 int = 8912120 bytes
EBO[4x4 SubSampling]: 552514 * 1 int = 2210056 bytes
EBO[8x8 SubSampling]: 137004 * 1 int = 548016 bytes
EBO[16x16 SubSampling]: 33718 * 1 int = 134872 bytes

total: 47597312 bytes (45MB)
       
for a grand total of approx 185MB per object.

I just tried to naively load more data then I could under xp(32bits) and still have memory allocation errors.

I can presently load a max of 5 objects but the fifth one doesn't display. Even at 4 I have problems as soon as I transform the data. Here is a chart of the Working set of the application depending on the number of objects loaded:

1	384988 KB
2	756528 KB
3	1115288 KB
4	1469752 KB
5	1743768 KB

Guess I will have to recompile everything for x64?
Could it help if I broke the large VBOs into lots of smaller ones ( would this reduce some contiguous memory requirements? )

Does anyone have info/papers/links to windows code actually using 3-4 GB of VRAM ?

also, I previously used RivaTuner to look at the amount of VRAM in usage. It seems that the memory virtualizer in Vista and Win7 doesn’t allow that tool anymore. Can anyone recommend a tool to do the same?

Thanks!

VBO for vertex data: 4482517 x 3 floats = 53790204 bytes
VBO for normal data: 4482517 x 3 floats = 53790204 bytes
VBO for map coords: 4482517 x 2 floats = 35860136 bytes

Why are you using such large vertex data structures?

Normals can be passed as 2 coordinates, where you generate the third automatically (since they’re normalized, you know what the third has to be just from the first two). And signed shorts, or even signed bytes, provide enough resolution for normals. So you can go from 12 bytes down to 2 bytes, or 4 bytes for two shorts.

Something similar goes for texture coordinates. Unsigned shorts should give you plenty of resolution. Unsigned bytes are probably too small to be worthwhile.

And it’s even possible to get away with signed shorts for your position data, depending on how much shader effort you are willing to put in to make it work. Basically, you have to generate (for each object) the actual extent of the model, then scale the model down to a normalized [-1, 1] range, then convert to signed shorts. In the shader, you have to get the extent of the model and scale the positions back up based on that extent. Then you can do your transforms.

Best case, you can reduce things down from your current 32 bytes per vertex down to 12 per vertex. A significant improvement :wink: Even in a conservative case (shorts for tex coords and normals, floats for positions), you can go down to 20 bytes per vertex.

EBO[Original Resolution]: 8948062 * 1 int

Yeah, you should really be using shorts for that, even if it takes multiple draw calls. The Base Vertex extension is great for this, as you won’t have to re-bind your vertex data.

Even at 4 I have problems as soon as I transform the data.

What? That only makes sense if you’re generating vertex data on the GPU.

What exactly are you doing when you “transform the data?”

Why are you using such large vertex data structures?

I usually work in doubles across the board and since this is my first OpenGL app I guess I stuck to the earlier examples :p. I just recently migrated from interractive mode to display lists to vbos after all !

Normals can be passed as 2 coordinates, where you generate the third automatically (since they’re normalized, you know what the third has to be just from the first two).

Would I need a Vetex program for this and calculate the dot product on the fly or am I missing another great GL feature ?

Something similar goes for texture coordinates. Unsigned shorts should give you plenty of resolution. Unsigned bytes are probably too small to be worthwhile.

This sounds really good, escpecially since my textures are originally in the range of 500x10000 pixels, I rescale them in advance in any case.

And it’s even possible to get away with signed shorts for your position data, depending on how much shader effort you are willing to put in to make it work. Basically, you have to generate (for each object) the actual extent of the model, then scale the model down to a normalized [-1, 1] range, then convert to signed shorts. In the shader, you have to get the extent of the model and scale the positions back up based on that extent. Then you can do your transforms.

I would be afraid of loosing some fine details that way. the 3D models I load are actual real-world object scanned at the micron level and we use the application to view the minute details. I would maybe try this as a last resort.

Yeah, you should really be using shorts for that, even if it takes multiple draw calls. The Base Vertex extension is great for this, as you won’t have to re-bind your vertex data.

Base Vertex extension as in the parameter to the glDrawElements function? I read elsewhere that the fewer calls the beter. Guess I will have to time it and see.

What? That only makes sense if you’re generating vertex data on the GPU.

What exactly are you doing when you “transform the data?”

In general, the models are pretty static, the user just plays with the light and camera and looks every where. We have a few algorithms to enhance the structure details that are called manually by the user that effectively modify every single vertex in the object.

Presently all this computation is done client-side with RAM data and the vertex buffers get mapped and modified. I read a small bit about vertex programs and how you can pack additional parameters for each vertex and then have the vertex program do its stuff “automagically”. There are a few parameters that could be set by vertex, a few “global ones” that could be passed to the program but in my case I always need to have access to an external(RAM) buffer in the computation.

Also, if I start playing with a vertex program, doIi lose all the lighting stuff that I curently have for free? I curently set up some material properties and light properties and get something really nice looking. Would I need to re-implement a phong shader of some sort?

Thanks a lot for all your input. It’s like having a privateGL tutor! :smiley:

Coming back to your original example

// create buffer object
pglGenBuffersARB(1, &m_vbo);
pglBindBufferARB(GL_ARRAY_BUFFER, m_vbo);

// initialize buffer object
aLocalVec.resize( iNbElem );
pglBufferDataARB(GL_ARRAY_BUFFER_ARB, iNbElem * sizeof(float), &aLocalVec.at(0), GL_DYNAMIC_DRAW );

// unbind buffer
glBindBufferARB(GL_ARRAY_BUFFER_ARB, 0);

I don’t see you freeing aLocalVec after glBufferData call. Are you freeing that memory and you didn’t show us that or, in case you don’t, is there a reason to keep that extra copy in your app? OGL itself doesn’t need it…

also, about playing with a vertex program (ahem, shader), here’s a little quote from nehe’s OpenGl Article 21

A vertex shader operates on every vertex. So if you call glVertex* (or glDrawArrays, …) the vertex shader is executed for each vertex. If you use a vertex shader you have nearly full control over what is happening with each vertex. But if you use a vertex shader ALL Per-Vertex operations of the fixed function OpenGL pipeline are replaced (see Figure 1):
Vertex Transformation
Normal Transformation, Normalization and Rescaling
Lighting
Texture Coordinate Generation and Transformation

so yes, you will lose all those material/light stuff that you get for free. But if you’re serious about doing this (and judging by your video card, you’re dead serious) this is a good road to take (but you have to think about what hardware are you intending to run your programs, but, again, judging by your mesh sizes, I’d say you need something quite potent). And for the actual lighting algorithms, there are a few shaders floating around the internet that are implementing the opengl standard lighting stuff, just take your inspiration from those and then you will definitely feel the need to modify them for your own purposes.
Plus, any good graphics book out there (shaderx series comes to mind) will have lots of interesting lighting implementations, so you will see, the sky is the limit. Just start slow and go wherever you want from there.

Cheers,
Iulian

Also, really take a look at the vertex buffer object extension, there’s the glMapBuffer /glUnmapBuffer which might be exactly what you want/need to implement your stuff… Take the time to read through the entire specification (the latest version of RedBook/Orange Book would be a nice investment)

Thanks for you input,

I don’t see you freeing aLocalVec after glBufferData call. Are you freeing that memory and you didn’t show us that or, in case you don’t, is there a reason to keep that extra copy in your app? OGL itself doesn’t need it…

That snippet was from a local function and aLocalVec is just that local and gets cleared on exit.

I scanned all the allocations in my program and removed all I could (I had a polygon list of ~ 105MB for calculating the normals which i removed in favor of a “hardcoded” loop since my meshes are always disposed in the same fashion). I am now down to two vertex lists of approx 50MB that I keep in RAM per object and that’s it. The rest is all VBO now (about 180MB per object) and I will try and slim it down further with Alfonse Reinheart’s insights.

I think I will also invest a litle time in the shaders stuff. Right now I need to recalculate the normals after each vertex transformation and I cut that time to approx 250ms per object but if I could offload some of the stuff to the shadder if would be a lot smoother!

thanks!

Would I need a Vetex program for this and calculate the dot product on the fly

Yes.

Base Vertex extension as in the parameter to the glDrawElements function? I read elsewhere that the fewer calls the beter.

Better for what? All the performance in the world doesn’t mean a thing if your application fails due to running out of memory :wink:

About the last statement, yes, usually the fewer calls, the better. Or, how someone else put it, the most optimised lines of code are the one that don’t get executed.

The reason for this, especially in the video world, is the same as anywhere else actually: the less long distance communication, the better. When you issue another command to the graphic card, your cpu has to take that command, pass it thorough the video driver (God knows what’s in there), update some internal states and after that the driver (still the cpu) will communicate with the gpu through the PCI-e/Agp/smoke_signals, which is , by the very definition, of limited bandwith and takes some time to complete. Time which the video card could very well be spending on crunching poligons/vertices/pixels/texels/fragments from the local (and faster accesible) video memory), insead of chatting with the cpu.

When you issue another command to the graphic card, your cpu has to take that command, pass it thorough the video driver (God knows what’s in there), update some internal states and after that the driver (still the cpu) will communicate with the gpu through the PCI-e/Agp/smoke_signals, which is , by the very definition, of limited bandwith and takes some time to complete.

And if you were talking about Direct3D, you would be correct. However, you’re talking about OpenGL, so not so much :wink:

OpenGL implementations get to do their own marshalling of calls. So having lots of rendering calls, while not necessarily a good thing performance-wise, is still reasonably efficient. Plus, there’s always glMultiDrawElements if you feel that function call overhead is killing your performance.

More importantly, if your program is running out of memory, your performance is 0 frames per second. I’ll take 15 FPS over 0 any day :wink:

Alfonse, I wasn’t talking about wether is better to run slow or not at all, that’s a no-brainer, I was talking about the other part of the conversation.

I’d think about that a bit more, if I were you… OpenGL/D3D work on basically the same hardware, which is API agnostic. And geometry batching/instancing is always a good idea. And no matter what API, it holds an internal state which is handled by the CPU, which, on most of the api calls, will need to talk to the GPU through some limited bandwidth interface, and during that time, the gpu will have to listen.

And no matter what API, it holds an internal state which is handled by the CPU, which, on most of the api calls, will need to talk to the GPU through some limited bandwidth interface, and during that time, the gpu will have to listen.

No, that’s not how it works at all. Not in OpenGL.

When you make a draw call in OpenGL, the GPU does not have to be informed of this immediately. Oftentimes, what happens is that the CPU gathers up a bunch of draw calls so that they can be submitted to the GPU all in one go. This is called “marshalling”. When these calls are submitted happens based on GPU interrupts and other internal implementation stuff. But they do not have to be submitted immediately when a call is made.

Furthermore, the “limited bandwidth interface” is not particularly limited. It’s simply a queue the GPU reads, one that wraps around. Commands don’t tend to be particularly large. They’re either DMA commands, register setting, or render calls.

Instancing and batching are about state-change overhead, not making multiple draw calls in sequence. Changing programs/VAOs/textures/etc is far more painful than rendering multiple times with the exact same state.

Marshaling is the “submitting” part (a fancy way of saying serialization), not the gathering part…

Trust me, I’ve seen some OGL implementations(yeah, the actual implementation), including, but not limited to, the one in ps3, I’ve been part of an effort to optimize a commercially available game (weird enough, some reviewers are using it to test the newest graphic cards, imagine that), so I have a rough idea of what I’m talking about …

When I said something about “limited bandwidth” stuff, I wasn’t talking about the command queue (an implementation detail) but more to the fact that the cpu talks with the gpu over a bus (I even specified pci-e/agp/smoke_signals), so try to read everything before you go around and start bashing people.

for the rest, I feel it’s useless to try to explain again…

Okay after playing with the stuff a bit I made some progress but still have some interrogations:

I managed to no use and client-side memory at all for data structures (my guess was that it restricted the contiguous memory available) and use VBOs instead. To take an arbitrary object as an example I get the following VBOs:

Vertex 5663304 x 3 floats = 68MB
Normal 5663304 x 3 floats = 68MB
texcoord 5663304 x 2 shorts = 23MB

data1 5663304 x 3 floats = 68MB
data2 5663304 x 3 floats = 68MB

EBO 5663304 x 1 int = 23MB
EBO 353956 x 1 int = 1.4MB

I switched the texcoords from floats to shorts using the following:

glMatrixMode(GL_TEXTURE);
glLoadIdentity();
glScalef(1.0/static_cast<float>(m_iWidth), 1.0/static_cast<float>(m_iHeight), 1);

I will also look into using glTexGen.

On the whole, I can presently load 10 such objects on screen without problems so it would be Mission Accomplished if I didn’t have other problems. The problems star appearing when I modify the vertex data. I can do it as many times as I wish for one of the objects without any problems but as soon as I try it on another object afterward, I get GL_OUT_OF_MEMORY from my glMapBufferARB calls and stuff starts disapearing.
But when I modify the data, i carefully release the mapped memory so it shouldn’t happen right? My map/unmap calls seem to be well balanced and I tried a few different combinations of STATIC/DYNAMIC/STREAM + DRAW/COPY/READ to no avail:

pglBindBufferARB(GL_ARRAY_BUFFER_ARB, m_vboVertex);
pglBufferDataARB(GL_ARRAY_BUFFER_ARB, m_iNbVertices*sizeof(vertex_t), 0, GL_STATIC_DRAW); //flag as dirty
float ptr = (float)pglMapBufferARB(GL_ARRAY_BUFFER_ARB, GL_WRITE_ONLY_ARB);
pglUnmapBufferARB(GL_ARRAY_BUFFER_ARB); // release pointer to mapping buffer

pglBindBufferARB(GL_ARRAY_BUFFER_ARB, m_vboData1);
float ptrData1 = (float)pglMapBufferARB(GL_ARRAY_BUFFER_ARB, GL_READ_ONLY_ARB);
pglUnmapBufferARB(GL_ARRAY_BUFFER_ARB); // release pointer to mapping buffer

pglBindBufferARB(GL_ARRAY_BUFFER_ARB, m_vboData2);
float ptrData2 = (float)pglMapBufferARB(GL_ARRAY_BUFFER_ARB, GL_READ_ONLY_ARB);
pglUnmapBufferARB(GL_ARRAY_BUFFER_ARB); // release pointer to mapping buffer

//Modify data in the ptr buffer

pglBindBufferARB(GL_ARRAY_BUFFER, 0);

CalcNormals();

I also tried the following form with no differences:

pglBindBufferARB(GL_ARRAY_BUFFER_ARB, m_vboVertex);
pglBufferDataARB(GL_ARRAY_BUFFER_ARB, m_iNbVertices*sizeof(vertex_t), 0, GL_STATIC_DRAW); //flag as dirty
float ptr = (float)pglMapBufferARB(GL_ARRAY_BUFFER_ARB, GL_WRITE_ONLY_ARB);

pglBindBufferARB(GL_ARRAY_BUFFER_ARB, m_vboData1);
float ptrData1 = (float)pglMapBufferARB(GL_ARRAY_BUFFER_ARB, GL_READ_ONLY_ARB);

pglBindBufferARB(GL_ARRAY_BUFFER_ARB, m_vboData2);
float ptrData2 = (float)pglMapBufferARB(GL_ARRAY_BUFFER_ARB, GL_READ_ONLY_ARB);

//Modify data in the ptr buffer

pglBindBufferARB(GL_ARRAY_BUFFER_ARB, m_vboVertex);
pglUnmapBufferARB(GL_ARRAY_BUFFER_ARB); // release pointer to mapping buffer

pglBindBufferARB(GL_ARRAY_BUFFER_ARB, m_vboData1);
pglUnmapBufferARB(GL_ARRAY_BUFFER_ARB); // release pointer to mapping buffer

pglBindBufferARB(GL_ARRAY_BUFFER_ARB, m_vboData2);
pglUnmapBufferARB(GL_ARRAY_BUFFER_ARB); // release pointer to mapping buffer

pglBindBufferARB(GL_ARRAY_BUFFER, 0);

CalcNormals();

I tried using smaller datatypes (GL_HALF_FLOAT, GL_BYTE, GL_SHORT) for the normals but visually it was not good. I’m still not using a shader so that may be a cause of problem no?

float ptrData1 = (float)pglMapBufferARB(GL_ARRAY_BUFFER_ARB, GL_READ_ONLY_ARB);

Under what circumstances do you need to read from the buffer? You set the data there, so you clearly know what it is. There’s no reason for you to read back from it. So stop doing it.

If you really need direct access to the buffer, just store a copy yourself in main memory.