VBO Performance

Hi people,

I recently rewrote my terrain renderer to use VBO’s instead of regular vertex arrays. I excpected to see significant improvements in performance, since the terrain model is quite big with a lot of vertex arrays.

I also did some experiments earlier with the Nvidia VAR/Fence extension and saw quite good performance improvements.

However, this time with VBO’s the performance actually got worse…

I tried it on two linux computers, one with an ASUS GeForce 6800 GT AGP card, with the 7174 drivers and one Dell Inspiron XPS 2, GeForce 6800 Ultra with 7667 drivers.

On the ASUS card I got more or less the same performance, but that computer is CPU bound anyway and on the benchmark I got about 15-20 fps.

On the XPS I got 60-80 fps without VBO, and 40-50 with VBO…

I haven’t tried it on windows yet, but generally the GL performance is the same in this application on windows and linux.

I have seen some threads on this topic earlier, but that was when this extension was brand new. I’d think that things should be different now.

So is this to be expected or am I doing something seriously wrong here?

You are probably doing someting wrong. However we need information to help you like:

  • How many VBO’s are you using?

  • What size are they?

  • What is your vertex format? (no ints/shorts etc I hope)

  • What is your index buffer format (if you use static index buffers)

  • How do you allocate the VBO’s? (code examples! especially the usage flags)

  • How are you using them? (single upload per VBO or multiple uploads reusing the same VBO buffer?)

  • Is the terrain geometry static? or is it some dynamic scheme?

The terrain is divided into tiles, and there can be up to a couple thousand tiles visible in extreme cases, but typically some hundreds visible.

Each tile consists of one GL_FLOAT vertex array and one GL_FLOAT texcoord array and a set of triangle strips using UNSIGNED_SHORT index arrays. Each vertex array typically has 50-500 vertices (varying level of detail)

Here are some relevant code bits:

Upload vertex array

void VertexArray3f::apply() const
{
 if(use_vbo_ && GL::hasVBO())
    {
    if(buf_ == 0)
       {
//       cerr << "VBO (upload)
";
       buf_size_ = getSize()*12;
       glGenBuffersARB(1, &buf_);
       glBindBufferARB(GL_ARRAY_BUFFER_ARB, buf_);
       glBufferDataARB(GL_ARRAY_BUFFER_ARB, buf_size_,
                       getArrayPtr(), GL_STATIC_DRAW_ARB);
       //vtx_.clear();
       }

//    cerr << "VBO (apply)
";
    glBindBufferARB(GL_ARRAY_BUFFER_ARB, buf_);
    glVertexPointer(3, GL_FLOAT, 0, 0);
    glBindBufferARB(GL_ARRAY_BUFFER_ARB, 0);
    }
 else
    glVertexPointer(3, GL_FLOAT, 0, getArrayPtr());
} 

Draw arrays

  

inline void IndexArray::drawElements() const
{
 if(GL::hasVBO())
    {
    if(buf_ == 0)
       {
//       cerr << "VBO (upload)
";
       buf_size_ = getSize()*2;
       glGenBuffersARB(1, &buf_);
       glBindBufferARB(GL_ELEMENT_ARRAY_BUFFER_ARB, buf_);
       glBufferDataARB(GL_ELEMENT_ARRAY_BUFFER_ARB, buf_size_,
                       getArrayPtr(), GL_STATIC_DRAW_ARB);
       }

//    cerr << "VBO (apply)
";
    glBindBufferARB(GL_ELEMENT_ARRAY_BUFFER_ARB, buf_);
    glDrawRangeElements(mode_, min_, max_, getSize(), GL_UNSIGNED_SHORT, 0);
    glBindBufferARB(GL_ELEMENT_ARRAY_BUFFER_ARB, 0);
    }
 else
    glDrawRangeElements(mode_, min_, max_, getSize(), GL_UNSIGNED_SHORT, getArrayPtr());
// glDrawElements(mode_, getSize(), GL_UNSIGNED_SHORT, getArrayPtr());
}

The vertex arrays are static (they will change when changing detail levels, but that doesn’t happen too often).

The “apply” function is called once for each vertex array for each frame, and the “drawElements” function is called for each index array each frame.

Just some experiments to try:

  • Have you tried using a system-mem index buffer with the static VBO vertex buffer? (have you tried the other way also?, static VBO index buffer with system mem vertex array?)

Perhaps it is only one of the buffer types that is causing the slowdown…

  • Have you tried interweaving your vertex type? (you said there is a texture/position - looking at the code it seems to be a seperate stream for each?) It may be hard to change the code properly, but just hack in a test case to see.

if it is actually slower with vbo then most likely what is happening is that the data is being transferred BACK from gpu memory to cpu memory, being altered by the driver, then being sent to the gpu. this obviously has the effect of doubling bus usage instead of eliminating it.

the resons the driver would need to have the data back to modify it vary, but could include it being in the wrong format or specific rendering features being enabled such as double-sided lighting. nvidia have a document on their site explaining what circumstances cause vbo’s to fail like this on their cards, can’t remember the name but try searching for ‘vbo’ on their site.

I can’t really comment on the decrease in performance. But one thing is for sure, your batch submission is pretty poor. 50 - 500 vertices, serveral hundred times per-frame is very bad. So many draw calls are likely to cause your CPU to bottleneck which will be submitting batches all the time. There is a very good nVidia document over here that gives a good explanation on proper VBO usage. I think you should also study some docs on good batching on both ATI and nVidia website. I remember the performance decrease i got when i shifted my terrain engine to VBOs, and these docs helped a lot. Here is a link to the discussion i had on this forum regarding that. Maybe this will help.

In the following statements in your apply function:

    glBindBufferARB(GL_ARRAY_BUFFER_ARB, buf_);
    glVertexPointer(3, GL_FLOAT, 0, 0);
    glBindBufferARB(GL_ARRAY_BUFFER_ARB, 0);

You bind and unbind the buffer without drawing any data, or is the code snippet incomplete?
One more thing. Since you are using 50 - 500 vertexes per tile, maybe you can pack them in a GL_SHORT type vertex buffer rather than a GL_FLOAT, thus giving much better memory usage performance. [edit] That is if your vertexes lie on integral boundaries [/edit]

Originally posted by eldritch:
[QB]The terrain is divided into tiles, and there can be up to a couple thousand tiles visible in extreme cases, but typically some hundreds visible.

Each vertex array typically has 50-500 vertices (varying level of detail)

Hello,

It seems to me that you fall in the situation where you are cpu limited because you have too many calls to bind different small VBOs. Try to gather more tiles into one VBO (best would be 64k).

just some thoughts :slight_smile:
have a nice day,

Originally posted by Zulfiqar Malik:
One more thing. Since you are using 50 - 500 vertexes per tile, maybe you can pack them in a GL_SHORT type vertex buffer rather than a GL_FLOAT, thus giving much better memory usage performance. [edit] That is if your vertexes lie on integral boundaries [/edit]
This assumes the server is able to natively handle SHORT vertex format. A dangerous assumption, as I noticed some time ago .

What little memory is saved (for 50-500 vertices we’re talking about such silly amounts as half of 3450 – 34500, i.e. 600 – 6000 bytes) will with servers not natively able to use SHORT vertex format induce a quite noticable performance penalty. Besides, with so little data, the majority ov the overhead isn’t from transferring the data I think, but more likely from the transaction itself.

Instead, just go with the advices so far, and unless it’s already been mentioned - try to batch n*n of your tiles into a single VBO. Currently, I’d expect the sheer number of calls to have quite an impact on performance. I’m almost sure you’d get higher performance drawing even if you had to draw 24 out of 25 tiles (for a 5x5 tile VBO) in a high LOD detail using a single call, compared to making that otherwise required VBO switch(es).

When (if) you then put indices (for the, I assume, static terrain) into VBO too, you’ll see an even greater speedup.

Originally posted by tamlin

This assumes the server is able to natively handle SHORT vertex format. A dangerous assumption, as I noticed some time ago.

Not an assumption my friend :slight_smile: . I have tested SHORT on 9700-pro, 9800-pro, X800XL and NV3x, NV4x, G7x (7800GT) and they give just as good performance as FLOATs for vertex data. I have noticed erratic performance figures for other arrays particularly normals. Infact its better to use FLOATs for normal data, since i have not encountered a single hardware that handles normals as efficiently with other types.

Originally posted by tamlin

What little memory is saved (for 50-500 vertices we’re talking about such silly amounts as half of 3450 – 34500, i.e. 600 – 6000 bytes) will with servers not natively able to use SHORT vertex format induce a quite noticable performance penalty. Besides, with so little data, the majority ov the overhead isn’t from transferring the data I think, but more likely from the transaction itself.

You assume that the vertex data is just uploaded at startup. However, for terrain algorithms, it is very common to upload new vertex data as LOD changes. When you are uploading vertex data in the middle of rendering, you definitely need to upload as minimal data as possible.
Furthermore, using SHORTs WILL consume half the memory and that can be quite a bit if you are talking millions and millions of vertexes. And a few million vertexes are quite common in modern day terrain renderers (i do not know whether eldritch’s renderer processes such huge amounts of data). e.g. consider that x, y, z float take up 12 bytes and shorts take up 8-bytes (aligned on 4-byte boundary) you save 4MB for just 1Million vertices and 6MB if they take up just 6bytes. To emphasize on how small a 1M vertex tile can be. Its just a 1000x1000 height map at highest geometrical resolution!!!

Hi folks,

Thanks for lots of useful input here.

First of all, Short is not a option for vertex data here, due to the coordinate systems.

    glBindBufferARB(GL_ARRAY_BUFFER_ARB, buf_);
    glVertexPointer(3, GL_FLOAT, 0, 0);
    glBindBufferARB(GL_ARRAY_BUFFER_ARB, 0);

You bind and unbind the buffer without drawing any data, or is the code snippet incomplete?

This is to avoid trouble if using other vertex arrays without VBO’s. (I have also tried without doing this to no effect). If using ELEMENT_ARRAY later I have to bind to a new VBO before doing any rendering anyway, so this shouldn’t matter, right?

It seems most likely that the large amount of small arrays is the culprit here.

When I originally wrote this terrain engine, I kept the tile sizes low due to limitations in texture size and other factors that were important at that time.

Now it obviously comes back to bite me, and I have a lot of huge terrain models that need to be supported by the next version of the terrain engine :frowning:

If I want to increase tile size while keeping the textures at the same size as today I need to introduce extra vertices for the texture coordinate borders, since each tile would have more than one texture.

What texture sizes should I aim for when each tile has a unique texture? Today I use everything from 128x128 up to 1024x1024 depending on the resolution of the input data. All textures are DXT1 compressed. The texture DB can be several gigs so I have a texture manager that loads and caches textures on demand.

Another issue is the triangle strips. these strips are typically not very long, maybe 5-15 triangles per strip. Currently I have one index array per strip. I guess it would be better to upload all the indices in one ELEMENT_ARRAY and then glDrawRangeElemnts with different offsets. However, using VBO’s for index arrays don’t seem to have any effect one way or the other…

Would it be just as efficient to have just one big array of GL_TRIANGLES indexes?

Originally posted by eldritch

Would it be just as efficient to have just one big array of GL_TRIANGLES indexes?

Yes, triangle lists can be just as efficient. Infact a good list implementation is much better than a bad strip implementation. So i would suggest using triangle lists for each patch.
As for not using shorts, i must remind you here that if there is a conversion scheme from regular cartersian coordinate system to yours then you can use a cartesian system to and store vertexes in shorts and unpack to your coordinate system in a vertex shader.

Originally posted by Zulfiqar Malik:
Not an assumption my friend :slight_smile: . I have tested SHORT on 9700-pro, 9800-pro, X800XL and NV3x, NV4x, G7x (7800GT) and they give just as good performance as FLOATs for vertex data. I have noticed erratic performance figures for other arrays particularly normals. Infact its better to use FLOATs for normal data, since i have not encountered a single hardware that handles normals as efficiently with other types.

I don’t understand. Why would normals be slower? In pure OpenGL 2.0 (if you forget about all the legacy stuff) there’s no such thing as vertex or normal arrays. There are vertex attributes. Position can be one attribute, normal can be another, but the driver can not tell the difference. How I combine these attributes in the vertex shader to come up with the resulting position and normal is up to me. Heck, I could use the position coordinates as normals and the normals as positions… So how could any of them be slower?? Well, since you made some tests, there must be some difference, I just don’t understand what/why/how?.. any clues?

Originally posted by eldritch:
Another issue is the triangle strips. these strips are typically not very long, maybe 5-15 triangles per strip. Currently I have one index array per strip.
So does this mean that even if you have 500 triangles in one VBO, you still render them in 5-15 triangle batches? That will give terrible performance. You can glue together triangle strips with degenerate triangles, or you could just use indexed triangles, although if you already have strips, it would probably be better to just connect them. Then you should be able to render the entire chunk with one call.

@andras: Well i don’t know whether the driver does any specific optimizations. But i did a lot of testing when i was developing a terrain rendering engine. Memory optimizations was one of my priorities, and since i already had normals packed in an unsigned byte therefore i decided to use that, and it gave me terrible terrible performance. At first i thought it to be a problem with the algorithm, but it later turned out to be a problem with ubyte normals since changing it to float immediately gave back the performance i was expecting. On top of that even shorts did not give me good performance although i was using them for positional data. But one reason for that could be that my positional data was dword aligned (just x and y with z being provided in the vertex shader), whereas my normal data was not (6 bytes), although i did not benchmark them as extensively as i did with ubytes.
But i can assume that since shorts were working fine with positional data therefore they could do just as well with normals, if given a good try. But ubytes are definitely not an option as far as my testing goes.

So does this mean that even if you have 500 triangles in one VBO, you still render them in 5-15 triangle batches? That will give terrible performance. You can glue together triangle strips with degenerate triangles, or you could just use indexed triangles, although if you already have strips, it would probably be better to just connect them. Then you should be able to render the entire chunk with one call.

Originally I did the stitching with degenerate triangles, but then I read an NVIDIA document stating that it was better with smaller strips (optimally around 16 vertices/strip I think).

But this may of course have changed since that, and I may also have misunderstood it…

I’ll experiment a bit with longer strips or triangle lists and see how that works out.

Originally posted by Zulfiqar Malik:

But one reason for that could be that my positional data was dword aligned (just x and y with z being provided in the vertex shader), whereas my normal data was not (6 bytes), although i did not benchmark them as extensively as i did with ubytes.
But i can assume that since shorts were working fine with positional data therefore they could do just as well with normals, if given a good try. But ubytes are definitely not an option as far as my testing goes.

Older ATI Radeon cards (e.g. 9800) do not handle data that are not dword aligned natively. Storing normals or colors in 3 ubytes/shorts is slow on them. Much better is use 4 ubytes/shorts even if the last ubyte/short is unused. I do not know if this is still the case with the latest ATI cards.