Batching and VBO's

Ok, conventional wisdom would suggest that we batch as many vertexes and send them to the video card as possible. But if you have multiple batches and all of them are in seperate VBO’s in video memory e.g. instead of a batch of 4096 vertexes, you have 4 batches of 1024 vertexes in video memory as VBOs. Then, my question is that how much performance hit would you get (from a batching perspective) if you bind the one big batch of 4096 vertexes or you bind 4x1024 batches individually one by one? I know that usually there is some sort of setup involved in binding a VBO, but i assume it would be really small (considering no other state changes like texturing, shader etc. are taking place). So if the driver is keeping all the VBO’s in video memory (if the memory is large enough), then i assume that performance of n-small batches would be almost equivalent to the performance of one BIG batch for same no. of vertexes? Can someone please explain the low level details (someone from nVidia?)?

Also i want to know the impact of draw calls on performance. Consider the buffers as stated above. Now i draw the entire 4096 triangle buffer using on glDrawRangeElements(…) call and draw the 4 1024 triangle buffers using 4 glDrawRangeElements(…) call. How much of a performance hit would i get? Would it be noticable? Can someone explain the low level details or refer to some article? Consider the index buffers to be completely in video memory.

I don’t know whether this has been discussed before, but if it has, then please refer the thread id.

Thanks in advance.

I think there won’t have any much difference at all. When you put data in graphic memory threw VBO, it goes threw AGP/PCI EXPRESS, so with a very high bandwidth (about several Go/s). So, you’ll have to deal with that; but considering that noone will upload hundreds of Mo at each frame, this will be done very quickly.
Of course there will have some little performance hits if you use several VBO instead of a single one. But that wouldn’t be noticeable at all. What is the impact of some extra fast function calls ? About zero regarding all the rest.

For drawing it’s almost the same even if generally few people draw one big array. There are other considerations than pure drawing performance for that.
Also and if I’m not wrong, small arrays would beneffit from some cache (pre-cache), whereas big arrays will generally disable that cache.

Hope that could help.

How could that help, jide?
I don’t think it would be possible to be more vague.
You’re actually wrong to under estimate the cost of switching VBO’s. Setting up the vertex offsets is expensive.
For more information refer to the vendors documents, don’t listen to gossip and hear-say.
Examples:
http://www.ati.com/developer/gdc/PerformanceTuning.pdf
http://developer.nvidia.com/object/using_VBOs.html

I think it really depends on many factors but regarding what he said, how would you have answered (since you didn’t…). And I don’t believe what I hear from far… I do tests on my own. So.

I’ve done several tests and I personnally didn’t see any differences between a big VBO and several (up to many) ones. And that whether I use a single call to BufferData or several calls to BufferSubData (tested with approximately 10Mo of data). I must admit I didn’t used mapping even if they seem best.

And more, for real-time renderings, it is really difficult to have only a single VBO due to the presence of many things: static geometry, dynamic ones, managing frustum culling, drawing in an ordered manner (like front to back).

PS: You said I’m wrong under estimating the cost of switching VBO. Maybe. But regarding the links you gave, here is what it’s said about that:

The binding operation is cheap compared to the setup of various pointers.

So, if I understand correctly what you said, you’re not really well…

You bind a VBO when you need to switch to a new VBO, isn’t it ? So VBO bindings seems not a real bottleneck at all.

About vertex offsets, I actually didn’t see anything going on your direction. Calling VertexPointer more than once per VBO seems not a good idea. But again, nothing about setting the offsets.

PS2: Now, show us what you know since you’re so good ! :slight_smile:

if you have 2 options how to realize an algorithm,
you can:

(1) start a thread in which the 2 options are
discussed and watch the thread for 2 or 3 days

or

(2) realize both options; measure and compare the
performance, if possible on different systems.

personally, i would choose answer (2)

i don’t want to offend you for posting this
question, but i really think it’s not too
difficult to program both algorithms plus a
frame/s counter.

No offense taken RigidBody. However, i posted this question after doing quite a bit of testing on different machines and heres the summary of what i have experienced so far:

  1. The cost of binding a VBO is negligible on both ATI and nVidia hardware.
  2. Multiple draw calls don’t effect performance that much.
  3. Creating a VBO is definitely expensive but far less than the cost of updating a read/write VBO. Hence its usually a good idea to create write only VBOs and re-create them when the vertex data needs to be modified.

Another thing that i noticed on ATi 9700-pro hardware (using latest drivers) was that the performance difference between using regular vertex arrays and VBOs was very negligible. At first I thought that it might be a driver bug, but the inconsistency didn’t go away even with newer drivers. The inconsistency between VBOs on nVidia and ATi hardware got me all confused and since i don’t have a lot of machines to test on (i actually have only 2 :slight_smile: ), therefore i wanted to know the opinions of people who have done good research on this subject.

Sorry for not posting this earlier, but the original post was long enough already.

And please guys stop behaving like people on noobish gamedev type forums where posts end up with people flaming each other, rather than discussing.

Originally posted by Zulfiqar Malik:
Another thing that i noticed on ATi 9700-pro hardware (using latest drivers) was that the performance difference between using regular vertex arrays and VBOs was very negligible.
I believe this performance difference is negligible for your test case (or your application data configuration). On my own experience, as soon as you have big chuncks of geometry around, CVA can’t match VBO anymore.

You can have a look at a simple tool that can show you performance difference here :
http://www.fl-tw.com/opengl/GeomBench/

Here, on my ATI 9600 machine, I got something like that :

              Static VAO (H 0 F L S): 89.67 FPS, 68.87 MTS, 768k Tris (T177/288)
              Static VAO (H 0 I L S): 89.00 FPS, 68.36 MTS, 768k Tris (T181/288)
  Compiled vertex arrays (H 0 F L S): 8.77 FPS, 6.74 MTS, 768k Tris (T81/288)
  Compiled vertex arrays (H 0 I L S): 8.73 FPS, 6.71 MTS, 768k Tris (T85/288)

FWIW, I think you should keep in mind that benchmarking has to be done with real datasets to be relevent for your application.

Cheers,
Nicolas.

Originally posted by Zulfiqar Malik:
[b]

  1. The cost of binding a VBO is negligible on both ATI and nVidia hardware.
  2. Multiple draw calls don’t effect performance that much.
  3. Creating a VBO is definitely expensive but far less than the cost of updating a read/write VBO. Hence its usually a good idea to create write only VBOs and re-create them when the vertex data needs to be modified.[/b]
  1. This looks true.

  2. It can depend. But generally calling once to draw a big array is sensibly the same as calling several little arrays (assuming they belong to the same vertex array, so with using indices, start index…). Using MultiDrawArrays didn’t help much neither (regarding my tests).

  3. I’m not really sure about that. VBO were created to be optimized for data transfers, not for being dynamic entities (a bit like C++ new construct). Doing the way you stippled implies (obligatory) extra cpu/gpu consumption because of the creation/deletion of VBOs. Keeping VBOs static prevent those sort of overheads.
    After all, whether you keep or create new VBO each time, you always have to fill them with data. And with keeping VBO a call to BufferData with a NULL value will frees the buffer. Deleting a VBO will anyway have to free the buffer.


And please guys stop behaving like people on noobish gamedev type forums where posts end up with people flaming each other, rather than discussing.

I’m sorry for that, I’m not used to fight generally, that’s not in my habbits.

Finally, as Nicolas Lelong said, try to test with relevant amount of data. Using 1 or 4 KB in each VBO won’t occur that much in real situations. And more, there might have some optimizations (due to pages or so I don’t remember well) that might false your results.

Hope that helps.

Originally posted by Nicolas Lelong

I believe this performance difference is negligible for your test case (or your application data configuration). On my own experience, as soon as you have big chuncks of geometry around, CVA can’t match VBO anymore.

hmmmm … interesting, because on my BSP renderer it doesn’t give much performance boost on an ATi hardware, and i am pretty sure that its not a CPU bottleneck or anything else related with the rest of the system (excluding the GPU), because a similar PC (exact same configuration) with GFX 5700 Ultra got major performance improvement once VBOs were turned on. Anyways, i shall probe into that

Originally posted by jide:

  1. I’m not really sure about that. VBO were created to be optimized for data transfers, not for being dynamic entities (a bit like C++ new construct). Doing the way you stippled implies (obligatory) extra cpu/gpu consumption because of the creation/deletion of VBOs. Keeping VBOs static prevent those sort of overheads.
    After all, whether you keep or create new VBO each time, you always have to fill them with data. And with keeping VBO a call to BufferData with a NULL value will frees the buffer. Deleting a VBO will anyway have to free the buffer.

The nVidia’s paper on Using VBOs explicitly states that, so i guess its true, at least on nVidia’s hardware. Secondly i have also read this somewhere in DirectX documentation on hardware accelerated buffers. Sometimes you do need to modify existing data or upload a new data to the VBO.
Read page 12 in the nVidia’s document under title “Caveats in VBO”.

Cheers.

Originally posted by Zulfiqar Malik:
[b]The nVidia’s paper on Using VBOs explicitly states that, so i guess its true, at least on nVidia’s hardware. Secondly i have also read this somewhere in DirectX documentation on hardware accelerated buffers. Sometimes you do need to modify existing data or upload a new data to the VBO.
Read page 12 in the nVidia’s document under title “Caveats in VBO”.

Cheers.[/b]
Can you develop more what you mean ? I’m afraid I misunderstand you.

What i meant was, that sometimes it is a requirement to use a VBO for dynamic data, especially in cases when the data changes very infrequently. In such a case you would have to modify the VBO data, but it is usually a better option to resend the whole data to the VBO manager rather than, locking the VBO and getting a pointer and then modifying the data and unlocking the buffer.
The references that i mentioned in the previous posts suggest just that :slight_smile: .

Originally posted by jide:
About vertex offsets, I actually didn’t see anything going on your direction. Calling VertexPointer more than once per VBO seems not a good idea. But again, nothing about setting the offsets.
jide, you have to set up the pointers every time you switch VBO - it’s in the very nature of the extension. Each vertex attrib pointer can source its data from different buffers, therefore you have to respecify the offsets for each attribute when you change the buffer you want to draw from.
So, obviously, the less buffers you use the better the performance is going to be because the less offsets you’re going to be specifying.

Ok guys i did some testing on VBO performance in a different direction. Heres the scenario and results

Approx 1 million indexed vertexes, and approx the same amount of triangles rendered. Since the vertexes were on integral boundaries therefore i used integer vectors to represent them i.e. the vertex format is

struct vertex
{
int x, y, z;
};

whereas indices were also specified as integers. Hardware accelerated VBOs were used both for vertex as well as index data and were compiled in the initialization phase and never touched again. Here is what the results:

On a Athlon XP 2500+, Geforce FX 5900 XT with 128 MB RAM (2.2 ns memory), and 512 MB system RAM i am getting a measly 40 Million Triangles per second consistently (i changed the batch size to different ones like 0.5 million etc.). I believe that’s quite pathetic to say the least. I am sure the card can do better than that, but i can’t seem to figure out the reason for such a performance. Can someone give me useful pointers.

Originally posted by knackered:
[quote]Originally posted by jide:
About vertex offsets, I actually didn’t see anything going on your direction. Calling VertexPointer more than once per VBO seems not a good idea. But again, nothing about setting the offsets.
jide, you have to set up the pointers every time you switch VBO - it’s in the very nature of the extension. Each vertex attrib pointer can source its data from different buffers, therefore you have to respecify the offsets for each attribute when you change the buffer you want to draw from.
So, obviously, the less buffers you use the better the performance is going to be because the less offsets you’re going to be specifying.
[/QUOTE]Okay, I didn’t see it like that.

[b]the vertex format is

struct vertex
{
int x, y, z;
};


i am getting a measly 40 Million Triangles per second consistently[/b]
Could it be related to this thread ? You might want to test using float instead of int.

Yeah i have read that thread before, and i did the benchtest in three configurations i.e. short, integer (for 4-byte alignment), and floats. Both short and integers give around 40MTris/s whereas floats give a slight increase to around 44MTris/s at the cost of double memory (w.r.t. shorts, in case the driver doesn’t bloat the short to a float internally, which wouldn’t be necessary imho). Well all i can say is … quite pathetic, and there is absolutely nothing else going on. I compile the static vertex and index buffers at initialization stage. Then my render loop is reduced to:

  

	glBindBufferARB(GL_ARRAY_BUFFER_ARB, m_root.bufferID);
	glBindBufferARB(GL_ELEMENT_ARRAY_BUFFER_ARB, m_indexBuffers[g_currLevel]);
	glEnableClientState(GL_VERTEX_ARRAY);

	glVertexPointer(3, GL_FLOAT, 0, (const GLvoid*)NULL);
	//glDrawElements(GL_TRIANGLES, 3 * m_root.numQuads * 2, g_indexFormat, (const GLvoid*)NULL);
	glDrawRangeElements(GL_TRIANGLES, 0, m_indexOffsets[g_currLevel + 1] - m_indexOffsets[g_currLevel], 
		3 * m_root.numQuads * 2, GL_UNSIGNED_INT, NULL);

	glDisableClientState(GL_VERTEX_ARRAY);
	glBindBufferARB(GL_ARRAY_BUFFER_ARB, 0);
	glBindBufferARB(GL_ELEMENT_ARRAY_BUFFER_ARB, 0);

Note that all buffers have been created with GL_STATIC_DRAW_ARB flag. I have tried doing glVertexPointer(…) just once in the start and not doing it each frame, but that doesn’t give any speedup at all!

Can someone (nVidia engineer?) give some information? I can post the entire code if need be.

With such small differences (10%), I must ask - is your CPU pegged when you do these tests?

I have managed to (I think) get peak performance out of my rather old ATI card (9250) using a sub-500MHz computer, using VBO for all data (incl. indices), using less than 5% CPU (obviously, I then sleep after the last flush, for just a little less than the previous frame measured rendering time).

Busy-waiting (i.e. calling glFinish without flush+sleep) only added ~5% more speed, at the expense of 100% CPU pegged all the time (for a 500MHz CPU it might not be “much”, but think of a 3.6GHz monster busy-waiting - I wouldn’t want to pay that electricity bill).

Are you sure you’re not fillrate limited? Try rendering to a small window or better still
glEnable(GL_CULL_FACE);
glCullFace(GL_FRONT_AND_BACK);

It’s also possible the drivers are using agp memory instead of video memory. On some systems that can be a major bottleneck. What happens to your performance if you render much less than half a million vertices?

40 Million tris/sec is definately too low for that system. I know you should be able to get at least 100 Mtris/sec.

Originally posted by: Adrian

Are you sure you’re not fillrate limited? Try rendering to a small window or better still
glEnable(GL_CULL_FACE);
glCullFace(GL_FRONT_AND_BACK);

It’s also possible the drivers are using agp memory instead of video memory. On some systems that can be a major bottleneck. What happens to your performance if you render much less than half a million vertices?

40 Million tris/sec is definately too low for that system. I know you should be able to get at least 100 Mtris/sec.

I shouldn’t be fill rate limited since the theoretical fill rate is way more than what i am getting, don’t remember the exact specs but it must be above 100MTris/s (if you are talking polygon fillrate). Back face culling was enabled before, but i tried glCullFace(GL_FRONT_AND_BACK), without any boost.

I wonder why the drivers would be using AGP memory, because there is no other 3D application running, and its has virtually the entire 128Megs at its disposal, and it should be smart enough to choose the best possible memory in this particular case. But nonethless, thats an important point since i am using the latest nVidia drivers, and i have noticed a few problems with them before as well. I might as well rollback the drivers and check out the results with an earlier more stable version.

Tried with a couple of earlier versions of nVidia’s display drivers, but to no effect. Getting almost the same results :frowning: .

This scenario is driving me nuts, i am gonna post it on nvdeveloper forums.