PDA

View Full Version : Batching and VBO's



Zulfiqar Malik
07-27-2005, 10:42 PM
Ok, conventional wisdom would suggest that we batch as many vertexes and send them to the video card as possible. But if you have multiple batches and all of them are in seperate VBO's in video memory e.g. instead of a batch of 4096 vertexes, you have 4 batches of 1024 vertexes in video memory as VBOs. Then, my question is that how much performance hit would you get (from a batching perspective) if you bind the one big batch of 4096 vertexes or you bind 4x1024 batches individually one by one? I know that usually there is some sort of setup involved in binding a VBO, but i assume it would be really small (considering no other state changes like texturing, shader etc. are taking place). So if the driver is keeping all the VBO's in video memory (if the memory is large enough), then i assume that performance of n-small batches would be almost equivalent to the performance of one BIG batch for same no. of vertexes? Can someone please explain the low level details (someone from nVidia?)?

Also i want to know the impact of draw calls on performance. Consider the buffers as stated above. Now i draw the entire 4096 triangle buffer using on glDrawRangeElements(...) call and draw the 4 1024 triangle buffers using 4 glDrawRangeElements(...) call. How much of a performance hit would i get? Would it be noticable? Can someone explain the low level details or refer to some article? Consider the index buffers to be completely in video memory.

I don't know whether this has been discussed before, but if it has, then please refer the thread id.

Thanks in advance.

jide
07-28-2005, 12:32 AM
I think there won't have any much difference at all. When you put data in graphic memory threw VBO, it goes threw AGP/PCI EXPRESS, so with a very high bandwidth (about several Go/s). So, you'll have to deal with that; but considering that noone will upload hundreds of Mo at each frame, this will be done very quickly.
Of course there will have some little performance hits if you use several VBO instead of a single one. But that wouldn't be noticeable at all. What is the impact of some extra fast function calls ? About zero regarding all the rest.

For drawing it's almost the same even if generally few people draw one big array. There are other considerations than pure drawing performance for that.
Also and if I'm not wrong, small arrays would beneffit from some cache (pre-cache), whereas big arrays will generally disable that cache.

Hope that could help.

knackered
07-28-2005, 01:00 AM
How could that help, jide?
I don't think it would be possible to be more vague.
You're actually wrong to under estimate the cost of switching VBO's. Setting up the vertex offsets is expensive.
For more information refer to the vendors documents, don't listen to gossip and hear-say.
Examples:
http://www.ati.com/developer/gdc/PerformanceTuning.pdf
http://developer.nvidia.com/object/using_VBOs.html

jide
07-28-2005, 01:46 AM
I think it really depends on many factors but regarding what he said, how would you have answered (since you didn't...). And I don't believe what I hear from far... I do tests on my own. So.

I've done several tests and I personnally didn't see any differences between a big VBO and several (up to many) ones. And that whether I use a single call to BufferData or several calls to BufferSubData (tested with approximately 10Mo of data). I must admit I didn't used mapping even if they seem best.

And more, for real-time renderings, it is really difficult to have only a single VBO due to the presence of many things: static geometry, dynamic ones, managing frustum culling, drawing in an ordered manner (like front to back).

PS: You said I'm wrong under estimating the cost of switching VBO. Maybe. But regarding the links you gave, here is what it's said about that:



The binding operation is cheap compared to the setup of various pointers.
So, if I understand correctly what you said, you're not really well...

You bind a VBO when you need to switch to a new VBO, isn't it ? So VBO bindings seems not a real bottleneck at all.

About vertex offsets, I actually didn't see anything going on your direction. Calling VertexPointer more than once per VBO seems not a good idea. But again, nothing about setting the offsets.

PS2: Now, show us what you know since you're so good ! :)

RigidBody
07-28-2005, 01:51 AM
if you have 2 options how to realize an algorithm,
you can:

(1) start a thread in which the 2 options are
discussed and watch the thread for 2 or 3 days

or

(2) realize both options; measure and compare the
performance, if possible on different systems.

personally, i would choose answer (2)

i don't want to offend you for posting this
question, but i really think it's not too
difficult to program both algorithms plus a
frame/s counter.

Zulfiqar Malik
07-28-2005, 02:36 AM
No offense taken RigidBody. However, i posted this question after doing quite a bit of testing on different machines and heres the summary of what i have experienced so far:

1. The cost of binding a VBO is negligible on both ATI and nVidia hardware.
2. Multiple draw calls don't effect performance that much.
3. Creating a VBO is definitely expensive but far less than the cost of updating a read/write VBO. Hence its usually a good idea to create write only VBOs and re-create them when the vertex data needs to be modified.

Another thing that i noticed on ATi 9700-pro hardware (using latest drivers) was that the performance difference between using regular vertex arrays and VBOs was very negligible. At first I thought that it might be a driver bug, but the inconsistency didn't go away even with newer drivers. The inconsistency between VBOs on nVidia and ATi hardware got me all confused and since i don't have a lot of machines to test on (i actually have only 2 :) ), therefore i wanted to know the opinions of people who have done good research on this subject.

Sorry for not posting this earlier, but the original post was long enough already.

And please guys stop behaving like people on noobish gamedev type forums where posts end up with people flaming each other, rather than discussing.

Nicolas Lelong
07-28-2005, 03:29 AM
Originally posted by Zulfiqar Malik:
Another thing that i noticed on ATi 9700-pro hardware (using latest drivers) was that the performance difference between using regular vertex arrays and VBOs was very negligible.I believe this performance difference is negligible for your test case (or your application data configuration). On my own experience, as soon as you have big chuncks of geometry around, CVA can't match VBO anymore.

You can have a look at a simple tool that can show you performance difference here :
http://www.fl-tw.com/opengl/GeomBench/

Here, on my ATI 9600 machine, I got something like that :


Static VAO (H 0 F L S): 89.67 FPS, 68.87 MTS, 768k Tris (T177/288)
Static VAO (H 0 I L S): 89.00 FPS, 68.36 MTS, 768k Tris (T181/288)
Compiled vertex arrays (H 0 F L S): 8.77 FPS, 6.74 MTS, 768k Tris (T81/288)
Compiled vertex arrays (H 0 I L S): 8.73 FPS, 6.71 MTS, 768k Tris (T85/288)FWIW, I think you should keep in mind that benchmarking has to be done with real datasets to be relevent for your application.

Cheers,
Nicolas.

jide
07-28-2005, 03:57 AM
Originally posted by Zulfiqar Malik:

1. The cost of binding a VBO is negligible on both ATI and nVidia hardware.
2. Multiple draw calls don't effect performance that much.
3. Creating a VBO is definitely expensive but far less than the cost of updating a read/write VBO. Hence its usually a good idea to create write only VBOs and re-create them when the vertex data needs to be modified.1. This looks true.

2. It can depend. But generally calling once to draw a big array is sensibly the same as calling several little arrays (assuming they belong to the same vertex array, so with using indices, start index...). Using MultiDrawArrays didn't help much neither (regarding my tests).

3. I'm not really sure about that. VBO were created to be optimized for data transfers, not for being dynamic entities (a bit like C++ new construct). Doing the way you stippled implies (obligatory) extra cpu/gpu consumption because of the creation/deletion of VBOs. Keeping VBOs static prevent those sort of overheads.
After all, whether you keep or create new VBO each time, you always have to fill them with data. And with keeping VBO a call to BufferData with a NULL value will frees the buffer. Deleting a VBO will anyway have to free the buffer.



And please guys stop behaving like people on noobish gamedev type forums where posts end up with people flaming each other, rather than discussing.I'm sorry for that, I'm not used to fight generally, that's not in my habbits.

Finally, as Nicolas Lelong said, try to test with relevant amount of data. Using 1 or 4 KB in each VBO won't occur that much in real situations. And more, there might have some optimizations (due to pages or so I don't remember well) that might false your results.

Hope that helps.

Zulfiqar Malik
07-28-2005, 04:20 AM
Originally posted by Nicolas Lelong

I believe this performance difference is negligible for your test case (or your application data configuration). On my own experience, as soon as you have big chuncks of geometry around, CVA can't match VBO anymore.
hmmmm ... interesting, because on my BSP renderer it doesn't give much performance boost on an ATi hardware, and i am pretty sure that its not a CPU bottleneck or anything else related with the rest of the system (excluding the GPU), because a similar PC (exact same configuration) with GFX 5700 Ultra got major performance improvement once VBOs were turned on. Anyways, i shall probe into that



Originally posted by jide:

3. I'm not really sure about that. VBO were created to be optimized for data transfers, not for being dynamic entities (a bit like C++ new construct). Doing the way you stippled implies (obligatory) extra cpu/gpu consumption because of the creation/deletion of VBOs. Keeping VBOs static prevent those sort of overheads.
After all, whether you keep or create new VBO each time, you always have to fill them with data. And with keeping VBO a call to BufferData with a NULL value will frees the buffer. Deleting a VBO will anyway have to free the buffer.

The nVidia's paper on Using VBOs explicitly states that, so i guess its true, at least on nVidia's hardware. Secondly i have also read this somewhere in DirectX documentation on hardware accelerated buffers. Sometimes you do need to modify existing data or upload a new data to the VBO.
Read page 12 in the nVidia's document under title "Caveats in VBO".

Cheers.

jide
07-28-2005, 04:47 AM
Originally posted by Zulfiqar Malik:
The nVidia's paper on Using VBOs explicitly states that, so i guess its true, at least on nVidia's hardware. Secondly i have also read this somewhere in DirectX documentation on hardware accelerated buffers. Sometimes you do need to modify existing data or upload a new data to the VBO.
Read page 12 in the nVidia's document under title "Caveats in VBO".

Cheers.Can you develop more what you mean ? I'm afraid I misunderstand you.

Zulfiqar Malik
07-28-2005, 05:40 AM
What i meant was, that sometimes it is a requirement to use a VBO for dynamic data, especially in cases when the data changes very infrequently. In such a case you would have to modify the VBO data, but it is usually a better option to resend the whole data to the VBO manager rather than, locking the VBO and getting a pointer and then modifying the data and unlocking the buffer.
The references that i mentioned in the previous posts suggest just that :) .

knackered
07-30-2005, 01:38 AM
Originally posted by jide:
About vertex offsets, I actually didn't see anything going on your direction. Calling VertexPointer more than once per VBO seems not a good idea. But again, nothing about setting the offsets.jide, you *have* to set up the pointers every time you switch VBO - it's in the very nature of the extension. Each vertex attrib pointer can source its data from different buffers, therefore you have to respecify the offsets for each attribute when you change the buffer you want to draw from.
So, obviously, the less buffers you use the better the performance is going to be because the less offsets you're going to be specifying.

Zulfiqar Malik
07-30-2005, 01:59 AM
Ok guys i did some testing on VBO performance in a different direction. Heres the scenario and results

Approx 1 million indexed vertexes, and approx the same amount of triangles rendered. Since the vertexes were on integral boundaries therefore i used integer vectors to represent them i.e. the vertex format is

struct vertex
{
int x, y, z;
};

whereas indices were also specified as integers. Hardware accelerated VBOs were used both for vertex as well as index data and were compiled in the initialization phase and never touched again. Here is what the results:

On a Athlon XP 2500+, Geforce FX 5900 XT with 128 MB RAM (2.2 ns memory), and 512 MB system RAM i am getting a measly 40 Million Triangles per second consistently (i changed the batch size to different ones like 0.5 million etc.). I believe that's quite pathetic to say the least. I am sure the card can do better than that, but i can't seem to figure out the reason for such a performance. Can someone give me useful pointers.

jide
07-30-2005, 02:14 AM
Originally posted by knackered:

Originally posted by jide:
About vertex offsets, I actually didn't see anything going on your direction. Calling VertexPointer more than once per VBO seems not a good idea. But again, nothing about setting the offsets.jide, you *have* to set up the pointers every time you switch VBO - it's in the very nature of the extension. Each vertex attrib pointer can source its data from different buffers, therefore you have to respecify the offsets for each attribute when you change the buffer you want to draw from.
So, obviously, the less buffers you use the better the performance is going to be because the less offsets you're going to be specifying.Okay, I didn't see it like that.

tamlin
07-30-2005, 08:40 AM
the vertex format is

struct vertex
{
int x, y, z;
};

...
i am getting a measly 40 Million Triangles per second consistentlyCould it be related to this thread (http://www.opengl.org/discussion_boards/cgi_directory/ultimatebb.cgi?ubb=get_topic;f=3;t=013552) ? You might want to test using float instead of int.

Zulfiqar Malik
07-30-2005, 12:27 PM
Yeah i have read that thread before, and i did the benchtest in three configurations i.e. short, integer (for 4-byte alignment), and floats. Both short and integers give around 40MTris/s whereas floats give a slight increase to around 44MTris/s at the cost of double memory (w.r.t. shorts, in case the driver doesn't bloat the short to a float internally, which wouldn't be necessary imho). Well all i can say is ... quite pathetic, and there is absolutely nothing else going on. I compile the static vertex and index buffers at initialization stage. Then my render loop is reduced to:




glBindBufferARB(GL_ARRAY_BUFFER_ARB, m_root.bufferID);
glBindBufferARB(GL_ELEMENT_ARRAY_BUFFER_ARB, m_indexBuffers[g_currLevel]);
glEnableClientState(GL_VERTEX_ARRAY);

glVertexPointer(3, GL_FLOAT, 0, (const GLvoid*)NULL);
//glDrawElements(GL_TRIANGLES, 3 * m_root.numQuads * 2, g_indexFormat, (const GLvoid*)NULL);
glDrawRangeElements(GL_TRIANGLES, 0, m_indexOffsets[g_currLevel + 1] - m_indexOffsets[g_currLevel],
3 * m_root.numQuads * 2, GL_UNSIGNED_INT, NULL);

glDisableClientState(GL_VERTEX_ARRAY);
glBindBufferARB(GL_ARRAY_BUFFER_ARB, 0);
glBindBufferARB(GL_ELEMENT_ARRAY_BUFFER_ARB, 0);Note that all buffers have been created with GL_STATIC_DRAW_ARB flag. I have tried doing glVertexPointer(...) just once in the start and not doing it each frame, but that doesn't give any speedup at all!

Can someone (nVidia engineer?) give some information? I can post the entire code if need be.

tamlin
07-30-2005, 06:21 PM
With such small differences (10%), I must ask - is your CPU pegged when you do these tests?

I have managed to (I think) get peak performance out of my rather old ATI card (9250) using a sub-500MHz computer, using VBO for all data (incl. indices), using less than 5% CPU (obviously, I then sleep after the last flush, for just a little less than the previous frame measured rendering time).

Busy-waiting (i.e. calling glFinish without flush+sleep) only added ~5% more speed, at the expense of 100% CPU pegged all the time (for a 500MHz CPU it might not be "much", but think of a 3.6GHz monster busy-waiting - I wouldn't want to pay that electricity bill).

Adrian
07-30-2005, 06:36 PM
Are you sure you're not fillrate limited? Try rendering to a small window or better still
glEnable(GL_CULL_FACE);
glCullFace(GL_FRONT_AND_BACK);

It's also possible the drivers are using agp memory instead of video memory. On some systems that can be a major bottleneck. What happens to your performance if you render much less than half a million vertices?

40 Million tris/sec is definately too low for that system. I know you should be able to get at least 100 Mtris/sec.

Zulfiqar Malik
07-30-2005, 09:06 PM
Originally posted by: Adrian

Are you sure you're not fillrate limited? Try rendering to a small window or better still
glEnable(GL_CULL_FACE);
glCullFace(GL_FRONT_AND_BACK);

It's also possible the drivers are using agp memory instead of video memory. On some systems that can be a major bottleneck. What happens to your performance if you render much less than half a million vertices?

40 Million tris/sec is definately too low for that system. I know you should be able to get at least 100 Mtris/sec.

I shouldn't be fill rate limited since the theoretical fill rate is way more than what i am getting, don't remember the exact specs but it must be above 100MTris/s (if you are talking polygon fillrate). Back face culling was enabled before, but i tried glCullFace(GL_FRONT_AND_BACK), without any boost.

I wonder why the drivers would be using AGP memory, because there is no other 3D application running, and its has virtually the entire 128Megs at its disposal, and it should be smart enough to choose the best possible memory in this particular case. But nonethless, thats an important point since i am using the latest nVidia drivers, and i have noticed a few problems with them before as well. I might as well rollback the drivers and check out the results with an earlier more stable version.

Zulfiqar Malik
07-30-2005, 10:35 PM
Tried with a couple of earlier versions of nVidia's display drivers, but to no effect. Getting almost the same results :( .

This scenario is driving me nuts, i am gonna post it on nvdeveloper forums.

jide
07-31-2005, 07:55 AM
Originally posted by Zulfiqar Malik:
I have tried doing glVertexPointer(...) just once in the start and not doing it each frame, but that doesn't give any speedup at all!This won't change anything since you use a single VBO with a single array. You really can use several VBO with each having several arrays without noticing any performance issues. (this of course can depend on several factors).

knackered
07-31-2005, 12:45 PM
Originally posted by Zulfiqar Malik:
Tried with a couple of earlier versions of nVidia's display drivers, but to no effect. Getting almost the same results :( .

This scenario is driving me nuts, i am gonna post it on nvdeveloper forums.Exactly what is the size of the VBO you're using?
When I did my benchmarking, I found there was a limit to the size of VBO before performance started to suffer. Split up your 1 million verts into several buffers and do the test again.

Zulfiqar Malik
07-31-2005, 08:15 PM
Finally got it fixed!
Luckily i found a very small reference regarding VBO performance in GPU Programming Guide on nVidia's website, and it mentions that performance with VBO's on nVidia's hardware is optimal when batches of 64K are used. Bigger batches can cause a negative impact on performance. I did just that and, the fill rate increased from 40MTris, to 110MTris/s on 640x480, still a little less than what i would have liked but good enough :) . I wonder if there is a similar limit (limit in terms of performance), on ATi's hardware because i couldn't find any references on ATi's website (Humus?). Nonethless, the problem has been fixed.
These graphics companies keep on saying bigger batches are better and never do they mention the upper limit for optimal performance, except for this very tiny reference in GPU Programming Guide. I think that they should be more vocal about this!

knackered
07-31-2005, 11:52 PM
Yep, that's exactly what I found when doing a terrain engine. I'm pretty sure nvidia mention it in their various pdf's.
You'll find the same with texture sizes, there's a sweet spot of texture size when uploading texture data dynamically.

Adrian
08-01-2005, 02:45 AM
If you gradually increase the batch size from 64k does performance suddenly drop or is it a gradual thing.

I wonder if VAR had this issue. I suspect not.

Out of curiosity what is the AGP speed of your motherboard? 4x or 8x?

Hampel
08-01-2005, 02:45 AM
@Zulfiqar: did you try to align your vertex data on 32 byte boundaries, e.g.:


struct Vertex {
float x, y, z;
float w; // always == 1.0f
}; @knackered: what is your sweet spot w.r.t. texture sizes and texture formats (e.g. precompressed S3TC)? :)

jide
08-01-2005, 04:23 AM
Originally posted by Adrian:
If you gradually increase the batch size from 64k does performance suddenly drop or is it a gradual thing.

I wonder if VAR had this issue. I suspect not.

Out of curiosity what is the AGP speed of your motherboard? 4x or 8x?As (if I remember well) he uses static arrays, this won't have any change at all but during uploading time which won't be noticeable anyway.

zeckensack
08-01-2005, 04:58 AM
Originally posted by jide:
As (if I remember well) he uses static arrays, this won't have any change at all but during uploading time which won't be noticeable anyway.And yet his last post, which is four posts above the one I'm quoting now btw, proves the opposite :rolleyes:

Edit for clarification:
This is the proven-wrong part: "won't have any change at all but during uploading time".

Zulfiqar Malik
08-01-2005, 05:20 AM
@Adrian: Yes, adrian, performance dropped suddenly. In one instance i was using slightly less than 64K batches and was getting close to 100MTris, in the next instance i increased the batch size to around 66K and the performance dropped to around 40MTris.

@Hampel: Yes, i did align the data on 32-byte boundary with no performance improvement. Secondly, i used 3 short vectors (total of 18bytes per vertex), and wasting 14 bytes just to get the alignment right wasn't much of an option for me :) . Memory usage in my case is critical! I don't know whether the driver will internally align it to 32-byte of not, but i think it should not!

@zeckensack: Jide is right zen, i specified batch size at compile time, and my vertex buffers were always static ;) .

Adrian
08-01-2005, 06:53 AM
Originally posted by jide:
As (if I remember well) he uses static arrays, this won't have any change at all but during uploading time which won't be noticeable anyway.What if the static arrays were being stored in AGP memory and not video memory? Then agp speed does matter since it will always be uploading the vertices. What I think is happening is that when you have batch sizes over 64k they are put in agp memory otherwise video memory. Zulfiqar, is your motherboard agp 4x or 8x. If it is 4x then I think my hunch is right if its 8x then it probably isnt.

jide
08-01-2005, 08:16 AM
What you say is pertinent Adrian. However I can't see why data will be stored in AGP memory almost as he is not using any texturing at all and such so the graphic card uses very few memory.

Generally (from what I know) AGP aperture size is only used when there isn't enough memory anymore in the graphic card. Maybe with trying to limit (or best disable) AGP memory would help to know...

Why memory would be allocated in AGP instead of the graphic card if batches are more than 64KB ? Anyone has information about that whole ?

Zulfiqar Malik
08-01-2005, 07:58 PM
@Adrian: I have tried these scenarios on two machines. GFX 5700Ultra (AGP 4x), Motherboard also AGP 4x, and GFX 5900XT (AGP 8x), Motherboard also AGP 8x, and on both machines i get exactly the same throughput (in terms of raw triangle count) depending on batch size.

I was just wondering, maybe someone can test this on an ATi card, or if there is something documented online for ATi cards in terms of best batch size? I do have a 9700pro machine, in the office, and i will test it out on that, but in a couple of days from now, perhaps.

I can't help but agree with jide, that with the video memory being almost free, the driver should use it whether the batch size is optimal or not, and reside to AGP memory only when the video memory is exhausted or at least the maximum amount of memory to be dedicated to vertex data is exhausted (if there is such a thing as the maximum memory for vertex buffer!).

Maybe some engineer from nVidia's driver team can answer this question, but i haven't gotten any response so far, on nvdeveloper forum.

Adrian
08-01-2005, 10:35 PM
That information suggests its not to do with agp/video memory since I would have expected a difference in performance between the agp8x motherboard and agp4x.

knackered
08-01-2005, 11:50 PM
the days of nvidia driver writers frequenting these forums are long since gone. We used to have Cass Everitt and Matt Craighead at our beck and call, but not anymore....those were the days my friend, when register combiners ruled.
The very most you can hope for nowadays is that Humus (who now works for ATI, in what capacity I don't know) might pop upstairs to the driver writers and ask them what they think.
Or you can all just carry on guessing why there'd be a 64k limit on a VBO batch on nvidia hardware.

tamlin
08-01-2005, 11:58 PM
I think it's as simple as you can't address more than 64K vertices using ushort indices, and the hardware is only optimized for this case.

Zulfiqar Malik
08-02-2005, 07:02 AM
@tamlin: I tried with all sorts of variations my friend, ints, shorts. Similarly for vertex data, floats, integers, shorts etc. etc. Quite a few permutations i tried, and each time i got the same results. I would have been lost still had i not found that GPU Programming Guide excerpt. Do get it from nVidia's site and search for VBOs, there are just a few lines on VBOs and one of them mentions 64K as being the most optimal (NOT LIMIT!) batch size for nVidia hardware.

knackered
08-02-2005, 09:17 AM
Originally posted by Zulfiqar Malik:
I would have been lost still had i not found that GPU Programming Guide excerpt.Err, well no, you wouldn't have been lost because I told you the problem a couple of posts ago. It's actually a fairly well known fact, and a quick search in these forums would have given you the answer in minutes.

Zulfiqar Malik
08-02-2005, 08:08 PM
hahaha, sorry knackered, totally forgot about you. Thanks for reminding me.

tamlin
08-03-2005, 12:02 AM
Originally posted by Zulfiqar Malik:
I tried with all sorts of variations my friendSeems I wasn't clear enough. I was referring to the 64K-vertices threshold. I can easily imagine the h/w having hard-coded lookup tables for popular vertex sizes for unsigned short indices, but for more unusual vertex sizes or, what I was thinking of, more than 64K vertices they would take not-so-well optimized path in the hardware.

Unless already in the pipe(s), I think this is something hardware vendors should start to seriously consider. Especially when they nowadays have 1/4 -- 1/2 GB of RAM on board, and it seems likely more will come. For currently some cases, and in the future likely way more, 64K vertices/VBO will not be enough. It therefore makes sense (to me) to up that 16-bit index limit to at least the next level - 24 bits (*that* should last a while, no :) ) - even if the indices by then must be 32 bits (2^n is a bitch at times).