TRIANGLE_STRIP + VBO

I asked this in the beginners forum, but got no response. I found that the VBO’s provide a performance improvement of 3x with interleaved array, (vertices+normals), and glDrawArray.

Now I want to add my triangle strips. However, with triangle strips it seemed to perform much much slower.

Basically, i generate n buffers(VBO) one for each triangle strip and then in my draw
routine i do the following

//Not my exact calls, but there is no mistake
//I checked
for( int i=0; i<numStrips.size(); i++)
{
gExt.myVBO.glBindBufferArray( vboStripIDs[i] );
glInterleavedArrays( GL_N3F_V3F, sizeof(VBOVertex), NULL);
glDrawArrays( GL_TRIANGLE_STRIP, 0, numStrips[i] );
}

Anybody know a better way to use TRIANGLE_STRIPS!

Originally posted by maximian:
Anybody know a better way to use TRIANGLE_STRIPS!

Try putting all of your triangle strips in one buffer and binding the VBO only once. You can iterate using the ‘start’ and ‘count’ fields to point to your individual triangle strips or, even better, you can use the MultiDraw API to do this automatically.

Also, I’ve heard that indexed triangles (with proper cache reuse) can be faster than strips on recent cards (for nvidia, perhaps ATI) and may be easier to build into single VBOs.

Interleaving is good, but not essential as long as the number of separate arrays is low. Try to keep the size of each vertex-slot in each bound array (if > 1) to a multiple of 16 bytes (16, 32, etc…) for better memory fetching speed.

Avi

Hi, thanks.

I tried indexed triangles, but they performed no better than regular immediate mode.

Ie glDrawElements, and a VBO for the indices into the triangle.

I made your modification, and now the performance is ~30 fps for 300K polygon, for both regular triangles, as well as triangle strips. I am confused! Triangle strips have fewer vertices!

32 bit aligned also!

I use the actc library
Fx 5600 ultra.

Maybe you are fill limited.

Not at all, i just tested with at a much smaller window size!

Actually, lowering my refresh rate from 70->60
increased my fps from ~ 23-30.

I am so dissapointed with this videocard and VBO in general. Guess consumer cards are great at accelerating various effects, and vertex shaders, but become useless when you throw enough geometry at them.

[This message has been edited by maximian (edited 11-30-2003).]

I wouldn’t say. G cards now have high GPU/MEM speeds, so transform/fill rates are uncomparable with prev gens. 'xcept 5200(u/nu)/5600(nu)/9000/9200

“Actually, lowering my refresh rate from 70->60
increased my fps from ~ 23-30.”

70/3 = 23.33
60/2 = 30
Looks like you’re measuring with wait for vertical blank enabled. Go to the display control panel OpenGL page and set it to always off.
Anyway I also wouldn’t recommend one strip per buffer.

Guess consumer cards are great at accelerating various effects, and vertex shaders, but become useless when you throw enough geometry at them.

Or maybe the simple answer is that you’re not using them correctly ?

You must be doing a fundamental mistake somewhere. If your bottleneck is not the geometry transformation, don’t complain about the geometry transform speed…

Y.

I am not incorrect about this. 320K polygons should easily fit in my vid mem so multiple buffers is not helpful. Actually, Itried multiple buffers and it was slower because of the multiple rebindings.

As for not doing correctly, there is not much to it. I tried vbo w/ index array, w/ interleaved + draw elements.

There is not much to vbo, just a few function calls so I hardly think there is mistake. Besides i am getting some acceleration, just not the amount I would expect.

Maybe you should try a single VBO (one for verts and one for indices) by concatenating the strips with degenerated triangles. I think most cards reckonize em now. With indexation this should be pretty quick. It’s something I would like to test soon. So if you try it before me I’d like some feed back. A priori you diminish the bandwith for indices compared to indexed triangles.

Thank you.

I tried that, first thing I did. I allocated two VBO buffers, one for elements, the other for vertices. Then I used glDrawElements. This produced performance equivalent to immediate mode. In fact performance was the same between
A Geforce II and Geforce 5600 Ultra!

330K polygons
~10 fps.

I assume when you wrote “32 bit aligned too” you meant byte, not bit. FWIW, V3F_N3F is (3+3)*4=24 bytes, which is not 16 byte aligned. However, the difference between using 24 and 32 byte verts shouldn’t be quite that large.

Anyway, yes, the single VBO for vertices is simply meant to reduce the API overhead. If you’re truly transform limited, it shouldn’t matter much, but could make a small difference if you’re transfer limited (i.e., getting system RAM for buffers).

If you were fill limited, you shouldn’t be seeing much difference with any of this unless you’re unwittingly changing the amount or position of your geometry. That’s easy enough to test.

However, at this point, I’d suggest posting the code you use to create your VBOs (mostly, to see which hints you use, system ram vs agp/video) and your exact rendering code for these objects. There’s a chance you’re either having lousy vertex cache reuse, are over-rendering parts of your model, or are getting the wrong memory-type for your buffers.

Also make sure you have the latest drivers if you haven’t already.

Avi

My bad 32 byte aligned. 24 + 8 byte pad.

struct VBOVertex
{
float normal[3];//3 float
float vertex[3];
float padd[2];//ensure it is 32 byte aligned
};

Let me post some code and take a look at it:

Rendering func
>>>>>>>>>>>>>>>
//Bind array buffer gExt.myVBO.glBindBufferArray(
vboTriArrayID );
glInterleavedArrays( GL_N3F_V3F, SIZE_OF_VBOVERTEX, NULL);
glDrawArrays( GL_TRIANGLES, 0, 3*numTris);

gExt.myVBO.glBindBufferArray(
vboQuaArrayID );
glInterleavedArrays( GL_N3F_V3F, SIZE_OF_VBOVERTEX, NULL);
glDrawArrays( GL_QUADS, 0, 4*numQuads);
>>>>>>>>>>>>>>>

The generation code is a bit convoluted
but it breaks down:

>>(Abreviated)
glGenBufferARB
glBindBufferARB( L_ARRAY_BUFFER_ARB, id)
glBufferDataARB( GL_ARRAY_BUFFER_ARB, size, NULL, GL_STATIC_DRAW_ARB);
buffer = glMapBufferARB( GL_ARRAY_BUFFER_ARB, GL_WRITE_ONLY_ARB )

for all polygons
for all indexes in polygon
buffer.normal = normal[index of polygon]
buffer.vertex = vertex[index of polygon]

glUnmapBufferARB
>>

That is pretty much it. I do that seperately for quads + tris. I tried same approach w/ strips instead of qua+tris, but got no performance improvement. Strips helped in immediate mode!

Thanks

The Nvidia VBO whitepaper is pretty clear that you should NOT call glVertexPointer or related functions (i.e., glInterleavedArrays) any more often then you absolutely have to. The general rule is to minimise the number of state changed you do in time-critical code.
http://developer.nvidia.com/object/using_VBOs.html

Two things that spring to mind:

  1. I tend to use separate binding per vertex, normal, color, etc… instead of glInterleaved. In theory, there should be no difference if the array is truly interleaved.

  2. you didn’t post your TRISTRIP binding or rendering code, but the thing I’d look for was if you were making the very common mistake of drawing too many verts compared to actual strip length. Even expert programmers make basic counting errors from time to time.

BTW, I may be misinterpreting your statments, but if you’re seeing the same performance with strips and optimized triangles, that should not be surprising given the size of the vertex cache. I thought you were concerned that you got worse performance with strips. Is that correct?

Avi

[This message has been edited by Cyranose (edited 12-01-2003).]

Actually you were right, originally strips seem to do worse. But that was a mistake of mine. Now they perform the same. Can you explain what you mean by due to the cache.

How can I manipulate my data so that it fits in the cache. Unfortunantly, this data is derived from surface scan, and there is little if any repetion.

Anyways, thanks for all your suggestions, to everyone.

EDIT:
In reference to drawing the triangle strip w/ too many vertices, I am not sure what you mean. In immediate mode, using triangle strip speeds up drawing anywhere from 60%-100%.
In VBO mode it does not improve performance!

[This message has been edited by maximian (edited 12-01-2003).]

Disabling vsynch seem to help also. Now fps for large model are up to 50+ fps! This is weird, i thought vertical synch would limit you only if your fps > refresh rate.

In light of this, to avoid future problems, on platforms that have vsynch enabled, how do I disable in my program! Windows specific setting, or opengl parameter. Thanks.

As an additional note, neither glBindBufferARB, nor glInterleaveArray affect performance by any
amount at least on my test setup. So state changes, in my case per model, have 0 effect. I expected this since, both bind + interleave only setup pointers. But some people felt otherwise.

Originally posted by maximian:
[b]Can you explain what you mean by due to the cache.

How can I manipulate my data so that it fits in the cache. Unfortunantly, this data is derived from surface scan, and there is little if any repetion.

In reference to drawing the triangle strip w/ too many vertices, I am not sure what you mean. In immediate mode, using triangle strip speeds up drawing anywhere from 60%-100%.
In VBO mode it does not improve performance!
[/b]

The vertex cache remembers the post-transformed results of the last N (N=16, 24, 48, etc…) vertices, saving memory fetch, transform and lighting time if one of those remembered vertices is repeated.

Optimizing means trying to sort the triangles such that you have the fewest transitions in and out of that cache. There’s little hope of fitting everything in such a small cache, but the sort can help a lot with typical meshes that have most vertices shared by 3 to 6 triangles. There’s a free mesh optimizer from Nvidia that does this work for you, btw, even on pre-existing meshes.

I don’t know whether the vertex cache does anything for non-indexed data these days, but it’s possible. Either way, the triangle strip is a primitive that’s designed to implicitly reuse vertices so caching wouldn’t add much unless your strips share other vertices too (as in the case of a mesh).

Anyway, if your data is in system memory or is sent in immediate mode without any nice AGP-mem buffering by the driver, then strips would be much faster–fewer glVertex calls and less data to transmit. But the difference between optimized indexed triangle and strips might be small once the data transfer or API calls are no longer the bottlenecks. Then it might come down to caching behavior or time to fetch the indices.

What I meant by “drawing too many vertices” is a common problem with triangle strips. The ‘count’ parameter is the number of vertices, which starts with 3 for the first triangle and adds 1 for each additional triangle (basically, v = numTri + 2). Some people try to use v = numTri *3 or *2 or somesuch, meaning they’re rendering extra verts that don’t always show up as garbage on screen but do take time to transform.

Avi