Vertex Buffer Objects performance issue

I recently dropped using display lists because I was getting horrible performance compared to immediate mode. DLs had other annoying properties so I dropped them without really understanding why they were so slow.

Instead I turned to Vertex Buffer Objects. And guess what. They’re also consistently slower than immediate mode (dropped framerate by 3x).

I am jusing LWJGL (GL extension for Java), so this may not be an OGL problem. If anyone has seen similar problem in C or C++ I’d at least know where to look :slight_smile: .

A boiled down version of my immediate code looks like this:

private void renderMesh() 
{ 
  for (int f = 0; f < m_aFaces.length; f++) 
  { 
    Face face = m_aFaces[f]; 
    GL11.glBegin(GL11.GL_TRIANGLES); 
    GL11.glNormal3f(face.nx, face.ny, face.nz); 
    GL11.glBegin(GL11.GL_TRIANGLES); 
    GL11.glVertex3f(face.v0x, face.v0y, face.v0z); 
    GL11.glVertex3f(face.v1x, face.v1y, face.v1z); 
    GL11.glVertex3f(face.v2x, face.v2y, face.v2z); 
    GL11.glEnd(); 
  } 
} 

My VBO code (again in a trimmed down version) looks like this

GL11.glEnableClientState(GL11.GL_VERTEX_ARRAY); 
GL11.glEnableClientState(GL11.GL_NORMAL_ARRAY); 
ARBVertexBufferObject.glBindBufferARB( ARBVertexBufferObject.GL_ARRAY_BUFFER_ARB, m_iVertVBO); 
GL11.glVertexPointer(3, GL11.GL_FLOAT, 0, 0); 
ARBVertexBufferObject.glBindBufferARB( ARBVertexBufferObject.GL_ARRAY_BUFFER_ARB, m_iNormVBO); 
GL11.glNormalPointer(GL11.GL_FLOAT, 0, 0); 

for (int f = 0; f < m_aFaces.length; f++) 
{ 
  Face face = m_aFaces[f]; 
  GL11.glBegin(GL11.GL_TRIANGLES); 
  GL11.glArrayElement(face.v0); 
  GL11.glArrayElement(face.v1); 
  GL11.glArrayElement(face.v2); 
  GL11.glEnd(); 
} 

GL11.glDisableClientState(GL11.GL_VERTEX_ARRAY); 
GL11.glDisableClientState(GL11.GL_NORMAL_ARRAY); 

The VBO is created like this:

FloatBuffer vertbuffer = BufferUtils.createFloatBuffer(3*verts.length); 

for(int j=0,i=0;i<verts.length;i++) 
{ 
vertbuffer.put(j++,verts[i].x); 
vertbuffer.put(j++,verts[i].y); 
vertbuffer.put(j++,verts[i].z); 
} 

IntBuffer temp = BufferUtils.createIntBuffer(1); 
ARBVertexBufferObject.glGenBuffersARB(temp); 

int iVBO = temp.get(0); 

ARBVertexBufferObject.glBindBufferARB( ARBVertexBufferObject.GL_ARRAY_BUFFER_ARB, iVBO); 
ARBVertexBufferObject.glBufferDataARB( ARBVertexBufferObject.GL_ARRAY_BUFFER_ARB, verts.length*3*4, vertbuffer, ARBVertexBufferObject.GL_STATIC_READ_ARB); 

I’m aware that glArrayElement is not the fastest way to use buffers, but I’d hate to expand meshes to 4, 5 or 6 times as many vertices as I really need (I need multiple texcoords per vertex).

Additionally, why would GL_STATIC_DRAW_ARB be 3x slower still than GL_STATIC_READ_ARB?

I am not reading data from GL, as you can see from my code. DRAW should be the correct hint, and there is no reason why READ would be faster, esp. 3x faster. But it is. (In summary DRAW is 9x slower than immediate, READ is 3x slower).

I’m on a 2.5GHz Intel using a GF4 Ti4400 card, newest drivers (2 days ago).

Any ideas?

Bugger, two errors in my post:

  1. No, I’m not calling GL11.glBegin(GL11.GL_TRIANGLES); twice in immediate mode - it’s a cut’n’paste error

  2. I use per-vertex normals in both immediate and VBO mode, again, copied the wrong code…

For the record:

Tried interleaving data in one VBO and it made ~5% difference (from 9.8 to 10.3 fps) .

Tried padding to get 32 bytes per vertex and it made no difference what so ever.

Tried various combinations of STATIC/DYNAMIC/STREAM and READ/COPY/DRAW. Apparently STATIC/DYNAMIC/STREAM makes no difference, while READ is faster than COPY which is faster than DRAW to the tune of 10/5/3 fps respectively. Go figure.

The fast combination should be STATIC_DRAW for the vertex buffers.
Do not put the glBegin-glEnd of an independent primitive like GL_TRIANGLES inside the for-loop, but keep them outside.
Do not use glArrayElement at all. Build a list of indices to send and use one glDrawElements, or better glDrawRangeElements call to draw all triangles.
Use unsigned short indices if you can.
Never say you are using the newest driver, post a version number.

The fast combination should be STATIC_DRAW for the vertex buffers.
I agree, but it’s not what I am seeing.

Do not put the glBegin-glEnd of an independent primitive like GL_TRIANGLES inside the for-loop, but keep them outside.
I know, but it doesn’t change anything. VBOs are still much slower than immediate mode.

Do not use glArrayElement at all.
I know. I am redoing things to use glDrawElements, but it has some annoying sideeffects (such as ~6x the number of vertices). Besides, why does glArrayElement even exist if it’s 3 times slower than immediate mode?

Never say you are using the newest driver, post a version number.
Sorry, good point: 61.77

Ok,

glArrayElement IS simply horribly inefficient. So could the official OpenGL Programming Guide please stop recommending it? Thanks.

Accepting the massive vertex increase in unpacking the mesh and using glDrawRangeElements has improved VBO performance to 2x immediate mode.

Case closed.

the programming guide recomends it for doing its job, however its job isnt to perform high speed gfx, as you are still hand feeding the GPU and because its hardly used its not optermised much in the drivers (most drivers base optermisation is going to be of the Quake3 tri-pushing type).

The ATI/NV performance tuning pdf has some tips on how to use VBOs and points out that you shouldnt use glArrayElement, which iirc is expanded apon in the video (from the ati dev pages).

Originally posted by Niels Jørgensen:
Accepting the massive vertex increase in unpacking the mesh and using glDrawRangeElements <…>
?
glDrawElements takes an array of indices. You don’t need to “unpack” your mesh at all.

Just allocate an array of unsigned ints. Put your indices into that array like this:

for (int f = 0; f < m_aFaces.length; f++)
{ 
  Face face = m_aFaces[f];
  your_uint_array[3*f]=face.v0;
  your_uint_array[3*f+1]=face.v1;
  your_uint_array[3*f+2]=face.v2;
}

Then you need to pass this array to glDrawElements. I’m not sure how you’d do that in Java but it shouldn’t be too hard.

The point is that you don’t need to touch your vertex data at all.

As I said, I need unique texture coords for each face, and since each face is defined by just three indices, each of these must completely define x,y,z,nx,ny,nz,tu and tv.

So, I need to expand the mesh, or accept that I cannot have unique tu,tv pairs for each face on each vertex (I am aware that there is some kind of middelground if I only need different tu,tv for different textures in which case I could change the tu,tv mapping when I switch texture, but that’s not what I’m doing right now).

As for the programming guide, maybe it’s me, but I read the description of glArrayElement as a recommendation for better performance over immediate mode - which it apparently isn’t. In fact it is 3 times slower.

I have no beef with the guide in terms of correctness, it’s their performance tips that IMO are a little out of whack.

My guess is that glArrayElement might be faster when you use client-side vertex arrays rather than VBOs which is more optimized for stuff like MultiDrawArrays or DrawRangeElements. Incidentally, variations on these calls are the only way to really push geometry around at what are considered nowadays to be reasonable rates; using immediate mode very quickly leave you completely CPU bound just by function call overhead (which might be even higher through Java). Oh, and supposedly the [v] versions of the immediate mode calls (e.g. glVertex3fv instead of glVertex3f) are slightly faster.

Anyway, if you really want per-face attributes (consider abandoning them), your best bet might actually be to expand them into a VBO and just use DrawArray. But then you have to send 3 vertices per triangle instead of ~0.5, but again you are almost certainly CPU bound and not bus bound.

How big are these meshes?

-Won

Originally posted by Niels Jørgensen:
As I said, I need unique texture coords for each face, and since each face is defined by just three indices, each of these must completely define x,y,z,nx,ny,nz,tu and tv.

You could just expand the tu/tv’s into separate arrays with a set of indices for each array and then make multiple calls for the assorted sets of tu/tv’s.

There’s no need to make EVERY vertex unique, you can use a single glDrawElements style vertex if the tex coords for the vertex are identical for all triangles that share it. There’s some simple code for how to do this here by jwatte of this forum.

Thanks for all your input :slight_smile:

Won: I think you’re probably right about glArrayElement, the red book does not cover VBOs so I guess they’re excused :wink: .

The number of vertices is not currently a problem, but I have maybe 20% of the geometry I’ll likely end up with and will probably need many passes (Expanded it currently amounts to about 40K vertices, original format about 1/6 of that)

harsman/rgpc: You both basically say the same thing, and it’s what I will try doing next. My first test indicated that no reuse was possible, but thinking about it, that doesn’t make sense. The difference in tu,tv is most likely just floating point precision issues. I’ll re-test with a more appropriate comparison than == :wink: .