Dynamic VBO Performance Problems

First a little introduction. I know and use both OpenGL and D3D, but most of my recent experience has been with D3D. I’ve successfully implemented dynamic VBOs in D3D (D3DUSAGE_DYNAMIC, D3DLOCK_DISCARD) and have recieved a very significant performance boost as a result.

Now onto the problem. I’m trying to do the same with OpenGL and it’s just not working out. All drawing is going through a single glDrawElements call (using 16-bit indexes), state changes are buffered up and then replayed, everything is batched up as agressively as possible, and there are a total of about 11 glDrawElements calls happpening. One index buffer, one vertex buffer (but see below).

Tactics I’ve tried so far include both glMapBuffer and glBufferSubData, with the latter run both many times per frame (per polygon, basically) and once only just before drawing. Variations of GL_STATIC_DRAW, GL_DYNAMIC_DRAW and GL_STREAM_DRAW, glMapBuffer both with and without a glBufferData (… NULL …) call, benchmarked on light scenes, heavy scenes, light CPU load, heavy CPU load, single buffering, double buffering, quadruple buffering, you name it, I’ve tried it.

There is zero performance difference between dynamic VBOs and regular vertex arrays in everything I’ve tried.

Before I write dynamic VBOs in OpenGL off as a “non-feature”, I’m interested in knowing if there is any such animal as a definitive “this is how you do dynamic VBOs in OpenGL” thing?

Thanks.

You haven’t answered the question, “How dynamic do you mean?” Is this data going to be changing every frame, or is it less frequent than that? How frequent is it?

You also haven’t stated how you’re determining that “There is zero performance difference between dynamic VBOs and regular vertex arrays in everything I’ve tried.” Because that seems highly unlikely.

The data needs to change every frame, yes. The worst case could be over 1,000,000 indexes per frame and about 250,000 vertexes per frame but that is quite extreme. Average loads are substantially lower. In both extreme and average loads no difference was observed.

Perfromance was measured using both frames per second and milliseconds per frame in FRAPS over an approximately 30 second run. Comparisons were also made with the D3D version of the same code.

Would highly recommend you read this thread starting here:

paying special attention to Rob’s post. I’ve tried this, and it works very well. You might also find this thread interesting/amusing:

where I’m tripping over semantics a bit to figure out the constraints of buffer orphaning. This thread has a code snippet you can copy and tweak-to-taste. Hardly perfect (2 maps/copies instead of 1, this ver doesn’t use DSA/bindless/etc.), but it’s a place to start.