Not seeing expected performance with VBOs

Hello.

We are currently developing a GPU-friendly method for adaptive subdivision of triangular meshes.

In one step of the algorithm, we wish to render pretesselated, (dyadically-)refined triangles, and calculate the surface in the vertex shader. This approach works well, and we get a nice rendered surface. During prototyping we just used immediate mode, passing triangle strips and integer coordinates as this:

  
for(size_t j=0; j<level; j++) {
        glBegin(GL_TRIANGLE_STRIP);
        for(size_t i=0; i<level-j; i++) {
                glVertex3i(i,j, level-i-j);
                glVertex3i(i,j+1, level-i-j-1);
        }
        glVertex3i(level-j,j, 0);
        glEnd();
}

We are now finishing the method, and planning to migrate to VBOs for a litte speed bump. My approach has been to store each level of refinement in their own VBO, using one degenerate triangle strip and an element array, and draw using glDrawElements.

This approach also works, but I notice a speed decrease of about 20% compared to immediate mode. My initial thought was that it was to much overhead involved in binding buffers etc, for the lower level of refinement so I have resorted to immediate mode for the lowest levels. However even when there are hundreds of indices in the VBOs it seem to be faster to use immediate mode.

Is there a lower bound on the number of vertices/indices when VBO’s become efficient? Anyone who would like to comment on this observation?

We are seeing this behavior on GF6600, GF6800 and GF7800 series of GPUs, all on Linux (also when using the 81.63 series of drivers).

Don’t expect an answer if you don’t show what exactly you’re doing in the VBO case and how that data is going to be used in the app. Post some code.

Thanks to the comment by Relic I will now post some more code to demonstrate my approach.

To generate the VBOs.

glGenBuffers( 1, &vbo );
glGenBuffers( 1, &ebo );

// Fill data and idx
glBindBuffer( GL_ARRAY_BUFFER, vbo );
glBufferData( GL_ARRAY_BUFFER, data.size()*sizeof( Vec4i ), &data[0], GL_STATIC_DRAW );

glBindBuffer( GL_ELEMENT_ARRAY_BUFFER, ebo );
glBufferData( GL_ELEMENT_ARRAY_BUFFER, idx.size()*sizeof( unsigned short ), &idx[0], GL_STATIC_DRAW );

To render them:

void GPUCurvedFaceRenderer::renderVBO( size_t level ) {
                                                                                        
        static GLuint current = 0;
                                                                                        
        if ( level < 3 ) {
                renderRefinedTriangle( level );
                return;
        }
                                                                                        
        map<size_t, GLuint>::const_iterator i = m_vbo.find( level );
                                                                                        
        if ( i == m_vbo.end() )
                genVBO( level );
                                                                                        
        GLuint vbo = m_vbo[level];
        GLuint ebo = m_ebo[level];
                                                                                        
        if ( current != vbo ) {
                glEnableClientState( GL_VERTEX_ARRAY );
                glBindBuffer( GL_ARRAY_BUFFER, vbo );
                glVertexPointer( 4, GL_INT, 0, BUFFER_OFFSET( 0 ) );
                glBindBuffer( GL_ELEMENT_ARRAY_BUFFER, ebo );
                                                                                        
                current = vbo;
        }
                                                                                        
        int l = 1<<level;
        glDrawElements( GL_TRIANGLE_STRIP, l*l + 4*l-2, GL_UNSIGNED_SHORT, NULL );                                                                                        
}

The level parameter is always between 0 and 8.
(So for level == 8, it is of order 2^16 indices, large enough for VBO’s to help I would belive.)

Profiling shows I do not spend any significant time looking up m_vbo (it is an STL map with maximum size 9).

It should also be mentioned that our vertex shader is very computationally expensive, so my main bottleneck is probably there.

However, I do find it strange that I see a performance decrease when using VBOs.

Try to replace your GL_INTs for vertex positions/vertex data with GL_FLOATs; this could hit a less optimized driver path…

I’ve never used this:
glVertexPointer( 4, GL_INT, 0, BUFFER_OFFSET( 0 ) );
Try GL_FLOAT data.
You used 3 components in the immediate mode code. Why do you use 4 in the array?

Put more brain into static VBO data! :wink:
I would optimize this the follwoing way:

  • Use one VBO for each object, do not split it it into one per LOD!
  • For all LODs of an object find the unique vertices. Only put those into the vertex array object.
  • Build the vertex array element objects by remapping the indices to the new unique vertices.
  • Remember the start offsets and counts of the indices.
  • Use those array element offsets to switch between LODs.
    This will remove the necessity to switch VBOs as long as you reuse the same object. Sort by model?

Using GL_UNSIGNED_SHORT indices is good for performance.
Try glDrawRangeElements too.
If you say your vertex shader is complex, reusing vertices is beneficial.
Even more so if you can render the thing in patches of vertices where indices are adjacent (e.g. less than 16 vertices wide meshes). This will keep the post transform caches in GeForce chips happy. Search for vertex reuse on gpgpu.org.

Originally posted by Relic:
[b]I’ve never used this:
glVertexPointer( 4, GL_INT, 0, BUFFER_OFFSET( 0 ) );
Try GL_FLOAT data.
You used 3 components in the immediate mode code. Why do you use 4 in the array?

Put more brain into static VBO data! :wink:
I would optimize this the follwoing way:
…snip…

Using GL_UNSIGNED_SHORT indices is good for performance.
Try glDrawRangeElements too.

[/b]

Thanks. I will try using GL_FLOATS (even though the data are all integers, they could even be unsigned shorts). I will also try the just one VBO approach. I will report back over the weekend with my results.

As for the 3 or 4 vertices, the fourth is used as a cheap way to indicate what ring of tesselation a certain vertex belongs to (we use it to avoid passing a vertex attribute separatly). My IM-code did not actually call glVertex3i, but a small wrapper which called glVertex4i, I just editied it away for clarity (sigh, instead I made it more obscure).

I can now confirm that the problem seems to have been the use of integer as datatype for the vertex data. On GF6800 Ultra AGP (1.0-7667 Linux drivers) changing from ints to floats resulted in a speedup of more than 35%!

The trick of using just one VBO and index into it gave some extra bonus FPS’s at the end.

Thanks to Relic for helping me track down this one!

A couple of months ago there was a very long thread regarding the exact same problem. I have come to the conclusion that there is almost no performance difference between usage of GL_SHORTs or GL_FLOATs, so you can use shorts instead of floats. Integers and byte arrays give a big performance hit. I claim this after having tested my terrain rendering algorithm (which can really choke the GPU) on the following hardware:

GFX 5700 Ultra,
ATI Radeon 9700 Pro,
GFX 6800 GT.