Advice required for render pipeline optimizations

I have a medium-size scene which drains the performance far more than it should:

Lighting and shadows are disabled, all it’s doing is rendering the meshes, nothing more.
Here’s some debug info for that particular scene:


Triangles: 162123
Vertices: 213934
Shader Changes: 5
Material Changes: 229
Meshes: 12281 (99.58% static) (static = Everything which doesn't have unusual vertex attributes (e.g. vertex weights for animated meshes))
Average number of triangles per mesh: 13
Render Duration (GPU): 134.107ms (~7.46 frames per second)

My rendering pipeline currently looks like this (pseudocode):


startGPUTimer()
glBindVertexArray(vao);
foreach shader
	glUseProgram(shader)
	foreach material
                glBindTexture(material)
		foreach mesh
			glDrawElementsBaseVertex(GL_TRIANGLES,numElements,GL_UNSIGNED_INT,offsetToIndices,vertexOffset);
		end
	end
end
endGPUTimer()

There are two vertex data buffers. Buffer #1 contains the vertex, uv and normal data for all meshes, buffer #2 contains the indices. Both buffers are bound via the vertex array object(vao), only ONCE every frame.
The biggest impact stems from glDrawElementsBaseVertex, which is called once for every mesh (12281 times in this case). Here’s the result of a profiling of the scene using gDEBugger:

My problem is, I simply don’t know what to do to optimize it. I’m already doing frustum culling. Occlusion queries wouldn’t help, considering almost nothing is obstructed and most meshes are very small.
I also can’t use instancing because there’s only very few meshes that are exactly the same (I’m already using LODs for the trees, but the amount of triangles isn’t the issue here).

I know that 12281 draw calls are a lot, but it seems to me that it shouldn’t be nearly enough to cause such a dent in performance? I know for a fact that the shader and the actual rendering are not at fault.
I’d like to keep everything as dynamic as possible, so no texture atlases, binary space partitioning, etc.

What are my options?

Using opengl 3.1 and higher you can try a few small things like cull faces away from camera and mipmapping?

If you are loading models you can load mipmapping farther away to reduce their detail and load faster by mipmapping their current texture

glGenerateMipmap(GL_TEXTURE_2D);
glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_MIN_FILTER, GL_LINEAR_MIPMAP_LINEAR);
glTexParameterf(GL_TEXTURE_2D, GL_TEXTURE_LOD_BIAS, -0.4f);

And to cull the back faces (allows you to load WAY more models) so that only sides facing the camera are rendering try doing this:

glEnable(GL_CULL_FACE);
glCullFace(GL_BACK);

While face culling works on as low as opengl 2.1 (maybe lower), the mipmapping code only works on pcs with 3.0 and higher.

[QUOTE=kaboom;1279578]Using opengl 3.1 and higher you can try a few small things like cull faces away from camera and mipmapping?

If you are loading models you can load mipmapping farther away to reduce their detail and load faster by mipmapping their current texture

glGenerateMipmap(GL_TEXTURE_2D);
glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_MIN_FILTER, GL_LINEAR_MIPMAP_LINEAR);
glTexParameterf(GL_TEXTURE_2D, GL_TEXTURE_LOD_BIAS, -0.4f);

And to cull the back faces (allows you to load WAY more models) so that only sides facing the camera are rendering try doing this:

glEnable(GL_CULL_FACE);
glCullFace(GL_BACK);

While face culling works on as low as opengl 2.1 (maybe lower), the mipmapping code only works on pcs with 3.0 and higher.[/QUOTE]
Mipmapping and backface culling are already active.
I’m pretty sure the actual rendering is not the problem – If I disable the fragment shader (By discarding all fragments) or change the resolution, the performance stays the same.

The biggest impact stems from glDrawElementsBaseVertex

No, it’s not. The cost of state changes is assessed when you issue a rendering command. This makes rendering commands appear expensive, but they really aren’t. What’s expensive are the state changes preceding it, but the cost of those only is charged to you when you actually render with new state.

Here’s the result of a profiling of the scene using gDEBugger:

If you’re going to post screenshots of profiling results, rather than actual text, please try to make them big enough to be easily legible.

Also, your profiling results suggest that you’re doing things that you said you weren’t doing.

You said “Both buffers are bound via the vertex array object(vao), only ONCE every frame.”, and yet, I see that the call count for glBindBuffer(GL_ARRAY_BUFFER) is only slightly less than half the count of your draw calls. Indeed, you seem to be calling glVertexAttribPointer almost as often as you are binding buffers.

So you are not rendering with the same buffers “only ONCE every frame”. Unless you’re binding the same buffer constantly, which would be silly.

The whole point of BaseVertex rendering is to avoid having to use glVertexAttribPointer to change which part of the mesh you’re using. It’s to avoid binding buffers and changing vertex formats and such. From the profiling data, I see no evidence that you’re gaining any of those advantages.

I know that 12281 draw calls are a lot

No, that really isn’t.

Face culling is OpenGL 1.0. LOD bias was made core in OpenGL 1.4

Ive only worked with opengl 2.1 and higher so I was stating only what i know.