Uniform Buffer Objects performance issues

First off I read this thread Uniform Buffer Objects (slow) - help needed - OpenGL: Advanced Coding - Khronos Forums but since the last post here comes from 2.5 yrs ago I thougt I could add something here. Basically, I am experiencing the same problems as the author of the aformentioned thread. I have a GF 240 GT with some of the latest drivers.

I render 625 meshes and, obviously, need a world transform matrix for each. Also, there is view proj matrix passed to the shader (this is set only once as it is constant for all objects, so only world transform needs to be updated). Using traditional uniform variables approach I manage to render everything in less than 2ms, which is a little over 500 FPS (before recording the time I call glFinish).

Now I switched to a constant buffer. When I update the buffer’s data with MapBufferRange the performance hurts immensely taking around 120ms to render a frame. On the other hand, when I update the buffer’s data with glBufferSubData, the CPU time needed to execute API calls is less than 1ms (!) but that is before calling glFinish. After calling glFinish the measured time is around 9ms, which gives 120 FPS or so.

The thing that bothers me most is the difference in timing taken before and after calling glFinish. If rendering all objects takes less than 1ms and calling glFinish is so expensive I guess OGL is simply buffering all commands. If so then I think it’s quite a lot of data to buffer.

Has anyone ever decided to abandon the use of goold oldie variable uniforms and switched completely to using uniform buffers?

What do you mean by “Now I switched to a constant buffer”? Do you have a single buffer with 625 matrices?

I’ve done some benches myself on UBOs and I found them slower that glUniform* for these kind of situations. I will post the results when I go home.

Sorry for not being specific. But “switched to a constant buffer” I mean I have a constant buffer which only holds two matrices, world and viewProj. This constant buffer is updated before each draw call. I’m doing it this way to be consistent with DX10/11.

Has anyone ever decided to abandon the use of goold oldie variable uniforms and switched completely to using uniform buffers?

No but I will be most interested in your results. From what I have read uniform buffers have to be copied to registers (uniforms) prior to use so they do have an overhead.
That was from older articles and may be out of date.

I have been caught out benchmarking with buffering of commands. My assumption is that OpenGL does not
actually buffer that much but sends commands to the gpu where they get stuck in queues. Certain OpenGL commands require a response from the gpu and that is
where the driver suspends waiting for the gpu to execute that command. (Certainly that is how channel control programs worked for mainframe front-end processes when I used to write that
code many eons ago ;))

One last question that actually matters. Do you use a single UBO for all the meshes or one per mesh?

One UBO for all meshes looks like this I guess:

for mesh in meshes do
    update UBO 0
    draw mesh
endfor

[QUOTE=Godlike;1244927]One last question that actually matters. Do you use a single UBO for all the meshes or one per mesh?

One UBO for all meshes looks like this I guess:

for mesh in meshes do
    update UBO 0
    draw mesh
endfor

[/QUOTE]

Yes, I have only one constant buffer. Moreover, it is set only once so the code should not suffer any redundant API overhead. The only extra function I call for each iteration is glBufferSubData to update data in the constant buffer under slot 0. Basically GL Intercept logs this for each mesh:


glBufferSubData( ??? )
glDrawElements( ??? ) GLSL=4  Textures[ (0,4) (7,2) ] 

[QUOTE=maxest;1244928]Yes, I have only one constant buffer. Moreover, it is set only once so the code should not suffer any redundant API overhead. The only extra function I call for each iteration is glBufferSubData to update data in the constant buffer under slot 0. Basically GL Intercept logs this for each mesh:


glBufferSubData( ??? )
glDrawElements( ??? ) GLSL=4  Textures[ (0,4) (7,2) ] 

[/QUOTE]

I may have an idea on what is wrong.

The draw calls are not executed the time you send them. In most implementations they are stacked in a command buffer and the driver decides when to send for execution. When you update the buffer the previous draw call depends on that buffer so the driver cannot mess with it because it will affect the previous draw call. What the driver can do is either wait for the dependency to be resolved (prev draw call is done) or it can create a copy (CopyOnWrite) and the new draw call will use the copy. Both solutions are a bit expensive.

What you can easily do to test this theory is to use one UBO per drawcall. I bet that you will see improvement.

As I said, I’m already using one constant buffer…

And what I am trying to say is that by using one buffer you are not using OpenGL with an optimal way. The sequence you describe has read/write dependency problems because every update to the buffer depends on the previous draw call.

// Iteration 0
write UBO 0
draw mesh 0
// Iteration 1
write UBO 0 -> wait for “draw mesh 0” to be done reading the UBO 0
draw mesh 1
// Iteration 2
write UBO 0 -> wait for “draw mesh 0” and “draw mesh 1” to be done reading the UBO 0
draw mesh 2

I am no trying to convince you to use something else. But the root of your problem most likely is what I described and if you want to solve it you need different approach.

Ough, sorry, I misread your post. I thought you wanted me to use one UBO what I’m already doing. Your idea makes sense, I will give it a try once I’m done with my current work :).

I’ve just checked it. Unfortunately performance stays exactly the same.

I’ve experienced the same - Though not with UBOs by normal VBOs. I had to stop using buffer mapping for that reason and only rely on subbufferdata.
What I think is happening is that the Nvidia drivers are trying to be clever (I’m guessing you’re using Nvidia like me). It’s my experience that Nvidia doesn’t care about the buffer usage and changes type as they see fit.

My advice would be to steer clear of mapping GL buffers and handle it yourself.

Buffers are not designed to be updated per draw call but per frame… In your case the texture buffer will be more suitable because you won’t be limited by the 64K size and the 256byte alignment.

// update matrices in TBO / UBO
glBindBuffer(GL_TEXTURE_BUFFER, tbo);
float4x4 mats = (float4x4)glMapBufferRange(GL_TEXTURE_BUFFER, 0, size, GL_MAP_WRITE_BIT | GL_MAP_INVALIDATE_BUFFER_BIT);
float4x4 *e = mats + num_instances;
while(mats != e) {
*mats++ = compute instance matrix;
}
glUnmapBuffer(GL_TEXTURE_BUFFER);

// render meshes
glActiveTexture(GL_TEXTURE0);
glBindTexture(GL_TEXTURE_BUFFER, tb); // bind texture buffer created by glTexBuffer
glUniform1i(your_shader_sampler_uni, 0); // set texture channel for shader sampler

for(…) {
glUniform1i(your_mesh_id_uni, id); // or use glVertexAttribI1i(15, id)
glDrawRangeElements(…);
}

// in shader
mat4 tm = mat4(
texelFetch(tb, your_mesh_id * 4),
texelFetch(tb, your_mesh_id * 4 + 1),
texelFetch(tb, your_mesh_id * 4 + 2),
texelFetch(tb, your_mesh_id * 4 + 3));

You can find more about buffers in free chapter from OpenGL Insights
http://www.seas.upenn.edu/~pcozzi/Op…rTransfers.pdf