Uniform buffer objects

So, I am working on VR platform with mobile device. I am currently trying to analyze whether UBO works better or not. So here are my observations:
have single UBO for all the objects in a scene. For simplicity, my shaders only using MVP matrix. So my application has 125 objects and all of the MVP matrices are stored in single buffer. So I do update operation only once for each frame. now if i compare the fps where i dont use UBO and use glUniform* I get 2-3 fps drop with UBO approach. I thought this approach should work better instead of having UBO for each object and updating 125 buffers in each frame.

I thought of having only view and projection matrix in UBO and pass model matrix for each object using glUniform but in that case shader has to calculate all the matrices which won’t work well with lighting shaders.

What would be the best approach for passing uniforms?

What GPU vendor and GPU?

There are some subtleties to using buffer objects effectively mobile GPUs.

I get 2-3 fps drop with UBO approach.

By itself, this doesn’t mean anything. This could be a very tiny amount of additional time or a very large amount of additional time as “frames per second” is non-linear in time. Performance (Humus).

What is the FPS with glUniform* and the FPS with UBO? From those numbers, you can compute frame time (in msec) for each. And subtracting those will tell you how much wall-clock time we’re talking about being different here.

It is Adreno 530 - G9350.
Sorry for wrong data, I did not compare it properly.
My application has 125 scene objects, and it is using only MVP in shader and now the fps is same with both cases i.e. with and without UBO.

One more question what is the recommended size for the UBO? I want to know when should I create new UBO.

For recommendations on buffer object usage in general and UBOs specifically, see the Qualcomm Adreno OpenGL ES Developer Guide. In particular, pay particularly close attention to how to update them. You don’t want to cause the driver to synchronize or ghost (make a copy) of the buffer object when you upload new contents, and it sounds like under different circumstances it may do either. Both are bad for performance.

Generally speaking, avoid modifying buffer objects that you have modified in the last 2-3 frames, unless you are doing UNSYCHRONIZED maps.

Thanks for your inputs.
My application is running with 44-52 fps ( fps is varying as the light is rotating) with glUniform* calls. I am trying this experiment only with transform ubos and for light I am not using UBOs.
Here are few cases I tried:

  1. Use same UBO in each frame, with/without unsynchronized maps, update UBO for initial few frames only. Still after this my applications runs with 42-47 fps. No idea why less fps even when i am not uploading UBO data in each frame.
  2. Use 2 UBOs, with/without unsynchronized maps, update UBOs in each frame, fps is 42-47.

I am orphaning a buffer in both the cases.

What is puzzling here is only difference is I am using transform UBOs, even when I dont update them, I get lesser fps, so is it due to data access in shader? cache misses? or just binding of UBO ?

Let’s convert that to something linear so we can compare (i.e. frame time). 44-52fps = 19.2-22.7 ms/frame. 42-47fps = 21.3-23.8 ms/frame. So a loss of ~1-2ms.

If this delta is due to flipping between glUniform*() calls and UBOs, this does suggest your use of UBOs doesn’t pipeline as well with your GL driver as your glUniform* calls (not too surprising as client arrays and standard uniform sets pipeline well on mobile; getting dynamic buffer objects to pipeline well is more difficult). Besides the UBO updates, you do have to bind UBOs whereas before you didn’t. Could be that’s it. Try taking the binds out.

How many binds are you doing? Are you doing lazy binds? Are you referencing UBO data by index? How are you setting the indices? Are you avoiding changing the UBO contents for 2-3 frames? Some drivers block when you orphan buffers, so you might try getting rid of that. Just allocate enough space for 2-3 frames of data so you don’t need to synchronize (as sync can be tricky without causing a render target flush mid-draw, which can generate nasty artifacts).

If you post some code, you’ll probably gather more ideas from folks about how you might optimize things. You might also try and gather more info on how to efficiently use buffer objects on the Adreno GLES driver (I’d suggest posting to the Adreno dev forums, but I know from experience your chances of getting help there aren’t good). That or try alternative approaches to updating and passing UBOs into your driver.

All that said, your frame times are 19.2-22.7 ms/frame even in the better of these two cases. Given that your goal is to slide in easily under 16.6 ms/frame, you need to cut at least 2.6-6.1 ms/frame off your best frame times … which appears to be more than you’ve got on the table here with uniform updates. So you might want to look elsewhere for bigger savings. Again if you post some of your draw code I suspect folks will be able to give you more ideas.

Have you run a frame profiler to see if/how your vertex and fragment work is pipelining on the GPU? In the ideal case (when you’re in tile-based mode), you want to see your fragment work for frame N-1 executing completely in parallel with the vertex work for frame N.

I tried taking binds out, but didn’t help. It looks like bottleneck lies in accessing data in shader. I am passing “uniform int index” for each mesh in scene to offset matrix in array.
I am having UBO size with max capacity so even if I dont use all the buffer I still allocate that much memory. Now, even if I allocate minimal memory ( 12800 bytes) for the buffer, it doesn’t improve. I was thinking if I was hitting register size limit.

When I removed one of the model in scene which has light with it, UBO works better with chain of UBOs as compared to glUniform. Now the interesting thing to note is that in this case my UBO size is (12800) which is same as above.

My implementation looks like this:
Lets say we have 5 meshes in a scene. For each mesh I upload fix number of matrices lets say 4 even if it doesn’t use all of them. Now all the 5 matrices are lying next to each other in array. Now I pass starting index of matrices for each mesh to the shader. for first object index will be 0 for 2nd it will be 5, 3rd -> 10 …

I disabled the updates to ubo after certain frames now just to eliminate the other bottleneck.
I could try with profiler but what is the information its going to give me? I am talking about comparison with glUniform case. Trying to improve the performance when using glUniform* is a different case for which profiler would help me.

Posting a code is not possible as it has too many components.

GPU profiling, particularly on mobile, makes relatively easy to determine if your GL-ES application code is causing the driver to block (synchronize) when you perform some operation. You’ll see gap in your timeline. That’ll make it easier to go after the causes of your primary bottleneck(s), which may or may not be related to how you bind your uniforms or in general setup your shaders.

[QUOTE=debonair;1286843]
My implementation looks like this:
Lets say we have 5 meshes in a scene. For each mesh I upload fix number of matrices lets say 4 even if it doesn’t use all of them. Now all the 5 matrices are lying next to each other in array. Now I pass starting index of matrices for each mesh to the shader. for first object index will be 0 for 2nd it will be 5, 3rd -> 10 …[/QUOTE]

you could post the shader code, for example
binding UBO to certain binding points should be done once, when you initialize your app, the content of the UBO can be updated (orphaning, mapping, … however) without binding it again to the binding point

you could remove the UBO and stream the MVP matrices as “instanced attributes”, that requires to update a GL_ARRAY_BUFFER each frame with ALL matrices, and the “offsets” can be set explicitly if you call:

GLuint instancecount= 10 /* draw 10x with 10 different MVPs */
GLuint baseinstance = 4 /* skips the first 4 MVPs in GL_ARRAY_BUFFER */
glDrawArraysInstancedBaseInstance(GL_TRIANGLES, 0, 3, instancecount, baseinstance);

https://www.khronos.org/opengl/wiki/Vertex_Rendering#Instancing

that way, you reduce the number of drawcommands to the number of different models, for each model change “baseinstance” to where the MVPs for these models start

[QUOTE=john_connor;1286845]you could post the shader code, for example
binding UBO to certain binding points should be done once, when you initialize your app, the content of the UBO can be updated (orphaning, mapping, … however) without binding it again to the binding point
[/QUOTE]

I tried removing binding but didn’t help. here is my shader code


#define MVP_OFFSET 0
#define VIEW_OFFSET 1
Uniform TransformUBO{
mat4 matirces[1000];
};
Uniform int index;
 
Mat4 view_matrix = matrices[index+VIEW_OFFSET];