PDA

View Full Version : UBO poor performance [GL 3.1]



Executor
10-16-2009, 12:05 AM
I try use UBO, but i have poor performance with him.

Code w/o UBO:


mat4 matLocal = ...;
mat4 matMVP = ...;
vec2 uvBase = ...;
vec2 perlinMovement = ...;
vec3 localEye = ...;
glUniformMatrix4fv(uniform_matLocal, 1, false, matLocal);
glUniformMatrix4fv(uniform_matMVP, 1, false, matMVP);
glUniform2fv(uniform_uvBase, 1, uvBase);
glUniform2fv(uniform_perlinMovement, 1, perlinMovement);
glUniform3fv(uniform_localEye, 1, localEye);

Code w/ UBO:


struct BlockPerBatch
{
mat4 matLocal;
mat4 matMVP;
vec2 uvBase;
vec2 perlinMovement;
vec3 localEye;
};

BlockPerBatch blockPerBatch;

glBindBuffer(GL_UNIFORM_BUFFER, ubo_BlockPerBatch); // once for all batches

...

blockPerBatch.matLocal = ...;
blockPerBatch.matMVP = ...;
blockPerBatch.uvBase = ...;
blockPerBatch.perlinMovement = ...;
blockPerBatch.localEye = ...;
glBufferData(GL_UNIFORM_BUFFER, sizeof(blockPerBatch), &blockPerBatch, GL_DYNAMIC_DRAW);

Shader:


#version 140

...

uniform BlockPerBatch
{
mat4 matLocal;
mat4 matMVP;
vec2 uvBase;
vec2 perlinMovement;
vec3 localEye;
};

...

w/o UBO - ~250 FPS
w/ UBO - ~225 FPS

GeForce 9600GT
Win7 Driver 190.89
OpenGL 3.1

What i do wrong?

Groovounet
10-16-2009, 03:19 AM
Do you reallocate your buffer at each frame?

UBO comes well together with MapBufferRange or MapBuffer and apparantely even glBufferSubData would be faster.

Have a look on the MapBufferRange API, that's THE way to go!

Brolingstanz
10-16-2009, 04:34 AM
... Also be sure to group by frequency of update. E.g. Per-frame, per-sector, per-object, per-culator, per-fume, ....

Executor
10-16-2009, 04:53 AM
Do you reallocate your buffer at each frame?

UBO comes well together with MapBufferRange or MapBuffer and apparantely even glBufferSubData would be faster.

Have a look on the MapBufferRange API, that's THE way to go!

In example from spec used glBufferData:


void render()
{
glClearColor(0.0, 0.0, 0.0, 0.0);
glClear(GL_DEPTH_BUFFER_BIT|GL_COLOR_BUFFER_BIT);

glUseProgram(prog_id);

glEnable(GL_DEPTH_TEST);
glMatrixMode(GL_MODELVIEW);
glLoadIdentity();
glTranslatef(0, 0, -4);
glColor3f(1.0, 1.0, 1.0);
glBindBuffer(GL_UNIFORM_BUFFER, buffer_id);
//We can use BufferData to upload our data to the shader,
//since we know it's in the std140 layout
glBufferData(GL_UNIFORM_BUFFER, 80, colors, GL_DYNAMIC_DRAW);
//With a non-standard layout, we'd use BufferSubData for each uniform.
glBufferSubData(GL_UNIFORM_BUFFER_EXT, offset, singleSize, &colors[8]);
//the teapot winds backwards
glFrontFace(GL_CW);
glutSolidTeapot(1.33);
glFrontFace(GL_CCW);
glutSwapBuffers();
}

SubData only for update one uniform in block.
I try glBufferSubData for all - fps is equal glBufferData.


... Also be sure to group by frequency of update. E.g. Per-frame, per-sector, per-object, per-culator, per-fume, ....

I sure...

Groovounet
10-16-2009, 06:51 AM
Really glBufferSubData and glBufferData are not good solutions.
This sample works but it's a not point to use in real applications. Calling glBufferSubData for a single data update would be worth that using glUniform* which is still possible to do within a uniform buffer.
Calling glBufferData is like a "C++ new" with OpenGL, you don't want to do so to upload you data!

Create and allocate the buffer once with glBufferData but update with the MapBufferRange API. Parallel, async and a fine grain control.

You can actually use a single buffer to pack all your "block per" kind of data as far as you keep the uniforms group together.

Example of a single uniform buffer:
128 bytes Per-frame
64 bytes Per-object
16 bytes Per-batch

Don't forget that GPU have a memory bust size with a minimun of 64 bytes usually, there is a balance to find to reach a good granularity and that's why I like the single grouped uniforms buffer approached.

And then you can have just the right amount of byte pick up and update with MapBufferRange even in parallel as far as it doesn't overlap.

Even If you have some huge amount of uniforms you could use several CPU threads to update the buffer data per block and send those data as you go in parallel.

Executor
10-16-2009, 07:27 AM
I try MapBufferRange later, tnx...

I have update drivers to 191.07 WHQL:

w/o UBO - ~250 FPS
w/ UBO - ~240 FPS

Result is better...

Groovounet
10-16-2009, 07:41 AM
Humm

What's you result with glUniform?

skynet
10-16-2009, 08:01 AM
Groovounet:
Do you have data to backup your claim that MapBufferRange is faster than glBufferData for _small_ buffers?

I'm using two UBOs to store per-View and per-Object matrices. These two buffers are not bigger than 240Bytes each. Whenever I need to change one of them, I upload the whole contents via glBufferData. This gives the driver a hint "the old data is no longer needed", and if the old contents is stll in use, it might use a double-buffer scheme internally to not stall the pipeline.

Stephen A
10-16-2009, 08:08 AM
You can achieve the same effect by calling glBufferData(null) and glMapBuffer.

I've tested on a few different drivers, and there's no clear winner between glBufferData and glMapBuffer. The only significant difference occurs when streaming data, where MapBuffer pull ahead (i.e. it allows you to write directly to the mapped region, and avoid allocating a temporary client-side buffer).

Groovounet
10-16-2009, 08:31 AM
I never considered that glBufferData could not stall actually. How does glBufferSubData affect your performances?

I have seen MapBufferRange with quite large buffers that's why I pack everything in a single buffer, to keep it large enough.
I quite assume that the MapBufferRange "access" parameter give the hits to the drivers.

For small buffer ... When you get all your uniforms in single uniform buffer it's not that small ...

(PS: I'm going to digg a bit more on this topic, I'll let you know with numbers!)

skynet
10-16-2009, 09:26 AM
For small buffer ... When you get all your uniforms in single uniform buffer it's not that small ...


I did not state clear enough. I do not have one buffer per object. Instead, I have one buffer that stores the modelviewprojection matrix. This UBO has to be changed per object (or better: per draw-call that is using a different matrix). The same UBO is shared by all shaders, though. This is my way to compensate the loss of the gl_ModelViewProjectionMatrix and ftransform() built-ins when I switched over to GL3.0+

Executor
10-16-2009, 09:39 AM
Humm

What's you result with glUniform?

w/o UBO (using glUniform*) - ~250 FPS

marshats
10-16-2009, 10:58 AM
First thing, why recompute sizeof(blockPerBatch) with each glBufferData(..sizeof(blockPerBatch)...). I would make a single call GLuint sizeof_blockPerBatch = sizeof(blockPerBatch) then glBufferData(..sizeof_blockPerBatch...).

Second, do you get a speed/FPS improvement if you use layout(std140) in your shader like


layout(std140) uniform BlockPerBatch
{
mat4 matLocal;
mat4 matMVP;
vec2 uvBase;
vec2 perlinMovement;
vec3 localEye;
};


Code w/ UBO:


GLuint uniformBlock_blockPerBatch_id;
GLfloat blockPerBatch[] =
{ //layout(std140) uniform matrix1
1.0,0.0,0.0,0.0, //mat4 matLocal
0.0,1.0,0.0,0.0,
0.0,0.0,1.0,0.0,
0.0,0.0,0.0,1.0,
1.0,0.0,0.0,0.0, //mat4 matMVP
0.0,1.0,0.0,0.0,
0.0,0.0,1.0,0.0,
0.0,0.0,0.0,1.0,
0.0,0.0, 1,1, //vec2 uvBase (last 1,1 is filler)
0.0,0.0, 1,1, //vec2 perlinMovement (last 1,1 is filler)
0.0,0.0,0, 1, //vec3 localEye (last ,1 is filler)
};
GLuint sizeof_blockPerBatch = sizeof(blockPerBatch);

//convenience map into blockPerBatch
mat4 &matLocal = (mat4&)uniformBlock_matrix1[0];
mat4 &matMVP = (mat4&)uniformBlock_matrix1[16];
vec2 &uvBase = (vec2&)uniformBlock_matrix1[32];
vec2 &perlinMovement = (vec2&)uniformBlock_matrix1[36];
vec3 &localEye = (vec3&)uniformBlock_matrix1[40];

defineUniformBlockObject(0,"BlockPerBatch",uniformBlock_blockPerBatch_id); // once for all batches

matLocal = ...;
matMVP = ...;
uvBase = ...;
perlinMovement = ...;
localEye = ...;

glBindBuffer(GL_UNIFORM_BUFFER, uniformBlock_blockPerBatch_id);

glBufferData(GL_UNIFORM_BUFFER, sizeof_blockPerBatch, &blockPerBatch, GL_DYNAMIC_DRAW); // don't recompute sizeof() every call!


where the helper defineUniformBlockObject function is


void defineUniformBlockObject(GLuint binding_point, const char *GLSL_block_string, GLuint &uniformBlock_id)
{
glGenBuffers(1, &uniformBlock_id);

//"layout(std140) uniform GLSL_block_string"
GLuint uniformBlockIndex = glGetUniformBlockIndex(shader_id, GLSL_block_string);

//And associate the uniform block to binding point
glUniformBlockBinding(shader_id, uniformBlockIndex, binding_point);

//Now we attach the buffer to UBO binding_point...
glBindBufferBase(GL_UNIFORM_BUFFER, binding_point, uniformBlock_id);

//We need to get the uniform block's size in order to back it with the
//appropriate buffer
GLsizei uniformBlockSize;
glGetActiveUniformBlockiv(shader_id, uniformBlockIndex,
GL_UNIFORM_BLOCK_DATA_SIZE,
&uniformBlockSize);

//Create UBO.
glBindBuffer(GL_UNIFORM_BUFFER, uniformBlock_id);
glBufferData(GL_UNIFORM_BUFFER, uniformBlockSize, NULL, GL_DYNAMIC_DRAW);
}


I see speed improvement using this over a bunch of separate glUniform* calls. But I haven't tested it extensively. I would be curious if in your case using "layout(std140) uniform" has any effect.

kRogue
10-16-2009, 12:33 PM
w/o UBO - ~250 FPS
w/ UBO - ~225 FPS


I know lots of you all love FPS as a speed measurement, but you really should look at how much time it takes to render, rather than how many renders per second:

w/o UBO 0.004 seconds
w UBO 0.0044 seconds

so the difference in time to render is, ahem, 0.4 ms, um is that even really a difference?

Additionally, the data you have tied to the UBO is not that much: 2 mat4's, 2 vec2's and 1 vec3 --> 39 floats not exactly a lot of data.

How big are the meshes being rendered? How many? As the number of draw calls goes up and the number of different shaders go up you will find that UBO will beat glUniform calls, but right now the difference in time is 0.4 ms, which is once you get into the realm of say 60fps/120fps is not even noticeable even in the FPS speed rating.

marshats
10-16-2009, 01:13 PM
Good comment on small block size and difference of .4ms!. That difference is probably timer precision error.

Note I use FPS as a measure based on the post on performance measurements (http://www.opengl.org/wiki/Performance).

kRogue
10-16-2009, 01:42 PM
My over simply way to measure FPS is to just track the time after each buffer swap (i.e. glXSwapBuffers, or whatever). It might not be perfect for a given frame but gives a good overall picture of how much time a typical frame is using.

Edit: after reading that link:
Looking at the link, that is exactly what they say to do. Silly me.

Executor
10-19-2009, 09:58 AM
First thing, why recompute sizeof(blockPerBatch) with each glBufferData(..sizeof(blockPerBatch)...). I would make a single call GLuint sizeof_blockPerBatch = sizeof(blockPerBatch) then glBufferData(..sizeof_blockPerBatch...).

Bad advice...
sizeof() doing in compile time...


Second, do you get a speed/FPS improvement if you use layout(std140) in your shader like

No difference...


How big are the meshes being rendered? How many?

~30 batches per frame
max ~130 batches per frame

dv
10-19-2009, 01:37 PM
By the way, I am a bit confused about the std140 layout. It seems MUCH more practical than the regular one, not requiring tons of queries etc. But it is safe to use std140 all the time for UBOs with constantly changing content? Or are there any disadvantages of std140 I should know?

Rob Barris
10-19-2009, 01:44 PM
Tradeoffs with using std140 layout:

+ your code is simpler since layout is known at build time
+ layouts can be more readily known.shared amongst multiple programs

- the runtime cannot optimize/pack/relocate "dead" uniform slots in cases where the program does not reference all of them. This can happen when "uber shaders" with lots of conditionally activated paths are in play.

- there may be some data packing opportunities that std140 precludes the runtime from using - which could be vendor or processor specific.

So "it depends".

Look at issues 47/48 in the extension spec for UBO:

http://www.opengl.org/registry/specs/ARB/uniform_buffer_object.txt

dv
10-19-2009, 02:10 PM
I see. So, for instance when I want to stream instancing data from somewhere, std140 would be beneficial since I could simply copy the data into the UBO instead of many small copies to the respective offsets. But for other scenarios where I for example simply update a model matrix now and then inside the UBO, a packed layout would make more sense. Did I understand this correct?

Brolingstanz
10-19-2009, 05:21 PM
I think the general idea is to keep things as tightly packed and sequential for access as possible.

See the vendor perf docs for details, but I don't think small, partial, amid ships updates are going to buy you much in any case.

Rob Barris
10-19-2009, 05:35 PM
I see. So, for instance when I want to stream instancing data from somewhere, std140 would be beneficial since I could simply copy the data into the UBO instead of many small copies to the respective offsets. But for other scenarios where I for example simply update a model matrix now and then inside the UBO, a packed layout would make more sense. Did I understand this correct?

If the layout of your original data follows that of a C struct, then yes, being able to mirror that data with a simple code style would be a win for std140.

The packed form is going to be more useful if there are named uniforms of a significant quantity or size, that will not get 'touched' by a given linked program. It really depends on your code.

Where this becomes particularly noticeable is if you are running on hardware with a limited number of register slots for program parameters, and your total set of uniform space exceeds it - going to 'packed' could well get you back under the wire.

Executor
10-26-2009, 07:30 AM
layout(std140):


Name: matLocal
Index: 2
Offset: 0
Size: 1

Name: matMVP
Index: 3
Offset: 64 // mat4 offset
Size: 1

Name: uvBase
Index: 18
Offset: 128 // mat4 offset
Size: 1

Name: perlinMovement
Index: 5
Offset: 136 // vec2 offset
Size: 1

Name: localEye
Index: 1
Offset: 144 // vec2 offset
Size: 1

All ok...

layout(packed):


Name: matLocal
Index: 2
Offset: 0
Size: 1

Name: matMVP
Index: 3
Offset: 64 // mat4 offset
Size: 1

Name: uvBase
Index: 18
Offset: 128 // mat4 offset
Size: 1

Name: perlinMovement
Index: 5
Offset: 128 // wtf?
Size: 1

Name: localEye
Index: 1
Offset: 128 // wtf?
Size: 1

Why three uniforms have some offsets?

Executor
10-27-2009, 10:48 PM
Up...