Uniform Buffer Objects (slow) - help needed

I’ve already seen the “VBOs strangely slow?” thread. I’ve read through it twice, it all makes sense. However, my problem is a bit different in that it’s dealing with UBOs and not VBOs. Although, behind the scenes are they entirely the same?

glMapBufferRange makes good sense to me, as it seems to mimic (for the most part) what DirectX has always had in terms of buffer object locking/unlocking.

The simple case that I currently have working is a shader program with a constant block that’s updated via a UBO. The constant block is structured as follows:


uniform DF_GLOBALS
{
    mat4    WorldView;
    mat4    WorldViewProj;

    vec4    BackBufferInfo;
    vec3    CameraInfo;
    vec2    ViewportInfo;
};

As you can see, there’s a few hundred bytes of data there.

However, the problem is largely with glMapBufferRange and to a smaller degree, glUnmapBuffer.

I’m currently mapping the entire buffer and using the following flags to request that the driver discard the possibly in use buffer memory, and hand me back a pointer to new memory, if necessary:

GL_MAP_WRITE_BIT | GL_MAP_INVALIDATE_BUFFER_BIT

That call to glMapBufferRange alone is taking just under 1 millisecond. Maybe (hopefully) I’m doing something wrong?

Using either GL_DYNAMIC_DRAW or GL_STREAM_DRAW at buffer creation time makes no difference.

The old school method of glBufferData( NULL ) to discard in conjunction with glMapBuffer does help quite a bit (relatively speaking). But even then, it’s still around the 0.1 to 0.2 ms range, per update.

This becomes unbearably slow very quickly if I try to draw many objects which their WorldView and/or WorldViewProj matrices updated. In that case I’m making many calls to glMapBufferRange per frame. Is there a better way I should be doing that?

Hardware is ATI HD4850 with latest drivers.

Any ideas? Thanks.

Some quick ideas:

  • try different ubo storage (std140, packed)
  • try glMapBuffer(GL_WRITE_ONLY) and use GL_DYNAMIC_DRAW and GL_UNIFORM_BUFFER target when creating buffer

I’m already using std140. I don’t see how that could/should affect mapping performance though.

I’m already using GL_UNIFORM_BUFFER as target at creation time. What else can you use there without OpenGL complaining? I’ve tried both GL_DYNAMIC_DRAW and GL_STREAM_DRAW, neither seems to make a difference.

Just so I’m clear, what’s the intended use of UBOs? Is it so you can quickly swap out different values for the same set of constants (i.e. init a couple of buffers once and then don’t touch them again - just swap amongst them)? Don’t see how that’s useful though.

Or are UBOs intended to used like most other buffer backed operations which benefit from async updates? I assume this is the intended usage pattern, so you can quickly and repeatedly update multiple times per-frame certain (any/all) constants via OpenGL’s buffer object mechanism.

The default ubo storage is shared. So the code you posted uses shared. It should not affect mapping performance but there might be a driver bug so you should check all storages.

GL buffers are only raw container of bytes. You can use any target for any purpose but there may be performance penalty.

I recommend this reading http://www.opengl.org/registry/specs/ARB/uniform_buffer_object.txt.

Storing uniform blocks in buffer objects enables several key use
cases:

 - sharing of uniform data storage between program objects and
   between program stages

 - rapid swapping of sets of previously defined uniforms by storing
   sets of uniform data on the GL server

 - rapid updates of uniform data from both the client and the server

Yes, I know. The code I posted was only a snippet of a larger piece, where I specify:

layout(std140) uniform;

I’m certain that all of my constants are using std140. Functionally, everything is working. Objects are rendering where they should be with proper orientations etc.

Not so much. It’s the performance that’s just not there.

I guess I could try using shared, but I’ll have to query for and store offests, which the code isn’t currently setup to do.

Just so I’m clear, what’s the intended use of UBOs?

There are many intended uses of them. Frequency of updates, however, is not one of them. If you’re trying to update a UBO more than once per frame, you’re using it wrong.

You can use UBOs to share particular data among several programs. For example, if you have a camera matrix and a projection matrix, these are constant for all entities in the scene. As long as all of these shaders use the same UBO, you can update them just by changing one UBO.

You can also use UBOs to store per-instance data. But this doesn’t mean you change the buffer’s data every time you draw another instance.

For example, if you have the model-to-world (MTW) matrix as per-instance data, you can allocate a “large” buffer object. From that buffer, you can allocate slices that are the size of the MTW matrix, one allocation for each instance. When you are preparing to draw your instances, you map the entire buffer and update everyone’s per-instance data. Note that this happens only once per frame.

This is how I do it, and it works fine on ATI.

First of all I use strictly UBO since the rendering API must be compatible with DX10/DX11. For all dynamic uniforms (those who are updated every frame, and there is a lot of them) I create one large shared UBO (actually two since it’s double buffered for performance reasons) and then use glBufferSubData to update a specific range on there.

Everything is std140, again for compatibility and to reduce dependencies between code and shaders.


glBufferSubData(GL_UNIFORM_BUFFER, MapOffset, MapLength, 0x0);
glBufferSubData(GL_UNIFORM_BUFFER, MapOffset, MapLength, pData);

Makes sense. However, for each instance rendered, it’s going to require some kind of index into the global array of MTW matrices, right? How is that index (per-object) supposed to be set? Somehow, somewhere, there’s going to need to be a per-object update of constant data - whether that be an index into a global array of MTW matrices or its actual MTW matrix.

And Sunray, like you I’m trying to maintain some feature parity because this is an abstracted renderer built on top of DX10 and OpenGL. The DX10 HLSLWihtoutFX10 example clearly implies that it’s okay to operate with multiple constant buffer updates per frame. Maybe the example isn’t that great and it breaks down in real world usage scenarios?

I just wanted to follow up here. Instead of mapping/unmapping, I’m now calling glBufferData with a pointer to the constant data instead of NULL. This is apparently extremely fast. This is handling several hundred objects with ease, in ~0.2 ms.

It seems to me that glMappBufferRange with GL_MAP_INVALIDATE_BUFFER_BIT followed by a memcpy into the mapped memory should be the same in terms of performance. However, it’s not. Furthermore, the way the docs read, it seems as though glMapBufferrange is intended to be used the same way DirectX’s buffers are used with either the discard or no overwrite flags.

How is that index (per-object) supposed to be set?

That’s up to you. You define the concept of an “Object”, so within that concept must be a reference to where this object’s per-instance data in the large buffer is stored.

This offset is not part of the uniform data; it is part of how you bind the uniform data. It’s one of the parameters to glBindBufferRange.

It seems to me that glMappBufferRange with GL_MAP_INVALIDATE_BUFFER_BIT followed by a memcpy into the mapped memory should be the same in terms of performance. However, it’s not. Furthermore, the way the docs read, it seems as though glMapBufferrange is intended to be used the same way DirectX’s buffers are used with either the discard or no overwrite flags.

Or, it’s just ATI :wink:

glMapBufferRange is still “new”, despite being available for 1.5 years. I imagine, with the relative scarcity of new OpenGL games on Windows that they have not had a need to optimize glMapBufferRange.

There is no reason for glBufferData(NULL) to be any slower than glMapBufferRange with invalidate. And on NVIDIA drivers, it isn’t. I guess ATI hasn’t gotten around to optimizing this call.