Inferior OpenGL performance in mapping uniform buffers

I’m porting my engine to different platforms and working on GL performance, which is far inferior to my d3d implementation.

Inspecting more, I have found that the main performance lags are in uniform buffer map/unmap, check out below screenshot of the PerfStudio result.
Same d3d app (with a simple scene) is about 50% faster than GL, when scene complexity (and so more uniform buffer maps) become higher, I get exponentially lower performance for GL.

my uniform map/unmaps are like this (very much like the d3d calls) :

glBindBuffer(GL_UNIFORM_BUFFER, buff);
glMapBufferRange(target, 0, size, GL_MAP_INVALIDATE_BUFFER_BIT|GL_MAP_WRITE_BIT);
// memcpy and unmap ...
 glUnmapBuffer(GL_UNIFORM_BUFFER);

I have tried different calls for mapping and I couldn’t succeed with better results.
Could you give me a hint or something about this issue ? or is this normal for current drivers ?

btw, I’m testing this on v4.2.12002 ATI (5750) drivers

I’ve found the same; the only reasonable way I’ve been able to update UBOs is to create a single large UBO at startup (big enough for my max number of scene objects), do a single big update via glBufferSubData once only and at the start of each frame, then use glBindBufferRange per object. Even then it was still slower than the common D3D method (outlined in the next paragraph) but only slightly slower (in the order of 5% to 10%) so I decided to just live with it.

Obviously this doesn’t map well to the D3D code you’re likely using (a single small constant buffer which you can map with discard per-object) and means that you’d need to start having divergent code paths which sucks somewhat. Maybe others can chime in with more info.

You’re orphaning every single update?

If there are a lot of these every frame, especially if they’re of varying sizes, I wouldn’t expect that to be blazingly fast.

Try allocating a single large buffer, streaming results into it with (GL_MAP_WRITE_BIT | GL_MAP_UNSYNCHRONIZED_BIT | GL_MAP_INVALIDATE_RANGE_BIT), and then only when it fills up orphan it with (GL_MAP_WRITE_BIT | GL_MAP_INVALIDATE_BUFFER_BIT).

Yes I’m orphaning every update. this is exactly what I’m doing in d3d version.
But d3d version is magnitudes faster than GL, with the same method.

Is this because of inferior drivers ?

It makes sense to lock the buffers whenever we want and shouldn’t worry about the issues, those drivers are there to optimize these stuff under the hood, aren’t they ?
unfortunately making a large buffer for the whole scene, as mhagain said, sucks, because I’ll have to implement two different code paths for d3d and gl

thanks

Unfortunately the GL buffer object API has always been kinda crap like this. Usage hints instead of explicit behaviours, drivers doing what they want anyway, shuffling buffers around between different storage based on heuristics, inconsistent behaviour for different buffer object types, and synchronization all over the place despite being told to not do so.

It’s not really the drivers at fault; it’s that buffer objects were specified fairly loose and woolly to begin with and the drivers are just trying to make the best of a bad specification. With D3D buffers you’re used to being able to say “do this” and the driver does what you want (or gives you a nice big error if it can’t); with GL buffers things are regrettably not so straightforward.

For UBOs, and as I indicated, the best performance I’ve personally had was from glBufferSubData (roughly equivalent to UpdateSubresource) rather than glMapBufferRange. Yes, that means having to do an extra memory copy, but at least the driver can copy off your data and do the update in a more orderly manner, managing resource contention itself.

Using glBufferSubData on amd catalyst 13.9 brings severe spikes in the frame. (and higher frame-time)
With glMapBufferRange on the other hand, doesn’t have any spikes, so it was the best solution for me

As for the api problem, the API is definitely less consistent than d3d, but how this particular mapping scheme is different from d3d ?
I didn’t get what u mean. d3d also has more or less the same map/unmap api, the driver can do what it does for d3d buffers when it detects we are using the same kind of APIs and usage hints and flags.

Another question. do you work with nvidia or amd drivers ?
I’m eager to test it on nvidia drivers too, which I don’t have.

My best guess is to have two large UBO buffers, and like mhagain mentioned - update everything with one glBufferSubData, before you render anything with that UBO. Two UBOs instead of one - for manual double-buffering, otherwise you may end-up waiting for the previous one to no longer be used by the FIFO, or orphaning the buffer.

Orphaning is easiest to do with glBufferData or glBufferSubData(…,0,fullsize), but with orphaning you end-up wasting time on allocating and mapping-to-cpu those large buffers again and again. If you have two or more buffers (swap between them on every intermediate render), chances are the current buffer is no longer in use by the FIFO, and the driver will detect it won’t need to orphan it. Thus it will probably reuse the allocation and cpu-mapping. To make best use of such cases, I would use glMapBuffer(…, GL_WRITE_ONLY) at the start of the render, fill in all necessary data for the render, glUnmapBuffer, and then multiple times (once per drawcall) glBindBufferRange/glBindBufferBase. This should have the same effect as glBufferSubData(,0,fullsize) except that you’ll be avoiding an unnecessary memcpy().

In your case, you are uploading 192 bytes per drawcall. For such small updates, there’s a perfectly-suited circular buffer that’s guaranteed full speed and is persistently mapped to cpu: the non-UBO uniforms (glUniformfv). The trouble with them is, for maximum throughput you should have only one uniform, of the likes of “uniform mat4 vvv[10];” or “uniform vec4 fff[10];”, and a bunch of unwieldy “#define uni_myAmbientColor fff[3].xyz” macros littering your shader.

When I coded this about a year ago I actually found no performance gain from manual double-buffering; my feeling is that the driver will detect the usage pattern and do it’s own internal double-buffering for you, but this is something that would need to be confirmed by someone in the know.

My general findings were (as you observed) AMD spikes like crazy with small updates per-object, but - as I said - a single large update for all objects at the start of the frame, coupled with glBindBufferRange calls per-object - worked fine and smooth. NVIDIA gave more equal performance in both cases, but I ended up going with the pattern that suited both NVIDIA and AMD best. I can’t recall how Intel behaved, or if I even bothered testing.

I said that it sucks above, but in truth it needn’t be wildly divergent code. Assuming that you make two passes over your scene objects - the first to determine what should be drawn (and sorting them into buckets), the second to actually draw them - it’s straightforward enough to do. Each object can get the start of the buffer range for it as a member variable, during the first pass you update it’s properties to a system memory buffer, then just before the second passes you make a single glBufferSubData call. The only parts that made me feel dirty were the extra memory copy and the use of extra memory caused by UBO alignment requirements, but I considered both to be fair enough tradeoffs for levelling the performance. Of course, in an ideal world I would have preferred to have had to do neither, and maybe I’ll go back and re-evaluate with the GL4.4 buffer storage functionality, which may allow for an update pattern closer to D3D’s.

Standalone uniforms and a big glUniform4fv call were an option I considered but discounted as in GL standalone uniforms belong to the program object and I needed to switch shaders a little too often to make myself feel comfortable with this pattern. If the same circumstances don’t apply to you it is worth trying for sure.

Thanks guys
I have manage to modify the renderer so there is a shared uniform buffer for all objects, by orphaning a big buffer once per frame and submit data in unsynced mode.

Now the performance in my GL renderer is roughly the same as d3d. But still, the frame times for d3d is more stable than GL, which seems to have more “spikes” especially in windowed mode.

This is a bit unrelated to GL forum, but I have found that d3d added same functionally (glBindBufferRange, glMapBufferRange) in the d3d11.1 API, but I haven’t managed to use it in my win7 box/ATI 5750 card. Device reports all features of d3d11.1 (including shared uniform buffers) with FALSE. I’m still looking for answers because this should be just an API/Driver issue as long as GL has it since v3.0.