Uniform Buffer Object Performance

Guoshima · July 13, 2012, 6:23am

Hi,

I am currently porting a part of the engine to use Uniform Buffer objects, but I have some performance questions.

For the current version I just placed all my uniforms into a single big uniform buffer per object and simply update the whole buffer each frame. This is just the first step to get things running.

I was first using glMapBuffer and glUnmapBuffer to copy the new uniform buffer data to the OpenGL Buffer Object. I tried with both GL_DYNAMIC_DRAW and GL_STREAM_DRAW. I was running into rather severe CPU issues while mapping the buffer. I read in the following thread:
Uniform Buffer Objects (slow) - help needed - OpenGL: Advanced Coding - Khronos Forums (bottom) to use glBufferData with the actual buffer data instead of NULL and than it runs a lot faster. I still have to properly compare my CPU timings between the old version using uniforms and new version using uniform buffer object, but I expect that uniform buffer object should run faster when I have properly split my uniform buffers into good logical sets.

But when I now compare the result of rendering about 1024 spheres into my G-Buffer the overall GPU time has gone up a bit. From around 13-14 msec to 17-18 msec. I am using GPU queries to measure the timings. Is it normal that rendering with uniform buffer object is slower than using uniforms directly? I read somewhere that the uniform buffers are stored in device global memory and then copied into device local memory when they are bound, so perhaps this could explain the slowdown on the GPU side itself. Otherwise I don’t see any real reason why rendering with uniform buffers should be slower on the GPU side.

I tried not updating the uniform buffers anymore and then my overall time with uniform buffers goes down to 15-16 msec, which is still not as fast as not using uniform buffers. So I guess splitting the uniform buffers into more logical units and less frequent updates also won’t help then on the GPU side.

So the question is, are these results normal or am I doing something wrong somewhere?

Kind Regards,
Kenzo

Guoshima · July 16, 2012, 2:41am

Another small update.

I ‘hacked’/changed my current version so that all my drawcalls have a unique uniform buffer object to make sure that it’s not related to updating the same buffer twice in a single frame. This didn’t really help.
Since all buffers are unique now, I can fully disable the updating of the buffers for a static frame, but then I still have the same results. The general timings of drawing around 1024 spheres with normal sized shaders and uniform buffer size is still around 15-16 msec.

My average constant buffer size is around 400 bytes, which is perhaps rather large.

The only reason I can see is still the location of the uniform buffer data itself, and then it gets copied from system memory or something on every drawcall. I tried all possible usage hints without any luck. Almost all my CPU time is spend in the drawelement functions of OpenGL.

Any tips or feedback would be nice, else I will have to fallback to using the regular Uniforms for the OpenGL side of the engine.

Cheers,
Kenzo

mhagain · July 16, 2012, 3:55am

I’ve had similar experiences with UBO updating in the past, but unfortunately never resolved them (just went back to old-fashioned uniforms) - but then again I didn’t try too hard to do so, and standalone uniforms were working fine so there was no reason to.

You might get some luck with glMapBufferRange using GL_MAP_INVALIDATE_BUFFER_BIT | GL_MAP_UNSYNCHRONIZED_BIT | GL_MAP_WRITE_BIT - I suspect that the root cause of the performance issues you’ve observed is GPU/CPU synchronization so this at least should help clue the driver in on what you’re doing.

Guoshima · July 16, 2012, 5:12am

I was just trying to use glMapBufferRange but still the same performance.

Here is my system information:
nVidia GTX 580
Driver: 301.42 (updated end of last week and has issues also with 28x.xx)

An overview of what I currently tried: (these results are all from my simple test scene and will try it again on a real level just to be sure)

Map Buffer with GL_WRITE_ONLY and memcopy data into buffer : huge influence when not all buffers are unique (reusing the same buffer for multiple drawcalls in the same frame causes major slowdown)
Map Buffer with GL_WRITE_ONLY but first orphaning data by calling BufferData with NULL data : same as above
Map Buffer Range with these flags GL_MAP_INVALIDATE_BUFFER_BIT | GL_MAP_UNSYNCHRONIZED_BIT | GL_MAP_WRITE_BIT : a bit faster than map buffer and not influenced if buffers are unique or not
BufferData and BufferSubData with all logical usage flags: GL_DYNAMIC_DRAW, GL_STATIC_DRAW and GL_STREAM_DRAW : fastest method and no influence if buffers are unique
Make sure all uniform buffers are unique per object : huge difference on same cases
Try no updating the constant buffer data at all : make all above tests faster but never reach the performance of using regular uniforms
Tried dubble buffering all uniform buffers but made everything a little bit slower
default I am using std140 packing, but when I try with packed data my average block sizes goes from 400 bytes to 277 bytes, and it looks like it runs a little bit faster then, so this could mean that it has something to do with memory bandwith and uploading of the data

Are there any OpenGL profilers which could give me more information on perhaps bandwith and cache misses on the GPU or something. I tried GDebugger but didn’t really help me a lot and freezes 95% of the time when I try to open my test level.

mhagain · July 18, 2012, 8:18am

I haven’t had a chance to look any more at this yet but I’m planning to put together a test app sometime in order to investigate further myself, as this is a topic that interests me too.

One thought that does occur to me is that under D3D10/11 (I know, but bear with me) there is a requirement for the equivalent (constant buffers) to be 16-byte aligned and have the size a multiple of 16 bytes, and in general each constant in the buffer shouldn’t straddle two groups of 4 floats. Now, irrespective of whether or not OpenGL may or may not specify any of this for UBOs, it is possible that doing this may be able to push drivers to a faster path. It definitely seems at least worth trying.

Guoshima · July 18, 2012, 8:47am

Thank you for the info.

I have currently reverted back to the old method of setting the uniforms because of time constraints.

All my constant buffers are already 16 bytes aligned, and every member takes for 4 floats minimum to be compatible with other SDKs as you mentioned.

mhagain · July 18, 2012, 3:13pm

Just did some very quick informal and simple testing; this time I copy/pasted the code from the arcsynthesis site so as to be certain that it’s at least a reduced possibility of it being me screwing up.

A quick cross-check with using glMapBufferRange instead of glBufferSubData and I can only reconfirm what both of us have already observed - UBO performance sucks.

It’s worth noting that on AMD at least I had to call glBindBufferRange at runtime rather than at load time, and after my glUseProgram call, otherwise the UBO wouldn’t be active. This is in line with the old issue where you had to call glUseProgram before calling glGetUniformLocation and was expected behaviour, if a little annoying.

Now, maybe this isn’t the most optimal use case; maybe it’s the case that direct replacement of standalone uniforms with UBOs is not the way to go and UBOs are more tuned for loading much larger blocks of data, but it seems reasonable to expect UBOs to at worst give comparable performance to standalone uniforms (thereby being a convenience feature rather than a performance one, which I’d personally be OK with), and it is a bummer as I have cases where having a shared uniform block would be very handy indeed. I’m almost tempted to long for the days of glProgramEnvParameter4fvARB - not quite, but almost. As it is I don’t feel too motivated to continue testing various permutations - I’ll just write my own C wrapper around standalone uniforms instead.

Again with the D3D10/11 comparison, but cbuffers do not behave like this.

Alfonse_Reinheart · July 18, 2012, 3:41pm

It’s worth noting that on AMD at least I had to call glBindBufferRange at runtime rather than at load time, and after my glUseProgram call, otherwise the UBO wouldn’t be active.

I have never had this happen. Indeed, Tutorial 17’s Double Projection code does the glBindBufferRange for the projection matrix before loading any shaders. It works just fine on my AMD HD 3300, using the 12.1 drivers. I bind it once and never again.

Indeed, the very first tutorial I use UBOs on binds the buffer object after loading the shaders, but before using any of them.

Are you sure your drivers are up-to-date?

mhagain · July 18, 2012, 4:05pm

12.6, yes. Without a runtime glBindBufferRange nothing gets drawn, with one stuff does get drawn. I can cross-check on another AMD tomorrow just to be certain.

aqnuep · July 18, 2012, 4:51pm

I don’t have this problem either with my samples. Uniform buffers are like any other buffer. You can bind them upfront, after creation, you can update them anytime and you can bind/unbind any program meanwhile, the uniform buffer will be used appropriately.
I suspect there might be some issue with your implementation.