GL_EXT_separate_shader_objects Performance ?

Hi,

I have an OpenGL ES 2 engine that I have it running on Windows and iOS, and I also have it running on DirectX 11.

As soon as I learned about GL_EXT_separate_shader_objects I imagined that I would have huge performance benefits because with this one OpenGL would look and feel more like DirectX9+ and my cross-API code could remain similar while maintaining high performance.

To my surprise, I implemented GL_EXT_separate_shader_objects only to find out that my performance is halved, and my GPU usage dropped from ~95% to 45%. So basically, having them as a monolithic standard program is twice as fast as being separate. This is on an AMD HD7850 under Windows 8 and OpenGL 4.2 Core.

I originally imagined that this extension was created to boost performance by separating constant buffers and shader stages, but it seems it might have been created for people wanting to port DirectX shaders more directly, with disregard to any performance hits.

So my question, is if you have implemented this feature in a reasonable scene, what is your performance difference compared to monolitic shader programs ?

What part of “performance” are we talking about? How did you “implement” them? How are you using program pipeline objects?

Some benchmarking code would be good.

While GLSL’s monolithic approach has some advantages for optimizing shaders

In the spec it does notice there may be performance cost. If you are dynamically stitching the shader program inside the render loop I would also expect a hit here as well.

glUseProgram call is (on average) 4 times faster than glBindProgramPipeline

I would expect this since the driver cannot use the program until the links between shader parts are made. I am surprised about the render times for the glDrawElements

Those “links between shader parts” can be made off-line. That’s the whole point of the program pipeline object: it encapsulates a sequence of programs, so you can do all that verification work up-front rather than at bind-time.

Sounds more like someone’s slacking off at their job of implementing this correctly.

Hm, so you’re saying that behind the scenes, he’s taking my binds to glUseProgramStages and creating another “on the fly” monolithic program ? This would be a huge overhead…

He’s talking about the establishment of a valid interface between the various separate programs. With separate programs, this must be done post-linking. Theoretically, an implementation could do it every time the pipeline is bound.

But as I point out, this would be stupid on their part. Then again, it is AMD…

But as I point out, this would be stupid on their part. Then again, it is AMD…

I agree but it would explain the hit on binding but not the hit on the draw call.

I’m adding some new GL3 features, I’m thinking to post here instead of making a new post.

Next on my list was Uniform Blocks. I had previously worked with DX11 constant buffers, so I thought this was to be an improvement.

I have around 2 out of 10 constants (view and projection matrix) that I update into every shader for every frame and then draw.

To my surprise, if I put all constants into an uniform block and instead of calling glUniformMatrix4fv twice per program, I now do an glBufferSubData once for each program’s (single) uniform block (I know I could share them, I’m just not there yet). What seems odd is that I’m actually having fewer gl calls, but performance is again slashed in half (more like 60% compared to glUniforms). Also, what really stroked me is that samplers can’t be put into uniform blocks so you end up with setting samplers with glUniform, so you basically have to use part of the GL2 pipeline and part of GL3. Should this be normal ? Why would glBufferSubData be twice as slow as 2 glUniform* calls ?

See http://hacksoflife.blogspot.co.uk/2010/02/one-more-on-vbos-glbuffersubdata.html for why glBufferSubData might not perform as well as you expect when streaming data.

You will probably get better performance by using the techniques mentioned in http://www.opengl.org/wiki/Buffer_Object_Streaming

Also, what really stroked me is that samplers can’t be put into uniform blocks so you end up with setting samplers with glUniform*, so you basically have to use part of the GL2 pipeline and part of GL3. Should this be normal ?

It’s because they’re not really uniforms. At least, not in the same way that the non-opaque types are. They’re not pieces of memory that store the value of the texture unit.

Thanks for the tip. Apparently, changing

glBufferSubData( GL_UNIFORM_BUFFER, Offset, Size, Data );

to

void *Pointer = glMapBufferRange( GL_UNIFORM_BUFFER, Offset, Size, GL_MAP_WRITE_BIT | GL_MAP_INVALIDATE_BUFFER_BIT | GL_MAP_UNSYNCHRONIZED_BIT );
memcpy( Pointer, Data, Size );
bool Success = glUnmapBuffer( GL_UNIFORM_BUFFER);

has increased my performance a lot. But this is probably mainly because under the hood, there’s some double buffering going on now. Bottom line still is that using uniform blocks is ~10% slower than plain old glUniform calls. I’m now thinking they might only be useful for large arrays or large sets of data.

EDIT : I now just do a single buffer update once per frame and to my surprise the performance is still lower than using glUniform, but the difference is now like 3-5% or so.

the performance is still lower than using glUniform

How many different shaders do you use per frame?

Right now, 30 vertex or pixel shaders that I mix and match. I haven’t count them, but I usually have 1 program per material (1 vertex and 1 pixel from that array) and I know I have ~100 materials (sometimes the same program gets used) and it’s also ~100 drawcalls. So it was one buffer update versus 100 * 2(constants) glUniform calls.

That’s most interesting. I haven’t benched marked the difference between uniforms and a uniform buffer. Although if it is only 3-5% with 2 uniforms there may be a cross over point as the number of unique items in the buffer increases. I have about 16, so I would need 16 uniform calls.

He also mentioned that, “I now do an glBufferSubData once for each program’s (single) uniform block”, which is probably not the most efficient way to go about poking with uniform blocks. If you’ve got a large series of objects that use the same uniform blocks, it’s probably more efficient to create an array of uniform blocks within a buffer and change them all in one call (either mapping and writing all the blocks or doing just one glBufferSubData call).

I later added that “I now just do a single buffer update once per frame”, and that’s with glMapBuffer which is like twice as fast as glBufferSubData.

I suppose there is a cross over point, no idea really. I’m anxious to test this out on GLES3 (mobile) GPUs which I assume they will have drivers specifically written for it compared to desktop where they originally had the DX11 driver and (probably?) patched in GL3+ features once they were approved.

I also came about this post lately ( OpenGL ES iOS drawing performance a lot slower with VBOs than without - Stack Overflow ) which is quite interesting and might be even worth investigating.

I just finished implementing Sampler objects (the ones you use with glBindSampler ) and to my surprise these are also slower than setting (just once) the sampler state with glTexParameter. I assume the speed diff might be from switching out states (since I have textures with and without mipmaps ) but I also thought that if the hardware is designed for DX11 (since I use a HD7850), it might already have a sampler object per individual texture so in hardware it might switch between a lot of objects and now with sampler objects I would minimize that, then again, no idea how the hardware does it so I’m only left to speculation.