PDA

View Full Version : GL_EXT_separate_shader_objects Performance ?



cippyboy
08-07-2013, 03:31 PM
Hi,

I have an OpenGL ES 2 engine that I have it running on Windows and iOS, and I also have it running on DirectX 11.

As soon as I learned about GL_EXT_separate_shader_objects I imagined that I would have huge performance benefits because with this one OpenGL would look and feel more like DirectX9+ and my cross-API code could remain similar while maintaining high performance.

To my surprise, I implemented GL_EXT_separate_shader_objects only to find out that my performance is halved, and my GPU usage dropped from ~95% to 45%. So basically, having them as a monolithic standard program is twice as fast as being separate. This is on an AMD HD7850 under Windows 8 and OpenGL 4.2 Core.

I originally imagined that this extension was created to boost performance by separating constant buffers and shader stages, but it seems it might have been created for people wanting to port DirectX shaders more directly, with disregard to any performance hits.

So my question, is if you have implemented this feature in a reasonable scene, what is your performance difference compared to monolitic shader programs ?

Alfonse Reinheart
08-07-2013, 03:44 PM
What part of "performance" are we talking about? How did you "implement" them? How are you using program pipeline objects?

Some benchmarking code would be good.

tonyo_au
08-07-2013, 07:42 PM
While GLSL's monolithic approach has some advantages for optimizing shaders

In the spec it does notice there may be performance cost. If you are dynamically stitching the shader program inside the render loop I would also expect a hit here as well.

cippyboy
08-08-2013, 04:14 AM
What part of "performance" are we talking about?

The part where I have ~100 vertex/pixel shader pairs in my scene, and previously I had one monolithic shader composed of the sub shaders (which I kept reusing for each monolithic program), and now I compose each shader into a program, and then make a program pipeline object for each combination, while continuing to reuse the vertex program and fragment program and I do not swap the shader bindings inside the pipeline object at run-time (as I heard it may have even more of a performance hit). I also have a couple of objects (UI mainly) that doesn't use pipeline objects, so I mix monolithic draw calls with pipeline objects draw calls, could this be a performance hit ? And performance for me generally means framerate, I'm having the same scene, the same shader code, and this technique is a lot slower.

Yeah, I know the spec says it removes some optimization, but it could also bring some optimizations (in my opinion) like not fetching pixel constants in the vertex shader, and vice versa. If this spec was invented for convenience or to ease up DirectX porting then ok, I get it and I'll move on.

EDIT: Some profiling using GPUPerfStudio 2 reveals that a glUseProgram call is (on average) 4 times faster than glBindProgramPipeline
glUseProgram 0.6 -> 2 microseconds
glBindProgramPipeline 2.42 -> 12 microseconds

Further more, I looked at big glDrawElements calls, 50-90K triangles and it takes around 20-30% more time to finish. There's definetely something weird going on in the driver... I now wonder if Nvidia GPUs behave the same...

EDIT2 : Just in case you're wondering, I talked about EXT (when it's in fact the ARB version for DesktopGL) because I was also planning to use it on iOS (if it had performance benefits)
On iOS it crashes if I create more than 4 Pipeline objects, on both simulator(OSX 10.8.2) and iOS 6.X. I just tried OS X 10.9 DP1 with the iOS 7.0 emulator and it works flawlessly. I'd have to reflash my device to iOS 7 to actually test performance (I run OSX in a VM and both run the same, ~2FPS due to software rendering)

tonyo_au
08-08-2013, 11:43 PM
glUseProgram call is (on average) 4 times faster than glBindProgramPipeline
I would expect this since the driver cannot use the program until the links between shader parts are made. I am surprised about the render times for the glDrawElements

Alfonse Reinheart
08-09-2013, 12:26 AM
Those "links between shader parts" can be made off-line. That's the whole point of the program pipeline object: it encapsulates a sequence of programs, so you can do all that verification work up-front rather than at bind-time.

Sounds more like someone's slacking off at their job of implementing this correctly.

cippyboy
08-09-2013, 09:57 AM
Hm, so you're saying that behind the scenes, he's taking my binds to glUseProgramStages and creating another "on the fly" monolithic program ? This would be a huge overhead...

Alfonse Reinheart
08-09-2013, 11:07 AM
He's talking about the establishment of a valid interface between the various separate programs (https://www.opengl.org/wiki/Shader_Compilation#Interface_matching). With separate programs, this must be done post-linking. Theoretically, an implementation could do it every time the pipeline is bound.

But as I point out, this would be stupid on their part. Then again, it is AMD...

tonyo_au
08-09-2013, 08:21 PM
But as I point out, this would be stupid on their part. Then again, it is AMD..
I agree but it would explain the hit on binding but not the hit on the draw call.

cippyboy
08-11-2013, 03:51 AM
I'm adding some new GL3 features, I'm thinking to post here instead of making a new post.

Next on my list was Uniform Blocks. I had previously worked with DX11 constant buffers, so I thought this was to be an improvement.

I have around 2 out of 10 constants (view and projection matrix) that I update into every shader for every frame and then draw.

To my surprise, if I put all constants into an uniform block and instead of calling glUniformMatrix4fv twice per program, I now do an glBufferSubData once for each program's (single) uniform block (I know I could share them, I'm just not there yet). What seems odd is that I'm actually having fewer gl calls, but performance is again slashed in half (more like 60% compared to glUniform*s). Also, what really stroked me is that samplers can't be put into uniform blocks so you end up with setting samplers with glUniform*, so you basically have to use part of the GL2 pipeline and part of GL3. Should this be normal ? Why would glBufferSubData be twice as slow as 2 glUniform* calls ?

Dan Bartlett
08-11-2013, 04:07 AM
See http://hacksoflife.blogspot.co.uk/2010/02/one-more-on-vbos-glbuffersubdata.html for why glBufferSubData might not perform as well as you expect when streaming data.

You will probably get better performance by using the techniques mentioned in http://www.opengl.org/wiki/Buffer_Object_Streaming

Alfonse Reinheart
08-11-2013, 04:30 AM
Also, what really stroked me is that samplers can't be put into uniform blocks so you end up with setting samplers with glUniform*, so you basically have to use part of the GL2 pipeline and part of GL3. Should this be normal ?

It's because they're not really uniforms. At least, not in the same way that the non-opaque types (https://www.opengl.org/wiki/GLSL_Type#Opaque_types) are. They're not pieces of memory that store the value of the texture unit.

cippyboy
08-11-2013, 04:52 AM
Thanks for the tip. Apparently, changing

glBufferSubData( GL_UNIFORM_BUFFER, Offset, Size, Data );

to

void *Pointer = glMapBufferRange( GL_UNIFORM_BUFFER, Offset, Size, GL_MAP_WRITE_BIT | GL_MAP_INVALIDATE_BUFFER_BIT | GL_MAP_UNSYNCHRONIZED_BIT );
memcpy( Pointer, Data, Size );
bool Success = glUnmapBuffer( GL_UNIFORM_BUFFER);

has increased my performance a lot. But this is probably mainly because under the hood, there's some double buffering going on now. Bottom line still is that using uniform blocks is ~10% slower than plain old glUniform calls. I'm now thinking they might only be useful for large arrays or large sets of data.

EDIT : I now just do a single buffer update once per frame and to my surprise the performance is still lower than using glUniform, but the difference is now like 3-5% or so.

tonyo_au
08-11-2013, 07:12 PM
the performance is still lower than using glUniform
How many different shaders do you use per frame?

cippyboy
08-12-2013, 04:45 AM
Right now, 30 vertex or pixel shaders that I mix and match. I haven't count them, but I usually have 1 program per material (1 vertex and 1 pixel from that array) and I know I have ~100 materials (sometimes the same program gets used) and it's also ~100 drawcalls. So it was one buffer update versus 100 * 2(constants) glUniform calls.

tonyo_au
08-12-2013, 09:44 PM
That's most interesting. I haven't benched marked the difference between uniforms and a uniform buffer. Although if it is only 3-5% with 2 uniforms there may be a cross over point as the number of unique items in the buffer increases. I have about 16, so I would need 16 uniform calls.

Alfonse Reinheart
08-13-2013, 01:36 AM
He also mentioned that, "I now do an glBufferSubData once for each program's (single) uniform block", which is probably not the most efficient way to go about poking with uniform blocks. If you've got a large series of objects that use the same uniform blocks, it's probably more efficient to create an array of uniform blocks within a buffer and change them all in one call (either mapping and writing all the blocks or doing just one glBufferSubData call).

cippyboy
08-13-2013, 02:28 PM
He also mentioned that, "I now do an glBufferSubData once for each program's (single) uniform block", which is probably not the most efficient way to go about poking with uniform blocks

I later added that "I now just do a single buffer update once per frame", and that's with glMapBuffer which is like twice as fast as glBufferSubData.

I suppose there is a cross over point, no idea really. I'm anxious to test this out on GLES3 (mobile) GPUs which I assume they will have drivers specifically written for it compared to desktop where they originally had the DX11 driver and (probably?) patched in GL3+ features once they were approved.

I also came about this post lately ( http://stackoverflow.com/questions/15297773/opengl-es-ios-drawing-performance-a-lot-slower-with-vbos-than-without ) which is quite interesting and might be even worth investigating.

cippyboy
08-15-2013, 05:42 PM
I just finished implementing Sampler objects (the ones you use with glBindSampler ) and to my surprise these are also slower than setting (just once) the sampler state with glTexParameter. I assume the speed diff might be from switching out states (since I have textures with and without mipmaps ) but I also thought that if the hardware is designed for DX11 (since I use a HD7850), it might already have a sampler object per individual texture so in hardware it might switch between a lot of objects and now with sampler objects I would minimize that, then again, no idea how the hardware does it so I'm only left to speculation.