Part of the Khronos Group
OpenGL.org

The Industry's Foundation for High Performance Graphics

from games to virtual reality, mobile phones to supercomputers

Page 1 of 2 12 LastLast
Results 1 to 10 of 19

Thread: GL_EXT_separate_shader_objects Performance ?

  1. #1
    Junior Member Newbie cippyboy's Avatar
    Join Date
    Mar 2008
    Posts
    15

    GL_EXT_separate_shader_objects Performance ?

    Hi,

    I have an OpenGL ES 2 engine that I have it running on Windows and iOS, and I also have it running on DirectX 11.

    As soon as I learned about GL_EXT_separate_shader_objects I imagined that I would have huge performance benefits because with this one OpenGL would look and feel more like DirectX9+ and my cross-API code could remain similar while maintaining high performance.

    To my surprise, I implemented GL_EXT_separate_shader_objects only to find out that my performance is halved, and my GPU usage dropped from ~95% to 45%. So basically, having them as a monolithic standard program is twice as fast as being separate. This is on an AMD HD7850 under Windows 8 and OpenGL 4.2 Core.

    I originally imagined that this extension was created to boost performance by separating constant buffers and shader stages, but it seems it might have been created for people wanting to port DirectX shaders more directly, with disregard to any performance hits.

    So my question, is if you have implemented this feature in a reasonable scene, what is your performance difference compared to monolitic shader programs ?

  2. #2
    Senior Member OpenGL Guru
    Join Date
    May 2009
    Posts
    4,948
    What part of "performance" are we talking about? How did you "implement" them? How are you using program pipeline objects?

    Some benchmarking code would be good.

  3. #3
    Senior Member OpenGL Pro
    Join Date
    Jan 2012
    Location
    Australia
    Posts
    1,117
    While GLSL's monolithic approach has some advantages for optimizing shaders


    In the spec it does notice there may be performance cost. If you are dynamically stitching the shader program inside the render loop I would also expect a hit here as well.

  4. #4
    Junior Member Newbie cippyboy's Avatar
    Join Date
    Mar 2008
    Posts
    15
    Quote Originally Posted by Alfonse Reinheart View Post
    What part of "performance" are we talking about?
    The part where I have ~100 vertex/pixel shader pairs in my scene, and previously I had one monolithic shader composed of the sub shaders (which I kept reusing for each monolithic program), and now I compose each shader into a program, and then make a program pipeline object for each combination, while continuing to reuse the vertex program and fragment program and I do not swap the shader bindings inside the pipeline object at run-time (as I heard it may have even more of a performance hit). I also have a couple of objects (UI mainly) that doesn't use pipeline objects, so I mix monolithic draw calls with pipeline objects draw calls, could this be a performance hit ? And performance for me generally means framerate, I'm having the same scene, the same shader code, and this technique is a lot slower.

    Yeah, I know the spec says it removes some optimization, but it could also bring some optimizations (in my opinion) like not fetching pixel constants in the vertex shader, and vice versa. If this spec was invented for convenience or to ease up DirectX porting then ok, I get it and I'll move on.

    EDIT: Some profiling using GPUPerfStudio 2 reveals that a glUseProgram call is (on average) 4 times faster than glBindProgramPipeline
    glUseProgram 0.6 -> 2 microseconds
    glBindProgramPipeline 2.42 -> 12 microseconds

    Further more, I looked at big glDrawElements calls, 50-90K triangles and it takes around 20-30% more time to finish. There's definetely something weird going on in the driver... I now wonder if Nvidia GPUs behave the same...

    EDIT2 : Just in case you're wondering, I talked about EXT (when it's in fact the ARB version for DesktopGL) because I was also planning to use it on iOS (if it had performance benefits)
    On iOS it crashes if I create more than 4 Pipeline objects, on both simulator(OSX 10.8.2) and iOS 6.X. I just tried OS X 10.9 DP1 with the iOS 7.0 emulator and it works flawlessly. I'd have to reflash my device to iOS 7 to actually test performance (I run OSX in a VM and both run the same, ~2FPS due to software rendering)
    Last edited by cippyboy; 08-08-2013 at 03:19 PM.

  5. #5
    Senior Member OpenGL Pro
    Join Date
    Jan 2012
    Location
    Australia
    Posts
    1,117
    glUseProgram call is (on average) 4 times faster than glBindProgramPipeline

    I would expect this since the driver cannot use the program until the links between shader parts are made. I am surprised about the render times for the
    glDrawElements

  6. #6
    Senior Member OpenGL Guru
    Join Date
    May 2009
    Posts
    4,948
    Those "links between shader parts" can be made off-line. That's the whole point of the program pipeline object: it encapsulates a sequence of programs, so you can do all that verification work up-front rather than at bind-time.

    Sounds more like someone's slacking off at their job of implementing this correctly.

  7. #7
    Junior Member Newbie cippyboy's Avatar
    Join Date
    Mar 2008
    Posts
    15
    Hm, so you're saying that behind the scenes, he's taking my binds to glUseProgramStages and creating another "on the fly" monolithic program ? This would be a huge overhead...

  8. #8
    Senior Member OpenGL Guru
    Join Date
    May 2009
    Posts
    4,948
    He's talking about the establishment of a valid interface between the various separate programs. With separate programs, this must be done post-linking. Theoretically, an implementation could do it every time the pipeline is bound.

    But as I point out, this would be stupid on their part. Then again, it is AMD...

  9. #9
    Senior Member OpenGL Pro
    Join Date
    Jan 2012
    Location
    Australia
    Posts
    1,117
    But as I point out, this would be stupid on their part. Then again, it is AMD..

    I agree but it would explain the hit on binding but not the hit on the draw call.

  10. #10
    Junior Member Newbie cippyboy's Avatar
    Join Date
    Mar 2008
    Posts
    15
    I'm adding some new GL3 features, I'm thinking to post here instead of making a new post.

    Next on my list was Uniform Blocks. I had previously worked with DX11 constant buffers, so I thought this was to be an improvement.

    I have around 2 out of 10 constants (view and projection matrix) that I update into every shader for every frame and then draw.

    To my surprise, if I put all constants into an uniform block and instead of calling glUniformMatrix4fv twice per program, I now do an glBufferSubData once for each program's (single) uniform block (I know I could share them, I'm just not there yet). What seems odd is that I'm actually having fewer gl calls, but performance is again slashed in half (more like 60% compared to glUniform*s). Also, what really stroked me is that samplers can't be put into uniform blocks so you end up with setting samplers with glUniform*, so you basically have to use part of the GL2 pipeline and part of GL3. Should this be normal ? Why would glBufferSubData be twice as slow as 2 glUniform* calls ?

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •