Part of the Khronos Group
OpenGL.org

The Industry's Foundation for High Performance Graphics

from games to virtual reality, mobile phones to supercomputers

Results 1 to 6 of 6

Thread: Compute Shader Performance Issues

  1. #1
    Junior Member Newbie
    Join Date
    Oct 2017
    Posts
    4

    Compute Shader Performance Issues

    I'm implementing a rigid body physics system with compute shaders. The problem I'm having is any time I use structs for my data I experience drops to under 1fps. For example, I have a radix sort for my broadphase collision, when I just sort vec4s as most compute shader demos do for particles I have no problem sorting 8192 bodies in under a millisecond, but as soon as I want to move the body structs around performance becomes sub-realtime.

    It's possible to sort with incides and a comparison value so the body SSBO only needs putting in order at the end, but things aren't a lot better and there are still physics operations where I can't really do this.

    If I cut the size of the body struct down performance slightly improves, but I don't think I can cut it down any more. In turn, if I make the struct artficially larger performance gets even worse. This makes it feel like a cache issue.

    Here is the struct I am trying to work with.

    Code :
    struct ConvexHull{
      vec3  position;
      uint enabled;
      vec3  half_ex;
      uint hash;
      vec4  verts_0[8];
      vec4  planes_n[6];
      vec4  planes_d[6];
    };

    Has anybody had this problem or have any ideas for dealing with it?
    Last edited by krackaan; 10-25-2017 at 10:58 AM. Reason: Additional information

  2. #2
    Junior Member Newbie
    Join Date
    Oct 2017
    Posts
    4
    I've actually timed this now. For the sort, which has 48 dispatchCompute calls, the total time is only 0.1 milliseconds working on uvec4 data. After the sort I re-order based on the sorted indices and the time is 1.7ms for a single call, given this needs to be called at least 3 times per a frame, thats a lot of my frame update time window gone with only 8192 objects.

    That makes the dispatch call that works on the ConvexHull struct more than 800 times slower than calls which just work on vec4 buffers. Here is the code.

    Code :
    #version 430
     
    precision mediump float;
     
    struct ConvexHull{
      vec3  position;
      uint enabled;
      vec3  half_ex;
      uint hash;
      vec4  verts_0[8];
      vec4  planes_n[6];
      vec4  planes_d[6];
    };
     
    layout(local_size_x = 128) in;
     
    layout(binding = 0, std430) readonly buffer In {
      ConvexHull in[];
    };
     
    layout(binding = 1, std430) writeonly buffer Out {
      ConvexHull out[];
    };
     
    layout(binding = 2, std430) readonly buffer SortData {
      uvec4 sort_buf[];
    };
     
    void main() {
      uint index = gl_GlobalInvocationID.x;
      out[index] = in[sort_buf[index].y];
    }
    Last edited by krackaan; 10-29-2017 at 03:13 PM.

  3. #3
    Member Regular Contributor
    Join Date
    May 2016
    Posts
    435
    are you measuring the gpu time with a query object ?
    https://www.khronos.org/opengl/wiki/...#Timer_queries

  4. #4
    Junior Member Newbie
    Join Date
    Oct 2017
    Posts
    4
    Quote Originally Posted by john_connor View Post
    are you measuring the gpu time with a query object ?
    https://www.khronos.org/opengl/wiki/...#Timer_queries
    Yes I am using timer queries.

  5. #5
    Junior Member Newbie
    Join Date
    Oct 2017
    Posts
    4
    The more I look at the, the more potential I think there is there could be a driver issue, I've eliminated a lot of possible causes regarding cache. If I don't use double buffering I get a speed up of 100 times, then for every vec4 I eliminate from writing to the struct the speed doubles. The size of the buffer or struct makes no difference, only the number of bytes written each time.

    I'd switch to Vulkan to see if that helps, but it seems few people are implementing anything like this and its hard to get an idea if its worth it or the issues will just be the same. I will probably post a query in the Vulkan forums too.

  6. #6
    Junior Member Newbie
    Join Date
    Apr 2017
    Posts
    3
    Did you try to look at the shader compiled with a text editor. Look at something that TMP LMEM[].
    Your structures seams to be hudge... register pressure problem?
    Did you try to use GPU perf analyser to find the bottleneck?


    What a pain in the ass this new human verification...

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •