Compute Shader Performance Issues

I’m implementing a rigid body physics system with compute shaders. The problem I’m having is any time I use structs for my data I experience drops to under 1fps. For example, I have a radix sort for my broadphase collision, when I just sort vec4s as most compute shader demos do for particles I have no problem sorting 8192 bodies in under a millisecond, but as soon as I want to move the body structs around performance becomes sub-realtime.

It’s possible to sort with incides and a comparison value so the body SSBO only needs putting in order at the end, but things aren’t a lot better and there are still physics operations where I can’t really do this.

If I cut the size of the body struct down performance slightly improves, but I don’t think I can cut it down any more. In turn, if I make the struct artficially larger performance gets even worse. This makes it feel like a cache issue.

Here is the struct I am trying to work with.


struct ConvexHull{
  vec3  position;
  uint enabled;
  vec3  half_ex;
  uint hash;
  vec4  verts_0[8];
  vec4  planes_n[6];
  vec4  planes_d[6];
};

Has anybody had this problem or have any ideas for dealing with it?

I’ve actually timed this now. For the sort, which has 48 dispatchCompute calls, the total time is only 0.1 milliseconds working on uvec4 data. After the sort I re-order based on the sorted indices and the time is 1.7ms for a single call, given this needs to be called at least 3 times per a frame, thats a lot of my frame update time window gone with only 8192 objects.

That makes the dispatch call that works on the ConvexHull struct more than 800 times slower than calls which just work on vec4 buffers. Here is the code.


#version 430

precision mediump float;

struct ConvexHull{
  vec3  position;
  uint enabled;
  vec3  half_ex;
  uint hash;
  vec4  verts_0[8];
  vec4  planes_n[6];
  vec4  planes_d[6];
};

layout(local_size_x = 128) in;

layout(binding = 0, std430) readonly buffer In {
  ConvexHull in[];
};

layout(binding = 1, std430) writeonly buffer Out {
  ConvexHull out[];
};

layout(binding = 2, std430) readonly buffer SortData {
  uvec4 sort_buf[];
};

void main() {
  uint index = gl_GlobalInvocationID.x;
  out[index] = in[sort_buf[index].y];
}


are you measuring the gpu time with a query object ?
https://www.khronos.org/opengl/wiki/Query_Object#Timer_queries

[QUOTE=john_connor;1289123]are you measuring the gpu time with a query object ?
https://www.khronos.org/opengl/wiki/Query_Object#Timer_queries[/QUOTE]

Yes I am using timer queries.

Interestingly, if I copy the ConvexHull in components the problem goes away, this executes in 0.1ms and might solve my problem.


void main() {
  uint index = gl_GlobalInvocationID.x;
  out[index].position = in[sort_buf[index].y].position;
  out[index].enabled = in[sort_buf[index].y].enabled;
  out[index].hash = in[sort_buf[index].y].hash;
  out[index].verts = in[sort_buf[index].y].verts;
  out[index].planes_n = in[sort_buf[index].y].planes_n;
  out[index].planes_d = in[sort_buf[index].y].planes_d;
}

The more I look at the, the more potential I think there is there could be a driver issue, I’ve eliminated a lot of possible causes regarding cache. If I don’t use double buffering I get a speed up of 100 times, then for every vec4 I eliminate from writing to the struct the speed doubles. The size of the buffer or struct makes no difference, only the number of bytes written each time.

I’d switch to Vulkan to see if that helps, but it seems few people are implementing anything like this and its hard to get an idea if its worth it or the issues will just be the same. I will probably post a query in the Vulkan forums too.

Did you try to look at the shader compiled with a text editor. Look at something that TMP LMEM[].
Your structures seams to be hudge… register pressure problem?
Did you try to use GPU perf analyser to find the bottleneck?

What a pain in the ass this new human verification…