glMultiDispatchComputeIndirect

I want to see glMultiDispatchComputeIndirect. This would be important to me because my compute shader needs to be globally synced often. There is no way to do this currently, except by calling glComputeIndirect multiple times in a loop. The new procedure would be changed to allow for multiple calls


void glMultiDispatchComputeIndirect(GLintptr indirect, GLsizei computeCount) {
  for(GLsizei GLComputeInvocation = 0; GLComputeInvocation < computeCount; GLComputeInvocation++) {
    glUniform1i(locationof_GLComputeInvocation, GLComputeInvocation);
    glDispatchComputeIndirect(indirect);
  }
}

The same thing could be done for the non-indirect version. Also, some people might like a 3 dimensional compute count.


void glMultiDispatchComputeIndirect(GLintptr indirect, GLsizei computeCountX, GLsizei computeCountY, GLsizei computeCountZ) {
  for(GLsizei GLComputeInvocationX = 0; GLComputeInvocationX < computeCount; GLComputeInvocationX++) {
  for(GLsizei GLComputeInvocationY = 0; GLComputeInvocationY < computeCount; GLComputeInvocationY++) {
  for(GLsizei GLComputeInvocationZ = 0; GLComputeInvocationZ < computeCount; GLComputeInvocationZ++) {
    glUniform3i(locationof_GLComputeInvocation, GLComputeInvocationX, GLComputeInvocationY, GLComputeInvocationZ);
    glDispatchComputeIndirect(indirect);
  }
  }
  }
}

The order of calls in it may matter to some people, but they could also go in any undefined order by giving a counter of invocations, or people could use atomic counters in their shaders.
This procedure should allow some optimization in the driver/hardware because they know the same thing will be called multiple times.
Or, maybe add in a Global Sync in the compute shader, but I know that would be very difficult because it would require all workgroups to stay alive and communicate with each other.

I think even just having the loop be in the driver instead of the user code should save some kernel-user round trips.