why is the work group size specified inside the shader (local_size_x)?
why?! specially given that in OpenCL it is declared during the application runtime (when calling clEnqueueNDRangeKernel()). In OpenGL, one is able to specify only the number of work groups during the application runtime (with glDispatchCompute()). We lose in flexibility by imposing it during the compute shader compilation time. Any evolution expected on this matter?
another question, is there any restriction on the values specified by local_size_x (layout qualifier) and num_groups_x (input of glDispatchCompute() )? Example, in OpenCL global_work_size must be evenly divisible by local_work_size.
Last edited by laobrasuca; 08-13-2012 at 09:24 AM.
or, is it legal to do the following:
uniform int local_size_x_from_app;
layout (local_size_x = local_size_x_from_app) in;
The grammar clearly defines it as "local_size_x= integer-constant". Uniforms aren't constants, so no.
yes, true, I had read it but didn't realize.
concerning my first post, any thoughts?
Here's a total guess. When you're writing optimized OpenCL kernels, sometimes you want these to be constants defined in the shader source. For instance, sometimes you unroll some or all of the iterations, and (for instance) stop doing barrier/synchronization when the problem size is reduced to <= 1 thread (1 warp). You can encase these unrolls inside "if" checks based on the local size (items/workgroup, threads/block, etc.). If the local size is constant, then the compiler can remove these if checks altogether and only pull in the iterations (and memory barriers) that are absolutely needed with no run-time conditional evaluation on the GPU.
Originally Posted by laobrasuca
That said, I haven't digested the compute shader extension yet, and am definitely no OpenCL/CUDA expert.
Last edited by Dark Photon; 08-13-2012 at 05:29 PM.
thanks for that. Optimized code must be one of great motivations indeed. It is maybe somehow linked to the fact that they drop precision (floating point computation) in favor of performance. All things considered, it still is OpenGL, so, graphics oriented, and it makes sense to consider that visual effects will not be too sensible to some loss on generality or precision.
it is a good thing that the ARB finally decided to have a shader dedicated to generic computing. sadly, it arrived a year too late for me, I've done all my stuff on CL/GL already xD It will get me some time to get my hands on it and change all my libraries.
The reason for that is really that the local work group size in fact affects the shader code. Think about it: local work group size affects thread scheduling scheme and shared memory usage pattern. While the driver could hide this and allow the developer to supply this at dispatch time, however, in fact it would probably still require a shader recompile so might not be deterministic from a performance point of view how expensive a compute dispatch is, even if the driver caches the shaders.
However, you can easily manage multiple local work group size yourself by simply creating multiple shaders with local work group sizes of your choice and select the appropriate when needed. This way there are no hidden costs and you can expect optimum dispatch speed.
Regarding GL compute shaders over CL-GL interop, well, you should definitely be able to gain some performance by using GL compute shaders as no matter how nice is CL-GL interop, developers often complain about its performance hit due to synchronization between the contexts. GL compute shaders are not affected by such cross-context synchronization issues.
Disclaimer: This is my personal profile. Whatever I write here is my personal opinion and none of my statements or speculations are anyhow related to my employer and as such should not be treated as accurate or valid and in no case should those be considered to represent the opinions of my employer.
Technical Blog: http://www.rastergrid.com/blog/
Re "no matter how nice is CL-GL interop", au contraire! It's not nice at all. GL_ARB_cl_event / cl_khr_gl_event have never shipped in the drivers I care about. And that leaves you with the glFinish()/clFinish() sledgehammer approach and massive pipeline bubbles to show for it.
Originally Posted by aqnuep
Forget even running CL/GL tasks in parallel on the GPU at once. You can't even run them in series on the GPU at once.
I, for one, heartily welcome our new Compute Shader overlords!
Last edited by Dark Photon; 08-15-2012 at 06:38 PM.