Compute Shaders in OpenGL

quirin · August 23, 2011, 12:04am

The functionality of CUDA/OpenCL is exposed to Direct3D developers through “Computer Shaders” aka DirectCompute.

While at first sight, these APIs seem equivalent, there are differences that result in both advantages and disadvantages.

From a graphics programmers’ perspective, DirectCompute has one important advantage, right now: interoperability works a lot better and is faster, because there is no context switch between the compute context and the graphics context. There is only one context, if I am informed correctly.

Is anyone aware of attempts to have compute shaders in GLSL?
Or even better: is there a way of running OpenCL in the same context as OpenGL?

Alfonse_Reinheart · August 23, 2011, 12:21am

What would it mean to have them running “in the same context”? An OpenGL context isn’t the same thing as an OpenCL context. They’re two different concepts.

The issue with interop is that you basically have to finish all your GL stuff (with glFinish) before you can start your CL stuff. And vice-versa. That has nothing to do with contexts, and everything to do with the specification design. This hard synchronization is due to neither spec really mentioning how they’re supposed to interact.

Chris_Lux · August 23, 2011, 12:35am

There are extensions to GL and CL to handle the interop, but sadly they are currently not supported (at least by Nvidia):

CL event to OpenGL sync
http://www.opengl.org/registry/specs/ARB/cl_event.txt

OpenGL sync to CL event
http://www.khronos.org/registry/cl/sdk/1.1/docs/man/xhtml/gl_event.html

Zyx_2000 · August 23, 2011, 3:31am

Some of the things one might want to do with OpenCL can be done efficiently with transform feedback.

marshats · August 23, 2011, 6:44am

Are you saying clEnqueueAcquireGLObjects/clEnqueueReleaseGLObjects are inefficient? I have seen no problems on NVIDIA hardware with interop between openGL and openCL … this is a snippet from OpenCL Book Chapter 10 GLInterop.cpp. I ran it on my Linux, GeForce GTX 465, OpenGL version 4.1.0, OpenCL 1.1 setup.


    // Note, we should ensure GL is completed with any commands that might affect this VBO
    // before we issue OpenCL commands
    glFinish();

    // map OpenGL buffer object for writing from OpenCL
    clEnqueueAcquireGLObjects(cqCommandQue, 1, &vbo_cl, 0,0,0);

    // Set work size and execute the kernel
    clEnqueueNDRangeKernel(cqCommandQue, ckKernel, 2, NULL, szGlobalWorkSize, NULL, 0,0,0 );

    // unmap openGL buffer object
    clEnqueueReleaseGLObjects(cqCommandQue, 1, &vbo_cl, 0,0,0);

    // Note, we should ensure OpenCL is finished with any commands that might affect the VBO
    clFinish(cqCommandQue);

Also, my NVIDIA card reports back “cl_khr_gl_sharing” is a supported CL_DEVICE_EXTENSIONS (see link) .

quirin · August 23, 2011, 7:45am

You are right. But I never got quite the bandwidth that I got with CUDA

quirin · August 23, 2011, 7:52am

Exactly. I am aware of that. It seems to me that in D3D11 there is just one context, i.e. one concept, instead of two. That is why I asked in the first place.

My problem is that, before every draw call I run a CUDA kernel taking about 0.1 ms. I have a cost of about 0.15ms when switching between CUDA and OpenGL. So I fear, my algorithm is context-switch bound.

quirin · August 23, 2011, 7:56am

Sweet Looks like the way to go…

Only one little thing: Could you please be so kind and report timings how fast you can switch between the GL and CL world on your card. That would be really nice to know. Thanks in advance!

marshats · August 23, 2011, 8:37am

On further reading (OpenCL Book by Munshi et al Chapter 10 pg 348-350 “Syncronization between OpenGL and OpenCL”), there is a code snippet to remove glFinish/clFinish. It requires two extensions GL_ARB_sync and cl_khr_event. My pseudo code changes to above post are as follows (note untested)


   cl_event release_event;

    GLsync sync = glFenceSync(GLSYNC_GPU_COMMANDS_COMPLETE, 0);
    gl_event = clCreateEventFromGLSyncKHR(context,sync,NULL);

    // map OpenGL buffer object for writing from OpenCL
    clEnqueueAcquireGLObjects(cqCommandQue, 1, &vbo_cl, 0,&gl_event,NULL);

    // Set work size and execute the kernel
    clEnqueueNDRangeKernel(cqCommandQue, ckKernel, 2, NULL, szGlobalWorkSize, NULL, 0,0,0 );

    // unmap openGL buffer object
    clEnqueueReleaseGLObjects(cqCommandQue, 1, &vbo_cl, 0,0,&release_event);

    GLsync cl_sync = glCreateSyncFromCLeventARB(context,release_event,0);
    glWaitSync(cl_sync, 0, GL_TIMEOUT_IGNORED);

Sadly my GPU has only one of the two required extensions required to try this; GL_ARB_sync is available but cl_khr_gl_event is not.

Chris_Lux · August 23, 2011, 10:26am

you are missing more than one extension. you need the cl_khr_gl_event extension to OpenCL and the ARB_cl_event extension to OpenGL. (see my post on top).

quirin · August 6, 2012, 7:14am

Finally, OpenGL 4.3 is out, that has Compute Shaders.

imported_kyle · August 6, 2012, 7:56am

Now the only thing missing is the rasterizer stage exposed in OpenCL

Aleksandar · August 11, 2012, 3:48am

I don’t want to disappoint you guys, but Compute Shaders are far away from both CUDA and OpenCL.
In a case you’ve missed the overview section of the spec, please read the following:

Another difference is that OpenCL is more full featured and includes features such as multiple devices, asynchronous queues and strict IEEE semantics for floating point operations. This extension follows the semantics of OpenGL - implicitly synchronous, in-order operation with single-device, single queue logical architecture and somewhat more relaxed numerical precision requirements.

The last remark is crucial for me to stop further delving into the spec of CS, since GLSL precision is to “relaxed” for serious computation.

malexander · August 11, 2012, 10:49am

From what I’ve read of Nvidia’s compute and graphics contexts, this is the price you pay for avoiding the context switch between compute and graphics. A compute shader seems to be adequate for effects like depth of field, which the extension appears to be targeting. In these cases, it’s quick and convenient, and offers good performance.

For shaders where precision is an issue, you’re pretty much stuck with the context switch – at least until hardware reaches the point where compute and graphics can be run simultaneously on different processors on the same GPU (if ever). Then it might hide some of the latency cost of context switching.

menzel · August 12, 2012, 11:34pm

AFAIK Kepler can do that.

Aleksandar · August 13, 2012, 4:27am

May I ask for the reference? The precision depends on the way GLSL is implemented, not on the contexts.

One again, the precision problem is in GLSL implementation. It uses hardware accelerated functions that have finite precision much worse than their CPU counterparts. But the speed is tremendous. For example, trigonometric functions execute in a single clock. In fact, both sine and cosine you can get in the single clock. By the way, NVIDIA has a much better precision than AMD. But this is not the place and time for the discussion about the implementation.

Do you mean about usage of Hyper-Q? I’m not quite sure whether or not it can be used in mixing graphics and calculation. Do you have any reference that claims such possibility?

Alfonse_Reinheart · August 13, 2012, 5:08am

One again, the precision problem is in GLSL implementation.

The point he’s making is that the OpenCL specification requires a certain level of precision that the GLSL specification does not. Therefore, you cannot rely on getting OpenCL-level precision from GLSL code.

It is a matter of specification, not merely implementation. Because the specification defines what the implementation can do. The reason AMD gets away with lower precision sin/cos is precisely because the specification lets them. Currently, the spec says that the precision on trig functions is undefined. If it had specific limits on precision… well, odds are good that AMD would veto any such proposal, but failing that they would have to improve the precision on their sin/cos functions (in theory, of course. In practice, there’s no conformance test, so there’s no way to know for certain whether they’re implementing the precision guarantees).

A loose specification leads to a lot of variation and lower precision. A tight specification does not. OpenCL is tighter than GLSL with regard to precision.

menzel · August 13, 2012, 5:16am

[QUOTE=Aleksandar;1241384]
Do you mean about usage of Hyper-Q? I’m not quite sure whether or not it can be used in mixing graphics and calculation. Do you have any reference that claims such possibility?[/QUOTE]

Yes, as I understand it, you can run multiple contexts from different applications concurrently. I would expect that mixing graphics and GPGPU was possible, however I have no hard proof that this is possible.
I don’t see why running two graphics contexts from different programs should be doable but a graphics and a GPGPU context of the same program shouldn’t.

malexander · August 13, 2012, 7:18am

May I ask for the reference? The precision depends on the way GLSL is implemented, not on the contexts.

Certainly:

http://techreport.com/articles.x/17670/2
“Better scheduling, faster switching”, near the end of that section

http://www.nvidia.com/content/PDF/fermi_white_papers/T.Halfhill_Looking_Beyond_Graphics.pdf (pdf warning)
Page Figure 5 caption (page 9)
New Concurrency for Global Kernels (page 14,15)

These are for Fermi, I’m not sure what Kepler brings to the table. Fermi must flush caches when switching between compute & graphics tasks, and cannot run them at the same time. It’s stated it can do a switch in as little as 25 microseconds, but that doesn’t take into account the fact that you’ll get tons of cache misses right after the switch, and probably some idling processers near the end of a compute/graphics task, right before the switch.

And yes, the precision issue itself has to do with OpenGL vs. OpenCL, so it’s possible we could see a GL_ARB_shader_high_precision extension at some point. Then you wouldn’t have to trade off performance (in the form of a context switch penalty) for precision like you do now, so the compute shader could do more heavy lifting without needing to jump to OpenCL/CUDA as often. Alternatively, future hardware could make this penalty so minimal that it becomes a non-issue, in which case a compute shader just becomes a programming convenience.

My point is that there are applications where precision isn’t an issue that would benefit from the lack of a context switch, and this is what compute shaders currently appear to be designed for.

Aleksandar · August 14, 2012, 6:04am

Thanks for the articles, malexander!

I didn’t say CS are not useful, and certainly could be faster because of execution in the same context with graphics API. I just said that CS cannot be substitution for CUDA/CL in general.
I’ve been using VS/TF for calculation for a long time. The precision problems I’ve solved by implementing my own functions for critical operations and by passing more parameters to shaders.
When heard about CS in GLSL I thought it could save some gymnastics in my code, but… Who knows why it is better to stay as it is…