Transform Feedback vs. GPGPU-APIs

Hello,

I want to do change a geometry object without the CPU. I tried to do all important stuff on the GPU with transform feedback. I do the a example from the OpenGL Bible and it works fine, but now I have a question about this opengl feature. Why schould I use transform feedback instead CUDA or OpenCL? What are the pros and cons about the two approaches?

It seems to me the only advantage of transform feedback is I don’t have to know anything about CUDA or OpenCL.

I hope someone have some additional statements about this Thread.

Thanks

Martin

Note that CUDA is only available on NVidia hardware.
Transform feedback can be nicer as you can share code from your “transform feedback shaders” with you other “rendering shaders” also as a graphics programmer modifying geometry might be more intuitive with shaders for you. But you are right that you should be able to do the same stuff in OpenCL - and you are probably more flexible in full GPGPU APIs. Use whatever seems more intuitive for you.

Using transform feedback will very likely be much simpler.

The general advice I’ve seen, heard, (and experienced, writing some OpenCL and CUDA code) is that you should write GPU code in your friendly graphics API’s shading language (e.g. GLSL) unless:

[ol]
[li]you can greatly speed up your algorithm using the shared memory on the streaming multiprocessors (NVidia lingo) or SIMD Engines (ATI/AMD Lingo), and [/li][li]you need that speed-up! [/li][/ol]
(FWIW, note that this shared memory is called “local memory” in OpenCL lingo, and it is shared between all “work items” [threads] running in a “workgroup”.)

Why do I/they say this? With your graphics API’s shading language (if you avoid the new GL 4.2 shader side-effects features), you do not have to know/care how the driver is parallelizing your shader into a bazillion threads and mapping it somehow to the GPU cores. In other words, you do not have to do parallel programming. You just get it – for free – without any tedium (and potential headache) on your part.

With OpenCL or CUDA, you “do” have to manage the parallelism in your kernel (GPU code), ensuring that the threads are spawned on the GPU to make optimal use of that particular GPU, that they don’t stomp on each other’s data, that they are synchronized and co-operate properly, that they make the optimal use of the GPU’s memory and compute resources, and other issues.

With stock GLSL (carefully avoiding the new GL 4.2 side-effects features such as image store), you just don’t care. The driver handles all that, simplifying your job.

So I’d suggest code it up first in GLSL (shouldn’t take long). If you find you absolutely need more speed, and you think you can get it by threads co-operating through shared memory, consider OpenCL or CUDA (keeping in mind that OpenCL is cross-vendor and even cross-device [GPU and CPU] while CUDA is NVidia-only). …but even then, take a look at the new side-effects and synchronization features added to GL 4.2 first! Might save you a bunch of work.

Thanks for your help and that is good to know. I want to try transform feedback with OpenSceneGraph. But this is a little bit tricky and by the way I got a lot of headache so far. :slight_smile: I thought OpenCL or CUDA could be simpler for the integration in OpenSceneGraph but it seems it is not that easy.

There are also other API interoperability issues not mentioned in the previous posts.

The rule of thumb is: If you are satisfied with GLSL implementation of calculation (both speed and accuracy), and you have to use the results in OpenGL, don’t use other API! It won’t be faster or easier to implement.

CUDA and OpenCL give more flexibility and higher accuracy in calculation. On the other hand, there is little to no control over shader execution, but it works pretty good per se. I’ve got trouble two years ago using CUDA for “repacking” VBOs since OpenGL and CUDA have to be synchronized, and one API has to finish all preceding commands before leaving control to the other. The pipeline stop is expensive! The most time-consuming operations are registration and buffer mapping/unmapping. I was doomed since I had to dynamically register buffers, and had a lot of them. Registration time is directly proportional to the number and size of the buffers. If you have a small/fixed number of buffers, it can be done only in the application startup time. It would prolong application startup, but the impact is localized.

A more disturbing delay that cannot be hidden was caused by resource mapping/unmapping. While mapping had impact on both CPU and GPU time, unmapping severely impacted only GPU time. It was two years ago. Maybe new drivers have better OpenGL-CUDA interoperability, so it would be interesting to hear other programmers’ experience.

IIRC, with OpenCL 1.0 specifically that’s right (had to do this over a year ago). glFinish() before flipping to OpenCL, and clFinish() before flipping back to OpenGL. If you only do it once, OK. But if you want to do this a bunch, it can be a killer.

However, IIRC OpenCL 1.1 supports a “sync” type mechanism so you don’t have to do a full pipeline flush to flip back and forth. Haven’t used it but for those that want to look it up, check out ARB_cl_event / cl_khr_gl_event.

I haven’t tried OpenCL yet. What I experienced was a delay imposed by internal synchronization between OpenGL and CUDA (or to be more precise, I think it was the problem). Kernels had been executed fast, but the buffer unmapping took a long time before OpenGL actually got it.