PDA

View Full Version : Rgarding CUDA+PBO slow down



mobeen
11-28-2011, 06:29 AM
Hi all,
I have an application that i did using FBOs and GLSL and it runs at ~110 fps on avg. Now I am trying to port it to CUDA. I am using the PBO for fast transfer (I pinp pong btw 2 pbos) and then render the resulting image using DrawPixels however it performs poorly (~25 fps on avg). I know that this slowdown is due to interop btw OpenGL and CUDA. I want to ask other more experienced users on what is the best method to render the output generated from CUDA to framebuffer?
From my research these are the choices
1) PBO + drawpixels
2) PBO + texsubimage2d (render to texture and then put this texture on a screen aligned quad)
3) I dont know more if u can add in?

I have only tried 1) should I try 2) or does anyone have a 3rd or 4th option?

Adding another insight into this. I did a time comparison among the two methods for just the calculation part not taking the transfers into account and the timings are
GLSL: ~0.001msec (using hires timer).
CUDA kernel: ~25 msecs (using cudaEvent API).
So does it mean that my CUDA kernel needs to be optimized further?

Aleksandar
11-28-2011, 10:05 AM
I have an application that i did using FBOs and GLSL and it runs at ~110 fps on avg. Now I am trying to port it to CUDA.
What is the reason to port app to CUDA if it already works fast with GLSL?


GLSL: ~0.001msec (using hires timer).
This is probably not a GPU time, but a CPU time (since you have mentioned hires timer). There is only one way to measure GPU time in OpenGL - timer_query.

mobeen
11-28-2011, 09:48 PM
What is the reason to port app to CUDA if it already works fast with GLSL?
Good question Aleksandar. This is for comparison sake only since nowadays people in academia ask (how does it compare against CUDA? and why do u want to do in GLSL when CUDA is a better option?)



This is probably not a GPU time, but a CPU time (since you have mentioned hires timer). There is only one way to measure GPU time in OpenGL - timer_query.
Thanks for this Aleksandar. With timer query, the reported time for GLSL is ~8.5 msecs. Even then, GLSL is ~3 times faster.

mobeen
11-28-2011, 10:23 PM
My CUDA code is an exact copy of GLSL code even then, there is a considerable difference in performance btw them. I know that I might get some more juice from my hardware using CUDA (shared memory and other caveats) but am I correct to say that GLSL is doing a lot of background optimizations which are transparent to us. These show up to you when you compare with some other API like CUDA. Am I correct to say this?

Chris Lux
11-29-2011, 12:01 AM
CUDA uses more precise floating point operations, or better GLSL uses more relaxed math. Using OpenCL I saw the same, but you can specify to the OpenCL compiler that it can use faster (more unprecise) math, which gave me a good boost (but not the same performance as GLSL). Take a look if you can do the same for CUDA...

mobeen
11-29-2011, 01:00 AM
Take a look if you can do the same for CUDA...
WOW. Thanks Chris Lux, CUDA nvcc has two flags (-use_fast_math and /Ox for full optimization). I just passed that in and VIOLAVOILA (thanks ZBuffer) my performance is much better now.
CUDA: ~7.25 msecs
GLSL: ~8.5 msecs

Some more questions:
Is there a way to control precision of floating point in GLSL. I have already specified
precision highp float; in all my shaders ?
Do u think, adding in shared memory will help further improve performance for CUDA?
Do I go ahead and replace drawpixels with using shader to render the output image?

ZbuffeR
11-29-2011, 01:59 AM
Interesting to see that CUDA and GLSL can have similar performance when tweaking the precision.

VIOLA
Totally OT, but I wanted to say PLEASE, never write it this way ! I know it is a frequently mistake made by english-speaking people, and it sounds very bad and means "was raping" in french ...

Correct spelling is "voila" (pronounced somewhat like "wala"). Even more correct spelling is "voilŗ" but the accent is really not that important.
Thanks a lot :)

mobeen
11-29-2011, 02:53 AM
Hi ZBuffer,
Thanks for the VOILA hint I have corrected the original post. Just updating stats here.
Now my CUDA code is using PBO + glTexSubImage2D to update the texture and then renders that texture rather than DrawPixels call and the new performance results are:
(All of these stats are generated on my NVIDIA Quadro FX 5800 GPU for an output resolution of 1024x1024 with contnuously updating the rendering.)

GLSL (using FBO and multiple textures): ~108-109 FPS (MaxFPS: ~1175)
CUDA (using glTexSubImage2D to display): ~89-90 FPS (MaxFPS: ~1597)
CUDA (using glDrawPixels to display): ~58-59 FPS (MaxFPS: ~248)

Since the algorithm I have has two parts update and render I break up the timing and here are the results.

GLSL: Update(6.920 msecs/frame), Render (0.315 msecs/frame)
CUDA: Update(7.814 msecs/frame), Render (0.316 msecs/frame)

Now the next step is to optimize the CUDA code further to reach the performance of GLSL.

malexander
11-29-2011, 10:25 AM
Nvidia mentioned in their plug for their Maximus technology that there is a context switch delay from compute to graphics mode. Perhaps that's factoring into the timing?

From an article on AnandTech (http://www.anandtech.com/show/5094/nvidias-maximus-technology-quadro-tesla-launching-today):

This actually creates some complexities for both NVIDIA and users. On a technical level, Fermiís context switching is relatively fast for a GPU, but on an absolute level itís still slow. CPUs can context switch in a fraction of the time, giving the impression of a concurrent thread execution even when we know thatís not the case. Furthermore for some reason context switching between rendering and compute on Fermi is particularly expensive, which means the number of context switches needs to be minimized in order to keep from wasting too much time just on context switching.

mobeen
11-29-2011, 11:47 PM
HI malexander,
I just read that article but Maximus is for a dual GPU card setup one Tesla and one Quadro. In my case I have a single card.

malexander
11-30-2011, 06:15 AM
What they were saying was that a single Quadro graphics card experienced choppiness because of context switches between compute and graphics, whereas the Tesla/Quadro solution did not because it didn't need to context switch as each card was dedicated to a context. So my thought was that your single card may be experiencing the same context switch delay. I suppose if you can optimize the CUDA solution so that it's much faster than the GLSL one, the added context switch time would be acceptable.

mobeen
11-30-2011, 07:53 AM
Oh that makes a lot of sense and indeed that is the direction I am working now thanks for the link and this insight.