Part of the Khronos Group
OpenGL.org

The Industry's Foundation for High Performance Graphics

from games to virtual reality, mobile phones to supercomputers

Page 2 of 3 FirstFirst 123 LastLast
Results 11 to 20 of 22

Thread: Nvidia Dual Copy Engines

  1. #11
    Intern Contributor
    Join Date
    Jul 2010
    Posts
    74

    Re: Nvidia Dual Copy Engines

    Wow, just wow! That is an amazing trick/hack using GL_STATIC_COPY and glCopyBufferSubData.

    It made a huge difference to the pixel readback performance of my program.

    I think someone should put this trick into the VBO wiki.

    I guess that by using a GPU and CPU buffer with glCopyBufferSubData you are emulating the standard CUDA memcpy situation which would explain why it is fast.

    I guess if you have a Quadro or whatever then the driver will do this pinned memcpy for you, but this seems to work nicely on GeForce.

    One more thing to test l_hrabcak, what is the performance of glReadPixels into a PBO, and then use CUDA GL sharing to read it to CPU using cudaMemcpy?

  2. #12
    Member Regular Contributor
    Join Date
    Nov 2003
    Location
    Germany
    Posts
    293

    Re: Nvidia Dual Copy Engines

    The CUDA-readback was already tested in this thread [1], reading back the framebuffer content directly using CUDA. it gave very good results. it would be very interesting if using a CUDA memcpy on a separate CUDA stream would allow to overlap drawing and copying in GeForces...

    [1] http://www.opengl.org/discussion_boa...855#Post291855

  3. #13
    Junior Member Newbie
    Join Date
    Feb 2002
    Location
    Bratislava, Slovakia
    Posts
    19

    Re: Nvidia Dual Copy Engines

    Quote Originally Posted by Leith Bade
    I guess if you have a Quadro or whatever then the driver will do this pinned memcpy for you, but this seems to work nicely on GeForce.
    Almost every buffer memory block with usage DYNAMIC or STREAM is page locked (pinned) memory. There are some limitations as for how much memory can be pinned, it all depends on current system resources. Slow glReadPixel transfer is not caused because of the non-pinned (paged) memory. For example, transfer to/from the pinned memory is (PCIe 1.1) 2.5/1.7 GB/s and for the paged memory 1.3/1.0 GB/s, which is faster than 0.4GB/s i have with glReadPixels. On Quadro or CUDA you are able to use glReadPixels-equivalent with full speed, which is even faster than this copy trick/hack.

    One more interesting thing, GPU transfers like glCopyBufferSubData on GPU side (both buffers with usage GL_STATIC_COPY) are slower too, on GeForce NV460 GTX is the transfer rate only 10GB/s but in CUDA 87GB/s. Texture update from glTexSubImage2D from buffer in GPU memory is 6-10GB/s; the speed depends on pixel format, RGBA8 and BGRA8 are the fastest ones.

    I recommend to read CUDA documentation, there are many things that apply to PC and GPU architecture and are common for OpenGL and CUDA like this memory stuff.
    http://developer.download.nvidia.com...ming_Guide.pdf
    http://developer.download.nvidia.com...ices_Guide.pdf

    Quote Originally Posted by Leith Bade
    One more thing to test l_hrabcak, what is the performance of glReadPixels into a PBO, and then use CUDA GL sharing to read it to CPU using cudaMemcpy?
    I didn’t test OpenGL and CUDA cooperation yet. But in this case the performance will be the same because OpenGL buffer copy CPU <-> GPU memory is as fast as in CUDA. But a direct copy from frame buffer object to CPU memory with CUDA should be faster.

    A more interesting test would be to use CUDA for asynchronous GPU transfers to speed up the OpenGL applications, to upload textures and geometry in parallel with scene rendering. This should really help the engines that need to stream data to and from GPU during the game play, because in the current GeForce's OpenGL implementation when you use glTexImage2D or glTexSubImage2D GPU it's wasting power and doing nothing.

  4. #14
    Junior Member Newbie
    Join Date
    Feb 2002
    Location
    Bratislava, Slovakia
    Posts
    19

    Re: Nvidia Dual Copy Engines

    Quote Originally Posted by Chris Lux
    it would be very interesting if using a CUDA memcpy on a separate CUDA stream would allow to overlap drawing and copying in GeForces...
    Any volunteers for this test?

  5. #15
    Intern Contributor
    Join Date
    Jul 2010
    Posts
    74

    Re: Nvidia Dual Copy Engines

    I did some CUDA tests:

    Reading from a PBO in GPU memory via CUDA to CPU is the same speed as the glCopyBufferSubData trick.

    Reading directly from renderbuffer via a CUDA array to CPU was significantly faster than glCopyBufferSubData trick.

    I now have a 3x speedup over the older GL 3 only PBO code :-). Did anyone findout how RAGE uses glReadPixels (using glDebugger or Parallel Nsight)? I would not be surprised if they use CUDA too since the GPU transcoding stuff was written by NVIDIA engineers for RAGE using CUDA (rather than OpenCL which pissed off ATI gamers).

    l_hrabcak & Chris Lux:
    I was thinking yesterday of experimenting with asynchronous transfers but am still trying to figure out how to do it.

    The problem is that on CUDA you have to use cuGraphicsMapResources which states "This function provides the synchronization guarantee that any graphics calls issued before cuGraphicsMapResources() will complete before any subsequent CUDA work issued in stream begins."

    This indicates that this function will stall the GL context. It also says using GL commands between map and unmap will produce undefined results... which means you can't async overlap GL and CUDA officially.

    Now what I want to know is what happens when you create a second GL context on another thread for CUDA transfers and keep rendering commands to the first thread. Will cuGraphicsMapResources stall both contexts?

  6. #16
    Junior Member Newbie
    Join Date
    Feb 2002
    Location
    Bratislava, Slovakia
    Posts
    19

    Re: Nvidia Dual Copy Engines

    Quote Originally Posted by Leith Bade
    Reading directly from renderbuffer via a CUDA array to CPU was significantly faster than glCopyBufferSubData trick.
    It is really good to know that it works, thanks for testing.

    Quote Originally Posted by Leith Bade
    I now have a 3x speedup over the older GL 3 only PBO code :-). Did anyone findout how RAGE uses glReadPixels (using glDebugger or Parallel Nsight)? I would not be surprised if they use CUDA too since the GPU transcoding stuff was written by NVIDIA engineers for RAGE using CUDA (rather than OpenCL which pissed off ATI gamers).
    It's a big problem to report a bug to NVIDIA and to actually get an answer. So we cannot expect a help here, we are not the ID software. It is sad because AMD which has a lot of problems with OpenGL is much more cooperative in this and thanks to this cooperation with developers even small ones they finally managed to release usable OpenGL drivers. Otherwise, in NVIDIA case it is more about marketing strategy and the only way how to change it is to help AMD to create better drivers.

    Quote Originally Posted by Leith Bade
    Now what I want to know is what happens when you create a second GL context on another thread for CUDA transfers and keep rendering commands to the first thread. Will cuGraphicsMapResources stall both contexts?
    I made a few tests on a modified CUDA simpleGL example (more points and lower FPS around 150 to keep GPU busy all the time), and it looks the sync is not the issue here, it doesn't cause a real sync. According to NVIDIA Nsight, I have the lag two frames which should not be possible with sync. http://outerra.com/images/simpleGL_mod_nsight.png

    But there is an overhead due to context switching (0.4ms on my Intel Q6600 should be less than half on i5), it should be better to “offload” into another thread. The idea with the second thread is probably the only way how to efficiently work with CUDA and GL but we need to test a few different approaches to be sure.

  7. #17
    Junior Member Newbie
    Join Date
    Feb 2002
    Location
    Bratislava, Slovakia
    Posts
    19

    Re: Nvidia Dual Copy Engines

    Quote Originally Posted by Leith Bade
    I now have a 3x speedup over the older GL 3 only PBO code :-). Did anyone findout how RAGE uses glReadPixels (using glDebugger or Parallel Nsight)?
    Btw if you are referring to the "page resolver" here, you can resolve page IDs on GPU side and read IDs only. In this case the download should be a few kilobytes.

  8. #18
    Super Moderator OpenGL Lord
    Join Date
    Dec 2003
    Location
    Grenoble - France
    Posts
    5,580

    Re: Nvidia Dual Copy Engines

    Quote Originally Posted by l_hrabcak
    Quote Originally Posted by Leith Bade
    Did anyone findout how RAGE uses glReadPixels (using glDebugger or Parallel Nsight)?
    Btw if you are referring to the "page resolver" here, you can resolve page IDs on GPU side and read IDs only. In this case the download should be a few kilobytes.
    Indeed I believe the CUDA part is not for read back, but to directly upload (custom) compressed texture data from CPU and decompress with CUDA to texture data suitable for GPU (s3tc, etc).

  9. #19
    Intern Contributor
    Join Date
    May 2008
    Location
    USA
    Posts
    99

    Re: Nvidia Dual Copy Engines

    Quote Originally Posted by l_hrabcak
    The OpenGL buffer download seems to be working in full speed on GeForce. The fastest way how to download texture to CPU memory is, call glReadPixel to buffer which is allocated in VIDEO memory (usage GL_STATIC_COPY) and then call glCopyBufferSubData to buffer in CPU pinned memory (usage GL_STREAM_READ).
    Sorry, trying to work this out. I'm using the 'traditional' PBO method now - I render to an FBO and glReadPixels to a PBO set up as GL_READ_ONLY. It sounds like you are saying that a two step process is better? Do you have a snippet showing the order of calls please? Are you using two PBOs? One 'on the GPU' and one with CPU access?

    Bruce

  10. #20
    Intern Contributor
    Join Date
    Jul 2010
    Posts
    74

    Re: Nvidia Dual Copy Engines

    Yes two PBOs:
    Code :
            glBindBuffer(GL_PIXEL_PACK_BUFFER, mOutputBuffer);
            glBufferData(GL_PIXEL_PACK_BUFFER, mOutputBufferSize, NULL, GL_STATIC_COPY);
     
            glBindBuffer(GL_COPY_WRITE_BUFFER, mCopyBuffer);
            glBufferData(GL_COPY_WRITE_BUFFER, mOutputBufferSize, NULL, GL_STREAM_READ);

    I have this code to do the copy:
    Code :
        glBindFramebuffer(GL_FRAMEBUFFER, mOutputFramebuffer);
        glBindBuffer(GL_PIXEL_PACK_BUFFER, mOutputBuffer);
        glReadPixels(0, 0, mWidth, mHeight, GL_RGBA, GL_UNSIGNED_INT_8_8_8_8_REV, 0);
     
        glBindBuffer(GL_COPY_WRITE_BUFFER, mCopyBuffer);
        glCopyBufferSubData(GL_PIXEL_PACK_BUFFER, GL_COPY_WRITE_BUFFER, 0, 0, mOutputBufferSize);
        glGetBufferSubData(GL_COPY_WRITE_BUFFER, 0, mOutputBufferSize, mOutput);

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •