Part of the Khronos Group
OpenGL.org

The Industry's Foundation for High Performance Graphics

from games to virtual reality, mobile phones to supercomputers

Results 1 to 4 of 4

Thread: slow transfer speed on fermi cards, part II

  1. #1
    Newbie Newbie
    Join Date
    Feb 2013
    Posts
    2

    slow transfer speed on fermi cards, part II

    The old thread of the same name was closed (maybe because the last activity was almost two years ago?!) but I hope it's ok if I "revive" it.

    I am trying to get mfort's CUDA workaround described in post #47 in the thread ("slow-transfer-speed-on-fermi-cards", the bbs-system won't let me post with URLs... *sigh*) to work, but I am banging my head into a wall at the moment.

    When I try to execute
    Code :
    cErr = cudaMemcpyFromArray( cuda_mem, cArray, 0, 0, 32, cudaMemcpyDeviceToHost );
    I get a SIGFPE (Arithmetic exception) even before it returns an error. (Note that the count 32 is just for testing, I have malloc'ed successfully much more memory.) If I use DeviceToDevice etc., I get the expected cudaErrorInvalidMemcpyDirection, and if I try to copy 0 bytes, it does not fail. All commands up until this point (both CUDA and OpenGL) has completed successfully.

    I have done what mfort outlined, but the instructions do not say how
    the renderBufferId object is set up. I tried with this
    Code :
                glGenRenderbuffers(1, &renderBufferId);
                glBindRenderbuffer(GL_RENDERBUFFER, renderBufferId);
    //            glFramebufferRenderbuffer( GL_FRAMEBUFFER, GL_COLOR_ATTACHMENT0, GL_RENDERBUFFER,
    //                                                         GL_RENDERBUFFER, renderBufferId );
    and also combinations with the commented out function call enabled, none of which works. I thought maybe the glFramebufferRenderbuffer would be required to connect the framebuffer (or whatever it is correct to call the thingy that OpenGL renders into) to the renderbuffer that the CUDA-copy will do the actual copying from.

    Anybody got any ideas that I could try out?

    J.

  2. #2
    Senior Member OpenGL Guru Dark Photon's Avatar
    Join Date
    Oct 2004
    Location
    Druidia
    Posts
    3,194
    the bbs-system won't let me post with URLs... *sigh*
    Yeah, new accounts are restricted from posting URLs for a while. These boards were just getting tons of spam postings from new users with links to various junk sites, and restricting links for new users cuts virtually all of it out. Just paste the URL as text in the post and maybe mutilate http or something so the boards will take it. Here's the link I think you were talking about: slow transfer speed on fermi cards

    ..but to your problem, you want to read this carefully. Gives you what you want w/o CUDA:

    * Re: Nvidia Dual Copy Engines
    Last edited by Dark Photon; 02-25-2013 at 06:32 PM.

  3. #3
    Newbie Newbie
    Join Date
    Feb 2013
    Posts
    2
    Quote Originally Posted by Dark Photon View Post
    ..but to your problem, you want to read this carefully.
    Thanks for the link! But I have already done what l_hrabcak describes in the post.

    Stage 1: I generate two buffers, buf[0] and buf[1], bind buf[0] to PIXEL_PACK, reserve storage with bufferData and GL_STATIC_COPY (once only), do glReadPixels, then unbind buf[0].

    Then the copy stage: Bind buf[0] to COPY_READ_BUFFER, buf[1] to COPY_WRITE_BUFFER (and do glBufferData w/GL_STREAM_READ the first time) and finally glCopyBufferSubData.

    Lastly, I use glMapBuffer(GL_COPY_WRITE_BUFFER, GL_READ_ONLY) to get the data to the host, most time spent ends up in this call, as expected.

    Doing this, I get approximately 1.2 GB/s, which is comparable to what l_hrabcak gets; 3.6MiB/2.82ms = 1.28 GB/S approx. He uses PCIe 1.1, but should not the speed for that be about twice as much, i.e., around 3 GB/s? And I use PCIe 2.0 (Nvidia-settings says "x16 Gen2") so I expected one more doubling, to about 6 GB/s. This is also what 'bandwidthTest --memory=pinned' indicates, it reports 6.35 GB/s for dev to host transfer.

    Note 1: At this point I am not concerned with async vs. sync'ed transfers etc. I just want to achieve the best transfer speed first.

    Note 2: If I use just one buffer, glReadPixels into PIXEL_PACK, and download with glMapBuffer(GL_PIXEL_PACK_BUFFER, GL_READ_ONLY), I actually get slightly better results, 1.5 GB/s. I don't understand this, either. In the dual-buffer setup, the final glMapBuffer command takes 2.9 ms (1.24 GB/s) and for the single-buffer setup, the mapping takes approx. 2.2 ms (1.64 GB/s), all other operations (binding, unbinding etc. takes negligible time relative to this.)

    Having tried this, and a lot of tweaks, I thus wanted to investigate the CUDA path... :-)


    J.

  4. #4
    Senior Member OpenGL Guru Dark Photon's Avatar
    Join Date
    Oct 2004
    Location
    Druidia
    Posts
    3,194
    Quote Originally Posted by jnygaard View Post
    Doing this, I get approximately 1.2 GB/s, which is comparable to what l_hrabcak gets... He uses PCIe 1.1, but should not the speed for that be about twice as much, i.e., around 3 GB/s? And I use PCIe 2.0 (Nvidia-settings says "x16 Gen2") so I expected one more doubling, to about 6 GB/s. This is also what 'bandwidthTest --memory=pinned' indicates, it reports 6.35 GB/s for dev to host transfer.
    When I did this, my readback rate jumped from 0.8GB/sec to 3.0GB/sec (3.7X speed-up).

    Theoretical max on PCIe2 x16 one direction is ~8GB/sec, but in practice you shouldn't expect more than ~6.4GB/sec. And that's for only timing a saturated transfer specifically -- no overhead. Probably can't expect to get that with other driver overhead and processing in the timing loop.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •