Part of the Khronos Group
OpenGL.org

The Industry's Foundation for High Performance Graphics

from games to virtual reality, mobile phones to supercomputers

Page 1 of 3 123 LastLast
Results 1 to 10 of 22

Thread: Nvidia Dual Copy Engines

  1. #1
    Member Regular Contributor
    Join Date
    Nov 2003
    Location
    Germany
    Posts
    293

    Nvidia Dual Copy Engines

    Very interesting read:
    http://www.nvidia.com/docs/IO/40049/...py_engines.pdf

    Good to finally get high performance asynchronous data streaming to and from device memory.

    But, the thing that annoys me is the note on page 14: "Having two separate threads running on a Quadro graphics card with the consumer NVIDIA® Fermi architecture or running on older generations of graphics cards the data transfers will be serialized resulting in a drop in performance."

    Why in hell not enable it for consumer products (if, and i may be wrong here, the hardware feature is present on all high end Fermi chips)? Texture streaming is extremely important there to. I am working in a scientific visualization context and we do have access to Quadro Boards, but we can not afford these a lot cards for every workstation where we develop and demonstrate large volume and image rendering software. The Fermi Quadro boards currently are extremely expensive, so access to them is almost impossible to us.

    The data transport to the GPU is almost always the main bottleneck for us, so the decision to cut this feature (next to quad buffer stereo) is very sad. And i can imagine that D3D, at least for some games, will make use of the extra copy engines... So D3D gets stereo rendering (ok, i know no QBS, but still) and the other cool GPU features.

    Sorry, but i get mad at such decisions.
    -chris

  2. #2
    Member Regular Contributor
    Join Date
    Nov 2003
    Location
    Czech Republic
    Posts
    317

    Re: Nvidia Dual Copy Engines

    Thanks for sharing the link. I was waiting to this paper for some time.

    Actually, I am quite disappointed about their approach. I'd like to be able to use one opengl context/thread and still use both copy engines. I would rather see some solution allowing us to issue memory transfer(s) and gfx commands from the same context and still have it running in parallel. This solution would help everyone without making any changes in the code.

    Making OpenGL context just for memory transfer is strange. This looks like workaround for me. Especially when PBOs are designed to be asynchronous. I understand the issue with one thread. It would break in-order pipeline execution. Maybe something like DirectX command buffers can make it better.

  3. #3
    Intern Newbie
    Join Date
    Oct 2007
    Posts
    47

    Re: Nvidia Dual Copy Engines

    Also being this a Nvidia specific optimization would'n be better to use CUDA OpenGL interop and use two streams using async memcpies from/to pinned mem so we would make use of two copy engines with only one OGL context?
    I have not thinked throughly about that but that should work..
    even better this would work also in Teslas which by the way are more economical..
    This of course would not work with AMD cards and code from whiteppaper although complex optimization for a simple problem works on AMD also..
    But wait OCL supports OGL interop and even I think dual dma copy is usable from OpenCL world but I can't be sure as I don't own a Quadro/Tesla to test..
    So best solution seems to use OCL OGL interop that should provide two benefits:
    better:
    *one OGL context.
    *works on tesla line also
    equal:
    *works on AMD also
    worse:
    *have to manage OCL context

    Perhaps there could be some problems due to "hard" synchronization between OCL/OGL interop but I think with not yet implemented ocl 1.1 and ogl 4.1 advanced OGL/OCL interop should fix al possible issues..
    What do you think?
    Can someone at Nvidia speak about my reasoning?

  4. #4
    Senior Member OpenGL Guru Dark Photon's Avatar
    Join Date
    Oct 2004
    Location
    Druidia
    Posts
    3,126

    Re: Nvidia Dual Copy Engines

    Quote Originally Posted by Chris Lux
    But, the thing that annoys me is the note on page 14: "Having two separate threads running on a Quadro graphics card with the consumer NVIDIA® Fermi architecture or running on older generations of graphics cards the data transfers will be serialized resulting in a drop in performance."

    Why in hell not enable it for consumer products (if, and i may be wrong here, the hardware feature is present on all high end Fermi chips)? Texture streaming is extremely important there to. I am working in a scientific visualization context and we do have access to Quadro Boards, but we can not afford these a lot cards for every workstation where we develop and demonstrate large volume and image rendering software. The Fermi Quadro boards currently are extremely expensive, so access to them is almost impossible to us.
    Seconded, and for exactly the same reason.

    Beyond this, a high-end consumer Fermi (GTX480) being out-benched by 2.6X by a last-gen card (GTX285) with data transfers is embarrassing (Re: slow transfer speed on fermi cards). At least make it as good as the last-gen boards.

  5. #5
    Intern Contributor
    Join Date
    May 2008
    Location
    USA
    Posts
    99

    Re: Nvidia Dual Copy Engines

    Quadro card with consumer Fermi? That's an odd modifier - so there's Quadro cards and *Quadro* cards?

    And we have to guess which it is?

    Bruce

  6. #6
    Senior Member OpenGL Guru Dark Photon's Avatar
    Join Date
    Oct 2004
    Location
    Druidia
    Posts
    3,126

    Re: Nvidia Dual Copy Engines

    Quote Originally Posted by Bruce Wheaton
    Quadro card with consumer Fermi? That's an odd modifier - so there's Quadro cards and *Quadro* cards?

    And we have to guess which it is?

    Bruce
    By "high-end consumer Fermi" I meant high-end "consumer GPU" (i.e. GeForce, as opposed to their "professional GPU" line: Quadro) with a chipset based on the "Fermi" chip line.

    They do have professional line (Quadro) Fermi-based GPU, but I wasn't referring to them.

    And as for guessing, while they do tell you on the NVidia pages what is "Fermi"-based, for more detail search the web. Reviews/wikipedia/etc. GFxxx chipset codenames are Fermi.

  7. #7
    Member Regular Contributor
    Join Date
    Nov 2003
    Location
    Germany
    Posts
    293

    Re: Nvidia Dual Copy Engines

    He was referring to the original Nvidia statement in my initial post, where they differentiate Fermi Quadros and consumer Fermi Quadros.

  8. #8
    Senior Member OpenGL Guru Dark Photon's Avatar
    Join Date
    Oct 2004
    Location
    Druidia
    Posts
    3,126

    Re: Nvidia Dual Copy Engines

    Ah, yeah. That is confusing.

  9. #9
    Member Regular Contributor
    Join Date
    Nov 2003
    Location
    Germany
    Posts
    293

    Re: Nvidia Dual Copy Engines

    Okay,
    i have been digging deeper into the DMA-engine stuff from Nvidia. What confuses me are the following points:

    - The white paper states that using a single threaded application and PBOs to transfer data _to_ the GPU (upload) does not overlap the data transfer with the rendering due to an internal context switch. Is this right? I assumed that using PBOs i was not only able to overlap CPU work with transfers but also GPU rendering work and transfers.

    - The dual copy engines are only available on Quadro: Does this mean that i have a single copy engine on my GeForce to do one way overlapped transfers?

    - According to the white paper i _need_ to use a separate thread and GL context to use the copy engine, as they are separate internal entities running GL contexts in parallel?

    Maybe someone has already worked with the copy engines on GeForce and Quadro hardware and can give me some insights on my issues (or an Nvidia internal can clarify some points).

    Regards
    -chris

  10. #10
    Junior Member Newbie
    Join Date
    Feb 2002
    Location
    Bratislava, Slovakia
    Posts
    19

    Re: Nvidia Dual Copy Engines

    Hi guys,

    I have spent some time on this problem too, here are my findings.

    GeForce family cards are not able to do transfer and draw in one time in OpenGL!

    Here is the picture from NVIDIA Nsight http://outerra.com/images/sc3_tex_upload.png. The green box is the glTexSubImage2D and it is called every fifth frame. As you can see the frame 978 is longer and the main part of the transfer is hidden in draw call time. In case of parallel transfer, the frame time should be the same. The texture is not used in any draw call so there is no implicit synchronization issue.

    In CUDA, parallel transfer and kernel execution is possible http://outerra.com/images/cuda_transfers.png (red is for kernel and green/grey for download). The transfer can be upload or download it doesn’t matter.

    The OpenGL data upload (texture and buffers) works in full speed on GeForce family which means ~5GB/s on PCe 2.0 and 2.5GB/s on PCIe 1.1. It seems to be the same speed as CUDA has acoording to bandwidthTest.exe --memory=pinned.

    The texture download (glReadPixels) is limited on GeForce family to the almost unusable speed ~0.9GB/s on PCIe 2.0 and on older system with PCIe 1.1 is ~0.4GB/s. This is very sad especially because it is NOT hardware limitation. In CUDA i have download speed 3GB/s on PCIe 2.0 and 1.7GB/s on old PCIe 1.1. The problem is that download is not GPU side async so it can really slow down the application performance.

    The OpenGL buffer download seems to be working in full speed on GeForce. The fastest way how to download texture to CPU memory is, call glReadPixel to buffer which is allocated in VIDEO memory (usage GL_STATIC_COPY) and then call glCopyBufferSubData to buffer in CPU pinned memory (usage GL_STREAM_READ).

    a few values for 1280x720xRGBA8 (3.6MiB, PCIe 1.1) download:
    fast nvidia download with glReadPixels and copy 0.7+2.12=2.82ms
    direct way glReadPixels only 8.85ms


    All stuff was tested on NVIDIA 460GTX 1GB (drv 285). Would be nice to find someone who will make such tests for Direct3D.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •