Re: Nvidia Dual Copy Engines
Thanks for sharing the link. I was waiting to this paper for some time.
Actually, I am quite disappointed about their approach. I'd like to be able to use one opengl context/thread and still use both copy engines. I would rather see some solution allowing us to issue memory transfer(s) and gfx commands from the same context and still have it running in parallel. This solution would help everyone without making any changes in the code.
Making OpenGL context just for memory transfer is strange. This looks like workaround for me. Especially when PBOs are designed to be asynchronous. I understand the issue with one thread. It would break in-order pipeline execution. Maybe something like DirectX command buffers can make it better.
Re: Nvidia Dual Copy Engines
Also being this a Nvidia specific optimization would'n be better to use CUDA OpenGL interop and use two streams using async memcpies from/to pinned mem so we would make use of two copy engines with only one OGL context?
I have not thinked throughly about that but that should work..
even better this would work also in Teslas which by the way are more economical..
This of course would not work with AMD cards and code from whiteppaper although complex optimization for a simple problem works on AMD also..
But wait OCL supports OGL interop and even I think dual dma copy is usable from OpenCL world but I can't be sure as I don't own a Quadro/Tesla to test..
So best solution seems to use OCL OGL interop that should provide two benefits:
better:
*one OGL context.
*works on tesla line also
equal:
*works on AMD also
worse:
*have to manage OCL context
Perhaps there could be some problems due to "hard" synchronization between OCL/OGL interop but I think with not yet implemented ocl 1.1 and ogl 4.1 advanced OGL/OCL interop should fix al possible issues..
What do you think?
Can someone at Nvidia speak about my reasoning?
Re: Nvidia Dual Copy Engines
Quote:
Originally Posted by Chris Lux
But, the thing that annoys me is the note on page 14: "Having two separate threads running on a Quadro graphics card with the consumer NVIDIA® Fermi architecture or running on older generations of graphics cards the data transfers will be serialized resulting in a drop in performance."
Why in hell not enable it for consumer products (if, and i may be wrong here, the hardware feature is present on all high end Fermi chips)? Texture streaming is extremely important there to. I am working in a scientific visualization context and we do have access to Quadro Boards, but we can not afford these a lot cards for every workstation where we develop and demonstrate large volume and image rendering software. The Fermi Quadro boards currently are extremely expensive, so access to them is almost impossible to us.
Seconded, and for exactly the same reason.
Beyond this, a high-end consumer Fermi (GTX480) being out-benched by 2.6X by a last-gen card (GTX285) with data transfers is embarrassing (Re: slow transfer speed on fermi cards). At least make it as good as the last-gen boards.
Re: Nvidia Dual Copy Engines
Quadro card with consumer Fermi? That's an odd modifier - so there's Quadro cards and *Quadro* cards?
And we have to guess which it is?
Bruce
Re: Nvidia Dual Copy Engines
Quote:
Originally Posted by Bruce Wheaton
Quadro card with consumer Fermi? That's an odd modifier - so there's Quadro cards and *Quadro* cards?
And we have to guess which it is?
Bruce
By "high-end consumer Fermi" I meant high-end "consumer GPU" (i.e. GeForce, as opposed to their "professional GPU" line: Quadro) with a chipset based on the "Fermi" chip line.
They do have professional line (Quadro) Fermi-based GPU, but I wasn't referring to them.
And as for guessing, while they do tell you on the NVidia pages what is "Fermi"-based, for more detail search the web. Reviews/wikipedia/etc. GFxxx chipset codenames are Fermi.
Re: Nvidia Dual Copy Engines
He was referring to the original Nvidia statement in my initial post, where they differentiate Fermi Quadros and consumer Fermi Quadros.
Re: Nvidia Dual Copy Engines
Ah, yeah. That is confusing.
Re: Nvidia Dual Copy Engines
Okay,
i have been digging deeper into the DMA-engine stuff from Nvidia. What confuses me are the following points:
- The white paper states that using a single threaded application and PBOs to transfer data _to_ the GPU (upload) does not overlap the data transfer with the rendering due to an internal context switch. Is this right? I assumed that using PBOs i was not only able to overlap CPU work with transfers but also GPU rendering work and transfers.
- The dual copy engines are only available on Quadro: Does this mean that i have a single copy engine on my GeForce to do one way overlapped transfers?
- According to the white paper i _need_ to use a separate thread and GL context to use the copy engine, as they are separate internal entities running GL contexts in parallel?
Maybe someone has already worked with the copy engines on GeForce and Quadro hardware and can give me some insights on my issues (or an Nvidia internal can clarify some points).
Regards
-chris
Re: Nvidia Dual Copy Engines
Hi guys,
I have spent some time on this problem too, here are my findings.
GeForce family cards are not able to do transfer and draw in one time in OpenGL!
Here is the picture from NVIDIA Nsight http://outerra.com/images/sc3_tex_upload.png. The green box is the glTexSubImage2D and it is called every fifth frame. As you can see the frame 978 is longer and the main part of the transfer is hidden in draw call time. In case of parallel transfer, the frame time should be the same. The texture is not used in any draw call so there is no implicit synchronization issue.
In CUDA, parallel transfer and kernel execution is possible http://outerra.com/images/cuda_transfers.png (red is for kernel and green/grey for download). The transfer can be upload or download it doesn’t matter.
The OpenGL data upload (texture and buffers) works in full speed on GeForce family which means ~5GB/s on PCe 2.0 and 2.5GB/s on PCIe 1.1. It seems to be the same speed as CUDA has acoording to bandwidthTest.exe --memory=pinned.
The texture download (glReadPixels) is limited on GeForce family to the almost unusable speed ~0.9GB/s on PCIe 2.0 and on older system with PCIe 1.1 is ~0.4GB/s. This is very sad especially because it is NOT hardware limitation. In CUDA i have download speed 3GB/s on PCIe 2.0 and 1.7GB/s on old PCIe 1.1. The problem is that download is not GPU side async so it can really slow down the application performance.
The OpenGL buffer download seems to be working in full speed on GeForce. The fastest way how to download texture to CPU memory is, call glReadPixel to buffer which is allocated in VIDEO memory (usage GL_STATIC_COPY) and then call glCopyBufferSubData to buffer in CPU pinned memory (usage GL_STREAM_READ).
a few values for 1280x720xRGBA8 (3.6MiB, PCIe 1.1) download:
fast nvidia download with glReadPixels and copy 0.7+2.12=2.82ms
direct way glReadPixels only 8.85ms
All stuff was tested on NVIDIA 460GTX 1GB (drv 285). Would be nice to find someone who will make such tests for Direct3D.