PCIe data transfers (Textures/PBO/anything else)

gogh0 · May 16, 2008, 7:17am

When I look around in forums I see many results of people benching PBO/TexImage/TexSubImage/CUDA, with various transfer rate from 1GB/s to 3GB/s for CUDA… My question is : Are you sure that the data is transferred where you think it is?

I don’t know for ATI, but I tested the functionality (not performance) with NVidia g92 and got some interesting informations.

Because there is no real way to know in OpenGL the amount of free video memory (well, nothing that I am aware of), I created a DirectDraw7 context in my OGL application to be able to query the free video memory at every single OGL call. Windows offers the API to query for free system memory so with those two value, I can follow what is really happening in the driver.

Let’s take a simple test :
glFinish()
1- glTexImage2D(NULL)
glFinish()
2- glTexSubImage2D(data)
glFinish()

Reading the spec, we may think that #1 will allocate video memory and #2 will transfer the data, but this is NOT what’s happening. #1 allocate system memory and #2 transfer the data to this RAM memory. Everything is CPU, not a single byte of video memory is allocated to this point. Both allocation AND transfer are done during the draw call. I’m wondering then what exactly people are benching!

Similar observation using PBOs. With PBOs using GL_STREAM_DRAW, it depends if you are “transferring” data through map/unmap or BufferData() call. With map/unmap, the PBO is kept in system memory, glSubTexImage(null) simply do nothing. Everything is done at the draw call, PBO and texture are both allocated and data is transferred at that point! The behavior highly depends on the usage flag (STREAM/STATIC/DYNAMIC/DRAW/READ/…) and other factors.

In any cases, who can be sure about the transfer rate they measure since the driver will really transfer the surface when it wants?

ZbuffeR · May 16, 2008, 7:29am

draw a small quad with this PBO as texture before timing ?

Zengar · May 16, 2008, 7:32am

Yeah, more control about storage would be really nice…

akaTONE · May 16, 2008, 8:36am

From what I have seen, nVidia will not always wait for the GPU to be finished on glFinish calls. The driver can detect cases when you will not be able to tell whether the GPU is finished, and simply flush and return.

As for what you are noticing in the GL behavior.

TexImage2D - driver simply allocates a texture backing store.
glFinish - does nothing
TexSubImage2D - driver does one of three things:

1. Driver uses the CPU to copy the data into the backing store directly
1. Driver copies the data to a temp buffer (if not using a pbo, or data is not properly formatted) and places a blit on the command stream.
1. driver flushes all commands; calls down to kernel to lock down piece of VRAM for texture, page on any data that is more up to date in the backing store, set up an aperture, write data into aperture with CPU, tear down aperture, unlock vram.
  glFinish - driver probably flushes and ignores the wait step if it can.
  Draw Call - driver queues up whatever you are drawing plus pages any items required for rendering such as your texture if the backing store is more up to date.
  glFinish - driver again probably flushes and ignores wait step if it can.

As for Map/Unmap not writing to VRAM, it depends upon the cost of paging off the buffer while it is Mapped and the likelihood of that happening. If VRAM pressure forces a page off of a mapped buffer in VRAM, the kernel has to:

redirect the page tables to cause a page fault if the task tries to write to the data while it is being relocated.
allocate a backing store if it has one.
copy the data to the backing store
redirect the page table entries to the new backing store
resume the thread if it was halted.

Then the kernel can continue doing what it was wanting to do which is probably allocate VRAM for something else. These resources are virtualized on a per object basis so having VRAM locked down is not always a preferable thing to do because this can cause fragmentation of VRAM and thus increase VRAM pressure. Your thread can also be preempted at any time so having VRAM locked down to a thread that is not active is not ideal either.