Are TBOs just as fast as PBOs?

fred_em · January 18, 2011, 8:42am

Hi,

I am currently sending texture contents on a frame basis to the GPU. I am using PBOs to do this.

Because I need more programming flexibility in my GLSL fragment shader, I need to switch to Texture Buffer Objects in order to make use of the texelFetch() function.

My question is the following:

TBOs seem pretty much like PBOs. glBufferData, glMapBuffer etc.
Will the contents also be transfered asynchronously, like with PBOs? Will the uploading performance be similar?

Thanks,
Fred

Alfonse_Reinheart · January 18, 2011, 9:22am

Because I need more programming flexibility in my GLSL fragment shader, I need to switch to Texture Buffer Objects in order to make use of the texelFetch() function.

The texelFetch function can be used with almost any type of texture (except shadow and cubemap textures).

TBOs seem pretty much like PBOs.

That’s because they are buffer objects. All buffer objects are equal; the only thing that matters is what you use them for.

There is no such thing as a PBO or a TBO; these are simply different uses for buffer objects.

Will the contents also be transfered asynchronously, like with PBOs? Will the uploading performance be similar?

What do you mean by this?

Calls to glTexImage or other pixel transfer operations while a buffer object is bound to GL_PIXEL_UNPACK will cause these functions to get their data from the bound buffer object instead of from client memory. This is what allows asynchronous uploads.

If you’re using a buffer object as the storage for a buffer texture, you will not be calling glTexImage or any other pixel transfer operations on the texture. There won’t be asynchronous uploads because there won’t be uploads.

If you’re not using filtering, and you don’t need multidimensional texture access, then you probably also don’t need the pixel transfers either. So just use a buffer texture.

fred_em · January 19, 2011, 4:30am

[quote=Alfonse Reinheart]

If you’re not using filtering, and you don’t need multidimensional texture access, then you probably also don’t need the pixel transfers either. So just use a buffer texture.

I understand what you say, but you are a bit playing with words.

My ultimate goal obviously is to transfer data from the main to the video memory. Regardless of how we represent this,
a RAM -> video RAM data transfer will occur. So there is an upload process in all cases.

I would like to know if the transfer in the following 2 scenarios will have the same performance:

In the code below ‘client’ means RAM. Not client/server OpenGL API. Just a central RAM pointer.

A) Texture-1D scenario, with upload using a PBO:

[b]void *clientBufferContents;

glBindTexture(GL_TEXTURE_1D, …);
glBindBuffer(GL_UNPACK_PIXEL_BUFFER, …);

void *clientPixelBufferPointer = glMapBuffer(GL_UNPACK_PIXEL_BUFFER, …);
mem-copy, from ‘clientBufferContents’ to '*clientPixelBufferPointer ’
glUnmapBuffer(GL_UNPACK_PIXEL_BUFFER);

// Asynchronous upload. The following call returns immediately.
glTexImage1D(GL_TEXTURE_1D, …, 0);

[/b]B) Texture Buffer Object scenario:

[b]void *clientBufferContents;

glBindTexture(GL_TEXTURE_BUFFER, textureID);
glBindBuffer(GL_TEXTURE_BUFFER, textureBufferID);

void *clientPixelBufferPointer = glMapBuffer(GL_TEXTURE_BUFFER, …);
mem-copy, from ‘clientBufferContents’ to '*clientPixelBufferPointer ’
glUnmapBuffer(GL_TEXTURE_BUFFER); // does THIS call, here, trigger an asynchronous transfer of the buffer to GPU RAM, and returns immediately?

[/b]

Alfonse_Reinheart · January 19, 2011, 9:24am

I would like to know if the transfer in the following 2 scenarios will have the same performance:

There are two possibilities for how things work. Either the buffer object’s storage is in GPU memory or it is in CPU memory. And this could happen to the buffer object in the buffer texture case or the pixel buffer case.

However, since the buffer texture’s buffer object needs to be in GPU memory in order for the shader to access it, let’s assume that this buffer object will remain there. Thus, we only need to examine two cases: is the buffer object for the pixel transfer in CPU memory or in GPU memory?

In short:

Case 1: PBO in CPU memory:

PBO: does pixel transfer from CPU memory to GPU memory.
TBO: does memory copy from CPU memory to GPU memory.

A pixel transfer is, at best, no faster than a memory copy. And it could be a good deal slower.

Case 2: PBO in GPU memory:

PBO: does memory copy from CPU memory to GPU memory, then does pixel transfer from GPU memory to GPU memory.
TBO: does memory copy from CPU memory to GPU memory.

The PBO case does more work, and is therefore necessarily slower.

There is no reason to expect the PBO case to be faster in either case. This is because you’re doing a pixel transfer in the PBO case. In conclusion:

mem-copy, from ‘clientBufferContents’ to '*clientPixelBufferPointer ’

If you’re just doing a memcpy, just call glBufferSubData. The purpose of mapping is so you don’t have to allocate and fill ‘clientBufferContents’ yourself. That is, you map the pointer, fill that pointer directly from wherever you’re getting the data for ‘clientBufferContents’ then unmap it.

fred_em · January 24, 2011, 2:47am

Thanks for a lot for your reply. My comments inline.

For things to work with the GPU, things have to reside in GPU memory. For regular operation, data shall reside in GPU memory.
Data comes from my hard drive, or central memory. For clarity here, let’s assume it comes from memory.

A data transfer has to occur between central memory to GPU memory. Regardless of how this will be carried out, the transfer will occur.

What I want to know, Alphonse, is whether this transfer will be done asynchronously:

A) for the CPU. Asynchronously means the CPU will just issue the transfer to the DMA engine, and will then happily resume its work.

B) for the GPU. The DMA engine is copying the data over to GPU memory and, while the copy is being performed, the GPU continues doing its rendering, until of course the availability of this data becomes a requirement - in which case there are two ways for the GPU to deal with this situation:
If rendering reaches the said point where texture data MUST be available, and the said texture data is the target of the DMA transfer, the GPU might choose either to:

B.1) wait for the DMA engine to complete its work transfering data into the texture. The GPU stalls in this case (again, only because it reached a point where it must deal with the said texture) and resumes work when data is ready.

B.2) use current texture data. I.e. the GPU chooses not to wait for the DMA engine. The GPU decides it’ll just use newer texture data later but for now it uses older texture data to avoid a stall.

My original question really is centered about this. Will the above described asynchronous mechanism guaranteed with both a PBO transfering new data to a texture, and with a TBO populated with new data? Will asynchronousness on both CPU and GPU side be guaranted if I choose to either use a TBO or PBO?

Just to clarify: from there on, we are not talking about asynchronousness. We are talking about how fast the transfer (be it synchronous or asynchronous, whatever) is going to be done end-to-end. I’m fine with that too, but let’s clearly split the message into two distinct parts.

You write: “PBO: does pixel transfer from CPU memory to GPU memory”. For me, it just does a data transfer, eg. a raw pixel transfer. It really is raw: the PBO doesn’t process or reencodes the data in my case. It just sends raw data to the GPU. The GPU will be happy with the data as-is.
You are saying the PBO pixel transfer is more work. There is no pixel format conversion in my case. What kind of pixel-related work are you talking about?

Data will be transfered over the PCI bus with the DMA engine in both cases. This is not a memory copy (memory copy = from central RAM, to central RAM). This a transfer.

So we have:

TBO: raw data transfer
PBO: raw data transfer

My PBO always resides in central RAM, because I need to populate it with data.

Right, you keep going back to the pixel transfer, as opposed to raw data transfer.
Basically, you are saying that PBO = pixel transfer, TBO = data transfer.
Pixel transfer is more costly than raw data buffer.
Is this what you mean?
TBOs and PBOs both contain pixels. They both are “buffer objects”, and they both contain pixels (a TBO does not contain arbitrary data more than a PBO does, conceptually speaking, in effect they really are equal. A TBO contains texels, a PBO contains pixels). The nature of the transfer is the same for me. Is there anything I am missing?

mem-copy, from ‘clientBufferContents’ to '*clientPixelBufferPointer ’

If you’re just doing a memcpy, just call glBufferSubData. The purpose of mapping is so you don’t have to allocate and fill ‘clientBufferContents’ yourself. That is, you map the pointer, fill that pointer directly from wherever you’re getting the data for ‘clientBufferContents’ then unmap it. [/QUOTE]
I’m doing glMapBuffer(…, GL_WRITE_ONLY), which performance is similar to glBufferData. But, agreed: glBufferData is sufficient and I don’t have to bother with glMapBuffer, because glMapBuffer gives me a new pointer. glBufferData allows me to work with my pointer right away without doing a memcpy from pointer to pointer. You are totally right here.

Could you tell me what you think regarding my first questions, higher up in this message? (asynchronousness, DMA engine, A), B) B.1) B.2)…)

Your light would be much appreciated.

Regards
Fred

Alfonse_Reinheart · January 24, 2011, 10:02am

A) for the CPU. Asynchronously means the CPU will just issue the transfer to the DMA engine, and will then happily resume its work.

And this was the point I was trying to make when I talked about buffer object placement.

If a buffer object’s storage resides in GPU memory, then calling glBufferSubData will initiate a memory transfer operation. However, this function is not allowed to fetch asynchronously from the pointer you give it. So the implementation must either copy that pointer to internal memory that it will use to DMA the data, or it must stall the CPU and DMA it right from your memory.

Granted, for any sizable data, the latter is unlikely. If for no other reason than the fact that DMA’s generally have memory alignment needs that your allocated buffer likely doesn’t meet. But my point is that there’s no way to know.

If you map a buffer and write to it, one of two things could be happening. The pointer you get from glMapBufferRange could be a CPU memory pointer or a direct GPU memory pointer. If it’s a CPU pointer, then the transfer from CPU memory to GPU memory will happen asynchronously after the buffer is unmapped.

However, if mapping gives you a GPU memory pointer (only possible if the buffer’s storage is in GPU memory), then there is no DMA at all. You are writing directly to GPU memory across the PCIe bus. There can be some write combining and so forth to help out, but the memory transfer is entirely synchronous with the CPU.

Again, there’s no way to be certain.

B.2) use current texture data. I.e. the GPU chooses not to wait for the DMA engine. The GPU decides it’ll just use newer texture data later but for now it uses older texture data to avoid a stall.

That can’t happen; the OpenGL spec doesn’t allow it. OpenGL gives implementations a lot of leeway to do things asynchronously and even out-of-order in cases where it doesn’t affect the output. But it is very clear on this: every command issued after another command must execute as if that previous command completed. This is true of every OpenGL operation.

Will the above described asynchronous mechanism guaranteed with both a PBO transfering new data to a texture, and with a TBO populated with new data? Will asynchronousness on both CPU and GPU side be guaranted if I choose to either use a TBO or PBO?

Nothing is guaranteed to be asynchronous at all.

You write: “PBO: does pixel transfer from CPU memory to GPU memory”. For me, it just does a data transfer, eg. a raw pixel transfer. It really is raw: the PBO doesn’t process or reencodes the data in my case. It just sends raw data to the GPU. The GPU will be happy with the data as-is.

My point is this: even if you’re correct and the CPU doesn’t have to intervene in the pixel transfer process at all (and there’s no way to be 100% certain of this for all hardware and implementations), the fact remains that the TBO case will be just as fast. It will always be just as fast, because it doesn’t have a pixel transfer operation happening. It’s just a memory copy. And it’s the pixel transfer that can cause the PBO to slow down.

So the best possible case for PBOs is to be equal to the TBO case. And the worst possible case for PBOs is much, much slower. Since there is no case for PBOs to be faster than TBOs, and there are cases where they are much slower, the choice is clear.

fred_em · January 25, 2011, 6:00am

I immensely appreciate your feedback, but let’s try to answer my original question

The question shows up in red underneath. It will be followed by a second question, also in red.

In the examples below, ‘textureData’ is a pointer given to me by a third-party library function. I have no control over this pointer, it is just given to me (I cannot give my own pointer for the third-party library function to fill in the contents into it, I am just given a pointer, there is no way I can change that).

A) Normal texture scenario with upload using a PBO:
[b]
void *textureData;

glBindTexture(GL_TEXTURE_1D, …);
glBindBuffer(GL_UNPACK_PIXEL_BUFFER, …);

glBufferData(GL_UNPACK_PIXEL_BUFFER, size, NULL, GL_STREAM_DRAW); // reserve space only

void *cpuAccessibleBufferPointer = glMapBuffer(GL_UNPACK_PIXEL_BUFFER, …, GL_WRITE_ONLY); // note: write only
mem-copy, from ‘textureData’ to ‘cpuAccessibleBufferPointer’
glUnmapBuffer(GL_UNPACK_PIXEL_BUFFER);

// Asynchronous upload to GPU memory. The following call returns immediately. This is a de-facto standard for nvidia and AMD hardware. The ARB_pixel_buffer_object guarantees asynchronousness for glReadPixels and, in my opinion, implies it for glMapBuffer/glUnmapBuffer followed by glTexImage2D. Modern games rely on this behavior anyway. Let’s not be purists with the spec, let’s just assume this is always true for the sake of making the discussion simple.

glTexImage1D(GL_TEXTURE_1D, …, 0);
[/b]

B.A) Texture Buffer Object scenario:
[b]
void *textureData;

glBindBuffer(GL_TEXTURE_BUFFER, textureBufferID);

glBufferData(GL_TEXTURE_BUFFER, size, NULL, GL_STREAM_DRAW); // reserve space only

void *cpuAccessibleBufferPointer = glMapBuffer(GL_TEXTURE_BUFFER, …, GL_WRITE_ONLY); // note: write only
mem-copy, from ‘textureData’ to ‘cpuAccessibleBufferPointer’
glUnmapBuffer(GL_TEXTURE_BUFFER); // [b]Question 1: on nvidia or AMD latest platforms (whichever you prefer) does this call trigger an asynchronous transfer, does the call return immediately?

[/b]
Now, I have a second question related to asynchronousness / synchronousness, below.

Here, you are saying that the following code will not trigger an asynchronous transfer (which I agree with), but a synchronous transfer. Just to confirm, let me show the code:

B.B) Texture Buffer Object scenario with glBufferData:
[b]
void *textureData;

glBindBuffer(GL_TEXTURE_BUFFER, textureBufferID);

glBufferData(GL_TEXTURE_BUFFER, size, textureData, GL_STREAM_DRAW); // Transmit texture data SYNCHRONOUSLY. glBufferData here transfers all textureData to GPU memory and THEN, after the transfer is complete, returns. Question 2: Correct?

Remember that textureData is given to me and I have no control over this pointer - I have to do a memcpy.

If you replied No to Question 2, this means (and I agree) that glBUfferData triggers a synchronous transfer in the code shown. Meaning it is a big no-no if I want an asynchronous transfer to take place.

If you replied Yes to Question 1, do you agree then, that Scenario B.A) is the best for me as far as asynchronousness is what I want? (asynchronousness: Yes, fast raw-buffer transfer)

If you replied No to Question 1, can you say that use-case Scenario A) is the best for me, again, because I want asynchronousness? (asynchronousness: Yes, and a pixel transfer a little bit slower than a raw buffer transfer, which I don’t mind)

Best regards
Fred

fred_em · January 25, 2011, 6:02am