PDA

View Full Version : Performance of texture upload with PBO



James W. Walker
07-15-2010, 01:09 PM
In the ARB_pixel_buffer_object specification, Example 2 shows a way to upload texture data using a PBO. The basic outline is that you create a PBO of the right size, map the PBO into memory, copy pixel data into the PBO with memcpy, unmap the PBO, and then upload from the PBO to a texture with glTexSubImage2D. When I tried using this method to upload an image of size 1858 x 1045 into a nonrectangular texture, the memcpy took (on average) around 13 ms, and the glTexSubImage2D took around 7 ms. On the other hand, if I don't use a PBO and just use glTexSubImage2D directly, it takes around 7 ms.

What am I missing? What's the advantage of the PBO method?

yooyo
07-16-2010, 08:47 AM
Texture streaming.
Without PBO, during glTexSubImage2D call, CPU is blocked and wait. With PBO, glTexSubImage2D call return immediatly.
Dont use plain memcpy. Use some faster memcopy code that uses MMX/SSE instructions. Google for it.
In video streaming, use two PBO's.. decode in PBO1 but upload to texture from PBO2. Then swap PBOs.

ZbuffeR
07-16-2010, 08:58 AM
Without PBO, glTexSubImage2D call, CPU is blocked and wait.
From hearsay, this is not completely true. An optimized GL driver only has to copy data on its side, then return, and then send data asynchronously to the GPU.

James W. Walker
07-16-2010, 10:40 AM
With PBO, glTexSubImage2D call return immediatly.

That's what I've read, but as stated above, that's not what I measured. So even if I could do the memcpy in zero time, I still wouldn't have a win.

James W. Walker
07-16-2010, 11:23 AM
Now I'm more confused... I did the timing again, with a somewhat bigger image size (1920 x 1080) and got very different results: about 13.5 ms for the memcpy, but only 0.1 ms for glTexSubImage. Maybe I did something wrong before.

As for the memory copy speed, this is on Windows, and I tried using the CopyMemory function provided by the OS, with basically the same results. I'd think that Microsoft would have optimized the hell out of CopyMemory, or am I being na´ve?

Arkh89
07-18-2010, 08:07 AM
In my situation, I'm using a single PBO to upload data to the GPU. For 1920*1080 RGB / BYTE, it takes about 2ms to send the whole data (not far from the bandwidth limitation : 3GB/s), fluctuations show min is near 1.7ms and max around 2.3ms.

Platefroms : WIN 7 x64 (both)
drivers : 257.XX (for my laptop only, I don't know for the desktop)
CPUs : I7 920 (desktop) I7 950 (laptop)
GPUs : FX 3800 (desktop) GTX 280M (laptop)

The Code i'm using :

glBindBufferARB(GL_PIXEL_UNPACK_BUFFER_ARB, bufferID);
void* ptr = glMapBufferARB(GL_PIXEL_UNPACK_BUFFER_ARB, GL_WRITE_ONLY_ARB);
memcpy(ptr, data, size);
//then unmap PBO
glBindTexture(GL_TEXTURE_2D, texureID);
glTexSubImage2D(GL_TEXTURE_2D, 0, offsetX, offsetY, w, h, mode, depth, 0);
//then unbind PBO

Are you sure about the size you are copying?

yooyo
07-19-2010, 07:28 AM
memcpy and CopyMemory is slow. Try with:
- http://www.gamedev.net/community/forums/topic.asp?topic_id=78382
- http://www.cs.virginia.edu/stream/FTP/Contrib/AMD/memcpy_amd.cpp

James W. Walker
07-19-2010, 10:50 AM
Are you sure about the size you are copying?

I did leave out some relevant details. My image data actually contains 2 side by side images, and I'm uploading each half to separate textures. I realize now that I can be a little smarter and upload the data to a PBO just once, and use that PBO for both glTexSubImage calls. However, even taking into account the factor of 2, the time I'm getting seems a lot higher that what you report.

OS: Windows Vista Home Premium 64 bit
CPU: Intel Core 2 Duo
GPU: ATI Radeon HD 4670

James W. Walker
07-19-2010, 12:26 PM
memcpy and CopyMemory is slow. Try with:
- http://www.gamedev.net/community/forums/topic.asp?topic_id=78382
- http://www.cs.virginia.edu/stream/FTP/Contrib/AMD/memcpy_amd.cpp


Thanks, but the thread at your first link refers to code at a dead link, and an Intel library that was then available for free but is no longer. The second link is designed for AMD processors, and it's not clear to me exactly what assumptions it makes.

James W. Walker
07-19-2010, 12:38 PM
As for the memory copy speed, this is on Windows, and I tried using the CopyMemory function provided by the OS, with basically the same results.

D'oh! I now see that CopyMemory is a macro defined to be RtlCopyMemory, and RtlCopyMemory is a macro defined to be memcpy! So maybe Microsoft doesn't have any memory-copy function in a DLL, just a memcpy in a C runtime library. And now another confession, I've been using CodeWarrior rather than Visual Studio, hence a really old C runtime library. That may explain the slowness.

James W. Walker
07-19-2010, 12:54 PM
I tried the AMD assembly code, and it may be a tad faster, say 12.9 ms rather than 13.5.

mhagain
07-20-2010, 07:42 AM
I'm also hitting a similar problem, where the supposed perf increase from using a PBO is just not happening. The code goes like: BindBuffer MapBuffer Update regions of the mapped area that need modifying, building a rect that desribes the size of the total updated area. UnmapBuffer TexSubImage Unbind (BindBuffer, 0)This is in a performance critical path and I need to be able to do 30-40 of these per frame. Textures are 64x512. The entire texture rect is not, however, being updated; only a subrect is, so BindBuffer with a NULL data pointer is not an option.

The annoying thing is that I know the hardware (Intel 4 Series) is capable; I have equivalent D3D code that handles it smoothly and almost for free (and gives you 80,000 verts per frame in addition), but OpenGL just stutters and stalls.

Is accessing the PBO serially more efficient than hopping around in it? Would there be benefit to keeping a copy of the texture data in system memory, hopping around in that to update, then copy to PBO and TexSubImage it?

Maybe a driver problem (I did say Intel) but I want to ensure that I'm using the correct optimal path before bashing at that.

mfort
07-20-2010, 08:26 AM
@mhagain

Updating only part of the buffer is very bad thing.
The driver copies the whole buffer.
Use smaller PBOs.

If you want better performance then do the memcpy in another thread.

Dark Photon
07-20-2010, 09:07 AM
memcpy and CopyMemory is slow. Try with:
- http://www.gamedev.net/community/forums/topic.asp?topic_id=78382
- http://www.cs.virginia.edu/stream/FTP/Contrib/AMD/memcpy_amd.cpp


Thanks, but the thread at your first link refers to code at a dead link, and an Intel library that was then available for free but is no longer. The second link is designed for AMD processors, and it's not clear to me exactly what assumptions it makes.
The first link works fine for me here. Suspect net filtering on your end.

And the second link (both actually) appear at first glance to be generic MMX. This formulation is just a little inconvenient to integrate in a C/C++ app since it's raw asm. Also, seems this is doing MMX 64-bit moves. Whereas with SSE2 (supported by all 64-bit CPUs and many 32-bit) you can do 128-bit moves.

So instead...

Here is C <emmintrin.h> code for that same concept -- that is, a non-temporal (non-cache-polluting) memcpy, which AMD terms "Streaming Store", but which uses SSE2:

* gamedev.net source code (http://www.gamedev.net/community/forums/topic.asp?topic_id=502313)

This concept behind all of these (but especially the previous link) is explained more fully (in English) here:

* Performance Optimization of Windows Applications on AMD Processors, Part II (http://developer.amd.com/documentation/articles/pages/PerformanceOptimizationofWindowsApplicationsonAMDP rocessors2.aspx)(read the whole page, or search down for Streaming Store)

Works fine on Linux/GCC too though. If you're compiling under GCC, this is how you can test whether this compilation supports SSE2:

#if !( defined(__GNUC__) &amp;&amp; defined(__SSE2__) )

if not, you can fall back to system memcpy.

Maybe you could add that SSE2 gamedev source code to your bake-off and post some comparative times. Be sure not to operate on the same data in the same prog run without a cache flush, and flip the order of your tests a few times to ensure your timings are in-fact independent. Separate test prog runs with diff algs each time is probably safest.

James W. Walker
07-20-2010, 10:23 AM
The first link works fine for me here. Suspect net filtering on your end.

I guess I wasn't clear... when I said "the thread at your first link refers to code at a dead link", I didn't mean that your link was itself a dead link, I meant that the thread is talking about code from http://www.joryanick.com/memcpy.htm, which is a dead link.


Here is C <emmintrin.h> code for that same concept -- that is, a non-temporal (non-cache-polluting) memcpy, which AMD terms "Streaming Store", but which uses SSE2:

* gamedev.net source code (http://www.gamedev.net/community/forums/topic.asp?topic_id=502313)

What's emmintrin.h?

Anyway, I'll give that code a go.

mhagain
07-20-2010, 10:27 AM
@mhagain

Updating only part of the buffer is very bad thing.
The driver copies the whole buffer.

Sadly I don't know in advance how large or small the update region is going to be, so I can't do that. Secondly I've now established that the culprit is definitely the call to glTexSubImage2D; comment out that call to that, even with the map/update/unmap left in, and it's smooth as silk, up to about 5-6 times the performance; the PBO is not what's causing the bottleneck here, it's glTexSubImage2D for definite. To rule out a usual suspect, I have also checked for BGRA.

Use smaller PBOs.
Been there, done that, wear the t-shirt down the pub every friday night. Doesn't help. :(

If you want better performance then do the memcpy in another thread.
Won't help, the copy of data to PBO is not the bottleneck, it's the copy from PBO to texture object.

mfort
07-20-2010, 11:01 AM
Sadly I don't know in advance how large or small the update region is going to be, so I can't do that.


check this out
http://www.opengl.org/registry/specs/ARB/copy_buffer.txt

In my experience glTexSubImage2D with PBO always takes zero time even with uploading BGRA 1920x540

James W. Walker
07-20-2010, 11:21 AM
Here is C <emmintrin.h> code for that same concept -- that is, a non-temporal (non-cache-polluting) memcpy, which AMD terms "Streaming Store", but which uses SSE2:

* gamedev.net source code (http://www.gamedev.net/community/forums/topic.asp?topic_id=502313)

Looks like my old CodeWarrior compiler can't cope with this. It has never heard of __sse2_available, and gave a bunch of "register spilled" warnings that I didn't know how to deal with. I commented out the __sse2_available part and tried it anyway, but it was slower than my older code. I guess I'll have to bite the bullet and learn to use Visual Studio.

mhagain
07-20-2010, 11:25 AM
Sadly I don't know in advance how large or small the update region is going to be, so I can't do that.


check this out
http://www.opengl.org/registry/specs/ARB/copy_buffer.txt

In my experience glTexSubImage2D with PBO always takes zero time even with uploading BGRA 1920x540

It's a nice extension but it's not available on my hardware, and won't be available on 75-90% of the users hardware either. It's annoying because it's not a problem in the D3D version of the app. :(

I think at this stage I need to make a standalone app that I can beat on with this.

Dark Photon
07-20-2010, 12:10 PM
What's emmintrin.h?
It's a cross-platform (apparently) x86/x86_64 header file that provides compiler symbols ("intrinsics") that compile to SIMD instructions such as MMX, SSE, SSE2, etc. For instances, see these links:

* http://msdn.microsoft.com/en-us/library/ba08y07y.aspx (Microsoft on SSE2 integer ops)
* http://www.codeproject.com/KB/recipes/mmxintro.aspx (ancient MMX docs)

I say apparently cross-platform because the SSE2 memcpy from gamedev allegedly compiles on MSWin with MSVS, and it compiles/runs fine for me on Linux with GCC. All I had to do was change this thing: "!__sse2_available" to:

#if !( defined(__GNUC__) &amp;&amp; defined(__SSE2__) )

which has nothing to do with this header file.

Dark Photon
07-20-2010, 12:15 PM
Looks like my old CodeWarrior compiler can't cope with this. It has never heard of __sse2_available
Yeah, use whatever your compiler sets to tell you SSE2 is available for this compile.

Or just for testing, replace this with "true" if you know your dev box supports SSE2. See this link:

* http://en.wikipedia.org/wiki/SSE2#CPUs_supporting_SSE2

All 64-bit boxes have it.


I guess I'll have to bite the bullet and learn to use Visual Studio.
Or just use Linux/GCC. It's free.

That said, I find that the built-in memcpy on GCC 4.4.1 is even slightly faster than the gamedev SSE2 non-temporal memcpy on our app's test data (batches streamed to VBOs) on Core i7 920, at least under -O2 (optimization level 2). They're pretty close though.

James W. Walker
07-20-2010, 01:45 PM
Hmm, Visual Studio is also giving me an error on __sse2_available. I suppose I need to include some header, but what? My Google-fu has failed me.

James W. Walker
07-20-2010, 03:54 PM
I managed to get Visual Studio to build me a DLL using the SSE2 memory copy function. I made sure it was compiled with optimizations and intrinsics, and verified that it was taking the SSE2 code path. But it still didn't get significantly below 13 ms on the copy.

yooyo
07-21-2010, 04:15 AM
Whole point behind PBO is to alow CPU and GPU runf without wainting on each other. If you neet to stream video to GPU, use this:
- create PBO pool, each PBO buffer should be able to fit whole frame. Map them all and mark as mapped.
- from decoder thread, when frame is decompressed, ask PBO pool for one unused and mapped PBO pointer. Copy frame to it and mark as filled with data.
Depending on decoder, you can even pass PBO pointer directly to decoder and it can decode frame in directly PBO buffer. This will avoid one memcpy call. Be carefull, if some decoders try to read data from this buffer, it can slowdown.
- from rendering thread, once per frame, check PBO pool status.
* if some PBO is marked as uploading (I assume that uploading from PBO buffer to texture will be done in one frame) map its pointer and set its status to mapped
* If some PBO buffer have some data (status = filled with data), unmap that PBO and call glTexSubImage2D. Mark PBO as uploading. Do not use that texture in current frame, because glTexSubImage2D may not be finished yet, so GPU will wait until texture object isnt ready to use.

Depending on number of stream you want to play, use 4 or more PBO's in pool.

To readback data you need two PBO buffers. Issue glReadPixels on PBO1, map PBO2 and copy data to sysmem or output video card, unmap PBO2 and swap PBO buffer names.

James W. Walker
07-21-2010, 01:51 PM
yooyo, thanks, but there are a couple of things that still confuse me.

First, when I started this topic, I referred to an example in the PBO specification, and that example did not use threads. Was it a poor example?

Second, if you're going to use threads, I'm not sure I see why PBOs are needed. Couldn't you just have one thread that does texture uploads directly with glTexSubImage2D, and another thread that renders with the textures?

James W. Walker
07-21-2010, 02:05 PM
OK, maybe I can answer my own question about why use PBO if you're going to use threads. I guess the simpler approach would not work well if you have only one processor, because while glTexSubImage2D was uploading synchronously, nothing else would be getting done. Right?

yooyo
07-21-2010, 05:12 PM
This is just easies possible example. Not designed for real world usage.

retro009
10-07-2010, 12:01 PM
* If some PBO buffer have some data (status = filled with data), unmap that PBO and call glTexSubImage2D. Mark PBO as uploading. Do not use that texture in current frame, because glTexSubImage2D may not be finished yet, so GPU will wait until texture object isnt ready to use.

Hello,
how do i know when unmapping (glTexSubImage2D) is finished?

Pierre Boudier
10-07-2010, 12:07 PM
you can insert a fence with a sync object after texsubimage, and query the status when you want to reuse your pbo.