Texture subload

Hi,

I was wondering if there was any way to load a texture on the gpu memory on many frame. For instance, if I have a budget of 0.5ms per frame for texture uploading, I would upload a 512x512 RGBA in 8 frames to be sure to not exceed my budget…

I look like if I have is two options: with PBO and without PBO.

In the case of without PBO, the problem is that the driver only upload the texture when we draw with it, so even if I do many glTexSubImage (it will be copied in the CPU driver managed memory until it is used), when I will draw for example a quad with the texture, the surface allocation will be done on the GPU, and the transfer for the whole texture will be executed on the frame that draw that quad…

In the case of the PBO, I haven’t found a mode that does what I want, and it is even slower since there is one more allocation done on the GPU memory…

Anyone know a way to force a subloading of a texture on the GPU?!

Thanks,

J-P

There is no way, as you can’t know what the driver is doing behind the scene. Still, PBO should be the answer as it enables async transfer. You argument that PBO is slower, because it requires another allocation, is a bit pointless, as you can pre-allocate the PBO memory in andvance.

Like Zengar said, there’s no way to be sure. I also came across this problem some time ago and at that moment I was able to force it by inserting some dummy code:

  • send texture data
  • attach texture to FBO
  • activate FBO
  • deactivate FBO

but I can’t guaranty this will work on all GPUs/drivers.

Here is how I use the PBO:

First frame :
Generate the texture/PBO
Bind the texture/PBO
glBufferData( GL_PIXEL_UNPACK_BUFFER_ARB, nDataSize, NULL, (GLenum)m_eDataStoreType );
glTexImage2D(GL_TEXTURE_2D, 0, 4, unInitTextureSize, unInitTextureSize, 0, GL_RGBA, GL_UNSIGNED_BYTE, NULL);
glBufferSubData( GL_PIXEL_UNPACK_BUFFER_ARB, unCurrentOffset*unChunkSize, unChunkSize, data );

Others frames:
Bind the texture/PBO
glBufferSubData( GL_PIXEL_UNPACK_BUFFER_ARB, unCurrentOffset*unChunkSize, unChunkSize, data );

Last frame:
Bind the texture/PBO
glBufferSubData( GL_PIXEL_UNPACK_BUFFER_ARB, unCurrentOffset*unChunkSize, unChunkSize, data );
glTexSubImage2D( GL_TEXTURE_2D, 0, 0, 0, unInitTextureSize, unInitTextureSize, GL_RGBA, GL_UNSIGNED_BYTE, NULL );
Draw a Quad with the texture

But I get a glitch (mem allocation on the gpu and upload)on the glTexSubImage, it look like if the driver doesn’t use the PBO on the GPU memory… I use GL_STATIC_DRAW, but I’ve try with other mode and I’ve pretty much the same results. I also try with map/unmap, and still have the same results…

glBufferSubData is slow. Use glMapBuffer + some fast memcopy + glUnMapBuffer. Fast memcopy function should be written to be cache frinedly, use registers, and other CPU trick. You can get ~50% gain compared to regular memcpy call.

Create several PBO’s and use them as circular buffer or call glBufferData with NULL before glMapBuffer call (to tell driver to discard previous PBO content).

Anyway… keep in mind that there is no any guaranties that this task would be finished within 0.5ms A lot more stuff in OS can stall your app (like simple window minimize stop even time-critical threads on 250ms).

It looks like if the driver, always want to have a copy of the texture in its managed memory.

I have try to use a dummy PBO, to avoid a glitch (whole texture upload).

When the application start:
Create and force upload of a dummy PBO of 1024x1024 RGBA

First frame :
Generate the texture
Bind the texture
glTexImage2D(GL_TEXTURE_2D, 0, 4, unInitTextureSize, unInitTextureSize, 0, GL_RGBA, GL_UNSIGNED_BYTE, NULL);
Bind dummy PBO
glTexSubImage2D( GL_TEXTURE_2D, 0, 0, 0, unInitTextureSize, unInitTextureSize, GL_RGBA, GL_UNSIGNED_BYTE, NULL ); // To force data update on the GPU, to avoid texture upload
Unbind dummy PBO

Others frames:
Bind the texture
glTexSubImage2D( GL_TEXTURE_2D, 0, xoffset, yoffset, xupdate, yupdate, GL_RGBA, GL_UNSIGNED_BYTE, data );

Last frame:
Bind the texture
glTexSubImage2D( GL_TEXTURE_2D, 0, xoffset, yoffset, xupdate, yupdate, GL_RGBA, GL_UNSIGNED_BYTE, data );
Draw a Quad with the texture


But it look like if on the first frame, the driver make a copy of the texture in its managed memory (I see a cpu mem allocation on the glTexImage). It is strange… even if the data is already on the GPU (PBO), the drivers want it on the CPU…

What makes you think PBOs are on the GPU after you hand GL the data? The only thing you can guarentee is that the driver has the data.

I’d guess that the CPU copy is partly geared toward being able to virtualize the amount of GPU memory available. E.g. if you’re using 300MB of mem for FB+texture+etc. and you have 256MB card but sufficient CPU memory, your app will run. Depending on your texture working set size it may run great or it may suck, but it’ll run. The driver kicks some old textures/PBOs/etc. off the card as needed.

It could also be used for some software (CPU) rendering paths that get kicked in when you ask for things that the hardware doesn’t natively support.

I query the cpu mem and gpu mem after each call, so I can see the allocation on both side. I also measure the time, thus I can have a great idea of what’s going on.

I face two problems, solving one of these would be great…

1- When using glTexSubImage, the whole texture is updated on the first call even if I can just specify a small part to update. Seems like if the driver “dirty region” managing is not good.

2- If I use a dummy PBO to feed the first glTexSubImage, the CPU memory managed want to have the texture and it will download the whole texture from the GPU…

Sorry.
What does it means - “old texture”?
Is it connect with glPrioritizeTextures ?

The driver might use a LRU cache, an old texture is one that hasn’t been used in a while.
I’m not sure how glPrioritizeTextures affects this behavior, is it only considered a hint ?

Sorry.
What does it means - “old texture”?
Is it connect with glPrioritizeTextures ?

[/QUOTE]

The driver might use a LRU cache, an old texture is one that hasn’t been used in a while.
I’m not sure how glPrioritizeTextures affects this behavior, is it only considered a hint ?
[/QUOTE]

I dont know. Probbably, driver check priority texture?
It has to test

Sorry.
What does it means - “old texture”?[/QUOTE]

Well, NVidia does seem to use some sort of scheme (I’d guess LRU) for deciding which textures to kick off the card if you run out of GPU memory. LRU would make sense. It’s possibly related to creation order as well. I remember years back reading something from NVidia or ATI recommending creating GPU-memory-consuming objects in a certain order.

On that subject, NVidia also does “not” seem to push textures to the GPU when you upload the texture data to the driver (e.g. glTexImage2D) – at least not always. This means when you flip a “texture rich” area into view without doing anything special, you end up with frame breakage while all that CPU->GPU upload happens, initiated by the GL driver. Thus the need to prerender with all your textures on startup to make the GL driver actually upload the textures to the GPU right then.

Is it connect with glPrioritizeTextures ?

Probably. Whether or how is up to each vendor.

On that subject, NVidia also does “not” seem to push textures to the GPU when you upload the texture data to the driver (e.g. glTexImage2D) – at least not always. This means when you flip a “texture rich” area into view without doing anything special, you end up with frame breakage while all that CPU->GPU upload happens, initiated by the GL driver. Thus the need to prerender with all your textures on startup to make the GL driver actually upload the textures to the GPU right then.

Pre render all textures is good when you already know all the textures you will need. When you don’t know, then you need to do some texture paging.

But I’m wondering if there is a way to have a good texture paging scheme that can work under a given budget to ensure no glitch.

Only page what you need for display plus some slop. Worded differently, (ignoring aniso) don’t page more than 2 texels per pixel plus some slop. Actually less due to brilinear cheats. That addresses the real needs of display.

Now to your “under a given budget”, you can use LOD bias to “fuzz out” your texture (i.e. use lower-res MIP levels). Feed this fuzz factor into your LOD calculations to dial down the texture memory requirement and meet your memory budget.

For more insight: read this.

Only page what you need for display plus some slop. Worded differently, (ignoring aniso) don’t page more than 2 texels per pixel plus some slop. Actually less due to brilinear cheats. That addresses the real needs of display.

Now to your “under a given budget”, you can use LOD bias to “fuzz out” your texture (i.e. use lower-res MIP levels). Feed this fuzz factor into your LOD calculations to dial down the texture memory requirement and meet your memory budget.

For more insight: read this. [/QUOTE]

That was one of considered way, to do not use mip-map level larger then 1024x1024 DXT5, but it would be nice to find a way to be able to “subload” a mip-map larger then 1024x1024 DXT5 without having a glitch

You can do that too.

Re-reading your original post, you’re right. For best performance you don’t want to use the non-PBO route because that will demand an immediate copy from your buffer by the driver.

For the PBO route, you want to make sure you hand the driver the data as far in advance as you can so it can shuffle the data over at its leasure. A ring buffer of PBOs has been advocated here for that.

You also want to be careful to tell the driver that you don’t care about the previous PBO contents. so call glBufferData with a NULL pointer before you call glBufferSubData. That’ll let the driver keep multiple PBOs in flight (IIRC).

All this being geared for the goal that when you finally call glTexSubImage2D, you want to maximize the chance the the texture subload block is either on the GPU or as close to being there as possible. In driver memory in a DMA-aligned memory buffer in the OpenGL driver is a lot better than in application memory, plus with proper PBO usage the upload can be pipelined.

…but yeah, unfortunately you can’t control all this. Best you can do is maximize the GL driver’s chances of optimizing the transfer to your benefit.

I have try that, but it looks like if on the glTexImage with the PBO binded (and fully uploaded on the GPU mem), it will transfer the data from the video memory to the cpu memory. I assume that since the cpu mem allocate the size of the texture and the it take a lot of time.

Here are the stats I’ve got:

Generate the texture: 0.030679 ms, mem GPU 0 bytes, mem CPU 0 bytes
Bind the texture: 0.034535 ms, mem GPU 0 bytes, mem CPU 0 bytes
glTexImage: 2.658257 ms, mem GPU 2359296 bytes, mem CPU 1048576 bytes

That was a bit confusing so I’m not quite sure what you’re doing here.

First, you wouldn’t use glTexImage with PBO upload, as you imply. That wouldn’t make any sense, because of course you’d very likely get a performance hiccup due to the CPU and GPU dynamic memory allocation. You’d instead allocate the GL texture memory up-front with glTexImage, then at render-time you’d upload from PBO using glTex"Sub"Image into your “existing” GL texture (whose memory has already been preallocated). Looking up to your previous message, you say this is only your first-frame algorithm. In which case timing the glTexImage call doesn’t make any sense. Time your glTex"Sub"Image calls. If you want to get rid of the glTexImage allocation overhead at render-time, just pull this thing out to startup and provide a null pointer to preallocate your texture(s) then.

Second, how much time is there from when you update the PBO and when you do the glTex"Sub"Image? The longer you wait, the more chance the driver will have gotten your PBO data onto the GPU so that glTex"Sub"Image time is trivial.

Also, you’re not trying to push whole big MIP levels at the GPU all at once are you? Tune your PBO upload path for max upload speed first. Then based on its timings, decide how big your subload chunks can be based on how much frame time you want to spend.