Carving up GPU memory dynamically

I’m very tired of the old school “prealloc all your textures on startup” model to avoid run-time performance hick-ups with allocating and deallocating textures. I typically have “way” more textures than will fit on the GPU and don’t know a-priori counts of textures of specific iformats and resolutions I will need during any given execution. I also can’t resort to the standard “Loading screen” cheat to hide this. And I’d like to continue to use hardware-accelerated texture sampling/filtering (including clamping, auto-LOD, and anisotropic filtering), not reimplement all this to the detriment of performance.

For those that watch the extensions closer than I, do we have the capability yet to dynamically carve up GPU memory on the app side and dole it out as needed to textures of dynamically-discovered iformats and resolutions, without introducing alloc/free performance hick-ups?

When not needed anymore, these textures should then be individually freeable and the app can garbage collect the memory holes together for future reallocation of textures of different iformats and resolutions, again without introducing performance hick-ups.

Core/ARB functionality is preferrable, but if not and I can get this with NV extensions, then that’s acceptable. I’m looking for the ability to treat GPU memory as a big untyped memory arena (e.g. gpumalloc( HUGE_SIZE )) and then lay out textures on this as I see fit.

Thanks for any insight!

P.S. Additionally, do we have a way yet to create 2D texture array “views” where the slices within each texture might be individual 2D textures scattered across GPU memory?

So you want to allocate and free memory… without allocating and freeing memory?

No, there are no extensions that let you do that.

Sparse textures are somewhat close to this, but you still need to state up front how big they are. And there’s no guarantee that giving them memory will not have allocation costs.

Considering the virtualization of GPU memory on most graphics hardware, it’s highly unlikely that this is ever going to happen. GPU memory belongs to the OS first and foremost; you’re not going to be allowed to “treat GPU memory as a big untyped memory arena”. That’s basically destroying the abstraction that OpenGL exists to create.

If you want nothing between you and the details of the GPU, then you’re going to have to use a console.

As for the array texture thing, no. Array textures are really just a specialized 3D texture; they rely on being contiguously allocated. You can use bindless texturing to achieve a similar effect to this, though.

Thanks for the reply. To the above, no. More like: allocate a large memory arena from GL (for instance, in a buffer object), and have the app manage it without GL being involved in the decision, …or at least be minimally involved. For instance, designate X bytes starting at offset Y in buffer object B for texture T1 with res R and iformat I, etc. – rinse, repeat. The goal being to allow application-specific memory management of the block, so you can just “re-type” memory as needed, which ideally should be cheap.

Versus the current crystal-ball approach of: “hmmm, I think/hope/pray I’m not going to need more than 42 1024x1024 DXT5 texture slices, 162 512x512 RGTC2 texture slices, etc. so I’ll preallocate all those because runtime GPU texture allocation/free is so prohibitive” and then later at render time while run-time paging discover that those estimates don’t cut it, so you can’t display some model at the desired resolution …or at all! In this case, you’d really like to commandeer some known-unused memory from another pool (or better yet, just have one “big” pool to scavange from), but that doesn’t work because you can’t just re-type the memory (i.e. designate it as for a specific iformat, res, num levels). You have to totally free and realloc it, leading to lots of GPU memory trauma (and frame breakage to prove it), so that’s out. And hopefully we all know the horrible effects that can occur to your frame rate when GPU memory gets near full, so just blissfully overcommitting GPU memory and letting the driver sort it out is a non-starter here as well.

Ignore the specific names and nuances used here, but just to convey the jist of what I’m suggesting:


glGenBuffers( 1, &buffer );
glNamedBufferDataEXT( buffer, size, 0, GL_DYNAMIC_DRAW );

offset = 0;
glGenTextures( 2, tex );
for ( int i = 0; i < 2; i++ )
{
  glTextureStorage2D( tex[i], levels, iformat, width, height );
  glTextureDeviceStorage( tex[i], buffer, offset );   // <------
  offset += glGetDeviceSizeOfTexture( tex[i] );
}

When you want to move textures around, you just do it (potentially without “any” texture calls; just do the GPU memcpy), and re-specify the appropriate texture handle’s buffer and offset when complete. And when you’re done with the data for some texture, you just stop using it, push the texture handle on a free list, and later repurpose it as a front to some other texture data.

Thanks for the extension pointers BTW. Have read one of those already but will read up on the other.

Looking at your code example raises a question: How are you going to manage this storage afterwards?
I guess you want to write a code, that manages all offsets, storing information about size and availability. But even if you don’t need one of the textures resident in your pool, then how are you planning to use space it occupied? You’ll need to find a texture(s) of smaller or equal size to fit in there, but that way your pool is may(and most likely will) suffer from heavy fragmentation. And you can’t make assumptions about it’s practical capacity when designing levels and such. What are you going to do when your texture pool will not be able to provide requested storage? Switch to basic allocation or use another pool? That is going to be pain in the ass to handle in your application.

And correct me if i wrong, but i think feeding actual texture data to GPU is much more of a performance problem in this area(at least on dedicated GPU’s), and it may nullify all your effort designing a texture pool.

And correct me if i wrong, but i think feeding actual texture data to GPU is much more of a performance problem in this area(at least on dedicated GPU’s), and it may nullify all your effort designing a texture pool.

I’ve been wondering the same thing, isn’t the texture data upload the dominating factor here? I haven’t done any measuring but my (naive?) expectation is that the glTexStorageD() calls are fast and the glTexSubImageD() are the ones consuming lots of time. My understanding how to tame the latter is that folks often start with uploading only the highest mip-map levels and fill in the lower (i.e. higher resolution) ones over successive frames. If the former really are the bottleneck for your application Dark Photon, can you perhaps limit the number of texture allocations you do per frame and skip rendering objects that don’t have all their resources loaded yet?

I think topic is more about non-type-specific, coherent and reusable storage. You either end up with potentially heavily-fragmented buffer or you expect driver magic to pack textures coherently in that buffer. You’ll loose control over offsets and such, but you still have an isolated, pre-allocated storage. But i’m not sure if it is possible to implement preserving decent performance.

I don’t see why we couldn’t have something like VirtualAlloc and VirtualFree, we would still need to type “views” but that’s about it.

I wrote a broad idea of what I consider a good streamlined API here : http://forum.beyond3d.com/showthread.php?t=63565
I don’t pretend it’s anywhere close to completion, but it’s the principles I’d like to see applied.

If texture “allocs” and “frees” were ordered totally arbitrarily (which they aren’t in my case, but let’s go with that for a second), you could end up having to deal with fragmentation. Dealing with it however becomes fairly simple and cheap. To get rid of a hole, just do a fast on-GPU memcpy within the buffer object to shift a texture down to fill it (think cudaMemCpy( …, DeviceToDevice)), and then update the offset in the texture handle (glTextureDeviceStorage) to reflect the new location. Rinse, repeat until you have a space big enough.

However in my use case like I said the allocs and frees aren’t totally arbitrary. The allocs come in a group, and the frees come in a group. So while I’d support the above “garbage collection” hole-filling strategy for generality (because it’s easy), it largely boils down to clearing out the buffer at one point, and then populating it with an unspecified number of textures of varying iformats and resolutions (all have MIPmaps).

The big wins with this “one buffer” approach are that 1) I only need 1 big buffer for this, not 32 or more different big ones containing textures of a predetermined iformat and resolution (e.g. 128x128 RGTC1, 256x256 RGTC1, … 4096x4096 RGTC1, 128x128 DXT5, 256x256 DXT5, …, 1024x1024 BPTC_SIGNED_FLOAT, etcetc.) and requiring “a ton” of empty space (lots more than with the single buffer approach) to provide for the worst case allocation scenario (in general, I can’t even provide for the worst case with this pre-typed approach without draconian limitations). Also, 2) it’s cheap to repurpose how chunks of this buffer are used dynamically based on what texture resolutions and iformats are dynamically discovered at render time.

What are you going to do when your texture pool will not be able to provide requested storage? Switch to basic allocation or use another pool?

That’s a lot less likely to happen with the “one big buffer” approach than it is with the current “a lot of big buffers” approach.

But the idea would be to allocate one big buffer, manage it efficiently, and (on a GPU with a decent amount of memory, which I can assume) never in practice run out of memory.

And correct me if i wrong, but i think feeding actual texture data to GPU is much more of a performance problem in this area

No, it’s actually not. If the textures have been precreated and prerendered with (to force their allocation on the GPU), you can trickle page texture content onto the GPU easily without causing big frame rate hickups much less frame breakage. Texture allocation and deallocation is where you start biting big performance hits. And that’s the point of my e-mail – we need a way to get around the crazy expense of that and make efficient and dynamic use of GPU memory for textures of varying formats.

No, not in my experience. Dynamic allocation and deallocation of memory on the GPU can cause reorganization of memory, performance hickups you can’t control (other than to remove the allocs/frees), and in some circumstances can stabilize in a situation where your steady-state frame rate is reduced to 1/2 to 1/3 your normal frame rate.

I haven’t done any measuring but my (naive?) expectation is that the glTexStorageD() calls are fast and the glTexSubImageD() are the ones consuming lots of time.

Keep in mind that ordinary texture allocation (ala glTexImageD(NULL) and presumably glTexStorageD) doesn’t actually do anything on the GPU (as far as I have seen). Texture uploads don’t either. It’s when you actually kick the GPU and tell it you want to render with one right now that the actual GPU-side allocation and uploads often happen, and that can be the expensive part.

If the former really are the bottleneck for your application Dark Photon, can you perhaps limit the number of texture allocations you do per frame and skip rendering objects that don’t have all their resources loaded yet?

Hopefully my comment above addressed this, but I try to avoid doing any texture allocations per frame (“especially” big ones) to avoid serious performance problems.

[QUOTE=Dark Photon;1254179]
No, it’s actually not. If the textures have been precreated and prerendered with (to force their allocation on the GPU), you can trickle page texture content onto the GPU easily without causing big frame rate hickups much less frame breakage. Texture allocation and deallocation is where you start biting big performance hits. And that’s the point of my e-mail – we need a way to get around the crazy expense of that and make efficient and dynamic use of GPU memory for textures of varying formats.[/QUOTE]

This part, i don’t quite get. How do you “force allocation on GPU” without allocating and uploading texture to VRAM and where can you apply such technique? I think most people would use functionality you’re talking about to aid arbitrary texture streaming. To save video memory and use it more efficiently, trying to have only textures\data, that is potentially on screen, where typeless, coherent data-pool for textures would be quite useful. But that requires uploading textures to VRAM during rendering, really fast and unpredictably.

I wasn’t saying that you could. What I’m saying is that when you create and initially populate textures via the GL API, you’re often just doing so with a copy of the texture in CPU memory. If you then flip a bunch of them in-view and try to draw with them, you’ll get a huge frame time spike because none of them are actually on the GPU yet. Thus the need to prerender with them to force the driver to actually allocate them in GPU memory.

I just realized something. You can use view-texture aliases to achieve at least some of what you want. It doesn’t give you everything you want, since you can only alias formats between textures of the same pixel size. So you can’t pretend that a DXT5 texture is a DXT1 texture of twice the width or something.

But, through the use of 2D array textures, you can pretty much dole out anything, from cubemaps to just 2D textures, to whatever. With full mipmap pyramids and everything.

Obviously, there are limitations for this approach. Specifically, that it wastes a lot of memory if all of the textures are not the same size. And the format limitations I mentioned earlier. But it could work for some of the more egregious cases.

Thanks! I’ll definitely read up on that.

Unfortunately, in our use cases there typically is only one format used with a specific avg size per-texel. Of format and res, res is of course the memory killer of those two axes (e.g. you can’t make everything 4096^2 maps).