glFinish stall when using a lot of texture memory (caused by driver freeing up mem?)

Here’s a quick problem description:

I am developing an application which uses a lot of GPU memory for texture data.
I have a set of texture arrays (say 20 ) that I update each frame with data located in a single big fat PBO ( say around 1.5GB for some cases, could be 6GB for other cases).
I’m binding these textures to a moderately complex shader for rendering.
The texture data are typically RGTC2 and S3TC_DXT1.
I’m using Nvidia 970

Now here’s the problem:

If I use a dataset larger than some size (e.g. 1.5GB), then occasionally I get a stall in the glFinish() part of SDL_GL_SwapWindow (I’ve verified that by adding glFinish outside of the SwapWindow call, and the CPU stall happens there)

If I disable PBO->texture updates, the stall still happens
If I disable setting the RGTC2 texture as a uniform in the shader the stall goes away
If I disable setting the S3TC_DXT1 texture as a uniform in the shader , the stall still happens
It does not matter how much I use the texture in the shader – a single fetch suffices to cause a stall

And the funny bit:
I’ve added out of curiosity some memory reporting code that I found for NVidia, using GL_GPU_MEM_INFO_CURRENT_AVAILABLE_MEM_NVX
Immediately after the stall, the reported available GPU memory is 300MB more (using one of the datasets), so the stall looks like it’s caused by the driver [b]freeing up memory

[/b]Does anybody have any idea about why and how that could happen, or even better, how to prevent such a stall?

Thanks.

That’s pretty interesting. I’m not sure, but it might be related to this bottleneck I saw subloading to large, compressed texture arrays on NVidia recently (GTX980 Ti):

With this one though, the primary bottleneck I saw was in the subloading calls themselves (glCompressedTexSubImage?D).

At the very least, it might be worth trying the workarounds I found to get a few more data points on your problem.

[QUOTE=babis;1284556]

[ul]
[li]If I disable PBO->texture updates, the stall still happens [/li][li]If I disable setting the RGTC2 texture as a uniform in the shader the stall goes away[/li]
[li]If I disable setting the S3TC_DXT1 texture as a uniform in the shader , the stall still happens [/li][/ul]

It does not matter how much I use the texture in the shader – a single fetch suffices to cause a stall

And the funny bit:
I’ve added out of curiosity some memory reporting code that I found for NVidia, using GL_GPU_MEM_INFO_CURRENT_AVAILABLE_MEM_NVX
Immediately after the stall, the reported available GPU memory is 300MB more (using one of the datasets), so the stall looks like it’s caused by the driver freeing up memory[/QUOTE]

Hmm… No solutions for you I’m afraid, but some educated guesses.

After subloading texel data before the GPU can sample from the texture, the driver has to swizzle the texel data. That is, it has to reorder the texels for better cache coherence when doing lookups into the texture. This takes time. It also takes memory.

Until it does the swizzle (aka tiling), it has to keep track of the “unswizzled” texels you uploaded. After the swizzle, it doesn’t need those anymore because the data is in the actual texture memory that’ll be used for texture sampling. So at some point, the driver is likely to free up the unswizzled texture memory. When and how it decides to do this I have no idea. I also don’t know whether it pools these unswizzled buffers and reuses them across multiple subloads to the same texture and/or across multiple textures.

Another thing. I don’t know how intelligent (or not) the driver is with swizzling subloads into large texture arrays. For instance, if you have a texture array with 1,000 slices (layers) and you re-upload MIPs for one slice, does it “re-swizzle” the entire texture or just that slice? What if you re-upload 10 slices scattered across that 1,000 slice texture array – what gets re-swizzled? This has big impacts on how much scratch memory and how much time is required to perform this re-swizzle. What little evidence I have seems to suggest that the driver is adaptive in what it does under-the-hood here based on your usage pattern, but that’s a guess.

Finally, based on past experience, I’m pretty sure the NVidia driver doesn’t actually provoke GPU texture uploads (and thus this reswizzling) until you actually render with the texture. This may partially explain your results where you removed visibility of some of your textures (RGTC2) from the driver/GPU in the draw pass. That is, it didn’t see that they were needed for the draw pass, so the driver didn’t kick-start the process to actually upload/swizzle the pending texture content to get them draw-ready.

As to why the same didn’t happen for the DXT1 textures, I’m not sure but (again) have guesses. First, did you have any other shaders used in your draw pass that were referencing those same DXT1 texture(s)? If not, then I will tell you when I was tracking down my texture array subloading slowdown (link above), I did notice that some texture arrays generated serious slowdowns while others were seemingly unaffected. This may correlate with the sheer size of the texture arrays (in number of slices or bytes). So you might collect these metrics for the texture arrays you’re currently operating with and see if you find a correlation with which texture array(s) are causing the slowdown.

Well damn, you actually did solve it for me!
The problem was related to the texture array size
I guess it didn’t happen with the DXT1 arrays, as they occupy half the size (it would have if they were larger I presume)
My texture arrays had a lot of slices, say 1024x1024x240 cubemaps, while I was changing and accessing only a few of them each frame. I presume that the driver was messing around with larger portions of the texture rather than just those slices.
Now I keep my huge PBO with all the data, but I rearranged the array access and update so that I use a smaller array (1024x1024x16) and there are no stalls anymore…

Thanks a lot!

Great! Glad you’ve got it solved. Yeah, there’s definitely some strange mojo going on with large texture arrays on NVidia. I’d love to know exactly what’s going on under-the-hood that can make them very slow for large texture arrays.