[QUOTE=babis;1284556]
[ul]
[li]If I disable PBO->texture updates, the stall still happens [/li][li]If I disable setting the RGTC2 texture as a uniform in the shader the stall goes away[/li]
[li]If I disable setting the S3TC_DXT1 texture as a uniform in the shader , the stall still happens [/li][/ul]
It does not matter how much I use the texture in the shader – a single fetch suffices to cause a stall
And the funny bit:
I’ve added out of curiosity some memory reporting code that I found for NVidia, using GL_GPU_MEM_INFO_CURRENT_AVAILABLE_MEM_NVX
Immediately after the stall, the reported available GPU memory is 300MB more (using one of the datasets), so the stall looks like it’s caused by the driver freeing up memory[/QUOTE]
Hmm… No solutions for you I’m afraid, but some educated guesses.
After subloading texel data before the GPU can sample from the texture, the driver has to swizzle the texel data. That is, it has to reorder the texels for better cache coherence when doing lookups into the texture. This takes time. It also takes memory.
Until it does the swizzle (aka tiling), it has to keep track of the “unswizzled” texels you uploaded. After the swizzle, it doesn’t need those anymore because the data is in the actual texture memory that’ll be used for texture sampling. So at some point, the driver is likely to free up the unswizzled texture memory. When and how it decides to do this I have no idea. I also don’t know whether it pools these unswizzled buffers and reuses them across multiple subloads to the same texture and/or across multiple textures.
Another thing. I don’t know how intelligent (or not) the driver is with swizzling subloads into large texture arrays. For instance, if you have a texture array with 1,000 slices (layers) and you re-upload MIPs for one slice, does it “re-swizzle” the entire texture or just that slice? What if you re-upload 10 slices scattered across that 1,000 slice texture array – what gets re-swizzled? This has big impacts on how much scratch memory and how much time is required to perform this re-swizzle. What little evidence I have seems to suggest that the driver is adaptive in what it does under-the-hood here based on your usage pattern, but that’s a guess.
Finally, based on past experience, I’m pretty sure the NVidia driver doesn’t actually provoke GPU texture uploads (and thus this reswizzling) until you actually render with the texture. This may partially explain your results where you removed visibility of some of your textures (RGTC2) from the driver/GPU in the draw pass. That is, it didn’t see that they were needed for the draw pass, so the driver didn’t kick-start the process to actually upload/swizzle the pending texture content to get them draw-ready.
As to why the same didn’t happen for the DXT1 textures, I’m not sure but (again) have guesses. First, did you have any other shaders used in your draw pass that were referencing those same DXT1 texture(s)? If not, then I will tell you when I was tracking down my texture array subloading slowdown (link above), I did notice that some texture arrays generated serious slowdowns while others were seemingly unaffected. This may correlate with the sheer size of the texture arrays (in number of slices or bytes). So you might collect these metrics for the texture arrays you’re currently operating with and see if you find a correlation with which texture array(s) are causing the slowdown.