Thoughts on using varyings vs. not using them
I have asked this question before on this forum but the question was not very well presented, buried inside a thread on another topic.
I'm trying to see if it is faster to
a) load some data from a TBO in a vertex (my scenario is a tess control) shader and pass these values to the fragment shader as 'flat out', eg. flat varyings,
b) not do anything special in the vertex (or tess control) shader, and just load the values for every fragment in the fragment shader
According to the AMD Evergreen GPU reference doc (*), varyings use the LDS (Local Data Share) memory space of the GPU. LDS is said to be twice as fast as L1 cache (http://devgurus.amd.com/thread/158895)
For a) I have checked the GPU assembly code generated with ShaderAnalyzer, varyings use the INTERP_LOAD_P0 instruction and read the varying value into a GPU register. So 1 register is used. INTERP_LOAD seems to be just a LDS load instruction, with no hardware interpolation.
For b) VFETCH is used (the result of texelFetch()), and 1 register is also used as a recipient of the read operation.
As you can see the thing I am worried with is the number of GPU registers being used, which can dramatically reduce performance.
I can't think of any reason why using varyings would use more registers - can you?
Of course, I can profile too but sometimes it's good to get some technical insight.
The equivalent of the LDS in nVidia terminology seems to be Shared Memory, doesn't it? The reason why Shared Memory is so effective, compared to the general L1 cache, is that it is optimized for concurrent thread access (I mean, each thread can simultaneously access its shared memory storage)?
Feel free here to tell me where I could possibly be wrong here.
From the GLSL/arb-asm compiler's view, if you use a texelFetch, its result would be a costly and input-dependent thing. Thus the compiler will try to save the value in registers until you no longer need it. Meanwhile, varyings may get re-loaded into registers as often as you want, which can save-up on registers.
If your tessellation level is too high, though; with varyings I guess you'll be wasting too many memory-copies on potentially non-visible fragments.
Thanks a lot for your insight Ilian.
Varyings do have their own cost and manually fetching these data within the fragment shader might be beneficial even if the number of fetches becomes orders of magnitude higher by delaying it from the VS to the FS as those fetches will almost always hit the cache. However, it is difficult to give a generic advice. I'd try both but I believe with a large number of such attributes the texel fetch in the FS sounds to be a better approach (and maybe the only one for very large number of attributes as you might run out of interpolants).
Disclaimer: This is my personal profile. Whatever I write here is my personal opinion and none of my statements or speculations are anyhow related to my employer and as such should not be treated as accurate or valid and in no case should those be considered to represent the opinions of my employer.
Technical Blog: http://www.rastergrid.com/blog/
Let's not forget the availability of texture units, though:
On a RHD 7950 (1792:112:32) if VFETCH is going through a texunit, then when 1792 shader units try to use 112 units, they'll have at most 1/16 the performance of a INTERP_LOAD_P0 version (if the latter instruction doesn't go through a texunit). Regardless of cache.
(though this 1/16 performance can effectively get masked, provided you have 16 times more threads running other kernels or the same kernel at a different program-counter waiting to be switched-to, which won't be using the texunits when switched to. But that's rare. )