Post T&L cache questions

Ive read everywhere that the post T&L cache is “FIFO” but Im not sure how to interpret this. Does it mean that for a list like 4,5,6,7,8 the GPU can only access the element “4”, or does it mean that all elements in the cache can be accessed, but if a new element is added(like 9), then the element “4” will be the one pushed out of the cache?

And whats the size of that cache today on an average card(PC) ? 10? 20?
I want to draw lots of NxN grids from triangles. …using drawElements, triangle strips, static draw. I guess within a grid after every row I should change the direction of the draw(so after a left-to-right row I do a right-to-left row) so I get more cache hits. …and I would like to know whats the “ideal” N number for this?
Thanks!

The latter, “if” the cache is truly FIFO (which it may or may not be) and if the cache size is max 5 elements long (which it of course isn’t).

And whats the size of that cache today on an average card(PC) ? 10? 20?

It depends (more in a sec). But in general though, with a decent triangle optimizer like Tom Forsyth’s, it’s so big now that it doesn’t really matter anymore AFAIK if you optimize for the cache properly.

A bit of history. “Back in the olden days” when there was fixed function hardware for T&L and then fixed-function hardware for vertex shaders (i.e. pre SM4.0), there was explicit memory allocated for the post T&L vertex cache IIRC. Size (in verts) varied per GPU but ended up around 45 verts before SM4.0. 45 is plenty.

Then came SM4.0 and unified stream processors, where each bank of stream processors had its own block of shared memory (~16KB). Not sure if I ever read for sure whether the vertex cache was stored in shared mem or global mem, but I think it was in shared. I know I’ve read in some GPU programming guide or other that post SM4.0, the number of vertices that fit in the vertex cache is dynamic based on the amount of varying data output by the vertex shader per vertex. So it’s probably some fixed resource (like shared memory) that’s dynamically split up to make best use of that amount of space given your output vertex footprint (in varyings aka interpolators).

So basically, the size of today’s vertex caches is dynamic based on how much data you output per vertex, but should in practice be plenty big enough for decently optimized indexed triangles.

I want to draw lots of NxN grids from triangles. …using drawElements, triangle strips, static draw.

You said triangle strips, which concerns me. Read this first:

I want to draw lots of NxN grids from triangles. …using drawElements, triangle strips, static draw. I guess within a grid after every row I should change the direction of the draw(so after a left-to-right row I do a right-to-left row) so I get more cache hits. …and I would like to know whats the “ideal” N number for this?

See the bit by Forsyth and Castano on optimizing for regular meshes. If you happened to know exactly how big the cache is, yeah you could chose your stripe width if you’re just going to walk back and forth with constant width. But you don’t know this, unless you do iterative microbenching to look for performance falloffs (or code your own vertex shader cache in a GLSL compute shader/OpenCL/CUDA). So you want to pick a conservative size to optimize for, or just use an algorithm that performs well across a range of vertex cache sizes like Forsyth’s.

The post-VS cache in modern GPUs is not a strict FIFO, because the vertices are processed in parallel. I think AMD GPUs process up to 64 vertices at the same time and NVidia 32. However, the FIFO model is still used for optimizing triangles, because it’s close enough to the way GPUs work, and the way real hardware works is too complex to make an exact prediction of caching.

Thanks! Thanks for the detailed answer Photon!
“you do iterative microbenching to look for performance falloffs (or code your own vertex shader cache in a GLSL compute shader/OpenCL/CUDA)”
…no, I dont want to do something like that:) I just want an OK performance, it doesnt have to be close to “optimal”.

“* Strippers (Forsyth)”
Im not sure I get this. Is he saying that strips are bad, indexing is bad or that the algorithm is not that important?

“* Optimal Grid Rendering (Castano)”
He prefetches some vertices by using degenerate triangles. Not that I would want to do this but arent modern GPUs throwing away vertices of degenerate triangles(when using drawElements) before they enter the vertex shader?

[QUOTE=Aliii;1255395]“* Strippers (Forsyth)”
Im not sure I get this. Is he saying that strips are bad, indexing is bad or that the algorithm is not that important?[/QUOTE]
He’s saying that strips aka triangle stripping as a primary optimization strategy is dead (especially ala DrawArrays TRIANGLE_STRIP), and indexed triangles (DrawElements TRIANGLES) with optimized triangle order is where it’s at (keep in mind this is from 2006, so this has been true for a long time). That said, he addresses your question at the bottom of his vertex cache optimization write-up here:

arent modern GPUs throwing away vertices of degenerate triangles(when using drawElements) before they enter the vertex shader?

Don’t know for sure. Could be, since it’s easy to detect.

Thanks! I will try drawElements - strips - p.restart index - draw rows back and forth, …and see what happens. It cant be that bad.