Vertex cache optimization

I read some information about cache optimization before writing here still would like to know how is everybody dealing with it. seems like indexed triangle list is the way to go on todays gfx cards but im kindof lost on how to exactly optimize for the cache (pre/post). indices in correct order is self-explanatory, anybody has experience in this to share whats most important and how to achieve good results?

1 Like

We’ve been using Tom Forsyth’s Linear-Speed Vertex Cache Optimisation. Very fast, yields good results, easy to write, and doesn’t care about the topology of what you’re trying to render.

Don’t even bother with NVTriStrip. It’s slow, has some bugs, and doesn’t yield very optimal results by comparison (at least that was the case when I tried it).

If you’re rendering regular grid’s check out Ignacio Castaño’s Optimal Grid Rendering. More on that here and Forsyth’s comments on that here.

And regardless, check out the ACMR (average cache miss ratio) stats on Ignacio’s page. Also see his spiel on why ACMR might not be the absolute best metric to use (but is better than most), suggesting average transform to vertex ratio (ATVR) instead.

While we’re on the subject, this page is a humorous but informative must-read.

Also regarding your “how to exactly optimize for the cache (pre/post)” query, the above links will probably make it obvious. But the top-priority cache to optimize for is the post-vertex shader cache (aka post-T&L cache), because it lets you skip whole vertex shader runs! It’s basically a FIFO cache of vertex shader outputs. The pre-vertex shader cache (aka pre-T&L cache) is nothing more than just a LRU memory prefetch cache of vertex attribute data (vertex shader inputs). So whenever you see someone state “optimize for the vertex cache” they’re typically talking about the post-vertex shader cache.

wow, thats quite some info, thanks Dark Photon! many of the links i read before, wanted to know more before going into practice. all links help alot guess i will take another look at Linear-Speed Vertex Cache Optimisation and start coding.

One other point to explore it fragments optimization too.

The idea is to sort the triangles so that the first triangle in the list are the more probable to be seen so that the z-test will then discard more fragments …

I haven’t explore this territory yet but it’s definitely the next level of vertex cache optimisation. :slight_smile:

@Groovounet: Huh ?
Unless I missed something, this has nothing to do with vertex cache ! Z-test can not help for vertices. This only optimize away complex fragment shader operations, which is valuable, but different.

Imho, using the strengths of cards with unified shaders and z-buffer facilities will yield much better results than optimizing strips and indexes forever.
Initial depth-pass with possibly helper-quads that are a simplified versions of the walls in a level (wall indexes are z-sorted on cpu) to quickly lay-down rough (but nice for compression) Z values. This will limit fragment-overdraw to 4x. Triangles can be quickly culled before the gpu attempts to generate any fragments. Move calculations from vert to frag shader. GPUs have inherent limit to number of triangles they can setup per cycle (1), so you can overtake all units for fragments.

Yes, mostly nothing to do with vertex caching but when you reorganize the vertex for vertex caching you can also think to optimize the vertex order to reduce the number of fragments that past the z-test. It’s kind of a probability thing.

With a sphere all the triangles have the same probability to be visible. With a character, a car, any complex mesh, some triangles have a higher probability to be visible than other. You could want to render these highly probable visible triangles first to increase the number of z-test that fail.

BTW, is there info available somewehre about current and previous HW generation (at least for NVidia and ATI) post-transform vertex cache sizes?

GeForce 2 / 256 / 4 MX: 16
GeForce 3 / GeForce 4 Ti: 24
GeForceFX / GeForce 6: 24
GeForce 7: 45
…

Up in these counts, size doesn’t matter as much, particularly if you use an alg like Forsyth’s that doesn’t care about the specific count. That’s a good thing, so performance degrades gracefully when you run on older hardware (though at this point, we’re talking old, old HW). These days you can expect at least 24.

Used to be a tool out there called VCacheDetector that would run tests and tell you the post-vertex shader cache length for a GPU.

Little remark.
I’m quite sure (because I’ve tested and used it myself), that on GeForceFX/6/7 cache size equals to 24.
On G80 and up cache size depends on actual varyings usage. The more used varyings - the less is cache size. But you can always count on 32 or greater (or smth like that).
I used 24 for pre-G80 and 32 for G80 and up for Tom’s algo as cache size.

didn’t know that varyings take up cache size. good info.
seems like 32 varyings = no cache

_NK47
I didn’t test for such a case (with full varyings load). But cache size really decrements with varyings usage incrementing. Not uniformly, but in uneven manner. Actually, when I was testing for cache size (about year and a half ago), I didn’t know about 32 attribute uniforms and I was using only standard COLOR0-1 and TEXCOORD0-7 notation.
I can only suspect, that using full varyings bulk will lead us to something like 12-16 vertices.
Also, there’s a big deal with geometry shaders. Hardly using them for vertices generation seems not to be good scenario for taking benefit from post-T&L cache.

I have a really stupid question (at least I feel stupid for asking…).

I went over Tom’s paper and I thought I understood what he is doing. However, looking at his “complete” source code, I find myself scratching my head. What am I supposed to do with that? It assigns a score to a vertex and then what? How do I reorder the face list? Doesn’t it make more sense to assign the faces a score? Not only that, but he did not include the data structures. Should a pointer to a vcache_vertex_data structure be included in each vertex struct or is it the vertex struct? It only has two values, NumActiveTris and CacheTag?

This leaves me with more questions than answers. Please feel free to laugh at me while offering advice. :wink:

Thanks,
Mark

MarkS,
that link sould help you more: http://www.opengl.org/discussion_boards/ubbthreads.php?ubb=showflat&Number=235320

Another good read is “Optimization of Mesh Locality for Transparent Vertex Caching” by Hugues Hoppe. Hefty bit on reordering strategies.

Compute tri scores from vertex scores
MAIN LOOP

  • Pick off the highest-score triangle (or approx highest, if using high score cache)
  • Add to “draw” list
  • Adjust num tris for each vertex not drawn
  • Add tri to vertex cache
  • Update changed vertex scores
  • Updated changed tri scores

You’re gonna have to read it a few times to figure it out and get the O(n) perf, but you can do it.

Well, that explains my confusion. I wasn’t looking at this as part of the rendering pipeline, but as a pre/post processing technique.

What I’m doing is exporting my models from Blender. However, Blender exports the faces in an odd manor. At first I thought the vertices and faces were being exported randomly, but upon further inspection, they are being exported in 4 - 6 poly clumps. What I was trying to do was write a post processing utility that would take the model and “correct” it.

Of course, I haven’t the slightest idea what Blender is doing or why. It may, in fact, be exporting the faces in a cache-optimized fashion. I’m waiting for a reply in the Blender forums on this matter.

Regardless, I don’t think this algorithm is what I need at this point, although I can see the need down the road.

I wasn’t looking at this as part of the rendering pipeline, but as a pre/post processing technique.

You can still use it as a post-processing technique. You simply store the triangle’s vertices in a vertex list rather than drawing it. Simple.

Of course, I haven’t the slightest idea what Blender is doing or why. It may, in fact, be exporting the faces in a cache-optimized fashion.

It may not. It certainly isn’t going to hurt to just fix it yourself.

No, you were on the right track before. It “is” a pre-processing technique you’d run post-modeling but pre-realtime to order the triangles in your batches for much better post-vertex-shader cache performance (i.e. fewer vertex shader runs on the graphics card).

Of course you could just as well run this on a background thread if you’re dynamically building batches in your realtime.

When I said add to “draw” list, I’m talking about the list of triangles/verts your building for drawing later.

By the way, if you’re comfortable using DirectX, there is an API call which optimizes the order of the vertices in a mesh. You could write a very simple app that reads your model in the DirectX data structure, calls the “Optimize” method (I don’t remember its name) and writes back the result to a new file.

I don’t know if it uses the same algorithm as described above, but it’s already all done and fully debugged.