PDA

View Full Version : Vertex cache optimization



_NK47
05-08-2009, 06:38 AM
I read some information about cache optimization before writing here still would like to know how is everybody dealing with it. seems like indexed triangle list is the way to go on todays gfx cards but im kindof lost on how to exactly optimize for the cache (pre/post). indices in correct order is self-explanatory, anybody has experience in this to share whats most important and how to achieve good results?

Dark Photon
05-08-2009, 06:51 AM
cache optimization...would like to know how is everybody dealing with it...anybody has experience in this to share whats most important and how to achieve good results?
We've been using Tom Forsyth's Linear-Speed Vertex Cache Optimisation (http://home.comcast.net/~tom_forsyth/papers/fast_vert_cache_opt.html). Very fast, yields good results, easy to write, and doesn't care about the topology of what you're trying to render.

Don't even bother with NVTriStrip. It's slow, has some bugs, and doesn't yield very optimal results by comparison (at least that was the case when I tried it).

If you're rendering regular grid's check out Ignacio Castaņo's Optimal Grid Rendering (http://castano.ludicon.com/blog/2009/02/02/optimal-grid-rendering/). More on that here (http://castano.ludicon.com/blog/2009/01/11/vertex-cache-optimization/) and Forsyth's comments on that here (http://home.comcast.net/~tom_forsyth/blog.wiki.html#%5b%5bRegular mesh vertex cache ordering%5d%5d).

And regardless, check out the ACMR (average cache miss ratio) stats on Ignacio's page. Also see his spiel (http://castano.ludicon.com/blog/2009/01/29/acmr/) on why ACMR might not be the absolute best metric to use (but is better than most), suggesting average transform to vertex ratio (ATVR) instead.

While we're on the subject, this page (http://home.comcast.net/~tom_forsyth/blog.wiki.html#Strippers) is a humorous but informative must-read.

Also regarding your "how to exactly optimize for the cache (pre/post)" query, the above links will probably make it obvious. But the top-priority cache to optimize for is the post-vertex shader cache (aka post-T&L cache), because it lets you skip whole vertex shader runs! It's basically a FIFO cache of vertex shader outputs. The pre-vertex shader cache (aka pre-T&L cache) is nothing more than just a LRU memory prefetch cache of vertex attribute data (vertex shader inputs). So whenever you see someone state "optimize for the vertex cache" they're typically talking about the post-vertex shader cache.

_NK47
05-08-2009, 07:34 AM
wow, thats quite some info, thanks Dark Photon! many of the links i read before, wanted to know more before going into practice. all links help alot guess i will take another look at Linear-Speed Vertex Cache Optimisation and start coding.

Groovounet
05-08-2009, 07:42 AM
One other point to explore it fragments optimization too.

The idea is to sort the triangles so that the first triangle in the list are the more probable to be seen so that the z-test will then discard more fragments ...

I haven't explore this territory yet but it's definitely the next level of vertex cache optimisation. ^_^

ZbuffeR
05-08-2009, 10:29 AM
@Groovounet: Huh ?
Unless I missed something, this has nothing to do with vertex cache ! Z-test can not help for vertices. This only optimize away complex fragment shader operations, which is valuable, but different.

Ilian Dinev
05-08-2009, 11:02 AM
Imho, using the strengths of cards with unified shaders and z-buffer facilities will yield much better results than optimizing strips and indexes forever.
Initial depth-pass with possibly helper-quads that are a simplified versions of the walls in a level (wall indexes are z-sorted on cpu) to quickly lay-down rough (but nice for compression) Z values. This will limit fragment-overdraw to 4x. Triangles can be quickly culled before the gpu attempts to generate any fragments. Move calculations from vert to frag shader. GPUs have inherent limit to number of triangles they can setup per cycle (1), so you can overtake all units for fragments.

Groovounet
05-08-2009, 04:33 PM
Yes, mostly nothing to do with vertex caching but when you reorganize the vertex for vertex caching you can also think to optimize the vertex order to reduce the number of fragments that past the z-test. It's kind of a probability thing.

With a sphere all the triangles have the same probability to be visible. With a character, a car, any complex mesh, some triangles have a higher probability to be visible than other. You could want to render these highly probable visible triangles first to increase the number of z-test that fail.

Hampel
05-11-2009, 01:31 AM
BTW, is there info available somewehre about current and previous HW generation (at least for NVidia and ATI) post-transform vertex cache sizes?

Dark Photon
05-11-2009, 07:06 AM
BTW, is there info available somewehre about current and previous HW generation (at least for NVidia and ATI) post-transform vertex cache sizes?

GeForce 2 / 256 / 4 MX: 16
GeForce 3 / GeForce 4 Ti: 24
GeForceFX / GeForce 6: 24
GeForce 7: 45
...

Up in these counts, size doesn't matter as much, particularly if you use an alg like Forsyth's that doesn't care about the specific count. That's a good thing, so performance degrades gracefully when you run on older hardware (though at this point, we're talking old, old HW). These days you can expect at least 24.

Used to be a tool out there called VCacheDetector (http://www.clootie.ru/delphi/dxtools.html) that would run tests and tell you the post-vertex shader cache length for a GPU.

Jackis
05-12-2009, 02:30 AM
Little remark.
I'm quite sure (because I've tested and used it myself), that on GeForceFX/6/7 cache size equals to 24.
On G80 and up cache size depends on actual varyings usage. The more used varyings - the less is cache size. But you can always count on 32 or greater (or smth like that).
I used 24 for pre-G80 and 32 for G80 and up for Tom's algo as cache size.

_NK47
05-12-2009, 02:51 AM
didn't know that varyings take up cache size. good info.
seems like 32 varyings = no cache

Jackis
05-12-2009, 04:06 AM
_NK47
I didn't test for such a case (with full varyings load). But cache size really decrements with varyings usage incrementing. Not uniformly, but in uneven manner. Actually, when I was testing for cache size (about year and a half ago), I didn't know about 32 attribute uniforms and I was using only standard COLOR0-1 and TEXCOORD0-7 notation.
I can only suspect, that using full varyings bulk will lead us to something like 12-16 vertices.
Also, there's a big deal with geometry shaders. Hardly using them for vertices generation seems not to be good scenario for taking benefit from post-T&L cache.

MarkS
08-13-2009, 12:45 PM
I have a really stupid question (at least I feel stupid for asking...).

I went over Tom's paper and I thought I understood what he is doing. However, looking at his "complete" source code, I find myself scratching my head. What am I supposed to do with that? It assigns a score to a vertex and then what? How do I reorder the face list? Doesn't it make more sense to assign the faces a score? Not only that, but he did not include the data structures. Should a pointer to a vcache_vertex_data structure be included in each vertex struct or is it the vertex struct? It only has two values, NumActiveTris and CacheTag?

This leaves me with more questions than answers. Please feel free to laugh at me while offering advice. ;)

Thanks,
Mark

Jackis
08-14-2009, 05:16 AM
MarkS,
that link sould help you more: http://www.opengl.org/discussion_boards/ubbthreads.php?ubb=showflat&Number=235320

Brolingstanz
08-14-2009, 08:01 AM
Another good read is "Optimization of Mesh Locality for Transparent Vertex Caching" by Hugues Hoppe. Hefty bit on reordering strategies.

Dark Photon
08-14-2009, 09:38 PM
It assigns a score to a vertex and then what?

Compute tri scores from vertex scores
MAIN LOOP
- Pick off the highest-score triangle (or approx highest, if using high score cache)
- Add to "draw" list
- Adjust num tris for each vertex not drawn
- Add tri to vertex cache
- Update changed vertex scores
- Updated changed tri scores

You're gonna have to read it a few times to figure it out and get the O(n) perf, but you can do it.

MarkS
08-15-2009, 01:44 AM
Well, that explains my confusion. I wasn't looking at this as part of the rendering pipeline, but as a pre/post processing technique.

What I'm doing is exporting my models from Blender. However, Blender exports the faces in an odd manor. At first I thought the vertices and faces were being exported randomly, but upon further inspection, they are being exported in 4 - 6 poly clumps. What I was trying to do was write a post processing utility that would take the model and "correct" it.

Of course, I haven't the slightest idea what Blender is doing or why. It may, in fact, be exporting the faces in a cache-optimized fashion. I'm waiting for a reply in the Blender forums on this matter.

Regardless, I don't think this algorithm is what I need at this point, although I can see the need down the road.

Alfonse Reinheart
08-15-2009, 02:10 AM
I wasn't looking at this as part of the rendering pipeline, but as a pre/post processing technique.

You can still use it as a post-processing technique. You simply store the triangle's vertices in a vertex list rather than drawing it. Simple.


Of course, I haven't the slightest idea what Blender is doing or why. It may, in fact, be exporting the faces in a cache-optimized fashion.

It may not. It certainly isn't going to hurt to just fix it yourself.

Dark Photon
08-15-2009, 12:27 PM
Well, that explains my confusion. I wasn't looking at this as part of the rendering pipeline, but as a pre/post processing technique.
No, you were on the right track before. It "is" a pre-processing technique you'd run post-modeling but pre-realtime to order the triangles in your batches for much better post-vertex-shader cache performance (i.e. fewer vertex shader runs on the graphics card).

Of course you could just as well run this on a background thread if you're dynamically building batches in your realtime.

When I said add to "draw" list, I'm talking about the list of triangles/verts your building for drawing later.

bertgp
08-17-2009, 08:15 AM
By the way, if you're comfortable using DirectX, there is an API call which optimizes the order of the vertices in a mesh. You could write a very simple app that reads your model in the DirectX data structure, calls the "Optimize" method (I don't remember its name) and writes back the result to a new file.

I don't know if it uses the same algorithm as described above, but it's already all done and fully debugged.

_NK47
08-18-2009, 03:28 AM
aren't there some ready to use classes for that? seems NVTriStrip can also output lists instead of strips or AMD Tootle?

Dark Photon
08-18-2009, 09:09 AM
aren't there some ready to use classes for that? seems NVTriStrip can also output lists instead of strips or AMD Tootle?
Last time I messed with NVTriStrip (3Q06), the alg it used was expensive and used mesh topology to build the triangle list. It blows up in cases like two intersecting planes with a shared axis (std tree technique). It appears (http://developer.nvidia.com/object/nvtristrip_library.html) it was last updated 2004, so that's probably still the case.

Implementing Forsyth's optimizer is well worth the time. I think I've seen other implementations of it elsewhere on the net. Try googling ("forsyth triangle optimization"). Failing that, feel free to ask questions, MarkS, and I'll help you out. It isn't really that hard, and actually kinda fun.

Up at the top of google hits, here's one link, with full source code:

* Triangle Order Optimization (http://gameangst.com/?p=9) (source code (http://gameangst.com/wp-content/uploads/2009/03/forsythtriangleorderoptimizer.cpp))

I haven't verified his implementation, but his times look really suspect.

Here's another one:

* Vertex cache optimization library (http://www.gamedev.net/community/forums/topic.asp?topic_id=542635) (source code (http://code.google.com/p/vcacne/))

Dark Photon
08-18-2009, 09:30 AM
<deleted>

_NK47
08-18-2009, 09:45 AM
Dark Photon thanks for the effort and suggestions.

Jackis
08-19-2009, 06:14 AM
Here are our versions of that algorithm (quite fast for me, but can be done much faster).
http://slil.ru/27912469

_NK47
08-19-2009, 07:24 AM
Thank you Jackis.
while still reading more on this topic found some ACMR vs. ATVR comparisons here (http://castano.ludicon.com/blog/2009/02/02/optimal-grid-rendering/).
seems like Forthys algorithm is quite good in many situations considering that on todays cards (and less current) vertex cache size is > 32.

Itun.itu
10-12-2010, 10:04 AM
Tell me please, what a MaxSizeVertexCache constant in Forthys algorithm do you use? 32?

Dark Photon
10-13-2010, 09:23 AM
Around that.

Look above for what it was 5-8 generations of GPU ago. Then consider that the incremental improvement once you get into sizes that large isn't that much, and if you pick a number "too big" you thrash the cache. So those numbers should be decent sizes to optimize for.

I think I read someplace that on recent GPUs, the post-vertex-shader cache size effectively varies depending on the amount of "varying" data that needs passed between the shaders (that is, it's probably tied to a specific maximum amount of memory in bytes, not in verts). But I don't remember where. Anybody got a ptr, or know differently? If true, there is no official "best" number for todays GPUs. You just pick a "really good" number, like those old numbers ~32.

Itun.itu
10-13-2010, 06:05 PM
I have written Forthys algorithm, ACMR changed form 2.0 to 0.67, but FPS and time for model rendering didn`t change. What can it be?

Dark Photon
10-13-2010, 06:20 PM
I have written Forthys algorithm, ACMR changed form 2.0 to 0.67, but FPS and time for model rendering didn`t change. What can it be?
That just suggests that you're probably not vertex transform bound. Other possibilities: fill, texture fetch, setup, state change, batch submit, etc. -- some are GPU bottlenecks, some are GPU bottlenecks. Lots of ptrs out there for identifying where the bottleneck is. Start at the end of the pipe and work your way back (e.g. start by shrinking the window). But first flip rendering of your model on and off and make sure it dominates the frame time. If not you may need to render it a number of times to pump the frame time up so you can focus on it specifically.

BTW, ATVR is probably a better measure of vertex cache efficiency. For more details, see Ignacio's post here (http://localhost/sgglwiki/index.php/Ignacio_Casta%C3%83%C2%B1o:_ACMR)). 1.0 being optimal of course.

Itun.itu
10-14-2010, 09:21 AM
I have done this all.
I have two meshes (one index buffer and one vertex buffer for each mesh). Before rendering I optimize them. Then in each frame I have a loop with two iterations in which there is glDrawElements call. I don`t have materials, textures and others. My scene is very simple.
Note: I use JOGL.

Dark Photon
10-14-2010, 11:33 AM
Gonna need more details to help you, but given that you said Java and simple (suggesting small batches probably), I would suspect you might be CPU bound submitting batches or state changes.

If it's simple, convert to C for testing, and see if any perf diff. If not, you've got something simple you can post for folks to review/try.