Optimization for vertex cache

I wrote a small app to optimise my indexed meshes… I implemented an index FIFO cache class which should emulate the hardware cache. My optimising algorithm is fairly simple: I am starting with the first face and then making the most “cheap” face the next one. I’m repeating this process till all faces are cycled.
As the result, there is no face that would have to alter more then one vertex in the cache. Cache writes before the optimisation counted about 70% off all faces, after optimization only 33%. But my performance fell back by about 7%! Why?
I’m using GeForceFX with VBO’s, no pixel writes, no buffer swapping, no clear. So I ensure that vertex processing is th eonly one bottleneck.

What am I doing wrong?

Well, I did a bit more precise FPS monitoring, took largher meshes and notices that the performance boost is there, but only by 2-3%. I hoped for more… Maybe there is a better algorithm of cache optimization?
Anyway, I get about 55 MTris/sek(31 MVert/sec) on my GeForceFX 5600 using indexed triangle lists.

Make sure you’re transform limited and not bandwidth limited. If your VBOs end up in AGP memory it’s fairly easy to become AGP bandwidth limited if your vertices are large.

What FIFO size are you assuming for the GFFX? It’s likely to be different than the ones documented for GeForce 2 and 3.

Also, if you’re exporting from a regular tool like 3dxmax, the plain triangle list that usually comes out of that often has good enough locality that very little improvement can be had.

Nvidia devtech has a library called nvtristrip that does something similar to what you are doing. You might want to test your stuff against that to see if they get a bigger perf boost. Otherwise, you are probably doing it right.

http://discuss.microsoft.com/SCRIPTS/WA-MSD.EXE?A2=ind0009E&L=DIRECTXDEV&P=R247&D=0&I=-3

Check out this link for NVIDIA… I’d really like to see this doc NVIDIA guy is mentioning. I have tried to apply to NVIDIA developer relations few times but without any response.

I have measured latency of 5 cycles per processed vertex on my FX 5600, so with GPU on 325 MHz you get 21.5 tris/s, 64.5 vert/s - all uncached (using artificial indices like: 01234567…).

For ATI I think RHuddy from ATI dev. rel. mentioned that he wrote some papers on pre/post T&L caches… here is the link to his post and the thread on MS directX forum:
http://discuss.microsoft.com/SCRIPTS/WA-MSD.EXE?A2=ind0210B&L=DIRECTXDEV&D=0&I=-3&P=14102

To check if you are GPU core clock bound try downclocking the core (ie. use tweaking tools to unlock the enhanced control panel clock freq options)

I hope you’ll find this useful.

That’s really funny … the first post is from RHuddy@nvidia.com, but the post from a couple years later he is RHuddy@ati.com. Maybe that explains why he never finished that whitepaper.

I use a custom plugin to export mesh data from 3dsmax. The problem is, I’m not very “native” to C++, so everything I could do is to slighty modify the already written plugin(it’s called hacking, so far). And it’s exporting non-indexed triangles. I’m eliminating the same triangles in my app later, therefore converting the mesh to a indexed one. In that moment it HAS already good cache behaviour. But after optimisation, it has much better cache behaviour, that’s the problem.
I’m assuming 16-entries cache. It is most probably wrong, but it’s “better to underestimate the cache then overestimate it”(nvidia).
I’m really shure that i’m transform-bound, i tested it.
The problem is the algorithm, I think. Better cache optimisation could be possible. As i said, every face(except the first one) has two vertices in the cache and one outside. It would be possible to orient vertices in a way, that several faces would entirely fit into the cache, but I’m already doing crazy loops and am tired of sitting about 10 minutes to discover a error somewhere (slightly hyperbolized)
What comes in my mind: could it be, that all vertices should build a streat line? Like (0,1,2)-(1,2,3)-(2,3,4) etc? After my optimisation I often get very different vertices rolled together, like (3, 55, 4). Maybe some sort of adressing penalty… will try it later.
Thanks!

P.S.
nvtristrip… It’s C++ and very messy .
But I’ll look into it.

[This message has been edited by Zengar (edited 02-16-2004).]

Optimizing for a vertex cache of any size

Originally posted by speedy:
[b] http://discuss.microsoft.com/SCRIPTS/WA-MSD.EXE?A2=ind0009E&L=DIRECTXDEV&P=R247&D=0&I= -3

Check out this link for NVIDIA… I’d really like to see this doc NVIDIA guy is mentioning. I have tried to apply to NVIDIA developer relations few times but without any response.

I have measured latency of 5 cycles per processed vertex on my FX 5600, so with GPU on 325 MHz you get 21.5 tris/s, 64.5 vert/s - all uncached (using artificial indices like: 01234567…).

For ATI I think RHuddy from ATI dev. rel. mentioned that he wrote some papers on pre/post T&L caches… here is the link to his post and the thread on MS directX forum:
http://discuss.microsoft.com/SCRIPTS/WA-MSD.EXE?A2=ind0210B&L=DIRECTXDEV&D=0&I=-3&P=1 4102

To check if you are GPU core clock bound try downclocking the core (ie. use tweaking tools to unlock the enhanced control panel clock freq options)

I hope you’ll find this useful. [/b]

“I have measured latency” …

How do you measured this ? I’m very intrested in it!

I think I read this somewhere, but I do not remember. If you use non indexed primitive with glDrawArrays for example, then caching is disabled right? I only mentions this because I found using glDrawArrays to work faster with huge meshes.