Okay. I wish I had time to try some benchmarks with you. This is a big design issue for me too, but I’m mired in other stuff right now.
The thing I’d try next, in case you also want to give it a shot, is to test indexed vs. non-indexed triangles, or at least test good indexing vs. bad.
The reason is, in the non-indexed case, the data will stream through the transform stages in-memory-order. In the indexed case, I’d expect the post-transform cache to hide a lot of the memory hits, i.e., whenever the cache returns a reusable vertex. And when a cache miss occurs, the memory fetch may not be in as much coherent-order, which may or may not matter, depending on the cache line configuration on that particular HW. So noticing different timings for these two cases may shed some light on how much the memory fetch costs (coherent vs. non-coherent, which would seem to apply to interleaved vs. non-interleaved, depending on the # of cache lines available).
The trick will be in normalizing the results. The non-indexed case may contain redundant transforms. So it should be slower regardless. I’d probably determine the performance ratio for the same geometry (indexed vs. non-indexed) to give me a rough idea of how much benefit the T&L cache is getting. Once you have this ratio, you could apply it to four combinations of indexed/non and interleaved/non to hopefully see some impact.
To test “bad” indexing instead, you could probably just design a pattern for your indexed case that really pushes memory fetching outside of normal bounds, both in terms of raw mem and the T&L cache. My first stab at that pattern would be something that jumps randomly around the VBO (3 verts at a time to make good triangles) with no-reuse between triangles.
BTW, you can also gage the T&L cache’s expected use with a dummy SW cache that just tracks the N most-recently used indices and counts the number of times they’re found in the cache vs. missed. The misses are what counts. And you’ll need to have very good cache utilization to reach the marketing numbers for your HW.
Also just FYI, since cubes have 8 vertices, the T&L cache re-use might be somewhat a-typical–probably about 2 hits and 1 miss for each 8 verts in a set (e…g, 8 misses / 24 indices = 33% miss rate). Something “meshier” might be better to test with. I think they often use a uniform grid of small indexed triangles or non-indexed, decent-length strips to get the best marketing numbers.
Does that make sense? I’m also guessing like you are, so if someone with more/any HW design experience wants to chime in, that would be most helpful. This would be so much easier if the HW companies just told us the optimal input patterns for their various HW.