I know that with Direct3D, nVidia did testing with their GeForce2 cards and found that vertex buffers with around 4096 vertices were the fastest; using fewer but larger vbuffers was slower, as was using smaller but more numerous vbuffers.
I’m wondering if it matters what the vertex array size is when passed to glDrawElements. Is it alright just passing arrays of 65536 vertices, or might it be faster to split the array into a few smaller arrays?
Thanks! It said 4096 for both, with my GeForce2. Perfect. I had to use GL_MAX_ELEMENTS_VERTICES_WIN and GL_MAX_ELEMENTS_INDICES_WIN though, as the two you said were not defined. Does WIN mean for the Windows platform, or for windowed mode?
I believe I downloaded them from microsoft.com relatively recently. Anyway it doesn’t matter what they’re called, only what they are: 0x80E8 and 0x80E9. What are the values of the non-_WIN ones?
I have the Red Book, and those constants are under the description of glDrawRangeElements; you’re right, GPSnoopy.
I think the reason they didn’t print it under glDrawElements() is that they assume you’ll use that for an entire unbroken array of data and that you wouldn’t be splitting the array into smaller ones beforehand. If the code assumes that, maybe it would be better using arrays of 65536 vertices and rendering 4096 at a time with glDrawRangeElements…
Suppose there’s a limit to the size of the currently active scatter/gather table for the card’s DMA engine? Another way to think of it is that the card may implement its own MMU, and there’s a limit to the number of TLB entries.
Perhaps there’s also some limitation on some counter/register somewhere that can’t go higher than 12 bits in one go, so any count greater than that has to be split in two, meaning the driver has to wait for an interrupt, or at least queue a second command, for the second half of a buffer with 4097 items in it.
I’m sure we could come up with more plausible explanations if we thought a little more about it.
If you’re spooling out dynamic geometry, the main issue is the CPU cache size. (VAR solves this problem by using uncached memory.) Pretty much beyond our control how you lay things out and how it collides in the cache.
VAR has its own max # of vertices – you can query the max index, which is 2^16-1 on NV1x and 2^20-1 on NV2x.
Then there’s all the caching of vertices post-T&L, which is of course <<4096 vertices.
And for CVA/DRE, our buffers to copy the vertices into are of limited size, but we can’t meaningfully expose that because different vertex formats use different numbers of bytes! Think of it as a VAR implementation internal to the driver.
For the record, I do want to remind people that if you’re using UNSIGNED_INT indices with VAR, DRE can be a definite win. In short (bad pun), the reason is the 2^20-1 index limit. If your “end” value is <=65535, then we know that your unsigned int indices can really be copied as shorts, which saves memory bandwidth. (If you use UNSIGNED_SHORT, we already know that by default; and UNSIGNED_SHORT, of course, already saves bandwidth by virtue of being smaller.)