glDrawRangeElements & GL_MAX_ELEMENTS_VERTICES / INDICES

An NVIdia 6800 with the latest drivers reports:

GL_MAX_ELEMENTS_VERTICES = 4096
GL_MAX_ELEMENTS_INDICES = 4096

and I gather that NVidia drivers have been reporting 4096 on much older hardware for a long time.

My question is, are these limits still reasonable, or are they (as some game developers say) way out of date and “not to be used”?

If ignored, what are folks using nowadays for max number of indicies and/or max number of contiguous touched verticies per glDrawRangeElements call?

P.S. Is 4096 supposedly the size of the on-chip GPU vertex FIFO (pre-T&L cache?)

P.S.S. I read that GL_MAX_ELEMENTS_VERTICES on ATI cards is ~2Gig and GL_MAX_ELEMENTS_INDICES is ~64K. Any idea these are so widely different from NVidia?

Those are implementation dependent, recommended maximums. The spec states that values beyond these may operate at reduced performance.

AFAIK, post-T&L cache sizes are no where near that big (around 32 or so I think, for current hardware).

yeah ive wondered that as well for the last few years its been
nvidia 4096 + ati billions
case in point
gf256 = 4096
gf7900 = 4096
now between the gf256 (the first ever gf) + the 7900 is a span of 6-7 years, practically everything else on the card has changed, yet they still report 4096, so what gives?

This must be because of a performance thing related to the driver (batching?)
I think cache has been 16 on Gf2, 24 on Gf3 and Gf4, 32 on GfFX, so it’s not the cache size for sure.

I don’t ignore them. I expect the company to return correct values to yeild good performance.

I think he was referring to the pre-T&L cache, not the post-T&L cache.

Something that I noticed from drawing large vertex sets on NVIDIA hardware is that you do run into a major slowdown when drawing indexed primitives beyond a certain point. I don’t really remember what it was, but I think it was something like 65k in size or 65k elements. I don’t remember if specifying a subsection of a large buffer with DrawRangeElements helped, but I do remember that this problem didn’t occur if just using DrawArrays to draw without indices. My solution was to just split up my array.

There are ways to query the actual post T&L cache size for your card (the other API). It turns out that my NV40 has a cache size of 24. This information does come in handy when stripping meshes (see NvTriStrip).

As for the batch size, you’re best bet is to test this, perhaps starting with the recommendation. If you’re serious about fine tuning performance, you should really leave nothing to chance, as even a recommendation has to make certain assumptions about the behavior of your app. And as always, consult the IHV performance docs for implementation specifics.

Originally posted by Minstrel:
There are ways to query the actual post T&L cache size
Which ones ?

Originally posted by AlexN:
I think he was referring to the pre-T&L cache, not the post-T&L cache.

So you are saying that it has to do with the size of the pre TnL cache?

jide

the tool like this (for counting post-T&L cache size)can be easily made by anyone.

imagine a regular grid N cells by M cells (N is the number you want to find)

simply make such an indexed mesh call (quad strip, for example, for simplicity - with primitive_restart index, as it doesn’t invalidate cache), that numbers there are like

0,N, 1,N+1, 2,N+2, … , N-2,2N-2, N-1,2N-1, RESTART

such strips count is equal to M (M number doesn’t matterm but must be big enough to see the perfomance difference, 100 or more)

the very first indices line must prefill cache, so it must draw degenerated quadstrip with indices equals to

0,0, 1,1, 2,2, … , N-2,N-2, N-1,N-1, RESTART

Make 1 draw call with all these strips (about M+1 restarts must be there, in this index buffer).

What do we want to see? If N is less or equals to cache size, then every point will be computed ONLY ONCE. If N is more then cache size, there will be points, which will cause cache to invalidate immediately, and almost every point will be computed twice (maybe besides of very first degenerated row).
In a good first case, after degenerated strip call, cache will be filled by first N points (0 thru N-1).
Then, every strip call will take 1st vertex from cache(0, for example) and put there 1st vertex not from cache (N). This is also true for 1, 2 and so on. So, after 2nd call cache will be filled by second N points (N thru 2N-1).

So, varying N (from 16 and up) and looking for non-smooth perfomance drop-down, we would discover exact cache size. On nVidia GeForceFX and GeForce6 (on 7-series I didn’t do that, but I expect the same result).

Hard drop-down (about 10-15 percents) is seen, when we move from 24 to 25, so the answer is evident ))

Thanks ! Indeed, I didn’t knew that it would be so easy.

Originally posted by V-man:
[quote]Originally posted by AlexN:
I think he was referring to the pre-T&L cache, not the post-T&L cache.

So you are saying that it has to do with the size of the pre TnL cache?
[/QUOTE]No, I don’t think it has anything to do with that. In the original post a question about the pre-T&L cache was asked but the thread got diverted a little bit into talking about the post-T&L cache, so I was just trying to clarify.

Originally posted by AlexN:
Something that I noticed from drawing large vertex sets on NVIDIA hardware is that you do run into a major slowdown when drawing indexed primitives beyond a certain point. I don’t really remember what it was, but I think it was something like 65k in size or 65k elements.
GeForce 2 cards (and probably GeForce 4MX) only supported 16-bit indexing in HW, so if you went over 64Ki vertices then it would have to do the indexing in software.

Given that the perf people are always telling us to batch as much as possible, these values do seem crazy. They should surely be as high as possible given the hardware limitations.

This was actually on a GeForce 7800, so I think there is something else, too. It shouldn’t have been anything to do with running out of memory, either, because there was plenty to spare.

Here’s what the OpenGL specification (added with OpenGL 1.2) says what the GL_MAX_ELEMENTS_VERTICES and GL_MAX_ELEMENT_INDICES limits for glDrawRangeElements mean:

Implementations denote recommended maximum amounts of vertex and
index data, which may be queried by calling glGet with argumento
GL_MAX_ELEMENTS_VERTICES and GL_MAX_ELEMENTS_INDICES. If end - start +
1 is greater than the value of GL_MAX_ELEMENTS_VERTICES, or if count is
greater than the value of GL_MAX_ELEMENTS_INDICES, then the call may
operate at reduced performance.
This language was incorporated from the EXT_draw_range_elements specification. In that specification, these limits applied only to the glDrawRangeElements.

Notice that these limits are defined in terms of the start, end, and count parameters to glDrawRangeElements.

The assumption the specification is making is that the driver internally copys the range of vertex data between indices [start,end] to a staging buffer that can be transformed (probably on the CPU) once and then the transformed vertices can be indexed out based on the “size” count of indices from the “indices” array. Or the driver could be copying the data once to a driver-internal VBO from which the GPU will source the indices, allowing for efficient vertex re-use.

(Recall the glDrawRangeElements command was added long before VBOs were added to OpenGL.)

So these limits are trying to indicate to the OpenGL application using glDrawRangeElements what size staging buffer the driver uses internally to make rendering efficient.

As other posters mention, there are indeed pre-transform and post-transform caches in modern GPUs and the sizes of these caches can affect the performance of glDrawElements and glDrawArrays commands.

However the performance effects of these caches has to do with the re-use of indices and the relative locality of vertex array contents. These performance effects aren’t really related to the number of vertices actually processed by a particular glDrawRangeElements as the limits above are specified to describe.

It’s tempting to “re-interpret” these limits to be more “relevant” to how modern GPUs work, but that is not what the limits are specified to mean.

Keep in mind the description of these limits in Table 6.35 of the OpenGL 2.1 specification is “Recommended max. number of DrawRangeElements indices” and “Recommended max. number of DrawRangeElements vertices” respectively.

I just want to warn developers that they should really put little or no dependence (I recommend “no dependence”) on their application’s behavior based on these particular limits. For example, don’t tie the size of a memory allocation to the value returned. If some future OpenGL implementation returns 2^31-1, such a memory allocation is likely to fail and expose bugs in your application.

These limits certainly have no relationship to the hardware’s pre- or post-transform vertex cache sizes nor the maximum hardware-pullable vertex index.

For the record, I don’t think applications actually need to know the size of the pre- or post-transform vertex caches. Unless you are hyper-optimizing your content for one particular platform, trying to size vertex array usage based on these cache sizes is not likely to be worth your time and energy. It’s interesting to know these caches exist, but if you simply arrange your vertex array usage for good memory locality and vertex index re-use, that’s the key goodness.

I hope this helps.

  • Mark Kilgard

Play safe. Make good choices.