Costs associated with compiled vertex arrays

I’m trying to come up with an optimal design for my PC/Mac orientated game engine on the more common GL implementations, and most sources I have read seem to agree that compiled vertex arrays are the way to go. But I am curious as to the costs involved with each part of the process. Specifically :

  • what is the cost of glInterleavedArrays? Does it cause any communication with the card on an average T&L implementation? Do I significantly lose any performance if I call it redundantly?

  • what is the cost of locking? I know that this causes the T&L to occur, but is it common that this is just a simple matter of passing the arguments to the card which already knows the original vertex locations, or is it at this stage that vertices are passed to the card, with a little ‘do these, please’ command?

  • is there any large cost associated with unlocking? Or is it normally just an opportunity to clear up a few variables on the CPU (as opposed to the graphics card) side?

Thanks in advance!

All of these questions are implementation-defined. There are, however, some general rules-of-thumb to follow with CVA’s. One of the prime ones is that you don’t get much benifit from CVA’s if you don’t multipass over geometry. So if all of your rendering is single-pass, CVA’s probably aren’t going to help you a whole lot.

Oh, and there’s nothing that says that locking a buffer causes T&L’s to occur. On modern T&L hardware (let alone vertex-program equipped hardware), CVA locking likely just performs an optimized copy of the vertex data to AGP memory.

One of the nice things about CVA’s is that it is easy to benchmark. Turning them on and off is trivial.

For most apps, you’re probably better off using DrawRangeElements. The CVA extension is notoriously hard to interpret…

  • Matt

Originally posted by Thomas Harte:
[b]- what is the cost of locking? I know that this causes the T&L to occur, but is it common that this is just a simple matter of passing the arguments to the card which already knows the original vertex locations, or is it at this stage that vertices are passed to the card, with a little ‘do these, please’ command?

  • is there any large cost associated with unlocking? Or is it normally just an opportunity to clear up a few variables on the CPU (as opposed to the graphics card) side?
    [/b]

When using locked vertexarrays, the driver has several paths it can take:

  1. It can map the user buffer through AGP and have the card consuming the data pulling it via AGP.
  2. It can copy the data to a system memory AGP pool and have the card consuming it through AGP in a similar way as in 1 (with the difference that it avoids having to map the buffer through AGP which may be an expensive operation if the buffer is too small).
  3. It can copy the data to a video memory pool and have the card consuming it through local video transfers.
  4. It can discard the lock/unlock hint and just act as if it was a normal (non-locked) vertexarray.

Note that locking doesn’t “cause the T & L to occur”, it just modifies the way vertices will be transferred to the graphics chip at rendering time. You cannot just get your vertexarray and transform it at locktime because even if the buffer is locked, you can still change your modelview matrix (or texture matrix), which will produce different projected vertices.

On each of these scenarios, the lock & unlock will behave differently. The scenario to choose will mainly depend on the size of the vertexarray: for small vertexarrays copying the data to a driver pool may be faster than mapping the user buffer through AGP. Obviously all this gets more complicated if you take into account that you may run out of memory in the driver pools (which will force you to do some kind of synchronisation with the graphics chip) or that some arrays will be locked (for example geometry) and some others won’t (for example texture coordinates), so you may need a mixture of scenarios).

Locking:
For scenario 1 the lock time is more or less constant disregarding the size of the buffer (it’s just locking down the buffer and mapping it through AGP).
For scenario 2 and 3 the lock time will depend on the size of the vertexarray, as it has to copy the vertexarray to the driver memory pool.

Unlocking:
For scenario 1, it forces the driver to ensure the graphics chip has consumed all the vertex data before returning to the app (sort of a glFinish), as the user may modify the buffer as soon as it is unlocked.

For the rest of scenarios, there’s no synchronisation neede.

At rendering time, scenario 3 will be the best performer (transfers of vertices from video memory are really fast), but you may not notice the differences between scenarios if your bottleneck is elsewhere in your pipeline (T&L transformation limited, fill-rate or texture-filter-rate limited, etc).

[This message has been edited by evanGLizr (edited 08-29-2002).]