NV_vertex_array_range question...

Having read everything regarding NV_vertex_array_range in this forum, I am still confused.

  1. Is there a hard limit on the amount of agp memory you can get? On two different systems (one is a VIA KT266A with GeForce3 Ti500, all AGP drivers/fixes installed, the other a BX with a Geforce 2 GTS), I always get 32MB as a maximum, regardless of video memory or agp memory.

  2. I am still confused as to what to do if I do not have enough AGP memory to hold all of the scene.

At the moment, I do this:
On startup:

  • allocate 32MB agp memory
  • glVertexArrayRangeNV(size, mFastMemStart);
  • glEnableClientState(GL_VERTEX_ARRAY_RANGE_NV);

Then, in the frame loop, I have a number of different meshes consisting of triangle strips. For each mesh, I copy the vertex/normal/texcoord data to the allocated memory and call glMultiDrawElements for the contained strips. The copy happens so that I sequentially use the available memory and when all the agp memory is used up, I restart from the beginning of the memory range. I also use a fence to after half of the memory is filled and after all of it is filled so that I can be sure no data the GPU still uses is overwritten. I have tested that the array range is valid, and the fence is never in a wait state.

However, in almost all scenes I have tried, this is slower than not using VAR. I don’t understand this, I am practically not doing anything else in the program, and I don’t use local viewer and only 1 directional light. So it should at least be as fast.

For a smaller scene (<32MB), I have also tried to copy the arrays to agp memory at startup, and only calling glMultiDrawElements on the fly, in which case I do get a significant speedup in some cases (twice as fast).

I had thought that doing the copy on the fly shouldn’t be slower than not doing the copy at all, because the copy should work in parallel to the GPU rendering the previous arrays - but in my case doing the copy on the fly results in a significant slowdown.

What could I have done wrong?

Any comments are appreciated,

Michael

Is there a hard limit on the amount of agp memory you can get? On two different systems (one is a VIA KT266A with GeForce3 Ti500, all AGP drivers/fixes installed, the other a BX with a Geforce 2 GTS), I always get 32MB as a maximum, regardless of video memory or agp memory.

The amount of AGP memory that you can use is defined by your BIOS settings.

BTW, depending on how many textures you are using, I wouldn’t suggest using all of your AGP memory for vertex data.

I had thought that doing the copy on the fly shouldn’t be slower than not doing the copy at all, because the copy should work in parallel to the GPU rendering the previous arrays - but in my case doing the copy on the fly results in a significant slowdown.

Is your copy sequential? How well optimized is your copy routine?

I’ve found that when copying to AGP memory, because it’s not cacheable, memcpy() does pretty well. It’s hard (but possible) to beat memcpy() in that circumstance. (I’m talking about MSVC 6.0 libc here)

Are you sure you are geometry transfer bound? If not, then geometry optimizations may not perform as expected.

Also, make sure you test the fence for the right half when entering a new half – but as you’re saying the fences are never busy, that shouldn’t be the problem.

As to Bios settings: I have the AGP aperture set to 128MB in the Bios (with 256MB main memory) on the BX machine, and 256MB aperture on the VIA machine (512MB main).
I only use one texture, it is a very simple test where I try to create a “best-case” scenario for VAR. Actually, the geometry is a regular-grid terrain which is stripped in an obvious fashion, about 88000 triangles.
I am pretty sure I am transfer bound at some stage, because as I wrote, if I store the geometry in AGP memory beforehands (not retransmitting every frame), I do see a pretty nice speedup (although not as fast as display lists). Only when I transfer every frame does the speed remain the same or even deteriorate.
And yes, I do use memcpy, and I use only one fence, doing a “finishfence/setfence” at half of the buffer and at the end, so this shouldn’t e a problem…
Hmmm… very strange. Any more ideas?

Thanks,

Michael

Actually I was also staring at a weak 5 Mtris/sec (on a GF1) even with everything static in video memory – with some greedy vertexformat I could almost make that without VAR too. The big speedup came with disabling lighting – went right up to 10 MTris/sec. At that point it was fillrate bound; switch desktop to 16bpp : 12Mtris/sec. Reduce to 320x240 : 14.2MTris/sec.

Lighting, Z buffering, blending, multiple combiners – they all cost cycles.

Also, if you can get away with a vertex format made out of shorts (and ubyte for color if you’re using that) then do so, and make it up in the matrices, rather than sending floats across the bus. That helps if you’re vertex transfer bound.

on the nvidia spec of opengl, i read that the better way to have the best speed render is to use this:
glDrawRangeElement (so glDrawRangeElementNV),
with strip triangles or fan.

you can have this spec at nvidia website, opengl sdk.

jide

DrawRangeElements helps because it will tell the driver exactly what memory it needs to access, without it having to first walk the index list and calculate min/max itself.

Of course, this is assuming that your index list is fairly tightly packed.

However, this means that DrawRangeElements is orthogonal to whether you’re using VAR or just plain malloc() memory. I e, for best performance, you should use both.

In addition, if you’re doing multi-pass rendering (shading) then you’re better off using LockArrays() than DrawRangeElements(), although if you have the information for one, you can do both just to make sure no quick exit in LockArrays() makes the driver miss a pass.

Thus:

Init()
{
wglAllocateMemoryNV()
glVertexArrayRangeNV()
memcpy( agp, your-model-data-here )
}

RenderObject()
{
glVertexPointer()
glLockArraysEXT(mesh_min,mesh_max)
foreach material {
glBindTextre() // etc
glDrawRangeElements(material_min,material_max,matrial_indices)
}
}

DRE doesn’t help when using VAR.

We never actually scan the index list at present. So if you don’t use DRE/VAR/CVA, you won’t get any reuse at all.

  • Matt

Originally posted by mcraighead:
[b]
We never actually scan the index list at present. So if you don’t use DRE/VAR/CVA, you won’t get any reuse at all.

  • Matt[/b]

Matt,

In GDC01_Performance.pdf, I read:
Vertex cache is activated when used in conjunction with:
• Compiled vertex arrays or VAR
glDrawElements or glDrawRangeElements

So it seems glDrawElements should suffice, and glDrawRangeElements isn’t needed?

Btw, for single-pass rendering, CVA is always slower than about any other rendering method on Geforce2/3 for me…

Michael

Wimmer,

The vertex cache is on the card, and is simply a FIFO snooper that looks at an incoming index, and sees if one of the last N vertices also used that index, and if so, re-uses the output of that index instead of running it through transform again. That all happens on the card, and can only happen if you’re using indices (DrawElements/DrawRangeElements) and not if you’re using DrawArrays or immediate mode. That has nothing to do with what happens to your index list in the driver (before it gets sent to the card) which is what Matt is talking about when saying they don’t scan the index list.

Supposing you’re not using VAR. Not scanning the index list means that they either don’t copy your data to a temp buffer when you call DrawElements(), which means that they have to synchronously wait for the transfer to happen before returning, or it means that they copy it and expand it at the same time, and thus turn your DrawElements() into DrawArrays(), which leads to no re-use. That certainly would explain why nVIDIA hardware doesn’t seem to perform quite as well as it should in “plain” (non-VAR) DrawElements() mode. I’m sure there’s a trade-off between CPU/memory usage to scan the array, and the performance lost by expanding (or staying synchronous), though.

As for CVA, I can’t imagine that they do anything other than copy your arrays into some pre-allocated locked memory, so you would get re-use in that case. Perhaps the pre-allocated buffer is too small for large meshes, though, as I’m sure the size is tuned for Quake3 usage (scene poly counts < 20,000 per frame).

Sounds like that document is incorrect.

In my own testing, I generally find that DRE/CVA is a significant performance win, both on the CPU and HW sides.

  • Matt

matt, are you sure? I don’t see why DrawElemets shouldn’t enable the on-chip vertex cache…

As for reuse:
I see that DRE or CVA will be useless with VAR, because you copy the data yourself. But without VAR, I see two advantages if the driver copies the data to some fast buffer:

  1. the call could return asynchronously (i.e., after vertices are transferred)
  2. in MP rendering, data need only be transferred once
    Am I right here?

So, the question is when do the drivers copy data:

  • using CVA alone (then how do you know which vertices to copy if you don’t scan the index list?)
  • using DRE alone (this would certainly only help the single pass case, and the only advantage would be the asynchronous execution, right?)
  • using CVA with DRE (the easiest case for asynchronous and multipass…)?

jwatte: I see, I was more thinking about the single pass case, where the vertex cache is the only way to get reuse at all…

Michael

wimmer,

Yes, my observation still holds for the single pass case.

The problem with DrawElements() is that you don’t tell the driver how large your buffer is. Thus, if the driver wants to copy your data to a separate buffer for asynchronous use, you have two choices:

  1. First, scan the index list and calculate the min/max index. Then, copy the array range specified by this min/max to your buffer, and copy the index array. Then the hardware will be fed data + indices, which enables vertex cache re-use.

  2. Walk the list, and for each index, linearly copy the vertex specified by the index. Then draw all the vertices in order, using DrawArrays() style transfer. The hardware will only be fed vertex data, thus there will NOT be any vertex cache re-use.

I would assume that model 1) would be the most efficient. However, matt has said that nVIDIA drivers do not do any index list scanning, AND he has said that regular DrawElements() does not enable the vertex cache, which leads me to believe they’re using method 2).

Note that you can think of method 2b), which would construct a source->destination index mapping table in memory as it’s copying the vertices. However, there’s no way I can see that would make that more efficient than method 1), so I doubt anyone is actually shipping drivers doing that.

jwatte,

interesting, I didn’t think about it that way. So what this implies is that the vertex cache can only be turned on if the driver can copy the vertex data to a place where the GPU can access it, because then the driver only copies the index array to the GPU (or places it somewhere the GPU can access it?).

This can happen easily with VAR (user copies data), CVA and DRE (user specifies range). I guess DRE is not necessary with CVA, right?

I wonder why I get consistently slower performance with CVA than with any other transfer method for single pass rendering.

Let me speculate: if I have an array that is too large, then the driver can’t copy the array to it’s buffer, so it won’t enable the vertex cache because it didn’t place the vertices somewhere the GPU could pull them itself… right?

Michael

[This message has been edited by wimmer (edited 03-03-2002).]