Large Mesh VAR Performance

I have to render a large mesh composed of some millions of triangles.
It seems that VAR are the best solution, so i allocate with wglAllocateMemoryNV 4 buffers with .5 priority and iteratively memcpy the vertex position and normals onto these buffers. I setup and test fences for the various buffers in order to be sure to not access buffers in the wrong moment. The mesh is tristripped, vertex coords and normals are floats,
single lighted, no doubleside material, backface culled,default material, no texture, no state change during the drawing.
Just a loop of:

glFinishFenceNV(…)
memcpy(…)
glVertexPointer(…)
glNormalPointer(…)

foreach(tristrip)
glDrawElements(GL_TRIANGLE_STRIP,…)

glSetFenceNV(…)

Within this environment i’m able to draw just 9.5M of triangles per sec on a geforce2go on my 1Gh laptop.

Can i obtain higher scores in your opinion?
I expected to reach at least 15~20 Mtri.
Isn’t this the best way to draw geometry?

thanks in advance

correct me if I’m wrong here, but doesn’t using memcpy to copy chunks of a mesh completly defeat the purpose of agp buffered vertex data? as it has to be copied to agp memory, then to the card, and not just to the card through normal vertex array usage…
the maximum I’ve managed doing that was 6million/tris/sec on my GF1ddr…

Probably I was not clear. I have not enough agp memory for the whole mesh (hundreds of mb) so i use some agp buffers where i memcopy the data before rendering.
I also tried to directly allocate video memory (wglAllocateMemoryNV with priority == 1) but the performance did not changed.

Do you allocate 4 buffers by calling wglAllocateMemoryNV four times and do you switch between the buffers with glVertexArrayRangeNV each time you need to switch your buffers?

If so, try allocating a single large buffer, set glVertexArrayRangeNV to this buffer only and subdivide the allocated buffer into four parts. Then use glVertexPointer etc. to switch between the buffers. Calling glVertexArrayRangeNV is said to be very expensive.

If you do a memcpy of many megabytes at the same time, you could consider just copying a smaller buffer, render it, and loop again. You want to maximize parralelization. The best would be to memcpy and render around 4000 vertices at a time.

Y.

Just to be more precise:
I allocate a single large buffer and use 4 subbuffers inside it; I make a glVertexArrayRangeNV( ) onto the whole buffer just once.

Trying to reduce the size of the buffers in order to improve the parallelism between cpu and gpu does not change the timings.

For sake of completness, the rendered size of the triangles obviously matters, my test bench consist of a set of almost flat squares composed by triangles that are almost pixel sized. Depth complexity of the scene is just one, (all the triangles are seen from above), so the depth order drawing issue should not be involved.

Within this environment i’m able to draw just 9.5M of triangles per sec on a geforce2go on my 1Gh laptop.

Can i obtain higher scores in your opinion?
I expected to reach at least 15~20 Mtri.
Isn’t this the best way to draw geometry?

thanks in advance

No, you’re doing everything right. But 9.5M triangles is all you can reasonably hope for, especially on a GeForce2 Go.