Timings while rendering a single big mesh

Hello, i would like to confront my timings while rendering a single big unstructured mesh (1M triangles or much more), colors and normals per vertex, no textures, single light, backface culled.

My approach is to spatially subdivide it into small blocks (1 to 7k vertices), quantize, stripify each block and send them (Vertex array) through AGP.

For real data my result on a NVGeforce4 Ti 4200 (latest detonator driver), Athlon 1200Mh, AGP 4x is just 17M triangles per second.
On the other hand I made a testset of 1M triangles organized in small blocks each containing a cylinder, (vertices forming an helix 12 vertex/turn) and reached 40M/sec.

I noticed that:

  1. The size of the blocks matters (best is from 1k to 7k vertices).
  2. the length of the strips little changes the speed. (its important above 30M/sec).
  3. i could degrade performance to 30M/sec by avoiding the vertex caching. (100vertex/turn)
  4. if i reorder vertices in the block and performance drops considerably, depending on how much (and expecially how locally) i permutate.

Questions:

  • are my number decent ones?
  • did i forget some trick which could boost my numbers?
  • It is correct that locality of the vertices respect to the strips is SO important?

Note 1: data formats:
glVertexPointer(3, GL_SHORT, 8, Vstart);
glNormalPointer(GL_SHORT, 8, Nstart);
glColorPointer(4, GL_UNSIGNED_BYTE,0,Cstart);

Note 2: i do not copy data into AGP each frame since all model fits in it. both of the above cited mesh were 1M triangle.

Note 3: i am using nv_fences extension.

Note 4: i use a simple stripifier of my own which seems to work faster and better than nvidia one. Probably i could not use the nvidia one properly, any suggestion welcome.

Note 4: rasterizazion seems not to be the bottleneck as i did not increase performance enabling cullface(front_and_back) or by resizing the window.

thanks,

Sorry, for taking again the attention on this post, but, probably, i was not so clear so the previous post could seem a newbie one.

We have to render huge static meshes (tens or hundred of MTriangles), that usually do not fits neither in agp nor in main memory. we have our octree based culling system with caching strategies for moving things in and out from agp.

I just wanted be sure that when i am in the best condition (everything placed in the agp) i was able to get the best rendering performance. So we did some tests and we found the result of the previous post (only 17M tri per sec for real datasets).

Has someone here got worse/similar/better results in rendering REAL big meshes?

thanks again.