I can’t push a geforcefx5900ultra to more than about 35 Mtris/sec, using the VAR extension. Is something wrong, or is it just this slow?
I copy a set of vertices to the var-memory. This is done only once and not every frame. I use DrawElements with QUAD_STRIP. I use the newest drivers.
I check VAR is enabled, but it is only a little bit faster than without VAR. A vertex consists only of a 3*float position. Here some speed measurements with different strip sizes (Linux and Windows):
8 quad_strips:
with VAR(33Mtris/sec)
without VAR(25Mtris/sec)
32 quad_strips:
with VAR(35Mtris/sec)
without VAR(29Mtris/sec)
128 quad_strips:
with VAR(34Mtris/sec)
without VAR(26Mtris/sec)
512 quad_strips:
with VAR(29Mtris/sec)
without VAR(23Mtris/sec)
Why does it get slower with long strips? should be the opposite effect.
The “Learning VAR” demo comes up with compareable results. Maybe it has an agp bottleneck, since it copies the vertices to the var-mem all the time. But with the following statement, I am expected to get video-mem right?
AllocateMemoryNV(size, 0.0f, 0.0f, 0.5f);
With the BenMark5 demo, I get about 125 Mtris/sec. This is directx, why is it different?
How can I check whether the var-memory allocated is actually agp-mem or video-mem? Perhaps I get agp-mem and not video-mem and the agp could be the bottleneck?
Does anyone have some sample code+src which is very fast?
What is the point of adverticing with 350M transformed tris/sec, if it only delivers 10% ?
Originally posted by mfugl:
[b]
I copy a set of vertices to the var-memory. This is done only once and not every frame. I use DrawElements with QUAD_STRIP. I use the newest drivers.
I check VAR is enabled, but it is only a little bit faster than without VAR. A vertex consists only of a 3*float position. Here some speed measurements with different strip sizes (Linux and Windows):
[/b]
Are you absolutely, absolutely sure that you have the newest drivers? My 5900 ran like a dog until I installed the 44.96 drivers for Linux.
[b]
What is the point of adverticing with 350M transformed tris/sec, if it only delivers 10% ?
[/b]
Those are untextured, zero-area triangles. You’re rendering quads. The driver probably has to triangulate each quad. I’m guessing you’re quads are not zero-area.
Do you have the correct AGP drivers installed?
BenMark5 probably uses triangles rather than quads.
There isn’t a whole lot of video-mem to be had. Maybe you are allocating too much and ending up with system-mem. Stick with agp-mem.
Are you writing linearly to memory? Are you aligning the memory correctly?
Try VBO instead of VAR. The usage is a little more intuitive and you don’t have to fart around with NV_fence.
Maybe quad strips aren’t as optimized as tristrips? Or maybe you’re just not drawing enough polygons – you won’t get to 100 MTris/sec by drawing 10,000 polygons at 10,000 fps.
I’ve been trying to acheive 338Million vertices/sec and I can get to 310Million vertices/sec using GL_TRIANLGES with shared vertices but this means I am actually only drawing 100Million polys/sec. If I use QUAD_STRIPS I can get 160 Million polys/sec but only 80 Million vertices/sec.
Originally posted by mfugl: The “Learning VAR” demo comes up with compareable results. Maybe it has an agp bottleneck, since it copies the vertices to the var-mem all the time. But with the following statement, I am expected to get video-mem right?
AllocateMemoryNV(size, 0.0f, 0.0f, 0.5f);
No that will give you AGP memory, you want:
AllocateMemoryNV(size, 0.0f, 0.0f, 1.0f);
The VAR demo is not a good example of maximum tnl performance because it moves each vertex every frame.
If you have a 10x10 vertex grid, you have 100 vertices and 81 quads, 1 quad = 2 tris => 100 vertices ~ 162 tris. =>1 extra vertex can give 2 extra tris.
Is that not right?
[This message has been edited by Adrian (edited 08-20-2003).]
While that’ll be 100 verts and 162 triangles, the effective vertex throughput rate has to be at least one per triangle, right? (plus two to start the strip)
When you’re having trouble drawing many vertices per second, make sure your vertex format is as small (compact) as possible. Turn off all lighting, texturing, etc and only send position as two shorts; that ought to eliminate most bus bandwidth problems (internal or external).
Just doing the math might be illuminating: if your vert is 3 position, 3 normal, and 2 texture coordinates, then it’s 32 bytes. You can send about 33 million of those across an AGP 4x bus (1 GB/s) in a second. Similarly, if your main RAM is SDR, that’s your maximum memory throughput.
the effective vertex throughput rate has to be at least one per triangle, right?
Yes, I’ve thought about it some more and I suppose transformed vertices=sent indices.
So for the figures I posted double the vertex transform numbers for the case STRIPS=1.
My vertex transform speed is close to the advertised spec so I am faitly happy with that, though I am arguably cheating by using shared vertices and independent tris.
I think the maximum poly throughput of the GF5900u is ~160million tris/sec, if anybody thinks any differently I would be interested to hear.
[This message has been edited by Adrian (edited 08-20-2003).]
Thank you for your comments. Unfortuately your ideas are not ‘it’
Adrian, your app. functionality is nice (though ugly code). I have compiled and run it with different settings for strip, var, meshx, meshy. I can get similar results as those you posted. Could you extends your nice app. with extra commandline options for whether:
Reuse of vertices in triangle mode. We could then also see the true vertex hw transformation speed.
Cull mode: all/front/front&back. No test should only cull all. There is no point in rendering invisible triangles - no 3d-app. does this in the real world. Actually the performance numbers drops quite alot with no culling.
And please make sure all vertices are within the screen.
My own app is still slow. I will look more into it within a couple of days. Actually when I can see all triangles on the screen, it drops down to about 20Mtris/sec. If I rotate the camera away, it ‘renders’ 52Mtris/sec. This is strange I think - could it really be the triangle setup which is the bottleneck?
Still, I wonder why does it get slower with long strips? should be the opposite effect.
You know what I think? That GeForces perform quads natively. Try this to see that:
Set ortho to [-1, +1] in all directions. Then draw two triangles to cover area (-10000, -10000, 0) -> (+10000, +10000, 0).
You will see space between tris (casue precision is lost, I think). And now draw a single quad in place of these two tris. And everything will be ok.