geforcefx5900ultra slow?

I can’t push a geforcefx5900ultra to more than about 35 Mtris/sec, using the VAR extension. Is something wrong, or is it just this slow?

I copy a set of vertices to the var-memory. This is done only once and not every frame. I use DrawElements with QUAD_STRIP. I use the newest drivers.

I check VAR is enabled, but it is only a little bit faster than without VAR. A vertex consists only of a 3*float position. Here some speed measurements with different strip sizes (Linux and Windows):

8 quad_strips:
with VAR(33Mtris/sec)
without VAR(25Mtris/sec)
32 quad_strips:
with VAR(35Mtris/sec)
without VAR(29Mtris/sec)
128 quad_strips:
with VAR(34Mtris/sec)
without VAR(26Mtris/sec)
512 quad_strips:
with VAR(29Mtris/sec)
without VAR(23Mtris/sec)

Why does it get slower with long strips? should be the opposite effect.

The “Learning VAR” demo comes up with compareable results. Maybe it has an agp bottleneck, since it copies the vertices to the var-mem all the time. But with the following statement, I am expected to get video-mem right?

AllocateMemoryNV(size, 0.0f, 0.0f, 0.5f);

With the BenMark5 demo, I get about 125 Mtris/sec. This is directx, why is it different?

How can I check whether the var-memory allocated is actually agp-mem or video-mem? Perhaps I get agp-mem and not video-mem and the agp could be the bottleneck?

Does anyone have some sample code+src which is very fast?

What is the point of adverticing with 350M transformed tris/sec, if it only delivers 10% ?

/mfugl

Perhaps you’re fill limited?

Originally posted by mfugl:
[b]
I copy a set of vertices to the var-memory. This is done only once and not every frame. I use DrawElements with QUAD_STRIP. I use the newest drivers.

I check VAR is enabled, but it is only a little bit faster than without VAR. A vertex consists only of a 3*float position. Here some speed measurements with different strip sizes (Linux and Windows):
[/b]

Are you absolutely, absolutely sure that you have the newest drivers? My 5900 ran like a dog until I installed the 44.96 drivers for Linux.

[b]

What is the point of adverticing with 350M transformed tris/sec, if it only delivers 10% ?
[/b]

Those are untextured, zero-area triangles. You’re rendering quads. The driver probably has to triangulate each quad. I’m guessing you’re quads are not zero-area.

Do you have the correct AGP drivers installed?

BenMark5 probably uses triangles rather than quads.

There isn’t a whole lot of video-mem to be had. Maybe you are allocating too much and ending up with system-mem. Stick with agp-mem.

Are you writing linearly to memory? Are you aligning the memory correctly?

Try VBO instead of VAR. The usage is a little more intuitive and you don’t have to fart around with NV_fence.

Maybe quad strips aren’t as optimized as tristrips? Or maybe you’re just not drawing enough polygons – you won’t get to 100 MTris/sec by drawing 10,000 polygons at 10,000 fps.

– Tom

I don’t think they advertise 350 Million triangles/sec, they advertise 338 Million vertices/sec.

Achieving card spec performance has been discussed in depth here. http://www.opengl.org/discussion_boards/ubb/Forum3/HTML/005940.html

I’ve been trying to acheive 338Million vertices/sec and I can get to 310Million vertices/sec using GL_TRIANLGES with shared vertices but this means I am actually only drawing 100Million polys/sec. If I use QUAD_STRIPS I can get 160 Million polys/sec but only 80 Million vertices/sec.

Heres my sourcecode http://www.adrian.lark.btinternet.co.uk/SpeedTest.htm

I change the #define STRIP 1 to 0 depending on whether I want maximum poly performance or maximum vertice performance.

Originally posted by mfugl:
The “Learning VAR” demo comes up with compareable results. Maybe it has an agp bottleneck, since it copies the vertices to the var-mem all the time. But with the following statement, I am expected to get video-mem right?
AllocateMemoryNV(size, 0.0f, 0.0f, 0.5f);

No that will give you AGP memory, you want:
AllocateMemoryNV(size, 0.0f, 0.0f, 1.0f);

The VAR demo is not a good example of maximum tnl performance because it moves each vertex every frame.

We could do with a benmark equivilent.

Here’s my speed test app. (23kb) http://www.adrian.lark.btinternet.co.uk/SpeedTest.zip
You’ll need glut32.dll.

To run, from the dos prompt use
SpeedTest <STRIPS> <VAR>

where <STRIPS>=1 or 0
and <VAR>=1 or 0

On my GF5900u XP2000
STRIPS=0 VAR=1
103 Million polys/sec
310 Million vertices/sec

STRIPS=1 VAR=1
155 Million polys/sec
84 Million vertices/sec

STRIPS=0 VAR=0
17 Million polys/sec
50 Million vertices/sec

STRIPS=1 VAR=0
48 Million polys/sec
26 Million vertices/sec

Originally posted by Adrian:
STRIPS=1 VAR=1
155 Million polys/sec
84 Million vertices/sec


How do you generate more than one poly (triangle, I presume) per vertex?

If you have a 10x10 vertex grid, you have 100 vertices and 81 quads, 1 quad = 2 tris => 100 vertices ~ 162 tris. =>1 extra vertex can give 2 extra tris.

Is that not right?

[This message has been edited by Adrian (edited 08-20-2003).]

While that’ll be 100 verts and 162 triangles, the effective vertex throughput rate has to be at least one per triangle, right? (plus two to start the strip)

When you’re having trouble drawing many vertices per second, make sure your vertex format is as small (compact) as possible. Turn off all lighting, texturing, etc and only send position as two shorts; that ought to eliminate most bus bandwidth problems (internal or external).

Just doing the math might be illuminating: if your vert is 3 position, 3 normal, and 2 texture coordinates, then it’s 32 bytes. You can send about 33 million of those across an AGP 4x bus (1 GB/s) in a second. Similarly, if your main RAM is SDR, that’s your maximum memory throughput.

the effective vertex throughput rate has to be at least one per triangle, right?

Yes, I’ve thought about it some more and I suppose transformed vertices=sent indices.

So for the figures I posted double the vertex transform numbers for the case STRIPS=1.

My vertex transform speed is close to the advertised spec so I am faitly happy with that, though I am arguably cheating by using shared vertices and independent tris.

I think the maximum poly throughput of the GF5900u is ~160million tris/sec, if anybody thinks any differently I would be interested to hear.

[This message has been edited by Adrian (edited 08-20-2003).]

Thank you for your comments. Unfortuately your ideas are not ‘it’

Adrian, your app. functionality is nice (though ugly code). I have compiled and run it with different settings for strip, var, meshx, meshy. I can get similar results as those you posted. Could you extends your nice app. with extra commandline options for whether:

  • Reuse of vertices in triangle mode. We could then also see the true vertex hw transformation speed.

  • Cull mode: all/front/front&back. No test should only cull all. There is no point in rendering invisible triangles - no 3d-app. does this in the real world. Actually the performance numbers drops quite alot with no culling.

And please make sure all vertices are within the screen.

My own app is still slow. I will look more into it within a couple of days. Actually when I can see all triangles on the screen, it drops down to about 20Mtris/sec. If I rotate the camera away, it ‘renders’ 52Mtris/sec. This is strange I think - could it really be the triangle setup which is the bottleneck?

Still, I wonder why does it get slower with long strips? should be the opposite effect.

/mfugl

Vertex cache! And all the hell to get optimal size for given HW.

uhm if you dont want to cull the tris it is completely logical that you cant achieve the numbers from the specs since you get fillrate limited…

You know what I think? That GeForces perform quads natively. Try this to see that:
Set ortho to [-1, +1] in all directions. Then draw two triangles to cover area (-10000, -10000, 0) -> (+10000, +10000, 0).
You will see space between tris (casue precision is lost, I think). And now draw a single quad in place of these two tris. And everything will be ok.

Michal Krol