136 M verts/sec on GeForce4 Ti ?

I am trying to reproduce the GeForce4 Ti4600 geometry performance as stated by nVidia (http://www.nvidia.com/view.asp?PAGE=geforce4ti) - 136 M verts/sec, without sucess.
My program does:

  • 1x1 area pixels
  • 400 long tri-strips
  • compiled display list for the tri-strip called multiple times
  • no lighting, or texture
  • only position data per vertex
  • half the tris are back facing, and being culled
    It reaches around 60 M verts/sec
    This same program was used with previous nVidia hardware, acheiving their stated performance.
    But 60 is too far removed from 136 …
    BTW, 60 is kind of what I expected, with the GF3 reacing around 31
    (with the Ti500 series) and the GF4 having dual geometry pipes.

Any ideas

Here’s what you need to do.

Make a vertex program that does absolutely nothing but pass the vertex position along.

Use glCullFace(GL_FRONT_AND_BACK); (to get rid of that pesky rasterizing).

Use NV_Vertex_array_range to send your trangles.

Even then, I doubt you’ll get 136 Mpps, but you might get an improvement over 60Mpps.

Put the data in video memory and use VAR? Make sure to take optimal use of vertex caching in a mesh where each vertex is reused up to 6 times?

Use CullFace FRONT_AND_BACK to ensure that you aren’t bound by the backend?

If you’re hitting the same pixel over and over again, you might get a RMW bottleneck.

Make sure you are counting vertices per second and not triangles? (A lot of people seem to never be able to get the two straight.)

Note that it is impossible to get good vertex reuse with 400-vertex triangle strips.

  • Matt

You shouldn’t need to create some kind of silly passthrough vertex program. Then you’re not measuring computation.

  • Matt

[QUOTE]Originally posted by mcraighead:
Put the data in video memory and use VAR? Make sure to take optimal use of vertex caching in a mesh where each vertex is reused up to 6 times?

isn’t a compiled display list better? It doesn’t push data through AGP at all.


Use CullFace FRONT_AND_BACK to ensure that you aren’t bound by the backend?

you mean by rasterization? The triangles are 1x1 in screen pixels.
But maybe the rasterization setup time is killing me?


If you’re hitting the same pixel over and over again, you might get a RMW bottleneck.

I took care to not hit the same screen pixel with two triangles in the strip.
But after subsequent calls of the strip display-list do project to the
same place. Maybe I’ll stick a tiny glRotate at the end of the list…


Make sure you are counting vertices per second and not triangles? (A lot of people seem to never be able to get the two straight.)

I hope I am not one of them …
With 400-long tri-strips, #tris =~ #verts (the ~ is 0.5%)


Note that it is impossible to get good vertex reuse with 400-vertex triangle strips.

Are you saying that I am not utilizing the ‘already computed vertex cache’ ?

Originally posted by mcraighead:
[b]Put the data in video memory and use VAR? Make sure to take optimal use of vertex caching in a mesh where each vertex is reused up to 6 times?

  • Matt[/b]

Do you mean something like an hexagonal grid? (6 triangles meet at
almost every vertex).
Maybe VAR is necessary not so much for the AGP speed, but for the
explicit ‘statement’ of the fact that vertices are repeated (equal indices), more than what is implied by a tri-strip (only 3 tris meet at a vertex in tri-strips)?

VAR isn’t just for AGP. You can put vertex data into video memory through VAR, and it will likely be faster than display lists.

> If you’re hitting the same pixel over and over again, you might get a RMW bottleneck.

Stupid question, what is a RMW bottleneck and when does it happen ?

Y.

Originally posted by Ysaneya:

Stupid question, what is a RMW bottleneck and when does it happen ?

I think he means Read Modify Write

Originally posted by mcraighead:
[b]Make sure to take optimal use of vertex caching in a mesh where each vertex is reused up to 6 times?

Make sure you are counting vertices per second and not triangles? (A lot of people seem to never be able to get the two straight.)

Note that it is impossible to get good vertex reuse with 400-vertex triangle strips.

  • Matt[/b]

So you actually mean that the 136 MVert/s figure refers to indices, not transformed vertices???

Some quick calculations: assuming you have a regular grid with 6 by n vertices. Sending this as one large triangle strip (with one vertex repeated for each row of 12 triangles to manage the “turnaround”) gives optimal vertex cache reuse on a Geforce 3+, so each vertex is transformed exactly once (and each interior vertex is shared by 6 triangles, as Matt implied).

On the other hand, you are sending 26(n-1) indices for this mesh (not counting degenerate indices for the turnaround). For large n, this means the ratio (indices sent)/(vertices transformed) approaches 2, so on a Geforce 4 you will hit about 120M indices/s.

If we take that another step further, assume you are not sending strips, but individual triangles. Then you have 10 triangles * 3 indices per triangle * (n-1) bands = 56(n-1) indices sent. Here, the ratio indices/vertices approaches 5, so for a Geforce 4, you could even quote 300M indices/s!

Is this what you mean???

Michael

I used VAR
I used glDrawElements to make vertex repeats explicit.
I tried indepedent tris, tri-strip per mesh band, and one long tri-strip for the whole mesh.
I traversed the bands in alternating directions, to help vertex caching
I disabled rasterization (glCullFace(GL_FRONT_AND_BACK)
I tried many mesh grid dimensions.

Still, I cannot get above 60 Mverts/sec

Does anybody have a benchmark that goes above that?

dont know what you guys are doing here, but maybe you have to disable some states too like depthtest, depthwrites (glDepthMask) ?!?

is this value measured in tris in view, or are there all clipped away ?!?

[This message has been edited by T2k (edited 03-20-2002).]

Did you try to make a strip as in my previous post? Did you count Indices/s?

Michael

Originally posted by wimmer:
Did you try to make a strip as in my previous post? Did you count Indices/s?
Michael

If I understood you correctly, then yes.
I tried several mesh dimensions, not just
6 squares (12 triangles) per row.

To your second question, I counted vertices.
In other words, the second argument to glDrawElements (with GL_TRIANGLE_STRIP mode as 1st arg)
The point is to ‘help’ the board reuse previous (cached) vertex transformation results.

To T2k’s question, there are no such states, and moreover, glCullFace(GL_FRONT_AND_BACK) makes nothing go to the rasterization stage.

The second parameter to glDrawElements (count) is the number of indices - so you are really counting the transferred indices, and should see the benefits of the vertex cache…

Strange… I don’t see what else you could do…

One last point is how much memory do you allocate with VAR? Sometimes you get back AGP memory even if you asked for Vidmem. You can usually test this by testing the memory speed of the returned memory…

Michael

Originally posted by wimmer:
The second parameter to glDrawElements (count) is the number of indices - so you are really counting the transferred indices, and should see the benefits of the vertex cache…
Yes, that was the idea


One last point is how much memory do you allocate with VAR? Sometimes you get back AGP memory even if you asked for Vidmem.

I used the “0,0,1” combination in the AllocateMemoryNV call, meaning readFrequency=0 , writeFrequency=0 , prioity=1.
If that doesn’t give me Vidmem, I don’t know what will.


You can usually test this by testing the
memory speed of the returned memory…

The returned memory seems to be in normal process virtual memory. Which I think means that the GPU DMA-pulls it at first use, then keeps it in video memory. I don’t flush it.

Michael[/b]

Ultimately, I wouldn’t expect to ever get maximum performance out of a video card. Likely, that performance was measured using the theoretical maximums of the hardware, not with a card connected to a PC. In short, the number was probably calculated, not measured. The number doesn’t take into account memory bandwidth (from vidoe memory or main memory) or latency.

The point is that on previous nVidia hardware, the stated numer did match such benchmark programs. Is it no longer true now?

I was also trying to figure out what I missed in the new architecture. The GeForce3 (ans also 2) stated ~ 32 Mverts/sec at 250Mhz core clock (and the benchmarks verified that). The GeForce4 is stated to acheive 136. Comparing the hw I couldn’t see how this was done. The core clock increased by 300/250 or 325/250 and the geometry pipeline was doubled, so you would expect ~ 83. I am wondering what extra ‘trick’ is present in GF4 to make this possible. Larger already-computed-vertex cache? More efficient micro-code? But as it stands now, this is for me an “unreproducible result”.

The “priority”-value is only a usage hint. The driver is not required to give back any specific kind of memory. As stated, I often experience that I get AGP memory instead of Vidmem if I request too much. I always get memory if I request 32MB or less, but only with 16MB or less do I get video memory. With a lot of textures loaded, this might even be reduced to 8 or 6MB.

Korval: testing whether you can achieve the maximum stated performance is interesting, because then you know that at least for this path, you are doing everything right. Then, you can easily find out which feature takes how much time and why your application runs as fast as it does. It gives you a baseline. And yes, with Nvidia cards it usually is possible to actually achieve the performance stated, as the original poster pointed out. I, too, have been able to get up to 30MVert/s on a Geforce 3 Ti500 and about 23MVert/s on a Geforce 2 GTS…

Michael

Originally posted by wimmer:
As stated, I often experience that I get AGP memory instead of Vidmem if I request too much. I always get memory if I request 32MB or less, but only with 16MB or less do I get video memory. With a lot of textures loaded, this might even be reduced to 8 or 6MB.

Can you explain how you test this? When the data ends up in vidmem, you still get a user-space pointer, maybe not even mapped to the AGP aperture, from AllocateMemoryNV. So I think testing the memory speed of the pointer you got is irrelevant - the data may be pulled up to the on-board memory (“vidmem”) and kept there. I think that is the point with this extension, worded something like “relaxing coherency constraints”.


I, too, have been able to get up to 30MVert/s on a Geforce 3 Ti500 and about 23MVert/s on a Geforce 2 GTS…

And… what do your tests show on GeForce4 Ti4600? Did you get your hands on one yet?
BTW, you can get the 30+ number on GeForce2 Ultra too…