PDA

View Full Version : 136 M verts/sec on GeForce4 Ti ?



Moshe Nissim
03-19-2002, 12:28 PM
I am trying to reproduce the GeForce4 Ti4600 geometry performance as stated by nVidia (http://www.nvidia.com/view.asp?PAGE=geforce4ti) - 136 M verts/sec, without sucess.
My program does:
- 1x1 area pixels
- 400 long tri-strips
- compiled display list for the tri-strip called multiple times
- no lighting, or texture
- only position data per vertex
- half the tris are back facing, and being culled
It reaches around 60 M verts/sec
This same program was used with previous nVidia hardware, acheiving their stated performance.
But 60 is too far removed from 136 ...
BTW, 60 is kind of what I expected, with the GF3 reacing around 31
(with the Ti500 series) and the GF4 having dual geometry pipes.

Any ideas

Korval
03-19-2002, 12:38 PM
Here's what you need to do.

Make a vertex program that does absolutely nothing but pass the vertex position along.

Use glCullFace(GL_FRONT_AND_BACK); (to get rid of that pesky rasterizing).

Use NV_Vertex_array_range to send your trangles.

Even then, I doubt you'll get 136 Mpps, but you might get an improvement over 60Mpps.

mcraighead
03-19-2002, 12:39 PM
Put the data in video memory and use VAR? Make sure to take optimal use of vertex caching in a mesh where each vertex is reused up to 6 times?

Use CullFace FRONT_AND_BACK to ensure that you aren't bound by the backend?

If you're hitting the same pixel over and over again, you might get a RMW bottleneck.

Make sure you are counting vertices per second and not triangles? (A lot of people seem to never be able to get the two straight.)


Note that it is _impossible_ to get good vertex reuse with 400-vertex triangle strips.

- Matt

mcraighead
03-19-2002, 12:40 PM
You shouldn't need to create some kind of silly passthrough vertex program. Then you're not measuring computation.

- Matt

Moshe Nissim
03-19-2002, 12:49 PM
[QUOTE]Originally posted by mcraighead:
Put the data in video memory and use VAR? Make sure to take optimal use of vertex caching in a mesh where each vertex is reused up to 6 times?

isn't a compiled display list better? It doesn't push data through AGP at all.


Use CullFace FRONT_AND_BACK to ensure that you aren't bound by the backend?

you mean by rasterization? The triangles are 1x1 in screen pixels.
But maybe the rasterization setup time is killing me?


If you're hitting the same pixel over and over again, you might get a RMW bottleneck.

I took care to not hit the same screen pixel with two triangles in the strip.
But after subsequent calls of the strip display-list do project to the
same place. Maybe I'll stick a tiny glRotate at the end of the list..


Make sure you are counting vertices per second and not triangles? (A lot of people seem to never be able to get the two straight.)

I hope I am not one of them ...
With 400-long tri-strips, #tris =~ #verts (the ~ is 0.5%)


Note that it is _impossible_ to get good vertex reuse with 400-vertex triangle strips.

Are you saying that I am not utilizing the 'already computed vertex cache' ?

Moshe Nissim
03-19-2002, 01:04 PM
Originally posted by mcraighead:
Put the data in video memory and use VAR? Make sure to take optimal use of vertex caching in a mesh where each vertex is reused up to 6 times?
- Matt

Do you mean something like an hexagonal grid? (6 triangles meet at
almost every vertex).
Maybe VAR is necessary not so much for the AGP speed, but for the
explicit 'statement' of the fact that vertices are repeated (equal indices), more than what is implied by a tri-strip (only 3 tris meet at a vertex in tri-strips)?

Korval
03-19-2002, 01:22 PM
VAR isn't just for AGP. You can put vertex data into video memory through VAR, and it will likely be faster than display lists.

Ysaneya
03-20-2002, 12:09 AM
> If you're hitting the same pixel over and over again, you might get a RMW bottleneck.

Stupid question, what is a RMW bottleneck and when does it happen ?

Y.

Moshe Nissim
03-20-2002, 12:21 AM
Originally posted by Ysaneya:

Stupid question, what is a RMW bottleneck and when does it happen ?


I think he means Read Modify Write

wimmer
03-20-2002, 12:39 AM
Originally posted by mcraighead:
Make sure to take optimal use of vertex caching in a mesh where each vertex is reused up to 6 times?

...

Make sure you are counting vertices per second and not triangles? (A lot of people seem to never be able to get the two straight.)

Note that it is _impossible_ to get good vertex reuse with 400-vertex triangle strips.
- Matt

So you actually mean that the 136 MVert/s figure refers to _indices_, not transformed _vertices_???

Some quick calculations: assuming you have a regular grid with 6 by n vertices. Sending this as one large triangle strip (with one vertex repeated for each row of 12 triangles to manage the "turnaround") gives optimal vertex cache reuse on a Geforce 3+, so each vertex is transformed exactly once (and each interior vertex is shared by 6 triangles, as Matt implied).

On the other hand, you are sending 2*6*(n-1) _indices_ for this mesh (not counting degenerate indices for the turnaround). For large n, this means the ratio (indices sent)/(vertices transformed) approaches 2, so on a Geforce 4 you will hit about 120M indices/s.

If we take that another step further, assume you are not sending strips, but individual triangles. Then you have 10 triangles * 3 indices per triangle * (n-1) bands = 5*6*(n-1) indices sent. Here, the ratio indices/vertices approaches 5, so for a Geforce 4, you could even quote 300M indices/s!

Is this what you mean???

Michael

Moshe Nissim
03-20-2002, 04:01 AM
I used VAR
I used glDrawElements to make vertex repeats explicit.
I tried indepedent tris, tri-strip per mesh band, and one long tri-strip for the whole mesh.
I traversed the bands in alternating directions, to help vertex caching
I disabled rasterization (glCullFace(GL_FRONT_AND_BACK)
I tried many mesh grid dimensions.

Still, I cannot get above 60 Mverts/sec

Does anybody have a benchmark that goes above that?

T2k
03-20-2002, 04:39 AM
dont know what you guys are doing here, but maybe you have to disable some states too like depthtest, depthwrites (glDepthMask) ?!?

is this value measured in tris in view, or are there all clipped away ?!?

[This message has been edited by T2k (edited 03-20-2002).]

wimmer
03-20-2002, 05:53 AM
Did you try to make a strip as in my previous post? Did you count Indices/s?

Michael

Moshe Nissim
03-20-2002, 07:12 AM
Originally posted by wimmer:
Did you try to make a strip as in my previous post? Did you count Indices/s?
Michael

If I understood you correctly, then yes.
I tried several mesh dimensions, not just
6 squares (12 triangles) per row.

To your second question, I counted vertices.
In other words, the second argument to glDrawElements (with GL_TRIANGLE_STRIP mode as 1st arg)
The point is to 'help' the board reuse previous (cached) vertex transformation results.

To T2k's question, there are no such states, and moreover, glCullFace(GL_FRONT_AND_BACK) makes nothing go to the rasterization stage.

wimmer
03-20-2002, 09:12 AM
The second parameter to glDrawElements (count) is the number of indices - so you are really counting the transferred indices, and should see the benefits of the vertex cache...

Strange... I don't see what else you could do...

One last point is how much memory do you allocate with VAR? Sometimes you get back AGP memory even if you asked for Vidmem. You can usually test this by testing the memory speed of the returned memory...

Michael

Moshe Nissim
03-20-2002, 09:23 AM
Originally posted by wimmer:
The second parameter to glDrawElements (count) is the number of indices - so you are really counting the transferred indices, and should see the benefits of the vertex cache...
Yes, that was the idea


One last point is how much memory do you allocate with VAR? Sometimes you get back AGP memory even if you asked for Vidmem.


I used the "0,0,1" combination in the AllocateMemoryNV call, meaning readFrequency=0 , writeFrequency=0 , prioity=1.
If that doesn't give me Vidmem, I don't know what will.


You can usually test this by testing the
memory speed of the returned memory...


The returned memory seems to be in normal process virtual memory. Which I think means that the GPU DMA-pulls it at first use, then keeps it in video memory. I don't flush it.


Michael[/B]

Korval
03-20-2002, 12:14 PM
Ultimately, I wouldn't expect to ever get maximum performance out of a video card. Likely, that performance was measured using the theoretical maximums of the hardware, not with a card connected to a PC. In short, the number was probably calculated, not measured. The number doesn't take into account memory bandwidth (from vidoe memory or main memory) or latency.

Moshe Nissim
03-20-2002, 12:24 PM
The point is that on previous nVidia hardware, the stated numer did match such benchmark programs. Is it no longer true now?

I was also trying to figure out what I missed in the new architecture. The GeForce3 (ans also 2) stated ~ 32 Mverts/sec at 250Mhz core clock (and the benchmarks verified that). The GeForce4 is stated to acheive 136. Comparing the hw I couldn't see how this was done. The core clock increased by 300/250 or 325/250 and the geometry pipeline was doubled, so you would expect ~ 83. I am wondering what extra 'trick' is present in GF4 to make this possible. Larger already-computed-vertex cache? More efficient micro-code? But as it stands now, this is for me an "unreproducible result".

wimmer
03-20-2002, 12:46 PM
The "priority"-value is only a usage hint. The driver is not required to give back any specific kind of memory. As stated, I often experience that I get AGP memory instead of Vidmem if I request too much. I always get memory if I request 32MB or less, but only with 16MB or less do I get video memory. With a lot of textures loaded, this might even be reduced to 8 or 6MB.

Korval: testing whether you can achieve the maximum stated performance is interesting, because then you know that at least for this path, you are doing everything right. Then, you can easily find out which feature takes how much time and why your application runs as fast as it does. It gives you a baseline. And yes, with Nvidia cards it usually is possible to actually achieve the performance stated, as the original poster pointed out. I, too, have been able to get up to 30MVert/s on a Geforce 3 Ti500 and about 23MVert/s on a Geforce 2 GTS...

Michael

Moshe Nissim
03-20-2002, 12:58 PM
Originally posted by wimmer:
As stated, I often experience that I get AGP memory instead of Vidmem if I request too much. I always get memory if I request 32MB or less, but only with 16MB or less do I get video memory. With a lot of textures loaded, this might even be reduced to 8 or 6MB.


Can you explain how you test this? When the data ends up in vidmem, you still get a user-space pointer, maybe not even mapped to the AGP aperture, from AllocateMemoryNV. So I think testing the memory speed of the pointer you got is irrelevant - the data may be pulled up to the on-board memory ("vidmem") and kept there. I think that is the point with this extension, worded something like "relaxing coherency constraints".




I, too, have been able to get up to 30MVert/s on a Geforce 3 Ti500 and about 23MVert/s on a Geforce 2 GTS...


And... what do your tests show on GeForce4 Ti4600? Did you get your hands on one yet?
BTW, you can get the 30+ number on GeForce2 Ultra too...

mcraighead
03-20-2002, 01:19 PM
I suppose independent triangles w/ no vertex reuse must be the way to go, then -- I was a bit confused. This might easily be a setup bottleneck you're measuring, not a transform one.

There is more in the chip that's changed than just the number of pipelines and the clock speed.

Also, even 100 Mvertices/s with 3F vertices is more than AGP 4x allows. Use 2F or 2S vertices. Video memory is also good.

- Matt

Moshe Nissim
03-20-2002, 08:58 PM
Originally posted by mcraighead:
I suppose independent triangles w/ no vertex reuse must be the way to go, then



I also tried independent triangles




This might easily be a setup bottleneck you're measuring, not a transform one.



This is exactly my suspicion. I don't think there is a way to separate the two. This is also why I think you cannot spec the first if the second is lower.




Also, even 100 Mvertices/s with 3F vertices is more than AGP 4x allows. Use 2F or 2S vertices.


But I thought VAR took care of that. The only thing that is transferred is the index list, which is shorts. Pitty I can't pack it into a display list (as spec'ed, a display list pulls the vertex data at compile time).

Anyway, I did use 2F vertices.


Matt, do you have a benchmark program that demonstrates this performance?




Video memory is also good.


This is why I started out with display list (of 2F vertex data). Actually, this is what gave me peak performance in previous hardware, and also in GF3 it gets the fastest time (although only by a small lead over VAR)

wimmer
03-20-2002, 11:11 PM
Can you explain how you test this?

Well, I have two different memcpy routines: one from the AMD processor optimization guide, and one very simple SIMD one from the net. The AMD one gives me about 930MB/s for AGP, and about 560MB/s for vidmem. The simple SIMD one gives me about 730MB/s for AGP, and 750MB/s for vidmem. This is all tested on an idle GPU.

It gets interesting when the GPU is busy rendering (i.e., pulling vertex data out of the same memory). Then, the AMD-AGP memcpy drops from 930MB/s to 510MB/s, which is reasonable, since now the GPU and CPU have to share the 1024MB/s bandwidth you get with AGP 4x. However, if I use vidmem under load, the SIMD-vidmem memcpy only drops from 730MB/s to about 700MB/s, which is reasonable as well, since video memory is >2GB/s, and the simultaneous GPU/CPU access is barely noticed.

This carries through to all other tests where there is a difference between AGP and vidmem, as well. So, when I test the memory I get from wglAllocateMemoryNV, I compare the speeds with the speeds above (which were obtained using small requested memory size and no textures loaded, so I am sure I got the correct memory), and sometimes the characteristics match exactly the AGP case, sometimes the vidmem case.

Whether the data still ends up in vidmem if I request vidmem but the memory has AGP speed characteristics, I can't say, but I doubt it... (btw, you need fastwrites enabled for the vidmem-speeds above)

About the GF 4: I haven't got one yet, unfortunately http://www.opengl.org/discussion_boards/ubb/smile.gif

Michael


[This message has been edited by wimmer (edited 03-21-2002).]

wimmer
03-20-2002, 11:33 PM
I suppose independent triangles w/ no vertex reuse must be the way to go, then -- I was a bit confused.

Yes, I think testing indices/s (by using the vertex cache) is quite silly if you want to test transform speed.

Testing independent triangles as Matt suggests will give you the "real" speed of the transform engine, independent of your geometry.

Then the number of indices reflects the actual number of vertices transformed. What speed did you get with independent triangles (i.e., triangles which DON'T share vertices)?


This is exactly my suspicion. I don't think there is a way to separate the two. This is also why I think you cannot spec the first if the second is lower.

If you use independent triangles, you need to transform 3 vertices per triangle, not 1, so the bottleneck should go back to transformation and not setup. Still assuming you want to test vertices/s, not indices/s...


BTW, you can get the 30+ number on GeForce2 Ultra too...

Obviously, since the Ultra is clocked higher than the GF3...

One more thing about vertices vs. indices: If I use a regular grid and render it as individual triangles in strip order on the GF3 Ti 500, I achieve 28.5MVertices/s (that means actually transformed vertices, counted by simulating the vertex cache in software as in the NvTriStrip example), but 86MIndices/s (i.e., Million Indices sent over the bus!). So there doesn't seem to be a setup bottleneck in this case, and you should achieve at least as much on the GF4...


This is why I started out with display list (of 2F vertex data).

I think Cass or Matt once stated that vertices are kept in AGP memory for display lists... But they may include other optimizations, e.g. like also keeping indices in fast memory...

Michael


[This message has been edited by wimmer (edited 03-21-2002).]

Moshe Nissim
03-21-2002, 06:23 AM
Originally posted by wimmer:

Testing independent triangles as Matt suggests will give you the "real" speed of the transform engine, independent of your geometry.

Then the number of indices reflects the actual number of vertices transformed. What speed did you get with independent triangles (i.e., triangles which DON'T share vertices)?

Indeed with independent tris I get 167 Mverts/sec. It looks like the setup was the bottleneck.
But then with this method of counting, the GeForce3 (or 2) acheives 73... Higher then the stated 32.

It looks like not only the performance changed from GF2/3 to GF4 but also the measured 'entity'. (comparing apples to oranges...)
The ratio when comparing the same things is the expected 2.3 (dual vs. single geometry pipeline, and increased core clock)

Moshe Nissim
03-21-2002, 07:08 AM
Correction
The 178 Mverts/sec was acheived with independent triangles, with shared vertices, so I guess I am seeing the effect of the "post-T&L vertex cache" (plus the fact that it is not tri-strip, so the triangle setup happens only one every 3 vertices).

With indepdendent triangles with no shared vertices, I get 134 Mvert/sec
At last, close to the nVidia stated number...

wimmer
03-21-2002, 09:17 AM
so I guess I am seeing the effect of the "post-T&L vertex cache"

Yes, exactly, that's what I am trying to say all the time! You need to distinguish between vertices/s and indices/s! If vertices are shared, you are counting indices/s, and, most likely, some kind of setup overhead. With independent triangles, as in "not-sharing-any-vertex-triangles" you get the real speed of the transform engine.

But wow, if you really achieve 134Mvert/s, then they really improved the vertex engine a lot!

Michael

Moshe Nissim
03-21-2002, 09:57 AM
Originally posted by wimmer:

But wow, if you really achieve 134Mvert/s, then they really improved the vertex engine a lot!
Michael

Its the dual pipeline vs. single in GF2/GF3, plus the core clock increase.
Not more, not less.

If you followed the previous claimed numbers (in GF2, I think with GF3 they didn't say anything), then you would have been led to believe the increase is more dramatical - from 32 to 136. But the point is that those numbers measure different things.
Back then they measured transform + setup.
Now they measure transform without setup.
If you use 1-vertex triangles, you're fine ;-)

wimmer
03-21-2002, 11:07 PM
so what't the real speedup?

what do you get on a GF2/3 with independent tris with no shared vertices?

Michael

Moshe Nissim
03-21-2002, 11:31 PM
Originally posted by wimmer:
so what't the real speedup?

what do you get on a GF2/3 with independent tris with no shared vertices?

Michael


1. tri strip, 6xM mesh (good caching) - 47
2. indep tris, shared verts, 6xM mesh - 93
3. indep tris, shared verts, 5xM mesh - 114
4. indep tris, non-shared verts - 28

All numbers are million vertices per second.

Wigh GF4, I think I got 178 with (3), and 134 with (4).
But I will have access to the machine only on Sunday to repeat the test exactly.

I know, the this gives very different factors.
I think what accounts for this is greater imporvement in GF4 with transformation, then with setup. (and both do "post-T&L caching")

I still don't understand how setup is influencing when there is no rasterization (glcullFace..). Is the facingess computation so intensive? The GL spec talks about computing the projected triangle's area and looking at its sign for determining facingness, but I thought it can be done without the full exact area computation.

GPSnoopy
03-22-2002, 12:36 AM
How comes Triangle Strips are so slow???

I mean it's the best case to draw triangles and yet you manage to get better results with independent triangles?! Ok they are shared vertices, but still shouldn't tri strips lead to better performances?

What do you mean by "6xM" and "5xM" mesh?

Moshe Nissim
03-22-2002, 01:16 AM
Originally posted by GPSnoopy:
How comes Triangle Strips are so slow???

I mean it's the best case to draw triangles and yet you manage to get better results with independent triangles?! Ok they are shared vertices, but still shouldn't tri strips lead to better performances?



With triangle mesh, you activate the triangle setup engine for every vertex you send.
With independent triangles, you activate it only for every three vertices.

If the triangle setup is the bottlneck, indep tris will be faster.
Of course this is for indep tris sharing vertices, and hardware capable of post-transform result caching.




What do you mean by "6xM" and "5xM" mesh?

6xM means a grid of 6 columns and M rows
so 6*M squares and 6*M*2 triangles.
Triangles are traversed along rows, back and forth (not 'raster order') to utilize better the vertex caching

wimmer
03-22-2002, 05:17 AM
Wait, wait! That's very misleading!

Triangle strips are by no means slower than independent triangles in his case! Also, I don't really see a setup bottleneck in this data.

Consider this:
In 1. (strips), Moshe is pushing 47 Million indices/s. Since it is one large strip, this equals to about the same number of triangles, i.e., 47 Million triangles/s.

In 2. (tris, 6xM), Moshe is pushing 93 Million indices, but each triangles need 3 indices instead of just 1, so the actual triangle rate is 31 Million triangles/s.

In 3. (tris, 5xM), it is 114 Million indices with 3 indices/tri, so we get 38 Million triangles/s.

So you see that triangle strips actually give you the very best performance for this mesh, and independent triangles are way slower. Obviously, this cannot be explained by a setup bottleneck, because setup is per triangle, and so for strips he is doing 47 Million setups/s, and for independent triangles only 31 or 38...

I wonder why the 6xM mesh is slower than 5xM. If you send the triangles indices in the correct order on a Geforce 3 (which has 18 effective cached vertices), you get maximum reuse, i.e., you transform each vertex in the whole mesh exactly once, even in the 6xM mesh (well, actually, there is exactly one vertex you need to transform twice). This should go for both strips and independent triangles.

Moshe: the dual pipeline + clock speed increase let's me go from 30 Million vertices/s to 75 Million vertices/s, but not to 134! There's something wrong here.

Ok, I think this topic is getting very confusing, especially for other readers.

What everybody has to keep in mind is that we are measuring three different entities here:

1) actual transformed vertices/s (i.e., a vertex taken from the vertex cache does NOT count here)

2) triangles/s (this could show up setup bottlenecks), to keep in mind how much geometry you are actually creating

3) sent indices/s (this basically measures how effective your geometry is organized and how well you exploit the vertex cache). This can from a maximum go from 3 times the triangle rate (if you use independent triangles) to exactly the triangle rate (if you use strips).

Now for the meshes discussed here, we have about twice as many triangles as vertices, but counting it exactly, you find that you send 5 indices per vertex if you use independent triangles, and 2 indices per vertex if you use strips. Actually, in the example above, 47 * 5/2 = 118 is not so far from the 114 so you can see where this comes from...

I hope this clarifies a bit, and please, let's talk about indices/s from now, not vertices/s, if we talk about the second parameter to glDrawElements...

Michael

Moshe Nissim
03-23-2002, 11:58 PM
I was referring to 'speed' when counting vertices (or indices), not triangles. This is what nVidia seems to be measuring in their spec. So in this 'definition' of speed,independent triangles are faster.
Of course, ultimately triangle strips are more efficient because they really produce a visible (=usefull) triangle for every vertex.

Maybe I can calrify a bit by summarizing the measurements, GF2, against GF3. I will give numbers of both triangles/second (in millions), and vertices/second. I always refer to vertices, but I say when vertices are shared and when they are not. Of course, in a strip vertices are shared by definition.

The GF2 is an Ultra with 250Mhz core clock
The GF3 is with 300Mhz core clock

X:Y means X millon vertices/second and Y million triangles/second
First is the result for GF2, second is for GF3

MxN means a mesh of M squares by N squares, which has (M+1)*(N+1) points and 2*M*N triangles.

If you like, you can replace "vertices" by "indices".

VAR is always used.

1. indep tris, non-shared verts, 6x32 -> 35:12 134:44
2. indep tris, shared verts, 6x32 -> 90:30 180:60
3. tri strip, 6x32 -> 31:31 60:60
4. indep tris, shared verts, 8x32 -> 70:23 180:60
5. indep tris, shared verts, 9x32 -> 67:22 177:59
6. indep tris, shared verts, 12x32 -> 67:22 150:50


My conclusions:

a. GF3 transform vs. GF2 increased more than the setup. This is why with indep. tris (1) it gets 134 over the GF2's 35 (3.8x), while with tri strip (2) it 'only' gets double, from GF2's 31 to 60.

b. GF4 has bigger post-T&L vertex cache. This is why when going from 'tight' 6x32 mesh (4) to 8x32 (5), the GF3 maintains performance, while the GF2 goes down from 90 to 70

c. nVidia's stated "136 vertices per second" measures performance unhindered by setup, and with post-T&L vertex cache (with some typical statistics)


BTW, I am in no way trying to diminish the GF4, just to understand what is going on...

GPSnoopy
03-24-2002, 07:42 AM
Could you send me your test application (tfautre@pandora.be), or can we download it somewhere?

I wanna test it 'cause I can't seem to hit 31MTris/s with my GF2U, no matter what I do with my programs. So I wanted to see if it's because of my config or because of my programs.

I suspect that I'm never able to use video memory and that I'm at best using AGP mem,probably because I use large objects (~60K tri). I get about 10MTri/s at best using triangle strips. I conclude that my AGP bus is the limiting factor 'cause the amount of data transfered is about the speed of the AGP2X.

Moshe Nissim
03-24-2002, 12:26 PM
Originally posted by GPSnoopy:
Could you send me your test application (tfautre@pandora.be), or can we download it somewhere?


I am using Linux.
If that is usefull to anybody, please let me know and I will post it.

wimmer
03-24-2002, 01:40 PM
I was referring to 'speed' when counting vertices (or indices), not triangles. This is what nVidia seems to be measuring in their spec.

No, Nvidia doesn't state the number of "indices" transformed, but the raw number of vertices their transformation engine can actually compute. I.e., this doesn't take into account the vertex cache at all! You are not actually measuring the spec Nvidia gives. For this, you have to do a Vertex Cache simulation and find out how many vertices actually have to be transformed (vs. being taken from the cache).


c. nVidia's stated "136 vertices per second" measures performance unhindered by setup, and with post-T&L vertex cache

As above, no, it measure performance _without_ the vertex cache (obviously, because you achieve the 134MVert/s on a mesh where the vertex cache is never active!).

BTW, whenever you say GF3, you actually mean GF4, right?

The Geforce 2 has a vertex cache of 16 entries, with 10 entries actually useable due to pipelining. The GF3 and GF4 has 24 entries, with 18 being useable.

Another thing which hits me is that your mesh is way too small. You are not even pushing 1000 triangles here, so you might be way from peak performance.

Just to give you an indication:

I can achieve >23 million triangles on a Geforce 2 GTS (slower than your Ultra), with one texture applied, and standard 3 float vertices and non-short (i.e., integer) indices, on a small window. The mesh used for that is a simple heightfield of 86400 triangles in 600 strips and is totally vertex-cache unfriendly (almost no cache reuse). I use indexed triangle strips with interleaved arrays and VAR in video memory for that, and I can achieve this figure on a Celeron 433!

If I leave away the texture, I get >24.5 million triangles, but at this stage I start to get CPU limited so the GF2 might be able to do more. Likewise, I can't test independent triangles because then the index traffic kill my CPU.

So try a larger mesh and see whether you don't get higher performance - otherwise your results look quite strange to me, because you are never achieving anything like the expected vertex rate with your mesh! Please also try tri strips for the other mesh sizes.

If you have a look at the learning_VAR demo from Nvidia, you will also see that the performance drops a lot if you reduce the number of triangles. In your test, in 3. the geometry engine is actually slower (about 18.5 million transformations/s if you do the vertex cache simulation) than in 6. (22 million transformations/s), so what you might be seeing here is the effect of using a larger mesh in 6. than in 3.

What CPU do you have for those tests?

Michael

wimmer
03-24-2002, 01:46 PM
GSnoopy:

There is no way to keep a GF2 Ultra busy with AGP2x if you need to transfer geometry every frame.

The only chance you have is
- making really sure you have video memory with VAR (I explained above how you do that)
- put as much of your geometry into video memory and DONT TOUCH it afterwards.

If you have more geometry in a frame than fits into VAR memory, then you are out of luck. The best you can try is probalby an intelligent caching scheme.

If your geometry fits, then you should be able to achieve about 19 million vertices with one infinite light, one texture applied and a very small viewport and a large mesh, provided you use interleaved arrays. If you have long strips, this equates to 19 million triangles as well.

Michael

Moshe Nissim
03-24-2002, 02:00 PM
As above, no, it measure performance _without_ the vertex cache (obviously, because you achieve the 134MVert/s on a mesh where the vertex cache is never active!).

Yes, you are right, sorry. The 180 (like in (2)) is where the vertex cache is evident.


BTW, whenever you say GF3, you actually mean GF4, right?

Wow! What a typo! At least I was consistent ..
You are right of course. Please read GF4 where I wrote GF2 throughout.



Another thing which hits me is that your mesh is way too small. You are not even pushing 1000 triangles here, so you might be way from peak performance.

Of course I am repeating the drawing between time measurements. But why should it matter?


I can achieve >23 million triangles on a Geforce 2 GTS (slower than your Ultra), with one texture applied, and standard 3 float vertices and non-short (i.e., integer) indices, on a small window.

I acheive this figure (on GF2 GTS) also, with simpler very-long triangle strips, compiled into a display list (instead of interleaved VAR). The numbers agree with the 250/200 core clock difference between GTS and ultra


... and I can achieve this figure on a Celeron 433!

What does CPU has to do with it? With VAR or display list, its being 'read' from video memory, not over AGP, when its drawn.


Likewise, I can't test independent triangles because then the index traffic kill my CPU.

This is a limitation of VAR. If I read the spec correctly, then you can't use the index list in a display list (it 'pulls' the vertex data at display-list compile time), so indices are always sent over AGP. With this indeed the CPU matters. But my original benchmarks were using display lists, for exactly that reason. Matt's dictum "use VAR" moved me off.


So try a larger mesh and see whether you don't get higher performance

I did, but I will try again with good old display lists.


- otherwise your results look quite strange to me, because you are never achieving anything like the expected vertex rate with your mesh!

What is the expected vertex rate?
134 without tri setup, and 60 with, seems right to me. Do you expect something different?


In your test, in 3. the geometry engine is actually slower (about 18.5 million transformations/s if you do the vertex cache simulation) than in 6. (22 million transformations/s), so what you might be seeing here is the effect of using a larger mesh in 6. than in 3.

How did you do the vertex cache simulation?

Do you see any disadvantage of using display list instead of VAR?
I mean, if the driver implementation is good (and I think it is), then display list allows for no data to be sent over AGP at all (except the small glCallList token). I think in this case mesh size shouldn't matter either, unless I get to really small number of tris per glCallList call


What CPU do you have for those tests?


730Mhz P3 for the GF2 Ultra
1700Mhz P4 for the GF4 (four ;-) )

Moshe Nissim
03-24-2002, 02:38 PM
some more results:
(DL = display list)

1. DL indep tris, non shared verts, 64x64 GF2U->35:12 GF4->77:26
2. VAR indep tris, non shared verts, 64x64 GF2U->32:11 GF4->47:16
3. DL tri strip, 64x64 GF2U->31:31 GF4->59:59
4. VAR tri strip, 64x64 GF2U->31:31 GF4->59:59

I guess 1 vs. 2 shows that AGP can be a limiting factor in VAR due to index transfer (BTW, I am using shorts)

The triangles have area of 1 pixel

wimmer
03-24-2002, 03:55 PM
I acheive this figure (on GF2 GTS) also, with simpler very-long triangle strips, compiled into a display list (instead of interleaved VAR). The numbers agree with the 250/200 core clock difference between GTS and ultra

But my GF2 GTS can do it with VAR. It's still strange that using the same rendering mode (VAR), my GF2 (200 core) renders at >24MT/s, and your GF2U (250 core) only at 22MT/s...

What does CPU has to do with it? With VAR or display list, its being 'read' from video memory, not over AGP, when its drawn.

Well, the CPU still spends a lot of time in the driver with indexed primitives. I don't know why, but it does. If you use glDrawArrays, this gets much, much less, but then can't use the vertex cache. But you could try using glDrawArrays for the independent triangles with non-shared verts...

But my original benchmarks were using display lists, for exactly that reason. Matt's dictum "use VAR" moved me off.

There are two sides to this. First, Matt or Cass once said that they are not storing geometry in video memory, but in AGP for display lists. Second, display lists can usually surpass VAR because the driver can do other optimizations as well. But VAR can be faster because you can store data in video memory, and it's more flexible because you can change data during runtime, and with VAR you (almost) always know what you get. If you have a large amoung of geometry, this may stay in system memory with display lists, and could be very inefficient.

What is the expected vertex rate?

I don't see why 4., 5. and 6. should be slower than 3. This actually shouldn't have anything to do with the vertex cache, since even with 3., the geometry engine could handle the number of triangles transformed without a vertex cache (at least for strips). So it's something else at work here. And I'm not sure about being setup limited - this doesn't sound logical to me.

How did you do the vertex cache simulation?

There's code (it's actually very simple) for that in the NvTriStrip lib at developer.nvidia.com. You just put the indices onto a fifo and for every index you send, you only count it if it's not already in the cache.

Do you see any disadvantage of using display list instead of VAR?
I mean, if the driver implementation is good (and I think it is), then display list allows for no data to be sent over AGP at all (except the small glCallList token).

As said above, no dynamic meshes, bad for very large meshes, no control. But if you have a medium number of very small meshes with state changes in between, you can't beat display lists.

730Mhz P3 for the GF2 Ultra

with AGP 2x, I guess... Well that might explain the slowdown for the larger meshes with independent triangles on the GF2 - lot's of index traffic!

1. indep tris, non-shared verts, 6x32 -> 35:12 134:44
2. VAR indep tris, non shared verts, 64x64 GF2U->32:11 GF4->47:16

I don't understand - why is the GF4 suddenly so slow for independent tris? Are you sure you really tested non-shared vertices the first time round?

And no, on a PIV I don't think that index traffer can be a problem...

Michael

Moshe Nissim
03-25-2002, 12:00 AM
But my GF2 GTS can do it with VAR. It's still strange that using the same rendering mode (VAR), my GF2 (200 core) renders at >24MT/s, and your GF2U (250 core) only at 22MT/s...

At least it renderes at 31 MT/s with display list...

I wonder what happens if I plug the DrawElements in a display list. Will it 'remember' the "vertex repeats" and utilize the vertex cache when I call the list?


There are two sides to this. First, Matt or Cass once said that they are not storing geometry in video memory, but in AGP for display lists

I don't think this can be true anymore.
I don't think you can push 47M verts/sec over AGP, even if the verts are only 2 floats, and the AGP is x4

But VAR can be faster because you can store data in video memory,
Again, I think display lists are also stored in video memory

and it's more flexible because you can change data during runtime,
This is true

If you have a large amoung of geometry, this may stay in system memory with display lists, and could be very inefficient.
That's one reason the GF4 has 128MB ;-) Its not only for textures...

And I'm not sure about being setup limited - this doesn't sound logical to me.

Why not?

with AGP 2x, I guess...
No, sorry ... ;-)
It reports succesfully setting AGPx4.
Its a VIA Apollo chipset

1. indep tris, non-shared verts, 6x32 -> 35:12 134:44
2. VAR indep tris, non shared verts, 64x64 GF2U->32:11 GF4->47:16

I don't understand - why is the GF4 suddenly so slow for independent tris? Are you sure you really tested non-shared vertices the first time round?

I don't understand either. The difference is the mesh size.
Yes, I'm sure it is independent tris in both cases. I even repeated the tests.
To add to the mystery, with display list, it is 73 regardless of grid size, while with VAR it jumps to 133 with a 6x32 mesh and decreases as N increases (Nx32 mesh)

wimmer
03-25-2002, 01:04 AM
I wonder what happens if I plug the DrawElements in a display list. Will it 'remember' the "vertex repeats" and utilize the vertex cache when I call the list?

I see no reason the vertex cache shouldn't be active in a display list. The only requirement is that the number of vertices is known and they reside in some memory where the GPU can pull the vertices itself - and that should be the case for a DL...

I don't think you can push 47M verts/sec over AGP, even if the verts are only 2 floats, and the AGP is x4

Why not? 4 bytes per float * 2 floats * 47 million = 376MB/s, easy for AGP 4x (1024MB/s). Even 77MVert/s is only 616MB/s, something the GPU can certainly achieve (I can push about 920MB/s into AGP if the GPU doesn't use it at the same time).

Why not?

If you are setup limited, why should performance decrease with a larger mesh? Setup overhead shouldn't change with mesh size. And the vertex cache shouldn't have anything to do with it, as the GF2U can transform 31 million vertices/s for triangle strips even without the vertex cache... See what I mean?

I don't understand either. The difference is the mesh size. To add to the mystery, with display list, it is 73 regardless of grid size, while with VAR it jumps to 133 with a 6x32 mesh and decreases as N increases (Nx32 mesh)

Two things come to mind here:
- a jump from 47 MVert/s for the smaller mesh to 133 MVert/s with the larger mesh sounds like the effect of the vertex cache kicking in (although this is not possible, since you are not sharing any vertices)
- on the other hand, going down from 133 MVert/s to 73 when going from DL to VAR could well be explained by the DL being in AGP and VAR being in video memory...

Michael

wimmer
03-25-2002, 04:33 AM
Ok, I give up for the moment. Didn't see anything wrong with the code at first sight.

Another data point: On a GF3 Ti500, I can do 37 million triangles/s on a 3440 triangle regular square grid, after applying nvtristrip, using strips. 2691 vertices actually get transformed, so that's a consistent 29 million vertices/s. That's no textures, no materials.

Michael