PDA

View Full Version : GF3 Z-occlusion performance? (Sorry, not OpenGL releated)



Humus
04-28-2001, 03:53 PM
Sorry, this is not OpenGL related ... but I think this is the best forum to get good technical answers of this question.
http://www.aceshardware.com/Spades/read.php?article_id=25000228

According to this review it seams that GF3's Z-occlussion isn't as effecient as ATi's HyperZ. I though they were pretty much the same, but in VillageMark Radeon outperforms GF3, which I find a little surprising. Anyone having a good explanation of this?
Matt? Cass?

Nutty
04-29-2001, 01:08 AM
VillageMark is distributed and created by Imagination Technologies (PowerVR) and underlines the strong points of their own product, the Kyro II. So, you could say this benchmark is biased.


Says it all really. Considering Geforce 3 comes top in all the other comparisons on that review.

Nutty

Humus
04-29-2001, 03:06 AM
Sure, but why do Radeon beat the GF3? I still find that confusing. In fact also, the benchmark is created to take pretty good advantage of the hardware. It even uses T&L even though the Kyro cards doesn't support it.

paddy
04-29-2001, 08:58 AM
Maybe it simply shows that even the GF3 is not perfect ...
Reading various comparisons, I discovered that the GF3 is not so much faster than the actual high end GF2 cards.
But actually the GF3 is the only one to offer so many extensions and DX8 features (when they finally manage to make it stable *lol*)

[This message has been edited by paddy (edited 04-29-2001).]

mcraighead
04-29-2001, 04:17 PM
I was rather unimpressed by that article; I saw a lot of obvious errors in it.

I don't think it's even worth looking at that particular benchmark.

- Matt

Humus
04-30-2001, 04:55 AM
Originally posted by mcraighead:
I was rather unimpressed by that article; I saw a lot of obvious errors in it.

I don't think it's even worth looking at that particular benchmark.

- Matt

An example to back up that statement?

Humus
04-30-2001, 05:00 AM
I can add that according to Dave over at beyond3d.com nVidia culls on perpixel level while ATi cull on per block level. That should explain why Radeons are much more efficient in this benchmark. Indeed, it should be much faster in the majority of the apps out there, unless of course the polygons starts to get so small on screen that they usually don't cover a whole block.

mcraighead
04-30-2001, 06:39 AM
Originally posted by Humus:
An example to back up that statement?

All right, I can't talk about many of the errors in the article, but here's one that stuck out as blatantly obvious.

The methodology used for claiming what the "boost from using optimized code paths" in Serious Sam is completely broken. The percentage gain numbers given mean absolutely nothing. In fact, it's downright misleading the way they describe the first benchmark as "optimized".

And it sounds like Dave is confused...

- Matt

Humus
04-30-2001, 09:57 AM
Sure, I can agree that those numbers doesn't mean much, but it's still interesting to see what effect it has when a game isn't optimized for a card or is optimized for only a certain card or range of cards.
(I also understand that you didn't like that test when it has comments like "I was wondering, what would have happened if Croteam (the designers of Serious Sam) would have resorted to non-optimized settings (read: NVIDIA is the lowest common denominator)." http://www.opengl.org/discussion_boards/ubb/smile.gif)

Anyway, so Dave is wrong/less correct?
So what is the fundamental difference between the ATi way and the nVidia way?
There's got to be a reason why a GF3 with more than twice the fillrate gets beaten by the almost one year old Radeon card when there's heavy overdraw.

Funk_dat
04-30-2001, 12:29 PM
I'm pretty sure the Geforce 3 and the Radeon uses the same z-buffer compression, which can theoretically reach 4:1 savings.

The GF3 and RAdeon also both have similar 'fast z-clear' functions that zaps all of the data in the z-buffer. This supposedly also speeds up performance.

Nvidia actually has stated that these features do not provide any major performance gains on the GF3. Note that this is NOT because the functions were poorly implimentated, but rather it is because of the increased memory bandwidth of the GF3. They are simply not needed as badly. In contrast, the Radeon can benefit more from these features because of its comparable limited memory bandwidth.

I've always found great reviews on Ace's Hardware. Dissing them cuz the test they used is wack isn't too cool. Of course, we all know Nvidia feelings on anything dealing with the KyroII. http://www.opengl.org/discussion_boards/ubb/wink.gif

If you are worried about biased reviews, check out http://www.anandtech.com. It's the best one out there in my opinion. Anand rules. There a bunch of info on his site about the GF3 and Radeon.
http://www.anandtech.com/showdoc.html?i=1426

Funk.

[This message has been edited by Funk_dat (edited 04-30-2001).]

mcraighead
04-30-2001, 11:19 PM
Well, I've found that the vast majority of video card web sites are very poor. Aces' is good for CPUs, but iffy for 3D. B3D has a good reputation, but I'm unimpressed.

The aceshardware article shows a pretty severe misunderstanding of what the various Serious Sam settings mean. As I already said, the word "optimized" is _extremely_ misleading! "Customized" would be much more accurate, and even then, some of the customizations they use for us are actually bad customizations.

The percentage numbers are _absolutely meaningless_. They neglect such obvious issues as CPU limitations! (This is a common problem with "FSAA performance hit" comparisons. Instead of benchmarking, say, 12x9 vs. 6x4x4, they benchmark 6x4 against 6x4x4. As a result it is virtually impossible to interpret the results in a meaningful way.)

The fact is that there is a huge knowledge gap between the people who run these web sites and those of us who work on these products for our jobs. It took me only a few months inside NVIDIA to realize how frequently these web sites say things that are just outright _wrong_. It's extremely bad journalism.

- Matt

Humus
05-01-2001, 05:14 AM
Yes, I understand and agree fully on that.
But back on topic, I'm very interested in knowning how the HierachicalZ/Zocclusion stuff differs technically.
My understanding is that the ATi stuff is dividing the screen into many small 8x8 tiles. For each tile a min & max depth is stored for that tile. FastZClear only needs to clear the min/max values. When rendering it renders to a small onchip buffer, but first it calcs the max/min depth for the incoming poly in that tile. If the new poly is entirely in front of the stored it renders with depthtest disabled. If it's entirely behind the stored it goes on to the next tile without the need to render anything. Else, it's needs to render normally. Then add to this the Zcompression which I guess kicks in when a onchip tile is ready and about to be written into memory.
I cannot 100% back up this, but that's how I think it works. (Maybe someone from ATi can correct me if I'm wrong).

Now how does the nVidia way differ?
I initially had the impression that it was essentially the same technique as ATi's, but seeing that it doesn't give the same performance boost as ATi's it got to be something that differs. Daves explanation made some sense to me, but if you're saying that hes wrong that how does it work?

mcraighead
05-01-2001, 08:33 AM
For reasons that should be obvious, I am not about to tell everyone what we do.

However, if you think that all of our fast Z work is for nothing, that's not true. It provides significant real speedups.

- Matt

Humus
05-01-2001, 09:50 AM
<sarcasm>
Hey, you can tell me. It's not like ATi is going to copy your way of doing it when their own is twice as fast. http://www.opengl.org/discussion_boards/ubb/tongue.gif
</sarcams>

Yes, I understand that it's not for nothing. GF3 is 76% faster than GF2 in this benchmark with similar fillrate, so it can't be useless.
But can you at least tell me if it's true that it culls at pixel level or if it's done at block level.

Nutty
05-01-2001, 10:05 AM
I'm pretty sure it's done on the pixel level.

Nutty

Funk_dat
05-01-2001, 10:32 AM
Traditional depth testing compares the pixels *after* the pixel has already been shaded and textured. The GF3 and Radeon have features the compare the pixels to the z-buffer *before* the rendering pipeline. This is where the savings come from. All of the fragment operations(texture, fog, stencil, etc.) are saved on the discarded pixel.

The technique Nvidia uses to send the fragments to the new z-hardware I don't know. And since Matt wont talk about it, I assume it's proprietary. Wether it's per pixel or per 'tile', I'm sure it's optimized quite nicely for the hardware.

The GF3 and Radeon also have z-buffer compression and fast-z clear which also add speed and are self explainatory.

One last thing: The radeon will benefit more from these features because of it's limited memory bandwidth. This doesn't mean that it'll perform better than the GF3 in the end, just that it'll get a bigger boost in resonable applications.

Hope I cleared some stuff up for you.

Funk.

Funk_dat
05-01-2001, 10:38 AM
also...

I didnt mean to diss Nvidia hardware or anything when I said that it the z-buffer stuff wouldnt give any major performance gains. I'm sure the statement I read was in regards to resonable applications and systems where often times limited CPUs and game-style geometry were involved.

The more bandwidth the better!

Funk.

Humus
05-02-2001, 03:32 AM
Originally posted by Funk_dat:
Traditional depth testing compares the pixels *after* the pixel has already been shaded and textured. The GF3 and Radeon have features the compare the pixels to the z-buffer *before* the rendering pipeline. This is where the savings come from. All of the fragment operations(texture, fog, stencil, etc.) are saved on the discarded pixel.

The technique Nvidia uses to send the fragments to the new z-hardware I don't know. And since Matt wont talk about it, I assume it's proprietary. Wether it's per pixel or per 'tile', I'm sure it's optimized quite nicely for the hardware.

The GF3 and Radeon also have z-buffer compression and fast-z clear which also add speed and are self explainatory.

One last thing: The radeon will benefit more from these features because of it's limited memory bandwidth. This doesn't mean that it'll perform better than the GF3 in the end, just that it'll get a bigger boost in resonable applications.

Hope I cleared some stuff up for you.

Funk.

Thanks, but you sort of summed up what I already knew. What I'm interested in is the difference between the two implementations.
Btw, I don't know if the reasoning that Radeon should benefit more because it has less bandwidth actually holds true. You must remember that the Radeon only has two pipelines to feed while GF3 has four. Bandwidth / rendered pixel is higher on the Radeon than on GF3.

Nutty
05-02-2001, 04:14 AM
What I'm interested in is the difference between the two implementations.


Humus, you aint gonna get that unless someone breaks their NDA with nvidia. Divulging the inner workings of _how_ _their_ implementation works, is a guaranteed way of getting a slap!

Nutty

paddy
05-02-2001, 05:09 AM
Mhhh...
ATI did not give the exact specifications (of course) but at least we have a guess of how it's working.
I don't know how the technology differs, but apparently, as the GF3 as a much higher mem bandwith, it's low scores show the NVidia system is not as good as the ATI one http://www.opengl.org/discussion_boards/ubb/smile.gif

Funk_dat
05-02-2001, 08:08 AM
Originally posted by Humus:
Btw, I don't know if the reasoning that Radeon should benefit more because it has less bandwidth actually holds true. You must remember that the Radeon only has two pipelines to feed while GF3 has four. Bandwidth / rendered pixel is higher on the Radeon than on GF3.

This is true if you are comparing the Radeon to the GF2, but the Nvidia wised up for the GF3 and equipped it with 4 memory controllers. This means there is potentially less bandwidth wasted per clock since you have the memory in more managable 32-bit chunks. Less wasted memory means that you can get closer to the theoretical limits of the adapter.

FUnk.

Humus
05-02-2001, 08:35 AM
Originally posted by Nutty:
Humus, you aint gonna get that unless someone breaks their NDA with nvidia. Divulging the inner workings of _how_ _their_ implementation works, is a guaranteed way of getting a slap!

Nutty

Maybe I should have clearified. I'm not interested in exactly how it's implemented hardware wise, I don't care how the wires goes and how they have packed their transisitors. I just want to know the basic concept of GF3's Zocclusion, if it's using tiles or is doing it per pixel. I mean, it's nothing that can ever be useful for their competitors to know. It's like asking what size it's vertex cache is. It's not gonna change the competitors design decisions, but it may change game developers design decisions.

Humus
05-02-2001, 08:52 AM
Originally posted by Funk_dat:
This is true if you are comparing the Radeon to the GF2, but the Nvidia wised up for the GF3 and equipped it with 4 memory controllers. This means there is potentially less bandwidth wasted per clock since you have the memory in more managable 32-bit chunks. Less wasted memory means that you can get closer to the theoretical limits of the adapter.

FUnk.

Ok, I can understand how you're thinking. But it only explains why Radeon has a higher percentage gain, not why it has a higher gain in absolute numbers. And it really doesn't explain why GF3 with it's more than twice as high fillrate, much faster memory and more sophisticated (I guess) memory subsystem gets beaten by the close to one year old Radeon even though they have supposedly similar Zbuffering optimizations. Thus it got to be some fundamental difference between the way the two are handling it, and that GF3 would be culling on per pixel basis made perfectly sense to me and explains it all aswell as why GF3 doesn't benefit so much from it's FastZClear as stated by some nVidia employee somewhere ... until Matt said that "Dave [who told it too me initally] is confused".

Funk_dat
05-02-2001, 09:53 AM
your right...I'm not sure why the tests came out the way they did. I wouldn't make any final judgements, though, until the hardware has actually been released, the drivers have been finalized, and an appropriate test has been run. The test ace's hardware ran were kinda old(they even had a disclaimer saying the tests might not be accurate). The Q3 Quavier benchmarking demo might be a good one to use.

Funk.

Nutty
05-02-2001, 09:55 AM
I just want to know the basic concept of GF3's Zocclusion, if it's using tiles or is doing it per pixel


I thought we already established that it's doing it per pixel. Given the fact that the Geforce series cards are not tile renderer architecture boards.

AFAIK it _is_ per pixel. I could tell you what I was told about it's functionality, but technically I'm still bound under NDA, and therefore shouldn't. TBH it's not really that relevant to developers. I wouldn't worry about it.

Nutty

mcraighead
05-02-2001, 11:49 AM
I don't even know what "per pixel" _means_ in this context.

Nutty, it sounds like you got some bad info too.

I absolutely hate the fact that the world "tiling" has been completely misinterpreted by all the web sites to mean something completely different than what it really means.

"Tiling" is just a _memory layout_! It has absolutely nothing to do with rendering architecture -- nothing at all.

The proper name for such rendering architectures is "chunkers", i.e., they batch up the scene into _chunks_, and they render one chunk at a time rather than one triangle at a time. (This also clarifies the fact that if a scene is too large, more than one chunk may be required.)


Now, a "tile" is just a name for a rectangular collection of pixels.

Am I going to say which operations we do exactly where? Absolutely not.

If someone outside of NVIDIA claims to know those kinds of details about how our rendering architecture works, they're probably lying. A lot of people _in_ our company don't know.

Virtually everything I've seen in this thread is either (1) misinformation or (2) speculation.

As developers, all you need to know is one simple rule: draw front to back, always.

- Matt

JackM
05-02-2001, 02:24 PM
The test ace's hardware ran were kinda
old(they even had a disclaimer saying the tests might not be accurate). The Q3 Quavier
benchmarking demo might be a good one to use.

I think this site already done it...
http://www.digit-life.com/articles/gf3asusagpv8200deluxe/q3-32-quaver.gif
http://www.digit-life.com/articles/gf3asusagpv8200deluxe/index.html

JackM

[This message has been edited by JackM (edited 05-02-2001).]

LordKronos
05-02-2001, 02:45 PM
Originally posted by Humus:
And it really doesn't explain why GF3 with it's more than twice as high fillrate, much faster memory and more sophisticated (I guess) memory subsystem gets beaten by the close to one year old Radeon even though they have supposedly similar Zbuffering optimizations...

Just taking a QUICK look over the site, its seems to me the GeForce 3 severly dominates over the kyro 2 and does a pretty number on even the Radeon on EVERY test except villiage mark. Interestingly Villiage Mark was (as stated) create by Imagination Technologies. Seems to me they have everything to gain by making the GeForce cards look AS BAD AS POSSIBLE. I might suspect that, not only did they tune the app to make the best use of their card, but they also might have gone out of their way at every opportunity to use every feature/renderstate/technique that would slow the GeForce down more than the Kyro. If doing something in an untraditional way gave a 5 percent hit to the GeForce but only a 2 percent hit to the Kyro, then do it that way regardless of whether it is inconsitant with the way 99% of apps do it. If on the other hand, something else gives a bigger hit to the Kyro, well, they might just conveniently not do that.

As for the Radeon, it might just be coincidence that the Radeon happens to do OK at these non-traditional things than the GeForce. You need to remember that 95% or more of games/apps render with pretty much the same styles/methodologies. Also remember that as you tune something to perform well at one task, it generally begins to perform worse at other tasks, sometimes even performing worse at obscure tasks than a completely generalized/unoptimized model. The GeForce 3 could theoretically be more tuned to the way REAL apps do things than the Radeon is, and when you get to something obscure (which Villiage Mark may be doing) the radeon may just happen to perform better because it is a more generalized/unoptimized solution.

Again, this is not based on any knowledge of either card or of Villiage Mark, just on my speculation of why that one benchmark stands out like an eyesore among the other benchmarks.

JackM
05-02-2001, 04:19 PM
You got a point there, LordKronos....just noticed that VillageMark is created by Img Tech

But...I think chunking architecture have potential....future games (aka Doom3) will have a huge overdraw factor, and KyroII is very efficient at handling it.

Jack

[This message has been edited by JackM (edited 05-02-2001).]

paddy
05-02-2001, 06:28 PM
I love this thread ... old wars coming back http://www.opengl.org/discussion_boards/ubb/smile.gif
I think you've gone too far. Humus just saw a weird behavior of the latest high tech graphic card versus some older ones, and wanted to have a technical explanation to this. That's all.

Humus
05-03-2001, 01:01 AM
Whooha .. lots of replies!
Ok, before I start to answer to all and everyone I must say that Paddy is right, it's not like I'm trying to prove there's something wrong with GF3.

Humus
05-03-2001, 01:07 AM
Originally posted by Nutty:
I thought we already established that it's doing it per pixel. Given the fact that the Geforce series cards are not tile renderer architecture boards.

AFAIK it _is_ per pixel. I could tell you what I was told about it's functionality, but technically I'm still bound under NDA, and therefore shouldn't. TBH it's not really that relevant to developers. I wouldn't worry about it.

Nutty

"Established", well, not entirely but sounds likely ATM. It also makes sense since it would require a lot of work to redo the whole architechture that a switch to tiled method requires. And since the S3TC bug still seams to be present in GF3 according to
many sites, I guess they've reused much from GF2.

Humus
05-03-2001, 01:26 AM
Originally posted by mcraighead:
I don't even know what "per pixel" _means_ in this context.

Nutty, it sounds like you got some bad info too.

I absolutely hate the fact that the world "tiling" has been completely misinterpreted by all the web sites to mean something completely different than what it really means.

"Tiling" is just a _memory layout_! It has absolutely nothing to do with rendering architecture -- nothing at all.

The proper name for such rendering architectures is "chunkers", i.e., they batch up the scene into _chunks_, and they render one chunk at a time rather than one triangle at a time. (This also clarifies the fact that if a scene is too large, more than one chunk may be required.)


Now, a "tile" is just a name for a rectangular collection of pixels.

Am I going to say which operations we do exactly where? Absolutely not.

If someone outside of NVIDIA claims to know those kinds of details about how our rendering architecture works, they're probably lying. A lot of people _in_ our company don't know.

Virtually everything I've seen in this thread is either (1) misinformation or (2) speculation.

As developers, all you need to know is one simple rule: draw front to back, always.

- Matt

Basically, with "per pixel" I mean that it takes a pixel, loads it Z buffer value and checks it. If it's in front it renders it, otherwise goes on to the next.
With "tiled" I mean that is uses some small block, as in Radeons case 8x8 if I'm correctly informed. So it takes tile, check the plane of the polygon that it's going to render to that tile and calc it's min & max depth within the tile, check against the cached max & min of the tile and decide whether to cull, render normally or render with writeonly depth.

As a developer though, while the front-to-back rule is the most important, the implementation details may also be important when making design decisions since the obviously perform very differently under different conditions. The "tiled" version will perform better under normal conditions where most polys covers more than a whole block, while the "per pixel" version would perform better with extremely high polygon count where every polygon is smaller than the tile.

Humus
05-03-2001, 01:33 AM
Originally posted by LordKronos:
Just taking a QUICK look over the site, its seems to me the GeForce 3 severly dominates over the kyro 2 and does a pretty number on even the Radeon on EVERY test except villiage mark. Interestingly Villiage Mark was (as stated) create by Imagination Technologies. Seems to me they have everything to gain by making the GeForce cards look AS BAD AS POSSIBLE. I might suspect that, not only did they tune the app to make the best use of their card, but they also might have gone out of their way at every opportunity to use every feature/renderstate/technique that would slow the GeForce down more than the Kyro. If doing something in an untraditional way gave a 5 percent hit to the GeForce but only a 2 percent hit to the Kyro, then do it that way regardless of whether it is inconsitant with the way 99% of apps do it. If on the other hand, something else gives a bigger hit to the Kyro, well, they might just conveniently not do that.

As for the Radeon, it might just be coincidence that the Radeon happens to do OK at these non-traditional things than the GeForce. You need to remember that 95% or more of games/apps render with pretty much the same styles/methodologies. Also remember that as you tune something to perform well at one task, it generally begins to perform worse at other tasks, sometimes even performing worse at obscure tasks than a completely generalized/unoptimized model. The GeForce 3 could theoretically be more tuned to the way REAL apps do things than the Radeon is, and when you get to something obscure (which Villiage Mark may be doing) the radeon may just happen to perform better because it is a more generalized/unoptimized solution.

Again, this is not based on any knowledge of either card or of Villiage Mark, just on my speculation of why that one benchmark stands out like an eyesore among the other benchmarks.



Well, I can say that VillageMark is NOT written to in any way to run worse on other platforms than Kyro. It's written to show how wonderful the deferred rendering of Kyro cards are, but it's not written in a way to intentionally make it run slower than it could on other cards. In fact, it uses T&L even though none of the Kyro cards supports it. If they wanted to make it look as bad as possible on other cards they should have draw back to front, which they have confirmed that it isn't doing (which is also obvious by looking at {Radeon | GF3} vs. GF2 scores), it draws in more of an random order.
It would of course be better with a third party benchmark, but there are no benchmarks except this one that can show how well a card can handle overdraw.

LordKronos
05-03-2001, 02:55 AM
Originally posted by Humus:
Well, I can say that VillageMark is NOT written to in any way to run worse on other platforms than Kyro... In fact, it uses T&L even though none of the Kyro cards supports it....

Yes, and hasn't it been shown several times that in an artificial benchmark, where the app is doing nothing but throwing polys (which I assume Villiage Mark is...I havent seen it) that a fast CPU can score better than when the GPU does the T&L? Perhaps T&L was included for this reason. The point is, you cant just take something and turn it into a blanket statement saying "well, they even used a feature that they dont have, so they obviously werent trying to make anyone else look bad"


If they wanted to make it look as bad as possible on other cards they should have drawn back to front, which they have confirmed that it isn't doing

Also remember that a card like the radeon can take advantage of strict back-to-front rendering. If you go strictly back-to-front, there is a high probability that each polygon rendered will be closer than the zMin for the corresponding z-tile, therefore the card can use a write-always mode (instead of read-compare-write) when updating the z-buffer. So they confirmed it wasnt back to front. Perhaps at the beginning of each frame, they just throw in enough close-up, tiny polys so as wreck havoc on any cards z-tiling, then they go back-to-front. If they did this, they would still be telling the truth.

So why would the Radeon perform better than the GeForce 3? Even IF the GeForce 3 had z-tiling (which I dont know if it does), perhaps it uses a tile size that is better optimized for what most apps do. In a realistic app, it might be better (Im just speculating) to have a 16x16 tile size rather than the radeon's 8x8. Then for larger occluded poly's it could discard more fragments with fewer tests than you could on an 8x8 tile system. If this were the case (and again...Im just making it up), a 16x16 pixel z-tile system would be more susceptible to a malicious benchmark (one that throws in a few tiny close-up polys, then renders back to front) than an 8x8 tile system would be, because a few "malicious" polys would "corrupt" a higher percentage of the tiles.

When any company writes a benchmark to show off their own product, they are counting on you to make these type of broad assumptions that "oh, their test must be valid, because otherwise they would have done ... instead"

LordKronos
05-03-2001, 03:02 AM
And again, I just want to clarify, I'm not making any type of statement about the benchmark...I really dont know much about it. I'm just trying to play devil's advocate here and point out what the benchmark MIGHT possibly be doing to give the Kyro card the advantage.

Humus
05-03-2001, 08:16 AM
Well, the purpose of the benchmark is to show the advantage of deferred rendering on Kyro cards. The reason they included T&L is probably to show that it doesn't have as much impact or something, but then on the other hand VillageMarks doesn't exactly contain a whole lot of polys. But there are really no reason to think that they have intentionally written the application in such a way that it should perform badly on other cards, except for the fact that it's overdraw is huge ... but that's also the point of the benchmark. They try to show that as overdraw increases the advantage of deferred rendering is huge, and since overdraw will increase over time Kyro must be the card of the future ... sort of. So, you should be taken with a grain of salt, just as you should with TreeMark and similar benchmarks, but I don't see any reason to believe that they've intentionally made efforts to make it slow on other cards. In fact, the claim that it's drawn in random order can easily be proven on Radeon cards by comparing HierarchicalZ on and off score. The performance difference is quite large between the two.

mcraighead
05-03-2001, 12:00 PM
I just wanted to comment on TreeMark, since it was mentioned...

TreeMark is basically a DrawElements performance test. I don't know if we ever released its code, but things don't get much more straightforward -- set up the vertex arrays (vertex, normal, texcoord), set the T&L state, and render some geometry using DrawElements.

- Matt

Humus
05-05-2001, 04:27 AM
Bringing this topic up again with some interesting results.
I've written a small benchmarking utility (which can be found here: http://hem.passagen.se/emiper/3d.html), and I've have some overdraw test with fixed overdraw factors (3 and 8). I have three drawing modes, strictly back to front, strictly front to back and random order. I posted this on the forums over at beyond3d.com and the results people had was quite interesting. It seams that indeed GF3's Zocclusion isn't any less efficient that Radeons HyperZ. GF3 performed around 3-3.5x as good in front to back than back to front with overdraw factor 8. Radeon gained 2.5-3x.
Why Radeons perform so good in VillageMark though I'm not sure, but I recall that it uses 3 texture layers which may be an important factor.

Anyway, another interesting result as showed by peoples posts over at beyond3d.com that you Matt may be interested in is that while every other card got pixel & texel fillrate results very close to their theoretical values GF3 didn't. It got around 600Mpixel/s and 1200Mtexel/s, while GF2 cards would get close to 800Mpixels/s and 1600Mtexel/s. I guess it may be a driver issue, perhaps not doing a page flip but rather a buffer copy?

paddy
05-05-2001, 05:02 AM
Talking about benchies ...
I made a little fillrate benchmark by overdrawing semi transparent polygons.
No HyperZ here, it's even almost CPU independant (I score same on my new Athlon 1ghz than my old PII-300). The whole compiled displaylist is executed during 10 seconds.
http://paddy.io-labs.com/rtfog.zip

ET3D
05-05-2001, 05:48 AM
Interesting results, I agree. I initially thought that maybe it's because the Radeon simply has better back to front rendering, but checking the Beyond3D thread, it doesn't seem so.

3 textures would be a very good reason - can you verify this? But this could still be explained by HSR architecture. For example, if your benchmark renders large polys and Zocclusion uses large blocks, then it would be more efficient, while VillageMark may be using smaller polys, that don't cover an entire block, and HyperZ may have an advantage in this. It's all just speculation, but you might try to add this to the test.

It was interesting to note that the Voodoo5 gains nothing from front to back, unlike virtually all other chips (even older ones).

Oh, and BTW, how refreshing to see a benchmark that doesn't require hours of downloading. (Small stab at NVIDIA here - why is there need for many MB of WAV and large TGA textures for tech demos?)

LordKronos
05-05-2001, 09:57 AM
Originally posted by ET3D:
(Small stab at NVIDIA here - why is there need for many MB of WAV and large TGA textures for tech demos?)

I wont argue about the wave files, but I think the large textures are 100% necessary for showing off card's quality.

mcraighead
05-05-2001, 10:03 AM
Originally posted by Humus:
Anyway, another interesting result as showed by peoples posts over at beyond3d.com that you Matt may be interested in is that while every other card got pixel & texel fillrate results very close to their theoretical values GF3 didn't. It got around 600Mpixel/s and 1200Mtexel/s, while GF2 cards would get close to 800Mpixels/s and 1600Mtexel/s. I guess it may be a driver issue, perhaps not doing a page flip but rather a buffer copy?

Can you detail exactly what the program is doing in each case, and what resolution/color depth/Z depth (remember that color and Z depth are independent on GF3) are being used?

I know that GF3 does quite well on the 3DMark 2001 fillrate tests (definitely beating GF2 by a lot). So you just must be doing something different.

- Matt

Humus
05-05-2001, 11:14 AM
Originally posted by mcraighead:
Can you detail exactly what the program is doing in each case, and what resolution/color depth/Z depth (remember that color and Z depth are independent on GF3) are being used?

I know that GF3 does quite well on the 3DMark 2001 fillrate tests (definitely beating GF2 by a lot). So you just must be doing something different.

- Matt

Assuming that they didn't change from default resolution it should be running in 1024x768x32 with 24bit Z.

This is the actual code for pixel fillrate:



glDisable(GL_DEPTH_TEST);
glDepthMask(GL_FALSE);

fillrateTexture = new FileTexture("Fire.png");
fillrateTexture->loadToAPI(GL_RGB5, GL_LINEAR, GL_REPEAT, GL_REPEAT);
fillrateTexture->setCurrent();

for (int n = 0; n < 1024; n++){
f += 0.01f;
s = 3 * float(cos(f));
t = 3 * float(sin(f));

glBegin(GL_QUADS);
glTexCoord2f(s,t);
glVertex2f(0,0);
glTexCoord2f(t,-s);
glVertex2f(1,0);
glTexCoord2f(-s,-t);
glVertex2f(1,1);
glTexCoord2f(-t,s);
glVertex2f(0,1);
glEnd();

SwapBuffers(hdc);
}


This is the code for texel fillrate:



GLint nTmu;
glGetIntegerv(GL_MAX_TEXTURE_UNITS_ARB, (GLint *) &amp;nTmu);

for (int i = 0; i < nTmu; i++){
glActiveTexture(GL_TEXTURE0_ARB + i);
glEnable(GL_TEXTURE_2D);
fillrateTexture->setCurrent();
}

startCycle = GetCycleNumber();
for (int n = 0; n < 1024; n++){
f += 0.01f / nTmu;
for (i = 0; i < nTmu; i++){
ms[i] = 3 * float(cos(f*(i+1)));
mt[i] = 3 * float(sin(f*(i+1)));
}

glBegin(GL_QUADS);
for (i = 0; i < nTmu; i++)
glMultiTexCoord2f(GL_TEXTURE0_ARB + i,ms[i],mt[i]);
glVertex2f(0,0);

for (i = 0; i < nTmu; i++)
glMultiTexCoord2f(GL_TEXTURE0_ARB + i,mt[i],-ms[i]);
glVertex2f(1,0);

for (i = 0; i < nTmu; i++)
glMultiTexCoord2f(GL_TEXTURE0_ARB + i,-ms[i],-mt[i]);
glVertex2f(1,1);

for (i = 0; i < nTmu; i++)
glMultiTexCoord2f(GL_TEXTURE0_ARB + i,-mt[i],ms[i]);
glVertex2f(0,1);
glEnd();

SwapBuffers(hdc);
}

jwatte
05-05-2001, 01:36 PM
I wont argue about the wave files, but I think the large textures are 100% necessary for showing off card's quality.


How about using JPEG compression at 8:1 or so? That's more or less lossless, and quite good enough to show off the card.

mcraighead
05-05-2001, 04:01 PM
Well, I tried to run it but got really crap scores on my GF3 -- think several times slower than my GF2.

Investigating...

- Matt

mcraighead
05-05-2001, 04:51 PM
My own stupidity -- I had failed to disable FSAA from earlier.

- Matt

mcraighead
05-05-2001, 05:24 PM
Here are my results -- P3-700, BX, latest (unreleased http://www.opengl.org/discussion_boards/ubb/smile.gif ) drivers, on a GF2 GTS (32 MB, standard clocks -- 200/166), and on a GF3 (64 MB, probably standard clocks of 200/230, but I forget).

***** GF2 *****

Results:
--------

Overdraw/HSR:
-------------
Overdraw factor 3, back to front: 125.99 fps
Overdraw factor 3, front to back: 174.65 fps
Overdraw factor 3, random order: 151.14 fps

Overdraw factor 8, back to front: 47.93 fps
Overdraw factor 8, front to back: 76.05 fps
Overdraw factor 8, random order: 66.54 fps

Fillrate:
---------
Pixel fillrate: 654.26 MegaPixels / s
Texel fillrate: 1276.91 MegaTexels / s

T&L/High polygon count static display list:
-------------------------------------------
Pure transform: 11416098 vertices / s
2 point lights: 7334693 vertices / s
8 point lights: 2954103 vertices / s
2 directional lights: 10365105 vertices / s
8 directional lights: 3950746 vertices / s

High memory bandwidth load/texture cache efficiency:
----------------------------------------------------
One 1024x1024x32 texture: 444.92 fps
Four 1024x1024x32 textures: 423.47 fps

***** GF3 *****

Results:
--------

Overdraw/HSR:
-------------
Overdraw factor 3, back to front: 223.05 fps
Overdraw factor 3, front to back: 483.02 fps
Overdraw factor 3, random order: 323.83 fps

Overdraw factor 8, back to front: 85.09 fps
Overdraw factor 8, front to back: 299.52 fps
Overdraw factor 8, random order: 185.96 fps

Fillrate:
---------
Pixel fillrate: 610.07 MegaPixels / s
Texel fillrate: 1258.03 MegaTexels / s

T&L/High polygon count static display list:
-------------------------------------------
Pure transform: 13401025 vertices / s
2 point lights: 7384063 vertices / s
8 point lights: 2998833 vertices / s
2 directional lights: 10725542 vertices / s
8 directional lights: 4029325 vertices / s

High memory bandwidth load/texture cache efficiency:
----------------------------------------------------
One 1024x1024x32 texture: 586.79 fps
Four 1024x1024x32 textures: 556.86 fps

So the GF3 wins all tests except for the fillrate test, which it (very slightly) loses, despite having the same theoretical speed.

I would suggest removing SwapBuffers from the test and doing a glFinish() at the end of each test before getting the time (which is something you should be doing anyhow). This would ensure that latency limiting on our part does not impact the results.

- Matt

JackM
05-05-2001, 05:50 PM
I did get a strange results in Win2k compared to XP Beta2 Pro...

PIII 850, GeForce DDR

Windows 2000

Results:
--------

Overdraw/HSR:
-------------
Overdraw factor 3, back to front: 100.76 fps
Overdraw factor 3, front to back: 121.27 fps
Overdraw factor 3, random order: 111.95 fps

Overdraw factor 8, back to front: 43.26 fps
Overdraw factor 8, front to back: 58.02 fps
Overdraw factor 8, random order: 53.61 fps

Fillrate:
---------
Pixel fillrate: 217.50 MegaPixels / s
Texel fillrate: 316.54 MegaTexels / s

T&L/High polygon count static display list:
-------------------------------------------
Pure transform: 5556127 vertices / s
2 point lights: 3943681 vertices / s
8 point lights: 1751863 vertices / s
2 directional lights: 5194629 vertices / s
8 directional lights: 2282674 vertices / s

High memory bandwidth load/texture cache efficiency:
----------------------------------------------------
One 1024x1024x32 texture: 228.36 fps
Four 1024x1024x32 textures: 103.96 fps


Windows XP Beta 2 Build 2462


Results:
--------

Overdraw/HSR:
-------------
Overdraw factor 3, back to front: 112.80 fps
Overdraw factor 3, front to back: 146.29 fps
Overdraw factor 3, random order: 130.50 fps

Overdraw factor 8, back to front: 42.73 fps
Overdraw factor 8, front to back: 61.48 fps
Overdraw factor 8, random order: 55.65 fps

Fillrate:
---------
Pixel fillrate: 412.65 MegaPixels / s
Texel fillrate: 471.45 MegaTexels / s

T&L/High polygon count static display list:
-------------------------------------------
Pure transform: 6935988 vertices / s
2 point lights: 4507733 vertices / s
8 point lights: 1789605 vertices / s
2 directional lights: 6369106 vertices / s
8 directional lights: 2396262 vertices / s

High memory bandwidth load/texture cache efficiency:
----------------------------------------------------
One 1024x1024x32 texture: 364.67 fps
Four 1024x1024x32 textures: 365.61 fps


One question Humus-did you use vertex arrays? (GL_NV_vertex_array_range)

It would greatly speed it up in some cases.

JackM




[This message has been edited by JackM (edited 05-05-2001).]

ET3D
05-06-2001, 12:32 AM
LordKronos, my problem is not with including large files, but, as jwatte suggested, their lack of compression. Music and large textures do add to a tech demo, but they can be made MP3 or JPEG. This can make the demos several times smaller, without a noticeable degradation in quality. I won't mind if the music will be decompressed to my hard disk, to save CPU at runtime. I just want the demos to be smaller, so that I don't have to spend an hour or more on my 56K modem in order to see a 2 minute demo. This makes the demo less impressive for me ("I spent an hour downloading for that?!").

JackM, I don't have this problems with Windows 2000. Here are the scores for my GeForce SDR (using 6.31 drivers) on a P3-700:

Overdraw/HSR:
-------------
Overdraw factor 3, back to front: 62.66 fps
Overdraw factor 3, front to back: 98.03 fps
Overdraw factor 3, random order: 79.90 fps

Overdraw factor 8, back to front: 23.90 fps
Overdraw factor 8, front to back: 45.83 fps
Overdraw factor 8, random order: 37.57 fps

Fillrate:
---------
Pixel fillrate: 381.19 MegaPixels / s
Texel fillrate: 453.48 MegaTexels / s

T&L/High polygon count static display list:
-------------------------------------------
Pure transform: 6578232 vertices / s
2 point lights: 4298507 vertices / s
8 point lights: 1739456 vertices / s
2 directional lights: 6018996 vertices / s
8 directional lights: 2337718 vertices / s

High memory bandwidth load/texture cache efficiency:
----------------------------------------------------
One 1024x1024x32 texture: 227.13 fps
Four 1024x1024x32 textures: 226.40 fps


Some of them are obviously lower due to the lower SDR bandwidth, but they're still much closer to your Windows XP scores in nature than to your Windows 2000 scores.

paddy
05-06-2001, 05:15 AM
Here are my results with a RadeON 64 DDR on a T-Bird 1020 mhz :

1) Without Hierarchical Z

Results:
--------

Overdraw/HSR:
-------------
Overdraw factor 3, back to front: 146.71 fps
Overdraw factor 3, front to back: 157.66 fps
Overdraw factor 3, random order: 153.03 fps

Overdraw factor 8, back to front: 55.42 fps
Overdraw factor 8, front to back: 60.90 fps
Overdraw factor 8, random order: 59.53 fps

Fillrate:
---------
Pixel fillrate: 394.45 MegaPixels / s
Texel fillrate: 1166.10 MegaTexels / s

T&L/High polygon count static display list:
-------------------------------------------
Pure transform: 9727177 vertices / s
2 point lights: 5127764 vertices / s
8 point lights: 5095699 vertices / s
2 directional lights: 5129071 vertices / s
8 directional lights: 5112222 vertices / s

High memory bandwidth load/texture cache efficiency:
----------------------------------------------------
One 1024x1024x32 texture: 363.95 fps
Four 1024x1024x32 textures: 309.80 fps

2) With Hierarchical Z

Results:
--------

Overdraw/HSR:
-------------
Overdraw factor 3, back to front: 146.80 fps
Overdraw factor 3, front to back: 270.88 fps
Overdraw factor 3, random order: 199.63 fps

Overdraw factor 8, back to front: 55.42 fps
Overdraw factor 8, front to back: 139.87 fps
Overdraw factor 8, random order: 101.97 fps

Fillrate:
---------
Pixel fillrate: 394.21 MegaPixels / s
Texel fillrate: 1165.46 MegaTexels / s

T&L/High polygon count static display list:
-------------------------------------------
Pure transform: 9721489 vertices / s
2 point lights: 5111637 vertices / s
8 point lights: 5091131 vertices / s
2 directional lights: 5135189 vertices / s
8 directional lights: 5132600 vertices / s

High memory bandwidth load/texture cache efficiency:
----------------------------------------------------
One 1024x1024x32 texture: 363.68 fps
Four 1024x1024x32 textures: 310.33 fps

Humus
05-06-2001, 06:16 AM
Originally posted by mcraighead:

I would suggest removing SwapBuffers from the test and doing a glFinish() at the end of each test before getting the time (which is something you should be doing anyhow). This would ensure that latency limiting on our part does not impact the results.


I see your point, but removing SwapBuffers will of course cause nothing to be seen on screen, which may some users may have problems with, but if it should be really correct I guess I should do that. Anyway, I tried it removing swapbuffer and puting in a glFinish in the end, but the difference was minimal to say the least. Like 0.3% or something. May be different on other cards.

Humus
05-06-2001, 06:19 AM
Originally posted by JackM:

One question Humus-did you use vertex arrays? (GL_NV_vertex_array_range)

It would greatly speed it up in some cases.


No, I use a static display list. I initally planned to have vertex arrays too, but I lost my interest in the project and just released what I had. If I ever update it I guess I'll add that too.

ET3D
05-06-2001, 06:59 AM
Just thought I'll post the results of a Celeron 300A (@ 300MHz) with a Radeon VE.

Overdraw/HSR:
-------------
Overdraw factor 3, back to front: 56.57 fps
Overdraw factor 3, front to back: 55.56 fps
Overdraw factor 3, random order: 52.45 fps

Overdraw factor 8, back to front: 19.59 fps
Overdraw factor 8, front to back: 23.75 fps
Overdraw factor 8, random order: 22.59 fps

Fillrate:
---------
Pixel fillrate: 109.23 MegaPixels / s
Texel fillrate: 325.14 MegaTexels / s

T&L/High polygon count static display list:
-------------------------------------------
Pure transform: 1247759 vertices / s
2 point lights: 219341 vertices / s
8 point lights: 73227 vertices / s
2 directional lights: 528794 vertices / s
8 directional lights: 314615 vertices / s

High memory bandwidth load/texture cache efficiency:
----------------------------------------------------
One 1024x1024x32 texture: 127.25 fps
Four 1024x1024x32 textures: 77.16 fps

jwatte
05-06-2001, 07:53 AM
Note that the Radeon VE does *NOT* have hardware transform & lighting, and thus the Celeron is the bottleneck in that test.

Personally, I think it was a marketing mistake for ATI to dilute the "Radeon" name.

ET3D
05-06-2001, 01:24 PM
I knew it lacked T&L, and that it had only one 3D pipeline. But I didn't know it lacked HyperZ. Although I bought it for DVD, something that it doesn't do well, either, for some reason (works worse than the Rage Pro that was previously in that PC).

Anyway, I don't think that ATI "diluted" the Radeon name any more than NVIDIA did that to the GeForce name with the MX 200 (which will likely be the most successful GeForce product). When you buy a VE, you still get most of what the Radeon is about - good 2D and DVD, lots of 3D features, and errors in the 3DMark 2001 lobby scene http://www.opengl.org/discussion_boards/ubb/smile.gif (well, actually the latest "special purpose" driver fixed the lobby scene for me, although they broke DVD completely).

Anyway, none of this really relates to the HSR issue.

[This message has been edited by ET3D (edited 05-06-2001).]