Z sort vs. min texture state changing.

Hello!

without Z buffer you can sort with your polygons (BSP etc.) but how can you minimize the texture state changing? Of course there are independent polygons where you can minimize the tex stat chg but how do you detect these situations?

Sorry for the terrible english

Good question…
I have thought of this to.

However I dont have an answer to it,
I render my polygons in materialgroups,
i.e. All the polygons with the same texture etc. are rendered at the same time.

It would be nice to know the performace difference if someone has tried both sorting by material and by depth.

I’d say that with current hardware, it’s getting rare to have the exact same textures, with the exact same states for two different meshes.

Think about it… shaders, multitextures, and all the rest (see through 600 extensions).

I would give priority to depth sort, then I would try to optimize the state changes.
One good thing would already be not to change states if the current state is already what you want. (Not sending useless state change to OpenGL. But don’t ask OpenGL for knowing current state, keep the info you need in your app for quicker access)

But I haven’t really worked on that yet so I can’t give you a real answer.

Sorting by depth will usually be a much bigger win. On various chips and modes (both from us and other vendors), you can get increases on the order of 4x, 8x, 16x on your effective fill rate.

On older chips (without early Z tests), this can still be, say, a 50% win.

  • Matt

Another problem with the rendering from Z max to Z min:
If you render something in the far (with a lot of effects) but later you overwrite the 50-80% pixels of this rendered object then you lost performance. So you render a lot of invisible parts!
With Z buffer I had 4x bigger FPS when I minimized the tex state changes.
When I used this without Z buffer I had +30% FPS (Voodoo3-2000, celeron 300). As I see the texture sta. chg. min is very important but without Z sort you can not redner correct pic.
The Z sort first and then tex. stat- chg. minimize (later TSCM) is wrong solution. An examle for this:

You have a floor and a ceiling (my dictionary gave this word ):

SIDE VIEW


1 2 3 4 5 6 7 8 9 Ceiling Tex: X


A B C D E F G H I Floor Tex: Y

If you render with strict Z sort then:
Tex change to X
Render 1
Tex change to Y
Render A
Tex change to X
Render 2
Tex change to Y
Render B
Tex change to X
Render 3
Tex change to Y
Render C

You have a lot of tex state change.

If you do not use strict Z order:
Tex change to X
Render 1-9
Tex change to Y
Render A-I.

You have only TWO base text change (of course the light map works without any modification)

Well, the floor and the ceiling are independent.

But how do you detect the independent parts?

You usually get much better performance when you render front to back (minZ to maxZ).
This makes overdraw much cheaper. The hardware still has to check the z buffer but if the test fails, there will be no write to the color buffer and no write to the z buffer. Try benching it again this way and let’s hear the results

Of course you get much better perf. if you use the Z buffer but in my question I do not want to use the Z. I would like to render back to front.
If you do not use Z buffer you have min +30% FPS and more video mem for the textures because your frame buffer is thiner (-16 or -24 bit).
With Z: overdraw is cheaper,render can be faster with front to back order.
Without Z: overdraw is maximum but the system has min +30% perf and more video mem

Which one is better? I do not have time for benching but later I will try it.

Using Z will probably increase your performance, I’d say. Don’t forget that manually sorting from back to front (and it has to be a perfect sort) is a real pain and can be a performance drag.

It’s hard to argue with, say, 8x the fillrate when going front to back.

  • Matt

Originally posted by mcraighead:
[b]Using Z will probably increase your performance, I’d say. Don’t forget that manually sorting from back to front (and it has to be a perfect sort) is a real pain and can be a performance drag.

It’s hard to argue with, say, 8x the fillrate when going front to back.

  • Matt[/b]

The Z sorting comes from the BSP so it is not a question. We try to do something without Z.

Generally Z testing is enabled so its best to sort by texture/material to minimize state changes.

In your case (not using z buffer) its still best to sort by material to eliminate state changes. But this is sortof changing. Texture changes I don’t find to be as costly. The state change is inevitable but its really more a matter of whether or not the texture is on the video card or still in system memory. Having to transfer over the bus is slow. The new GeForce 4 with 128 megs and the generally increasing amount of memory on cards would lead me to believe that the cost of switching textures (when the already are in ram on the card) are more free. By more free I simply mean that if the texture is already on the card its faster than if the texture is in system memory, obviously. More video ram means more textures stay resident in video ram which should equate to faster texture binds. Then again I haven’t really tested performance in this area so this is largely speculative.

how about this?
draw from front to back with your bsp, and draw the cutting-planes WITHOUT drawing to rgb but occlusion-culling enabled to count how much pixels are in fact visible…

when there are, draw the other side, too, else give up and don^t follow the bsp in this leaf anymore…

that way you get nearly no overdraw…

oh, and… if you FAIL in a z-test on gf3,radeon,gf4, you don’t calculate anything of the pixel, no texture lookup etc… z-buffers are compressed in some format to access them really fast… so, you don’t loose that much from drawing an object BEHINd a yet drawn object… and you can count how much is drawn to stop as early as possible drawing any further backwards…

do you want anything more?

I agree; draw your BSP front to back and leave depth testing on.

For some people this may be unintuitive, but depth testing really does gain you performance in many cases.

Another important point is that the bandwidth cost of reading/writing Z is often much less than reading/writing color, because of things like Z compression.

The naive calculation would say that a non-zbuffered app would spend 4 bytes per pixel to write color, but the zbuffered app would spend 4 or 12 bytes to read Z and possibly write Z and color. However, if the Z is compressed, this calculation is wrong.

Even if your app is not sorted, with high overdraw you will get a very high percentage of pixels killed with one Z read. You’d think at first that front-to-back is 100% killed, back-to-front is 0% killed, and random is 50% killed. In fact, though, if overdraw is high, random gets a lot more than 50% pixels killed.

Also, keep in mind that if you are not using the Z buffer, you will in all likelihood not be bandwidth-limited, but fill-limited. The extra bandwidth may not actually hurt you.

And even then, it may not actually be extra bandwidth (it may be less), and your fill rate will go up too.

When you add in the effects of texturing and fancy shaders, this computation becomes even more one-sided.

  • Matt

I once did a small benchmark app that tests overdraw performance, (available here (look for GL_EXT_reme)). I just did a quick test, here’s the result for a Radeon 8500:

Overdraw factor 3, back to front: 248.97 fps
Overdraw factor 3, front to back: 841.26 fps
Overdraw factor 3, random order: 470.54 fps

Overdraw factor 8, back to front: 106.55 fps
Overdraw factor 8, front to back: 654.97 fps
Overdraw factor 8, random order: 288.45 fps

For the overdraw factor 8 there’s a 6.15x performance improvement by drawing front-to-back instead of back-to-front. Not too bad

Heh, now that I ran this app I see that ATi’s drivers have improved a lot since last time I ran it. I only got a 3.5x increase last time, and T&L performance is twice as high too and the fillrate is close to theorethical maximum as it should be I used to get 2.5x on a Radeon classic btw, and GF3 used to get 3.5x improvement, those could have improved since last time too.

Originally posted by Humus:
and GF3 used to get 3.5x improvement, those could have improved since last time too.

Overdraw factor 3, back to front: 197.99 fps
Overdraw factor 3, front to back: 425.95 fps
Overdraw factor 3, random order: 288.27 fps

Overdraw factor 8, back to front: 75.36 fps
Overdraw factor 8, front to back: 262.79 fps
Overdraw factor 8, random order: 164.41 fps

Note that I’m running a P3 933 + GF3 Ti 200 + pc133 SDRAM and a lot of background apps + 5 IE instances + 2 instances of MSVC with 2 large projects loaded and there was some more HD activity…

But I still like these scores.

Hmm … I would expect better than that from a Ti200 :stuck_out_tongue:
It still does prove the point that depth sorting can improve performance a lot, 3.5x is not too bad

Anyone did a benchmark with state changes as well

sorted by texture front to back
sorted by texture back to front
sorted by texture random

not sorted by texture front to back
not sorted by texture back to front
not sorted by texture random

for not sorted by texture I would do the worst case which is interleaving the texture so that for every object you draw you need a texture switch

Stop that texture switching nonsense already!! (or even !!!)

As soon as you start doing some serious geometry work, like, say, models with more than ten vertices each, the point is absolutely moot! Yes, you might flush the pipeline. Duh. That’s a stall of a few hundred cycles on the hardware, absolutely worst case. Guess what, the hardware runs at hundreds of MHz.

If you get a significant speed boost from dodging texture binding changes, I’ll bet you

  1. use near zero CPU time except making OpenGL calls
  2. have only static geometry (yah, extends point one…)
  3. draw less than 500 vertices per refresh
  4. get 300+ fps anyway
  5. in summary, never pushed any kind of serious scene through the pipeline

So stop it, stop it, stop it!

Depth test on the other hand is a per fragment operation, that’s where you want to be fast.

To wrap it all up again:
1)enable depth buffering and testing
2)render front to back
3)disregard everything else

Is this true?
Why does “everyone” sort their objects by texture then?
And how about using some heuristic? This may actually be quite fast.
e.g Draw the landscape, big objects,… first and then smaller objects, thus increasing the chance they will be occluded.

Charles

think about this:
you have a strategy-game… take starcraft in 3d… with 100 spacemarines and 100 zerglings against eachothers…

now if you draw it like this:
zerg
sm
zerg
sm
zerg
sm
zerg
sm
zerg
sm
zerg
sm
zerg
sm
zerg
sm
zerg
sm
zerg
sm
zerg
sm
zerg
sm
zerg
sm
zerg
sm
zerg
sm
zerg
sm

that would result in quite a lot UNNESSESARY switches…

THAT GEOMETRY HAS TO BE SORTED, simply cause other way round is very stupid…

but dont sort as picknitting… sort the big chunks… like same units drawn at the same time, same trees same time, same particles same time…
but if you code in a normal way this happens anyways…

now this is for strategy-games, where z-sort is not that big a problem anyways (one landscape, units above…)

for ingame games like quake, you have at nearly every wall a different texture… so what do you wanna do?
AND quake has very much overdraw, means you see the whole level behind a wall theoretically… that is where z-buffering/z-sorting/occlusion-culling is really needed… not texture switching…

its always app-dependent…

Thread topic is something like ‘Z buffering vs texture switching - who wins?’
[ul][]ye shall start taking advantage of hardware T&L by tesselating your objects nicely[]ye shall only do coarse culling in software as your polycounts will be high[]ye shall use fewer and instead bigger textures per distinct object[]witness that you now draw many polys per object, and you by design never have to change textures during that[]witness that you are fill limited anyway[]ye shall forfeit the belief of being limited by state changing[]make it so that you can fill faster by reducing overdraw or making it cheaper[]make it so that the hardware can discard more pixels early[]render front to back[]thank you[/ul]