Using Z-Buffer or W-Buffer ? not sure, what do you use ?

Hello…
…mmmh, i’m very undecidely about this topic: should i use the W-Buffer or the Z-Buffer.
Because: until now, my engine is using only the Z-Buffer, the terrain or the viewing-range is not so large, that i need a w-buffer.
BUT: on several discussion around in the web, most people are designing their engine, that it can handle both, the Z-Buffer and the W-Buffer - yes, i know, the Z-Buffer is used every time, the W-Buffer only if i supply a new member to my vertex-class, called “w” or something like this.
But i’m totally confused, because i don’t know if i should “get” my engine to use W-Buffer or not ?!?! BECAUSE: if i decide to use the w-buffer sometimes only, i have to add the w-component to the vertices - but in most cases the W-buffer is not necessary, but then the vertices would have the additional w-component, and its not used; this means a waste of memory.

AND: when should i use the w-buffer ?
Do you use the w-buffer ?

mmh, i would be very thankful if any one could give me some tips or “bringing me to right way” with this “w-buffer vs. z-buffer” topic.

thanx.

Well, I don’t think any of the cards support w-buffers under OpenGL (only under Direct3D). Unless you plan on also supporting Direct3D, I guess that probably makes your decision pretty easy.

Also, about the extra w component per vertex, its usually not a big deal memory wise. The more important issues is memory alignment. If you are only using XYZ + 2 ST tex coords, each vertex ends up 28 bytes. You would likely want to align this to 32, so you have the extra room for the W component anyway. Likewise, if you are using something like XYZ + ST tex coord + STR tex coord + normal, you are already at 44 bytes, and in most cases you would want to align your verticies to 64 bytes, so you still have plenty of room to add in the W. However, if you happen to be using a vertex format that is already 32 bytes in size, adding this extra W will push you over the alignment boundary and you will have to go to 64 byte alignment, which can decrease performance.

P.S. The above is assuming you are using interleaved arrays.

[This message has been edited by LordKronos (edited 08-29-2002).]

@LordKronos:
mmh, thats true, it’s not the first time that i hear that the w-buffer is only supported by DirectX - and yes, my engine only supports OpenGL; but on the nVidia homepage there is a paper, which discusses the w-buffer and i’m nearly sure (not 100%) that it was GL-related; perhaps i can be wrong, too, i don’t know it exactly.
OK, let assume that only DirectX is capable of using the w-buffer, what is future GL-versions ? Do they have a w-buffer ? The point is: my engine is designed in some parts “statically”, this means there is no flexible-vertex-format, like DirectX, and if i decide now not to use the w-buffer, i have to rewrite everything if future GL-version supports the w-buffer. (and i’m very not interested in using something like this f****** flexible vertex format or something similar, because it’s to much and i don’t need it!)

Or do you say, that the Z-Buffer at a depth of 24 or 32bits is enough ?
(keep in mind that my engine’s focus is on “low distance renderint” mainly, but sometime “high distance renderint”, e.g. if the user zooms away a big distance, but mainly “low distnace rendering”)

Another question to your calculation:
how do you get 28 bytes out of: xyz + 2 ST tex coords ? because; xyz is float, that means 4 bytes for each val, makes 12 bytes by 3 components. And 2 ST tex are also float’s (in my engine), this means 2 * 4 = 8; now adding the 12 bytes of the xyz-comp’s means 20 bytes + 4 bytes for the index of the used texture makes 24 bytes. How do you get 28 bytes ?
Or do you save the index of the texture used by a mesh not in the mesh vertices, instead in the mesh itself ?

What do you mean with interleaved arrays ? Because i’m from germany i have to use a translation software, but the translation of this program does not give any helpful hints, what the word “interleaved” meant in the relation of using arrays ?

(as you can see…question over question…yes, i agree, it was big error that i stucked my nose into the DirectX-SDK and to work for a german firm that uses DirectX for their VR-sim’s. )

“Interleaved” is explained in the OpenGL manuals.

You’re mixing some things here. The w at vertex level is the homogenous extensions of the 3D xyz-coordinates mainly used to allow matrix transformations to include translation.
The w which is used for the w-buffer is the one after the transformation. That means there’s always a w, because the matrix is 4x4 and the glVertex3f will assume w=1 internally. So for using a w-buffer it’s not necessary to specify the w at vertex levels. It needs to be interpolated by the rasterizer on pixel level like depth data.
The point here is, that you can gain additional precision because the distribution of float representations is non-linear over zNear to zFar. There’s a nice slide somewhere on this site. Search the forum for z-precision questions.
There are also answers to your questions on large terrain issues. First try to push zNear as far out as possible to get the zFar/zNear ratio as small as possible, use at least 24 bit depth buffer bits. If that’s not helping, you can split the geomtry into multiple distinct z-ranges (search the forum), or maybe it’s possible to generate 1-z depth results instead which moves the higher precision range to the zFar side (never tried).

Be warned, messing around with the depth values might get you in trouble if you want to apply recent shadow algorithms.

Originally posted by DJSnow:
how do you get 28 bytes out of: xyz + 2 ST tex coords ? because; xyz is float, that means 4 bytes for each val, makes 12 bytes by 3 components. And 2 ST tex are also float’s (in my engine), this means 2 * 4 = 8; now adding the 12 bytes of the xyz-comp’s means 20 bytes + 4 bytes for the index of the used texture makes 24 bytes. How do you get 28 bytes ?[/b]

When I said “2 ST tex coords” I meant 2 SETS of coords (ST for tex0, ST for tex1). That makes it 16 bytes, not 8. 12 + 16 = 28.

@Relic:
ahhh, yes, you opened my eyes ! now it’s more clearer to me, what i want. thanks. So, it seems that it is not needed to specify a w-component at vertex-level if GL assumes atomatically that w = 1. BUT: for which case is it needed, when a programmer specify this “w-value” at vertex level ? I swear that i have source-codes of people where the guys specify the w-value at vertex-level.
Your answer helped me a lot, especially the search-topics here on gl-forum.
So, i’ve decided (with the knowledge your post gave me) not to add any w-val to the vertices, and not to “try to support the w-buffer”. Thanks.

@LordKronos:
OK, now i see how you came onto the result of 28bytes.
But, i’ve calculated the whole bunch of memory usage in my engine, i need much more bytes per vertex, because:
3 float val’s, xyz (pos) = 12 bytes
3 float val’s, rgb (color) = 12 bytes
2 float val’s, uv for EACH texture level, makes with a max. texture-level-count of 4 a total result of (2 * 4 * 4) = 32 bytes
3 float val’s, normal vector = 12 bytes
all this together makes a total byte-count of: 68 bytes - argh holy **** - 4 bytes to much to align at a 64-byte-orientied alignment. ARRRRGH Uuuuhhff, i think i have to reduce the max-texture-level-depth to 3 or aditionally more less than 3 (huh, do you understand my bad english?).
Is the speed penalty between 32-bytes alignment and 64-byte alignment very big ?
Or, assume that i head over to use a max-texture-level-count of 8, is it passably to use a 128-byte alignment ? Because, this would give the chance to put more information in the vertex ?!

General question: how are your engine’s vertex / bytes aligned ? can you tell me ?

What is the most common way to store the vertices ?

change your vertex color to use bytes instead of floats for each component. RGBA, 1 byte each = 4 bytes total. Even if you aren’t using the alpha, it would probably be helpful to throw it in (to pad everything else to 4 byte alignment).

I don’t know what the “typical” penalty is for going from 32 to 64 bytes. Most processors now have a 64 byte cache line. If you can fit a vertex into 32-bytes, that means each cache fetch will end up grabbing 2 verticies. Any time you draw a poly using 2 consecutive verticies from your vertex buffer, you will save yourself an extra fetch from memory, as the previous fetch (for the previous vertex) will have pre-loaded the vertex into the cache.

Likewise, I can’t say what the performance difference is between 64 and 128, but I am pretty sure it is much more severe, since EVERY vertex will now need to fetch 2 cache lines.

[This message has been edited by LordKronos (edited 08-29-2002).]

On byte-alignment for interleved arrays:

Why bother? It isn’t like these GPU’s don’t have a cache. If you align the accesses to 32 bytes, you’re missing out on actaully using information stored in the cache. Take a 24-byte vertex, for example. Five verts requires 4 memory fetches (using 32-byte cache lines). Granted, if it’s hopping around in your array, it will, somewhat frequently, have to pull in 2 cache lines, but that’s fine: it’s going to need to pull that information in anyway (unless you’re wasting space).

Look at it this way. 500 verts at 24-bytes per vert are packed to 12000 bytes total. If you pad them to 32-bytes, you’re looking at 16000 bytes. That’s alomst 4K more in memory, which translates to 125 more fetches (and, therefore 125 more stalls on memory fetches). Not only that, if the cache is less than 16K, you’ll get better cache performance with the packed verts than with padded ones.

Note that this is not the post-T&L cache I’m talking about. This is a pre-T&L vertex fetch cache.

That only applies if you access your vertex buffers linearly. Most cases benefit from random access using indicies.

In the case of random access, if you have a 24 byte vertex and you have a 64 byte cache line, then 20% of all verticies will require 2 cache line fetches. When you go to draw one of these boundary verticies, you lose performance if neither half is in cache, break even if only one half is in cache, and come out ahead if both halves happen to be in cache. I think in the general case, having both halves in cache is probably the least likely scenario. Overall, you will typically come out at a loss.

If you are on a system where the cache line size is only 32 bytes, then 50% of verticies will span cache lines, making your odds even worse.

For hardware T&L systems, I cant say what the exact result would be exactly, as I’m not sure what size the cache lines are for either the AGP bus or (in the case of vidmem vertex buffers, such as VAR) what kind of memory cache techniques the cards use. However, I suspect you will see similar results.

You seem to have missed the gist of my argument, so allow me to reiterate it for you.

Basic assumptions:

  1. You are going to use all the vertices in the array at one time or another.
  2. The graphics card’s T&L pipe is cached.

Granted the first one, we understand that, for the case of 500 verts, you will need to transfer all of them at least once, correct? As such, you’d like to do that in as few fetches as possible.

Given 32-byte cache lines, you’re looking at 16000 bytes of data. That translates into at least 500 fetches for a 32-byte vert, and this assumes that the verts are never cast out of the cache.

Given a 24-byte vert, you have to transfer only 12000 bytes. That translates to 375 fetches. Taking the same assumption as before (the cache is large enough to hold all of the data so nothing is forced out), that leads to only 375 fetches.

Note that, because the assumption is that every vert will be touched (which is the typical assumption with vertex arrays), order is unimportant. Eventually, everything gets into the cache. Since 375 < 500, less time is wasted on memory fetches in my case.

Also, there is the benifit that the data is smaller. Smaller data == better cache performance. If we remove the assumption of a cache big enough to hold all of the data, the 24-byte version gets better cache performance. Given a cache of 300 elements, it is more likely that one of the 24-byte vertices will be there than the 34-byte case.

Smaller data == faster bus transfer speed. I don’t think caching specifically comes into it, when we’re talking about AGP memory -> card transfers.

I recommend floats for vertex position, ubyte for colors, and shorts for everything else (well, floats for normals if you generate them dynamically). In the case given, 34 bytes + 42 bytes == 20 bytes total.