2 question about gfx-cards

First question: How well is z-Buffering implemented on current hardware? For example: if i render a triangle, which is behind a wall, which has already been drawn, this means, that the triangle is completely invisible. Are gfx cards able to detect this pretty early (ie directly after transforming the vertices) or do have to detect most of it on a per-pixel level?

Second question: If i have 4 texture-units and one 3D-texture, which i want to use for lighting, is it a good thing to bind this texture to all four texture-units and thus doing four light-sources at once?
AFAIK the four texture-units operate at the same time, which means, they would all access the texture at the same time. And i was told once, that frequently accessing the same part of the memory can slow down an app, since memory needs a small amount of time to “regenerate” until it can be read/wrote to again. I was told it would be more effective to have several copies of data, if it has to be accessed (read in this case) so frequently. Is this true?

Jan.

Z-Buffering can only be done per-pixel. Modern cards to the z-test before any other per-pixel operations, though, so it’s still very fast.

Reading the same area of memory frequently is faster than reading different areas; I have no idea what your source was on… You should only have one copy of the texture.

It’s almost certainly faster to use all four texture units at once if you can than to do four passes.

For the second question, having 4 copies instead of 1 will probably give you no change.

The texture units probably operate independently and I imagine each has it’s own cache.

It would be interesting to benchmark this one.

For the second question, having 4 copies instead of 1 will probably give you no change.

Actually, thanks to the texture cache, having 4 copies will likely be slower. If the memory is already in the cache, the next 3 texture fetches will be virtually instantaneous (or, at the very least, not a bottleneck).

The texture units probably operate independently and I imagine each has it’s own cache.

Oh, I seriously doubt that. It doesn’t make any real sense to do it that way. Indeed, it assumes that there are 16 physical texture units in a piece of hardware, rather than one (or two) that are used multiple times per shader.

In all likelyhood, there is a large-ish texture cache that services all texture fetch operations. Notice how, in the GeForce 2-4 era, having 2 texture units was considered good. The Radeon 7500 tried to get away with 3 separate texture units. However, the GeForceFX and Radeon 9500+ lines all have, pretty much, one texture unit per pipe. The GeForceFX can sometimes try to create 2 by using the 8 pipes as 4 pipes with dual sampling, but it is pretty much the same as the 9500. Why? Because, in the world of programmable hardware where 1 opcode == 1 cycle, you can’t benifit from having more than one physical unit. When you get hardware that can issue multiple instructions per cycle, or some kind of internal pipelining of the pixel pipes (ala modern CPU’s), then maybe you can get somewhere. Of course, if the texture opcodes are pipelined, one pipeline stage would be for accessing textures, so technically, you wouldn’t want to have multiple texture units in that case either.

Zbuffering is not just performed per pixel and first. Hidden geometry can avoid generating fragments all together:

Features like coarse zbuffer or hyper zbuffer or super-duper zbuffer Mk VII, are able to evaluate regions of the rasterized primitive before any fragments are generated. They are able to evaluate the farthest z information written to an entire region of the framebuffer and compare it to the depth of the nearest fragments you are about to generate for that region and test them all at once. In this way they can trivially reject large portions of a screen region prior to doing any per fragment work.

In terms of functionality, it looks identical to vanilla OpenGL, just with certain depth testing modes and with the right front to back sort, it automagically draws faster.

AFAIK the exact details of the implementations are trade secrets so details are scarce. Each generation of hardware the big two seem to claim they do a better job of this in their architecture. Could be tweaking region size, hierarchical region size or something that is a bit cleverer that simply comparing farthest screen against nearest fragment, or just better pipelining. Perhaps some implementations merely compare fragment z with on chip coarse z and avoid waiting on z fetch whereas others compare regions, I don’t think there’s any way to tell.

One other thing about this, fragment level zbuffering needs to be pipelined with any shading so without coarse z lots of occluded fragments would keep the ztest hardware busy while the rest of the hardware was idle so fragment level z test first is not a panacea. Coarse z reject saves most of this work, it’s pipelined and it probably helps take the load of the FIFO(s) to the back end of the pipe at the same time.

[This message has been edited by dorbie (edited 06-23-2003).]

Ummm, Korval, since when do any GfFX chips have eight pipelines? It’s plain old 4x2, just like Geforce 4Ti, Geforce 3 and Geforce 2.

Jan2000,
it’s generally a good idea to render front to back (more or less; do it only if it’s easy to maintain this ordering, don’t go overboard and sort/split each and every triangle). Early Z has been with us for some years, and will probably stay. Implementation quality varies, but in concept this is still based on fragments (or rectangular fragment blocks).

Originally posted by zeckensack:
Ummm, Korval, since when do any GfFX chips have eight pipelines? It’s plain old 4x2, just like Geforce 4Ti, Geforce 3 and Geforce 2.

If I’m correct, the FX5200 and FX5600 have 4 texture stages, the FX5800 and FX5900 have 8.

Korval, I can’t say for certain which case is true, but the reason why I think having a separate cache is better is that each unit is highly likely to access a different texture.

Having a portion of each texture ready in the cache for each unit would be benificial (either way), but having separate smaller caches would guarantee no texture gets booted out.

This would make the hardware more expensive.

>>>The Radeon 7500 tried to get away with 3 separate texture units.<<<

I know it had 3, but what’s so special about it? Too bad for them everyone had there eyes on the Gf2s, and 2 units was expected.

>>>Because, in the world of programmable hardware where 1 opcode == 1 cycle, you can’t benifit from having more than one physical unit.<<<

1 opcode = 1 cycle is unlikely. Since mostly it’s about floating point, it’s more like each operation takes >=30 cycles.

Well, at least matrox was nice enough to provide a schema for the parhelia. http://www.matrox.com/mga/products/parhelia512/technology/36stage_shader.cfm

4 units sharing a TC?

It just doesn’t seem good.

Don’t gf3/4 and gffx have 4 texture units for fixed pipe? About the pipes, I think gffx use 8 during certain calculations then revert to 4 for multitexturing or something like that. A hybrid I think.

Edit: the 4 units total is for arb while gf3/4 have 8 reg combiners. How many combiners do gffx have? 8 also?

[This message has been edited by JD (edited 06-23-2003).]

If I’m correct, the FX5200 and FX5600 have 4 texture stages, the FX5800 and FX5900 have 8.

Nope, when you say texture stages what do you mean? To me texture stages, means the number of textures it can access in 1 pass. All GF-FX cards can access 16 textures in 1 pass.

5800 and 5900 are a hybrid 4x2/8x1.

The 5200 and 5600 are even more cut down, and are only 4x1 IIRC.

Don’t gf3/4 and gffx have 4 texture units for fixed pipe? About the pipes, I think gffx use 8 during certain calculations then revert to 4 for multitexturing or something like that. A hybrid I think.
Edit: the 4 units total is for arb while gf3/4 have 8 reg combiners. How many combiners do gffx have? 8 also?

The Gf3 and Gf4, were 4x2. Hence if you used more than 2 textures in a single pass, your rendering speed was halved.

The GF-FX range has 8 NV_RegCom general combiners also. But this is seperate from the main ARB/NV_fragment_program stuff. IIRC, you can even use nv_register_combiners, after your NV/ARB_fragment_program for extra power.

Nutty

1 opcode = 1 cycle is unlikely. Since mostly it’s about floating point, it’s more like each operation takes >=30 cycles.

Um, no. We can do single-cycle floating-point math nowadays (as long as you don’t use denormalized floats). Also note that a single cycle for GPU’s is about 4-5 times more time than a single cycle for modern CPU’s. Even the FX 5900 is clocked slower than the fastest Pentium II.

I know it had 3, but what’s so special about it? Too bad for them everyone had there eyes on the Gf2s, and 2 units was expected.

It’s special because it stands as the card with the largest number of texture units. And, it’s special because it proves that more texture units != faster.

Nope, when you say texture stages what do you mean? To me texture stages, means the number of textures it can access in 1 pass. All GF-FX cards can access 16 textures in 1 pass.

A texture stage refers to glTexEnv texture stages. GeForceFX hardware only has 4 of these stages. If you want access to more than 4 textures, you must use ARB/NV_fragment_program.

[This message has been edited by Korval (edited 06-23-2003).]

[This message has been edited by Korval (edited 06-23-2003).]

I just tested it.

I have a Gf Ti 4200, with 4 TU. The 3D texture was sized 888 pixels.
TU 1: base texture
TU 2-4: 3D light texture (one texture)

only 1 TU active: 180 FPS
1 + 2 TU active: 120 FPS
1 + 2+3 TU active: 80 FPS
1+2+3+4 TU active: 65 FPS

Then i created 3 3D textures and bound them to different TUs. The speed was exactly the same. No difference at all.

Then i used 3D textures sized 646464 and 256256256. Guess what? It didn´t slow down a single FPS! No matter if i used one texture, or 3 textures, it was always at the same speed. I didn´t expect such a good performance of hi-res 3D textures (or maybe i should have expected a higher performance of low-res 3D textures?).

So now i draw this conclusion: Using the same texture in more than one TU DOES slow down the app, but it DOESN´T slow it down more than using several copies. So accessing the same memory over and over again neither speeds it up, nor slows it down.

So now the only problem left, is how i achieve the effect, that the 3 lights don´t cancel each other out. I seem to be forced to look into register-combiners now.

Jan.

Originally posted by Nutty:
[b] The Gf3 and Gf4, were 4x2. Hence if you used more than 2 textures in a single pass, your rendering speed was halved.

5800 and 5900 are a hybrid 4x2/8x1.
[/b]

cough

If you want to make these things hybrids or whatever, then you could probably say 4x2 or 8x0. But even this would be wrong.

It’s really a lot easier: you get 8 z/stencil values written out, if you don’t write any color. This is the rule that applies, and nothing else.

Now that we have this ‘hybrid’ thing covered, the NV30 is 4x2.

>>>Um, no. We can do single-cycle floating-point math nowadays (as long as you don’t use denormalized floats). Also note that a single cycle for GPU’s is about 4-5 times more time than a single cycle for modern CPU’s. Even the FX 5900 is clocked slower than the fastest Pentium II.<<<

Some FPU operations can take 1 cycle, but add/sub/mult/div/sin/cos/tan and a few others will take longer.
Here’s an article from Ars Technica that describe the G4 and P4. It’s a nice read
http://www.arstechnica.com/cpu/01q4/p4andg4e2/p4andg4e2-3.html

I have heard in the Geforce256 days that it had a FPU ~5 times more efficient than the pentium (2 or 3?).

It’s obvious that the GPU’s like the Geforce is good at what they do but it’s not obvious at how many clock cycles they put in for single operations.
Plus, it’s obvious that they can’t clock them higher. If they could, the Geforce FX would be running at 3GHz and Radeons would not be selling. You can be sure of that one!

>>>So now the only problem left, is how i achieve the effect, that the 3 lights don´t cancel each other out. I seem to be forced to look into register-combiners now.<<<

You’re gone have to describe the problem better than that if you want input. Why register combiners?

[This message has been edited by V-man (edited 06-23-2003).]

I only started with dynamic lighting. So the first thing i tried was a single light. Now to test, how the performance is, i used 2 and 3 lights. However first i thought it didn´t work at all, since i couldn´t see anything anymore.
Then i figured out the problem. All i do is modulate the base-texture by 1/2/3 lightmaps. Now if a poly is close enough to the light, the base-texture is modulated with white, which means there is no change. If it is far away, it is modulated with black, so it gets dark. But if i have 2 or more lights this means, that the first light blackens everything, except for one small area. The second light is certainly some meters away so this one blackens the remaining bright are and WOULD lighten another area, but this doesn´t happen, because this area is already black.
I took a look into register-combiners yesterday, and i think it would be possible with them. I would just do AB + CD, where A and C are the basemap, B is one lightmap and C is another lightmap. And in the final register i might even be able to add a third light.
But i have no experience with this at all, so i still might run into some problems :wink:

Jan.

Here’s an article from Ars Technica that describe the G4 and P4. It’s a nice read

However, it is ultimately meaningless, as it is refering to modern CPU’s. The architecture of these chips is very different from the architecture of a modern GPU. Indeed, the CPU has to be able to do more work, as it must be able to handle denormalized floats. Also, P4’s run at 80-bit precision for regular fp calculations.

Also, note that if a floating-point op took ~30 cycles (or even ~10, or >3), you would never be able to use a fragment shader on any Radeon9500+ or GeForceFX. Think about it; no one can afford to drop 2/3’s of their fillrate, let alone 29/30’s of it. Each fragment shader opcode must take on the order of one cycle; otherwise shaders that ran fairly fast on something like a 8500 would inexplicably run unusably slow on a 9500. This is hardly the kind of thing that ATi or nVidia would embrace.

Plus, it’s obvious that they can’t clock them higher.

Actually, it’s more likely that it’s too cost-prohibitive. And, as AMD has taught Intel time and again, more clock cycles != faster.

[This message has been edited by Korval (edited 06-24-2003).]

Although this thread seems to have changed topic, i´d like to inform you, that i got my lighting working with Register Combiners. I am so proud of myself!
Maybe one of you could tell me wether ATI cards do support this extension, too, or if i am limited to NVIDIA cards, now.
I wouldn´t be surprised if ATI doesn´t support it and wants developers to use their fragment shaders instead.

Jan.

There is the ARB combiner, crossbar and DOT3 extensions on both NVIDIA and ATI.

Whether these are sufficient depends on what you’re doing, but they’re probably enough for you to implement a cross platform version based purely on ARB extensions:
http://oss.sgi.com/projects/ogl-sample/registry/ARB/texture_env_combine.txt
http://oss.sgi.com/projects/ogl-sample/registry/ARB/texture_env_crossbar.txt
http://oss.sgi.com/projects/ogl-sample/registry/ARB/texture_env_dot3.txt

>>>However, it is ultimately meaningless, as it is refering to modern CPU’s. The architecture of these chips is very different from the architecture of a modern GPU. Indeed, the CPU has to be able to do more work, as it must be able to handle denormalized floats. Also, P4’s run at 80-bit precision for regular fp calculations.<<<

The architecture is different, but the basic concepts are still there. Anyways, you mentioned modern cpus, and I gave a link. Give me a link that says that someone has an FPU (in a GPU or CPU) doing basic arithmetic + trig func in a single cycle at GHz speeds.

The thing about video cards is that unlike the motherboard/CPU/RAM, is that they are simpler. The memory controller and all of the cache has been in the GPU for years. The memory performance is quite good and the GPU is fed well. Not to mention SIMD instructions. Who knowns how many floats a GPU can crunch in parallel.

And as long as you mention float precision…

Did you know that on NV30 fp, you can specify what precision you want the GPU to use?

From NV_fragment_program.txt

RESOLVED: Applications can optionally specify the precision of
individual instructions by adding a suffix of “R”, “H”, and “X” to
instruction names to select fp32, fp16, and fx12 precision,
respectively.