PDA

View Full Version : DOOM3 texture unit use.



dorbie
08-05-2004, 05:47 PM
Interesting to note Carmack's comments on the final use of various texture units in the DOOM3 shaders (for the ARBfp path):


!!ARBfp1.0
OPTION ARB_precision_hint_fastest;

# texture 0 is the cube map
# texture 1 is the per-surface bump map
# texture 2 is the light falloff texture
# texture 3 is the light projection texture
# texture 4 is the per-surface diffuse map
# texture 5 is the per-surface specular map
# texture 6 is the specular lookup tableI won't post the actual business end of the code, it's there if you look.

P.S. as a complete aside, is this a possible disadvantage of ASCII based API interface calls? This could be dumped for glslang C like source code too, no dissasembly or reverse compilation required. I suppose it's not exactly the crown jewels, but it is interesting to the curious.

Korval
08-05-2004, 06:55 PM
It's strange that he picked texture unit 4 for the diffuse map. Most hardware that "cheats" tends to assume that TMU0 is the primary texture.

Maybe this is Carmack's way of getting hardware developers to stop doing stuff like that.

dorbie
08-05-2004, 07:48 PM
Depends how you define primary. AFAIK they're all active when this shader is used and the images for the various per surface terms are the same resolution. Remembering that this shader gets split into two for other multipass implementations that may have set the order early at the application level and he merely ran with it for simplicity. Illumination & attenuation in the first 4 units, material properties in the later 3 would be the right split for a 4 texture unit multipass. (EDIT: dang, the bumpmap screws up that theory with specular.)

OTOH Given the dependency of other instructions on fetches from cubemaps & bump maps and the cache friendly nature of something like a cubemap fetch per fragment (the cubemap is only used to normalize the interpolated light direction vector) it may even make sense to keep them coherent (if indeed there's an implementation preference).

Humus
08-05-2004, 08:01 PM
Hmm, a lot of those textures seems to be purely lookup tables. It would make sense on R200 level hardware to use textures for that due to instruction set limits, but I wouldn't be surprised if things would run faster and look better on R300 hardware if the normalizing cubemap and the specular lookup table was replaced with math. Maybe even the light falloff texture too.

dorbie
08-05-2004, 08:11 PM
There are two versions of the shaders, one uses math for some stuff, the other does dependent reads, but still has the texture unit comments.

Edit: this is a bit bogus, despite comments the code does most of the same fetches (it's clearly been edited a few times with some code commented out, in one case there's a math normalization commented out, in another it is implemented but this makes the shaders similar w.r.t. fetches if not dependent reads), but it was captured on NVIDIA 6800 hardware, I don't know what happens on other hardware.

sqrt[-1]
08-05-2004, 09:05 PM
FYI: You don't have to use a interception tool to get the shaders, they are in the data files.

Simply rename pak000.pk4 to pak000.zip and open it up with a zip program. The shaders for all code paths are in the directory "glprogs".

plasmonster
08-05-2004, 09:10 PM
For speed, nVidia suggests normalizing wildly changing vectors with math, due to texture cache misses, while using cubemaps for all smoothly varying vectors. I wonder if this is different with ATi hardware.

Anyway, I'm surprised there's no explicit stage for an environment reflection cube, unless that doubles in the specular unit (simple 2D fudge). Don't tell me there's no reflection going on. They need one more unit to make an even 8 anyway!
:-)

Id is most neighborly to share.

dorbie
08-05-2004, 09:17 PM
Thanks for the pointer sqrt.

Q, there are other shaders for environment mapping. You cannot correctly include an environmental reflection term in a shader that is applied when stencil shadow testing is on. Sure you could ignore the texture based illumination terms but that stencil test is going to win in the end. :-)

plasmonster
08-05-2004, 09:21 PM
Whew, that's good news. I'd like to get a look at this stuff myself.

Korval
08-05-2004, 11:11 PM
Depends how you define primary.The diffuse texture; the one that, regardless of shading system, is virtually required in order to have decent graphics.


but I wouldn't be surprised if things would run faster and look better on R300 hardware if the normalizing cubemap and the specular lookup table was replaced with math.I would, to some degree.

Remember, you get 1 vector op and 1 texture op per cycle. If you can keep both busy, then you're better off than being math-heavy. It's almost impossible to use more texture accesses than math ops, so it's best, performance wise, to move as much as possible into textures. Cache misses aside, of course.

skynet
08-06-2004, 06:06 AM
Aside from that, I noticed that JC is using program parameters for all and everything. As far as I knew up today is, that updating those parameters is fairly expensive. So I usually (ab)used gl-state variables to upload my data (especially matrices). I did not benchmark it against using program parameters.

So my question is, is using program parameters now a recommended practice? Do I gain anything by using glstate to upload data into the vertex and fragment programs?

Xmas
08-06-2004, 07:04 AM
Originally posted by Korval:
Remember, you get 1 vector op and 1 texture op per cycle. If you can keep both busy, then you're better off than being math-heavy. It's almost impossible to use more texture accesses than math ops, so it's best, performance wise, to move as much as possible into textures. Cache misses aside, of course.Don't forget filtering. You only get one bilinear filtered sample per cycle.

Korval
08-06-2004, 10:48 AM
So my question is, is using program parameters now a recommended practice? Do I gain anything by using glstate to upload data into the vertex and fragment programs?I don't think that native glstate was ever any faster than program parameters.


Don't forget filtering. You only get one bilinear filtered sample per cycle.Good point. But, then again, you're usually not trilinear/anisotropically filtering lookup tables.

Sunray
08-06-2004, 12:38 PM
I noticed that there are no self-shadowing in Doom 3? Why, annoying popping in cut scenes?

Humus
08-06-2004, 02:33 PM
Originally posted by Korval:
Remember, you get 1 vector op and 1 texture op per cycle. If you can keep both busy, then you're better off than being math-heavy. It's almost impossible to use more texture accesses than math ops, so it's best, performance wise, to move as much as possible into textures. Cache misses aside, of course.Well, with that many textures active at the same time, and I assume fairly short shaders (guessing, haven't bought the game yet, will do that tommorrow) I wouldn't be surprised if the cost of accessing textures would be the largest performance defining factor.

Humus
08-06-2004, 02:41 PM
Originally posted by skynet:
Aside from that, I noticed that JC is using program parameters for all and everything. As far as I knew up today is, that updating those parameters is fairly expensive. So I usually (ab)used gl-state variables to upload my data (especially matrices). I did not benchmark it against using program parameters.

So my question is, is using program parameters now a recommended practice? Do I gain anything by using glstate to upload data into the vertex and fragment programs?Constant updates can be expensive if you do it excessively, but there's seldom any reason to be paranoid about it. The cost is usually comparably small and hidden by the cost of actually executing the shader, unless you have really small batches and draw few pixels. In any case, it's certainly not recommended to abuse gl-state. It won't speed anything up, you're just specifying the constant another way. The GL-state is there for convenience, not for performance. Abusing it only leads to unreadable code.

knackered
08-07-2004, 08:24 AM
Doom3...great to see per-pixel lighting in something more than a spinning cube demo at last.
Although I had the same feeling when I first played Chronicles of Riddick on the xbox months ago.
Dorbie, i'm a bit puzzled as to why you think the texture stage number is important in any way in these shaders...

dorbie
08-07-2004, 06:02 PM
Sunray, there is self shadowing in Doom3. The player doesn't seem to cast a shadow, but that's not a technical issue, objects and monsters self shadow in the technical sense, and Carmack's reverse makes player shadow clipping a non issue. I'll bet you dollars for donuts you can turn player shadow casting on via the console, but this is definitely not a technical limitation, it would cost a bit extra to draw, but it is either an artistic or performance compromise (probably artistic). It's the sort of crap that get's debated inside game companies (heck I worked for idiots who ripped working stencil shadows out of a game at the last minute to make XBOX look like PS2 so don't be surprised by the outcome of 'artistic decisions').

Knackered, the texture useage is interesting and was the subject of debate a while back, although it was pretty much predicted, the actual numbers were just in the comment from the shader code, I wasn't enumerating. Yup Riddich looked good, AKAIK it pretty much got the lighting & shadowing right with interesting shaders (a bit shiny though).

SirKnight
08-07-2004, 06:43 PM
I'll bet you dollars for donuts...
Better be Krispy Kreme. :D

Humus
08-07-2004, 07:46 PM
It seems I was right about getting a speed boost by changing the shader to use math instead of lookup tables, but I didn't expect this much:
http://www.beyond3d.com/forum/viewtopic.php?t=14874

At best in game up to 40% faster with max AA/AF by just replacing the dependent texture read with POW, and in the timedemo about 18% faster. :eek:

plasmonster
08-07-2004, 09:52 PM
The speed increase is very interesting, but I wonder how much artistry is bypassed without the lookups; you can do some pretty neat stuff with some fancy ramps. They may have used the lookups for more than speed; indeed, they may have been willing to take one on the chin for the atristic freedom it allows.

Nice boost though. I'd be interested in an explanation for the discrepancy. I get the strange feeling that this maybe another case of IHV ankle-bitting.
:-)

By the way, when you say a 40% increase, do you mean that brings it on par with the nVidia path?

All things being equal, I'd rather have the artistic freedom of textures, as opposed to relatively inflexible math (with cost in mind here). You can stuff some massively complex math into a texture. Of course, it all depends on what you're doing; normalizations don't need to be fancy.

V-man
08-08-2004, 01:59 AM
Are you guys sure there aren't multiple versions of the same shaders in there?

It was my understanding that NV recommends using look up texture and normalization cubemaps instead of math, while ATI GPUs prefer the other way around.

Oh well, I'm sure there is a 100MB service pack coming soon :)

SirKnight
08-08-2004, 10:16 AM
Good job discovering that Humus.

I tried the mod and since I have an FX 5600 card what I got was expected. It seemed like as I walked around the fps was a little lower. This is kind of a no-duh thing as everyone knows it's better to use texture lookups rather than math in FX's. I also got those little white dot artifacts in certain areas which is probalby due to the compression of spec maps in medium quality mode. I'm not 100% on why using pow instead shows the compression artifacts like it does, maybe due to the higher precision of the pow instruction? That's my guess anyway.

When I get my 6800 GT I'll try it again and see what happens. I don't much of an increase in speed but you never know, the NV40 fragment processor is pretty good.

SirKnight
08-08-2004, 10:25 AM
Hmm, I thought something else was looking weird. Just as I expected, each surface has its own specular exp map. No wonder certain surfaces' specular part looked odd. Everything after the mod is using the same specular exp which is not right. :) But like said on that other board, this can be fixed by JC passing the exp into the shader in a variable.

plasmonster
08-08-2004, 11:45 AM
It seems strangely absurd to me, the notion that after 5 years of development, they never thought to pick up a phone and ask ATi how their hardware works. I would assume that Id has a red-phone for every IHV in existence. I'm sure they could have the entire ATi driver team over for lunch, with charts, slides, and belly dancers.

Id is in a unique position to shape the industry. I suspect that is what they are doing; they've done it before. I'd be first in line to give them the benefit of the the doubt in this case.

Anyway, to my mind, it's clearly the case that (dependent) texture reads scale far better than brute force math. This may not be exactly critical in Doom3 (haven't played with it yet), but it will be. Who was it that said that programming was the art of caching? (always liked that) :-)

dorbie
08-08-2004, 12:09 PM
Q, math is the way of the future. Texture reads to implement math functions is powerful but clearly an ugly hack, ultimately the compiler could implement the LUT if it's really faster, heck that's how some hardware works anyway. You need a really warped sense of aesthetics to think that shader writers should be loading texture units with ramps and then doing reads & dependent reads instead of calling a simple pow function.

Dunno if id & ATI are on good terms after ATI leaking the E3 DOOM3 demo. Let's remember that the math has artifacts it may need the LUT for the reasons mentioned.

SirKnight, you sure about the specular exp map? There's a gloss map but an exponent map? That would be cool, but I dunno it's there. The fetch is a 2D fetch but *from recollection* at least one of the shaders calls the other axis in the LUT the divergence factor (or similar) and I don't recall a dependent read in there for that coord *from memory so I could be way off*, and any exponent map would at least have to be a dependent *texture* read to the texcoord of the fetch. Very cool if he did this, it could be pretty inexpensive with for example a 2 component gloss map, I just don't think that's what is going on. I'll need to take another look.

plasmonster
08-08-2004, 12:25 PM
Texture reads to implement math functions is powerful but clearly an ugly hack, ultimately the compiler could implement the LUT if it's really faster, heck that's how some hardware works anyway.It's beautiful to me. Are you going calculate a BRDF in a shader?


You need a really warped sense of aesthetics to think that shader writers should be loading texture units with ramps and then doing reads & dependent reads instead of calling a simple pow function.Warped? I call it art. Suppose you want a specular lookup in the form of a rose? Textures give you great artistic freedom. I agree that in the distant future math will prevail, as it does on the CPU today (though lookups are still quite common). And as I stated before, it depends on what you're doing.

Korval
08-08-2004, 03:10 PM
It's beautiful to me. Are you going calculate a BRDF in a shader?
Sure, if the performance was there.


Warped? I call it art. Suppose you want a specular lookup in the form of a rose? Textures give you great artistic freedom.That's reasonable for non-photorealistic rendering, but realistic rendering is based on actual mathematical functions. A look-up table in realistic rendering is just a performance optimization (when it actually helps performance) compared to actual math.

plasmonster
08-08-2004, 04:36 PM
Sure, if the performance was there.That's all I meant to suggest. LUTs give us what they've always given us: the ability to do today what we would otherwise have to wait for tomorrow to experience. And to me, that's a beautiful thing.


A look-up table in realistic rendering is just a performance optimization (when it actually helps performance) compared to actual math.Absolutely. After all, it was the actual math that created the table entries in the first place.

I'm just a freak when it come to LUTs. I'm a LUT freak.

dorbie
08-09-2004, 08:11 AM
Q there's a difference between a LUT where it is required and a LUT for everything. It is laborious to implement a pow function as a LUT when it should be a single operator or call in the program. Specious examples don't make the case for texture LUTs as the optimal path, LUTs do have their legitimate uses. If a LUT is optimal for a basic supported math operator then ultimately the shader compiler should use a LUT under the covers where resources permit, this has the added benefit of hardware abstraction and running well everywhere. When you toss out libmath and hand implement cache resident tables everywhere in your C program code then I'll believe you're really a LUT nut. That's ultimately the case you're making.

sk
08-09-2004, 09:18 AM
Originally posted by dorbie:
SirKnight, you sure about the specular exp map? There's a gloss map but an exponent map? That would be cool, but I dunno it's there. The fetch is a 2D fetch but *from recollection* at least one of the shaders calls the other axis in the LUT the divergence factor (or similar) and I don't recall a dependent read in there for that coord *from memory so I could be way off*, and any exponent map would at least have to be a dependent *texture* read to the texcoord of the fetch. Very cool if he did this, it could be pretty inexpensive with for example a 2 component gloss map, I just don't think that's what is going on. I'll need to take another look.Right, there's no specular exponent map, which would be difficult to make use of in older hardware paths.

Having looked at the nv20 path, the specular power is approximated via register combiner math (it seems to be roughly power 12, but shifted a bit, so it saturates earlier). The specular table in the arb2 path may be attempting to match this quasi-power function.

Re the divergence factor, that's in the test.vfp shader, which I believe isn't used by any of the rendering paths (it would be arb2 or possibly exp if any). The factor in question comes from the normal map's alpha component; my guess is that this shader is an experiment to anti-alias specular highlights (which JC also talks about in an old .plan).

plasmonster
08-09-2004, 10:21 AM
Dorbie, I think we're on the same page. I'm not saying anything that hasn't been said 10 million times already. Perhaps I said it badly. Sorry for the OT digression.

dorbie
08-09-2004, 01:41 PM
Q, it's just a fun discussion, np.

sk, Ahh, the divergence makes sense then. You measure the rate of local vector change in the normal map, store it in alpha then adjust a texture LUT (I assume that adjustment would be a convolution filter for high exponents so an exponent LUT would look crisp & bright on one end and blurred diffuse & darker on the other). I see some funny related things in one shader I see is an attempt to move localnormal.a to localnormal.x (which seems like a swizzle & nothing else) but in the another shader w is extracted and *almost* used :-)
How's this for some legacy code:

MOV R1.y, localNormal.w;
MOV R1.y, 0.2;
MOV R1.w, 1;
TEX R1, R1, texture[6], 2D;

Must've slipped through the cracks. I suppose that 0.2 is a fixed constant convolution (& possible clamp) reducing sparkle on high exponents & might explain the quality diff with a straight math exponent vs LUT.

Years ago I considered a similar alpha term in a different context as a MIP lod bias on bump mapped environment maps. If you use the normal divergance as a MIP LOD bias term for the LOD selection to reduce aliasing of the environment map in situations where you can't predict the post dependent read derivatives of s & t in hardware (maybe 5+ years ago now). Ideally today you'd want to do something smarter considering the potential for anisotropic probes.

CatAtWork
08-09-2004, 04:18 PM
IIRC, the alpha and red components of localNormal are swapped because the alpha channel provides more bits when compressed using DXT.

sk
08-09-2004, 05:41 PM
Originally posted by dorbie:
sk, Ahh, the divergence makes sense then. You measure the rate of local vector change in the normal map, store it in alpha then adjust a texture LUT (I assume that adjustment would be a convolution filter for high exponents so an exponent LUT would look crisp & bright on one end and blurred diffuse & darker on the other).
Right, that's the general idea.

NVIDIA have a paper on a solution which uses the filtered normal length as a direct measure of variation:
http://developer.nvidia.com/object/mipmapping_normal_maps.html

Also "Algorithms for the Detection and Elimination of Specular Aliasing":
http://www.cs.yorku.ca/~amana/research/

This talks about clamping the exponent as well.

And as I mentioned before, see JC's last plan (near the end, which is quite some way) for his early take on this issue:
http://www.webdog.org/plans/1/


Originally posted by dorbie:
I see some funny related things in one shader I see is an attempt to move localnormal.a to localnormal.x (which seems like a swizzle & nothing else) but in the another shader w is extracted and *almost* used :-)
How's this for some legacy code:

MOV R1.y, localNormal.w;
MOV R1.y, 0.2;
MOV R1.w, 1;
TEX R1, R1, texture[6], 2D;

Must've slipped through the cracks. I suppose that 0.2 is a fixed constant convolution (& possible clamp) reducing sparkle on high exponents & might explain the quality diff with a straight math exponent vs LUT.As CatAtWork said, that a to x move is there for compressed normal maps, which isn't worth eliminating (at a complexity and maintainability cost) for the uncompressed case.

If that code is from test.vfp again -- I don't recall seeing it anywhere else -- I wouldn't classify it as a legacy code fragment so much as just debug assembler in an experimental shader, which seems to have slipped through as a whole.

sk
08-09-2004, 05:56 PM
Originally posted by dorbie:
Must've slipped through the cracks. I suppose that 0.2 is a fixed constant convolution (& possible clamp) reducing sparkle on high exponents & might explain the quality diff with a straight math exponent vs LUT.Oh I think I see what you're saying now: the production version (interaction.vfp) could be using a LUT which folds in a fixed 'convolution' based on the average divergence? That's an interesting idea but I think you really want the convolution to vary per-pixel. I'm still inclined to believe that the 0.2 is just there from idle testing.

sk
08-10-2004, 02:57 AM
Originally posted by sk:
Having looked at the nv20 path, the specular power is approximated via register combiner math (it seems to be roughly power 12, but shifted a bit, so it saturates earlier). The specular table in the arb2 path may be attempting to match this quasi-power function.JC confirms in an interview that the lookup is there to match the specular power approximation of older paths:
http://www.beyond3d.com/interviews/carmack04/index.php?p=2

Edit: Also my description of the power function above seems to be a bit off as I misread the RCs (not the easiest API to follow!). Other people seem to be doing a good job of matching the LUT with a couple of instructions.

dorbie
08-10-2004, 07:00 AM
It used to vary per texel, look at the code I pasted, "localNormal" is the bumpmap fragment, but he replaced it with a fixed value and kept the legacy texel based MOV in there. That's why I was saying it slipped through the cracks, it's erm.... less than optimal. Hopefully something in the driver optimizes it out but it's interesting to notice that it has been changed from a per texel value to a constant 'convolution' (or whatever) that ignores the bump alpha, AND that the constant isn't 0 or 1, but 0.2 hinting strongly at some constant modification of the exponent function (probably that reduces specular aliasing through frequency and/or contrast reduction post-exponent).

sk
08-10-2004, 07:24 AM
Originally posted by dorbie:
It used to vary per texel, look at the code I pasted, "localNormal" is the bumpmap fragment, but he replaced it with a fixed value and kept the legacy texel based MOV in there. That's why I was saying it slipped through the cracks, it's erm.... less than optimal. Hopefully something in the driver optimizes it out but it's interesting to notice that it has been changed from a per texel value to a constant 'convolution' (or whatever) that ignores the bump alpha, AND that the constant isn't 0 or 1, but 0.2 hinting strongly at some constant modification of the exponent function (probably that reduces specular aliasing through frequency and/or contrast reduction post-exponent).Perhaps I wasn't clear before; optimality isn't an issue here as firstly this is an unused test shader and it's the sort of code one writes when in the middle of testing/debugging.

dorbie
08-10-2004, 07:52 AM
I understand it looks like experimental code and it's obviously the kind of thing you get with a work in progress. The shader code fragment I posted was intercepted on the way to the graphics card while retail DOOM3 was running with a GeForce 6800. That doesn't guarantee it was actually used for much rendering but it is still informative.

P.S. I missed one of your earlier posts, thanks for the links.

sk
08-10-2004, 08:11 AM
Originally posted by dorbie:
I understand it looks like experimental code and it's obviously the kind of thing you get with a work in progress. The shader code fragment I posted was intercepted on the way to the graphics card while retail DOOM3 was running with a GeForce 6800. That doesn't guarantee it was actually used for much rendering but it is still informative.Ah well that's interesting; I can see that it's not completely unused in the sense that it's loaded by the game.

When you intercepted it, was it on load or during a game frame? I've not seen that shader being set during frame rendering myself.

You'd really need to see a draw call use it, otherwise it's only going as far as the driver.

Edit:

Dorbie had this to say later:
P.S. I missed one of your earlier posts, thanks for the links.No problem. Now stop editing your post, so that I can stop editing mine! ;)

plasmonster
08-10-2004, 04:14 PM
By the way, the normal mip-mapping technique that sk pointed to works as advertised. And as a bonus, you get free gloss-mapping. What do you know, another humble case for LUTs... *sniff*

:cool:

sqrt[-1]
08-11-2004, 04:21 AM
Slightly OT:

I was just wondering what tools (sk,dorbie and others) were using to intercept the shaders and determine when a shader was being used etc. (As an author of such a tool I was just curious what people use)

dorbie
08-11-2004, 08:21 AM
I use yours of course, glintercept, it's a great tool. It can be a bit slow and misses some legacy calls, for example glBindTextureEXT, but it is trivial to add a call like that using your header parsing system without recompilation. The configurability is great with the config file, but I think the glilog should go with all other output (IMHO).

I'd be interested in your plans for .4 etc. (Edit: nm I've see "Future Features" on your web page, some optimization would be good).

sk
08-11-2004, 11:26 AM
Hi sqrt[-1]:

I'm using glIntercept as well and I have to agree that the parsing system and configurability are great! XML output is also useful as are the thumbnails, plus single frame logging. I haven't been using it for long though so I can't think of areas for improvement yet, therefore all I can offer are congratulations and thanks for a great tool! :)

In addition I've also been using D3's built-in r_logFile option to dump GL calls. This is useful as there's some extra semantic / higher-level information (type of rendering, material names, etc). On the other hand it can't compare to the XSL presentation of glIntercept and it may well skip some calls (I've not compared the two closely).

sqrt[-1]
08-11-2004, 02:13 PM
dorbie:

I would be horrified to learn if it actually skipped logging even legacy calls. The worst that should happen is a note in the error log saying it is logging an unknown function and put ???? in the parameters section of the function log. (Some of the more advanced loggers (image/shader) assume at least 1.1 support)

I will be optimizing when feature complete but I am pretty sure the overhead if just doing frame grabs is minimal. (when doing continual logging the conversion from float->string in the parameter conversion kills speed. Try running without specifing a function config file and you should see a massive boost)

dorbie
08-11-2004, 03:12 PM
It does report the unsupported calls.