Register coloring optimization? (GFFX, R300)

I’m looking into the possibility of assembling vertex and fragment programs at runtime. Our artists use Max-style materials, so depending on what maps are configured (environment, bump, transparency, etc) I need different programs. Rather than having humans write all the variations, it seems prudent to instead write little “linkable” snippets and chain them together as necessary.

My current query is regarding the use of temporaries. Some cards seem to be quite sensitive to amount of temporaries used (GFFX) whereas other cards have strict limitations on dependency chains (R300) and sometimes get fooled by “false” dependencies. I have two ideas for how to manage temporaries:

  1. Each linkable program fragment exposes how many temporaries it needs, and calls them temp1-tempN; the prolog emits declarations for these temporaries. Keeps temporary usage low, but may cause false dependencies.

  2. Each linkable program fragment just allocates new temporaries in a linearly increasing fashion, with no name re-use. This removes all false dependencies, but may cause a large number of temporaries to be referenced.

I’m leaning towards 2), but for that to be practical, I need to be able to trust the GFFX drivers to do register coloring and reduce temporary register footprint to the smallest possible.

So I guess that’s a long way of asking jra101 and/or Mark whether they’re l33t enough :slight_smile: Other ideas are also welcome.

Do you plan to feed forward some temporaries between different code fragments?
Eg, put multiple variations of texture sampling into code fragments and reference the result from another?

If so, you might want to give your code fragments named inputs/outputs and “clobber lists”. This also simplifies ‘linking’ quite a bit, as you can verify a combination for validity.
“clobber list” temporaries can come from a shared pool, preferably with register-style names, and the maximum amount you need for any given ‘linked’ program is the length of the longest clobber list.

Something along these lines

#2 should work fine if you have a legal program, but I would encourage you to write your own register coloring/allocation code too, or write it in “simple Cg” and use the Cg compiler to do the register coloring/allocation for you.

Cg will also be providing support for shader sub-objects as well that will achieve the same end without having to “sprintf” programs.

Thanks -
Cass

Thanks for the reponses!

Yes, feeding data forward is necessary, although I was thinking along the lines of using well-known names for such data.

Keeping a “clobber list” is similar to case 1, which may (for complex shaders) run into validation problems on ATI R300 hardware. There was another thread about this a few months back, where re-using temporaries could cause false dependencies.

And, Cass, Cg is something I’m looking at, but I haven’t measured the loading/preparation performance difference yet. We dynamically load/unload lots of shader data in the background, so I’d be inclined to go with whatever compiles the fastest, to reduce stuttering when new objects come onto screen.

Originally posted by jwatte:
Keeping a “clobber list” is similar to case 1, which may (for complex shaders) run into validation problems on ATI R300 hardware. There was another thread about this a few months back, where re-using temporaries could cause false dependencies.
Does this still apply?
I thought it’s pretty easy to figure this out: an uncondtitional write to a temp breaks the dependency chain on that temp (and moves its end to the last use of the temp as a source operand).
Or do I miss cases with that policy?

edit: Just found the old thread . I’d be interested in Zeno’s thoughts, too

[This message has been edited by zeckensack (edited 09-09-2003).]

I’d be interested in Zeno’s thoughts, too

Well, since you asked

I saw jwatte’s question a while ago and meant to respond but got bogged down with some other work.

I’ve run into the same problem myself. The current inflexibility of shader programs and the importance of optimizing them as much as possible necessitates creating an assembly language shader for every permutation of rendering parameters. For example, you might need a shader for one light and fog, two lights and no fog, diffuse only light, specular only light, specular only light with fog, etc.

Obviously this is a programming nightmare. I can see a few possible solutions:

  1. Go ahead and write all permutations by hand. This has the drawback of being extremely labor intensive and difficult to maintain. The advantage is that it’s conceptually simple and any individual shader can be nearly optimal.

  2. As jwatte is doing, write code snippets and link them together at runtime (or at program startup). The disadvantages here are complexity of implementation and getting optimal register usage, where optimal may vary depending on your hardware. Another disadvantage is that you will undoubtedly have extra instructions that could have been optimized away if you knew the whole program ahead of time. The advantage to this system is, of course, that you only have one copy of code to maintain for each state (light, fog, texture, etc).

  3. Wait for future hardware to support the things needed for flow control. Loops and function calls would solve most of the issues. Will these things be in the next generation of hardware or the one after that?

Anyway, I think (2) is the best option at this time if you have the time to write and debug such a system.

To finally discuss jwatte and zeckensack’s questions: I don’t know if ATI has eliminated the false-dependency issue in their drivers as I haven’t tested my old shader on the newest drivers (3.7 at this time). My hunch is that it is more likely that NVIDIA would optimize to minimize temporaries than ATI to eliminate false dependencies. The reason I say this is because it is known that NVIDIA’s cards’ performance is heavily tied to the number of temp registers in use so it’s probably a high priority for them whereas ATI’s issue only comes up in abnormally long and rather unusual programs (not the kind used in games), so it probably receives a low priority. Given all that, I’d go with your second option above.

I’m sure that there are tons of problematic details you’ll come across when writing up a system to link programs. If you’re allowed, would you mind discussing some of the problems and solutions you come up with when you’re done?

Hope this brain-dump helped someone

for sharing registers, reusing them, and all the stuff… possibly a look at softwire.sf.net / sw-shader.sf.net could help, as he had the same problem for his runtime assembler, just with the x86 and sse and mmx registers…

Zeno,
thanks
I was asking mainly because I contemplated writing an ARB_fragment_program target. Meanwhile I already found enough incentive to do so and it’s done and it works well. My shaders are probably too simple to cause much problems though.
This is ‘the biggest one’ I fished out of the system:

!!ARBfp1.0
TEMP ccom_tmp;
TEMP chroma_tmp;
TEMP tcom_out;
TXP tcom_out,fragment.texcoord[0],texture[0],2D;
SUB chroma_tmp,tcom_out,program.env[1];
ABS chroma_tmp,chroma_tmp;
DP3 chroma_tmp,chroma_tmp,program.env[3].a;
SUB chroma_tmp.rgb,chroma_tmp,program.env[3].b;
KIL chroma_tmp;
MUL ccom_tmp.rgb,fragment.color,tcom_out;
MUL result.color.a,program.env[3].a,program.env[0].a;
MUL ccom_tmp.a,fragment.fogcoord.x,program.env[3].g;
LRP result.color.rgb,ccom_tmp.a,program.env[2],ccom_tmp;
END

Just a chroma keyed, fragment color modulated texture lookup with linear fog. Nothing too earth shattering

Originally posted by cass:
…Cg will also be providing support for shader sub-objects as well that will achieve the same end without having to “sprintf” programs…

Maybe this is a bit off topic but I still wonder how to melt $yourFavoriteShadingLanguageHere with specific application details.
For example, if I want to do something like $result = $this * $that but “$this” and “$that” are coming from the app (maybe varying with app release), how can I let the shader / the shader parser use the correct values? For example, I may want to blend everything to red or black if player is dying so my app will compute a float value for the scale but… how can I understand how to do? Parsing the shader file is the only thing I can think about and every time I consider this issue I feel this is why some applications have their own shader languages.

Obli: application-specified parameters are, as you noticed, very important, and usually glossed over in all the samples and explanations.

What you have to do is to define specific names/types/expectations for all kinds of shaders, and have the application query the shader for these specific names. The DirectX/CG concept of “semantic” may help here; if you want to do it for assembly shaders, you’ll have to embed comments describing this meta-information in the program text and parse that when loading the program.

The ATI shader demo engine that they’ve shown in “how to write a shader library” talks, uses all kinds of special semantics to specify such things as rendertarget textures; whether they’re cube or flat render targets; etc.

Basically, while the way to express programs is getting pretty standardized, the “glue” to hook these between your app and the program is still wildly varying. Simple, translation-free “shader portability” is not a thing of the current.

Originally posted by jwatte:
…Basically, while the way to express programs is getting pretty standardized, the “glue” to hook these between your app and the program is still wildly varying. Simple, translation-free “shader portability” is not a thing of the current.

Ack. Oh well, thanks for the reply anyway, at least now I’m sure there’s no standard way to do that.