Fragment programs and ARB_shadow (on ATI hw)

harsman · July 4, 2003, 3:13am

This thread sort of forked into a shadow mapping discussion which got me thinking about shadow mapping in conjunction with fragment programs. Interestingly enough the spec states the following:

Interactions with ARB_shadow

The texture comparison introduced by ARB_shadow can be expressed in 
terms of a fragment program, and in fact use the same internal 
resources on some implementations.  Therefore, if fragment program 
mode is enabled, the GL behaves as if TEXTURE_COMPARE_MODE_ARB is 
NONE.

Which seems really weird. This means there is no way to use dedicated filter-after-compare hw a.k.a “you get free PCF shadow maps for the cost of a texture lookup”.

It obviously makes life easier for ATI’s driver engineers (since the radeon doesn’t have dedicated PCF hw AFAIK) but not for anyone else. You can’t get the nice cheap pcf on nvidia hw (which supports it) for example. The sane thing would be if this was remedied in the spec and a tex lookup on a shadow texture resulted in multiple instrucitons on hw that doesn’t have dedicated pcf functionality.

Of course, you might want the actual depth of the depth texture in addtiion to the PCF value sometimes. To do blurring of the shadow with occluder receiver distance for example like Angus pointed out in that other thread. So the best thing would be if there were two texture lookup instructions, shadow lookup and regular. This is how it works in glslang, so it seems weird fragment programs are broken.

Anyway, since ATI hw doesn’t have the fancy specialised PCF functionality I got to thinking about how to do fast PCF anyway and got a pretty good idea. If you use a four channel depth map you can store the depth of 4 shadow map texels in one so to speak. So the first channel holds the depth at the current texel, the next at u+1texel, u-1texel, v+1texel etc. I don’t know if it will work at polygon edges. I think it will if you just render one depth channel and then copy it to all the other channels with the required offsets. The edge cases might be tricky though. This eats bandwidth like hell of course, primarily in the shadow map generation phase but you have to fetch four times the number of bits in the shadow map when accessing it as well.

However, to filter you just need one texture lookup and DP4 with a weight vector to get your PCF value. Pretty neat. Has anyone done shadow maps on the Radeon? What did you do? Jason and Evan, didn’t one of you do the chimp demo, that had shadows didn’t it?

And I really think the fragment program spec should be updated to work in the sane way with shadow maps, the way it is now seems weird.

Korval · July 4, 2003, 11:08am

Which seems really weird. This means there is no way to use dedicated filter-after-compare hw a.k.a “you get free PCF shadow maps for the cost of a texture lookup”.

What is PCF?

Glslang has a sampler to handle shadow compares (though I wish they had used a more descriptive name than “shadow”).So, there’s really little point in updating the fragment program spec.

harsman · July 4, 2003, 12:33pm

Percentage Closest Filtering, i.e fetch n values (four for bilinear filtering more for a larger filter kernel) do the compare on all of these and then filter the resulting values (1:s or zeros). This works, in contrast with just filtering the depth which breaks down at mesh edges.

I spotted an error in the above btw, I forgot the actual comparison in the code You’ll need to fetch, do a SGE or similar and then DP4 that with weights.

However, the more I think about it, the more I’m convinced that actual bilinear filtering (i.e PCF hw) will look better than a cheapo box filter which is what you get if you do it the way I described above. Unless you do multiple tex lookups per pixel you won’t get that “smooth” look (at least I think you won’t, I haven’t implemented this yet), you’ll just get the shadow map box filtered but still with “nearest” filtering so to speak.

evanGLizr · July 4, 2003, 3:53pm

Originally posted by harsman:
If you use a four channel depth map you can store the depth of 4 shadow map texels in one so to speak. So the first channel holds the depth at the current texel, the next at u+1texel, u-1texel, v+1texel etc. I don’t know if it will work at polygon edges. I think it will if you just render one depth channel and then copy it to all the other channels with the required offsets.

That would work if the depth texture is just 8bit per depth sample, but depth textures are either 16, 24 or 32bit per sample, so you have to fetch several texels anyway (you cannot have “multicomponent” depth textures).

I haven’t looked at ARB_fragment_program interaction with floating point textures, but I’m sure you cannot fetch several fp32 components using just one instruction :?
Maybe you could fetch two 16bit values at once (using fp16 multicomponent texture or just RGBA 8bit texture), but I would be surprised if you could fetch more than 32bits at once.

Regarding PCF, you seem to imply that the texels you fetch are bilinear filtered and then PCF filtered, but I don’t think that’s the case: they are nearest filtered and then PCF filtered (with weights), there’s no bilinear filtering (or trilinear, for that matter) going on, as that would be “geometrically incorrect”.

In addition to that, IIRC you cannot do non-nearest filtering on fp textures.

Edit:
After reading the two extensions on floating point textures:

ATI_texture_float Doesn’t specify interactions with ARB_fragment_program.
NV_float_buffer only allows to use textures with NV_texture_rectangle and specifies that fp textures can be read with NV_fragment_program as normal.

So even if it looks like you can do a >32bit texture load in only one ARB_fragment_program instruction, I doubt the hardware really reads >32bit per component textures in one pass (it would either replicate the TEX instruction or use internal loopback), so I still think you wouldn’t win anything with your method but wasting memory :-m.

Multiplying the texture data by four would also kill your texture cache, so with your method each texture load would actually take even more time to be executed.

[This message has been edited by evanGLizr (edited 07-04-2003).]

Korval · July 4, 2003, 5:32pm

ATI_texture_float Doesn’t specify interactions with ARB_fragment_program.

And why should it? There aren’t any. You fetch the components as normal.

So even if it looks like you can do a >32bit texture load in only one ARB_fragment_program instruction, I doubt the hardware really reads >32bit per component textures in one pass (it would either replicate the TEX instruction or use internal loopback), so I still think you wouldn’t win anything with your method but wasting memory :-m.

Based on what do you believe this? At the very least, I know you can use n-channel float textures just like regular n-channel color textures on ATi hardware. The only difference is that the results of fetches can be outside [0-1], and that there is no filtering applied to float textures. Outside of this, nothing is different.

Multiplying the texture data by four would also kill your texture cache, so with your method each texture load would actually take even more time to be executed.

Why? How big is the texture cache on ATi hardware?

harsman · July 5, 2003, 3:15am

Regarding PCF, you seem to imply that the texels you fetch are bilinear filtered and then PCF filtered, but I don’t think that’s the case: they are nearest filtered and then PCF filtered (with weights), there’s no bilinear filtering (or trilinear, for that matter) going on, as that would be “geometrically incorrect”.

Maybe i implied that but that’s certainly not what I meant I’m also very aware you don’t get any filtering on float textures. That was the whole point of doing the filtering yourself! Reading a 4 channel float texture should certainly be possible though, the tex instruction should just work like a normal RGBA lookup. However, you might be right about the risk of stalls when fetching more than 32 bits, there might be bus limitations when going higher since that’s pretty uncommon.

Regarding bandwidth, I don’t think that will be as big of a problem as you think. If you’re doing per pixel lighting with shadows you probably have a pretty long fragment program anyway, so trading bandwidth for instructions should be sane.

However, since I realised you still need to compute the texcoords per pixel to get real PCf and not just a nearest neighbour filtered, box filtered shadow, the idea is probably toast anyway. I’m going on vacation for two weeks tomorrow so I won’t be able to test any of this, it will have to wait until I get home.

evanGLizr · July 5, 2003, 3:29am

evanGLizr:

ATI_texture_float Doesn’t specify interactions with ARB_fragment_program.

Korval:
And why should it? There aren’t any. You fetch the components as normal.

It should specify interactions at least for completeness, NVIDIA extension does, for example.

evanGLizr:
So even if it looks like you can do a >32bit texture load in only one ARB_fragment_program instruction, I doubt the hardware really reads >32bit per component textures in one pass.

Korval:
Based on what do you believe this? At the very least, I know you can use n-channel float textures just like regular n-channel color textures on ATi hardware.

Once upon a time I was opengl driver developer, that sentence is based on my experience with real hardware. It would be overkill to have hardware to support reading 128bits in once when that’s hardly going to be the usual case.

Note that the fact that the app can use floating point textures transparently doesn’t mean the driver or the hardware isn’t doing multiple reads behind your back.
All these “assemblers” do not map to hardware directly (but for one generation of one company, maybe), there’s a lot of work behind the scenes that the app never sees.

evanGLizr:
Multiplying the texture data by four would also kill your texture cache, so with your method each texture load would actually take even more time to be executed.

Korval:
Why? How big is the texture cache on ATi hardware?

I don’t know how big texture cache (or caches, as they probably have some hierarchy) are on ATI’s or NVIDIA’s hardware, but what I know is that with that method you make your cache 1/4 its size and that cannot be good.

Take into account that those caches, again, are tuned for the normal usage: 4 one byte components for two or three texture stages.

Edit: Doh, UBB doesn’t support hierarchical quoting.

[This message has been edited by evanGLizr (edited 07-05-2003).]

evanGLizr · July 5, 2003, 3:51am

Originally posted by harsman:
I’m also very aware you don’t get any filtering on float textures. That was the whole point of doing the filtering yourself!

I thought the point of doing the filtering yourself was doing PCF, not bilinear filtering

Regarding bandwidth, I don’t think that will be as big of a problem as you think. If you’re doing per pixel lighting with shadows you probably have a pretty long fragment program anyway, so trading bandwidth for instructions should be sane.

Hmm you’ve got a point, but note that you are assuming that you (or the driver) can rearrange the instructions to hide the stalls coming from dwarving your cache size (which will be much longer than the usual stall coming from just reading a texture, and the driver will normally use “usual stall” metrics to rearrange instructions).
Obviously that depends on your specific scenario.

[This message has been edited by evanGLizr (edited 07-05-2003).]

Korval · July 5, 2003, 4:14am

It should specify interactions at least for completeness, NVIDIA extension does, for example.

Somehow, I prefer ATi’s. Maybe it’s because I can make any kind of normal OpenGL texture floating-point (luminance, intensity, etc), whereas nVidia’s only lets you have 4-channel versions.

In any case, the interactions are specified, though not directly, by this line:

Each R, G, B, and A value so generated is clamped based on the component type in the <internalFormat>. Fixed-point components are clamped to [0, 1]. Floating-point components are clamped to the limits of the range representable by their format.

There’s little more that needs to be said. How it interacts with ARB_fragment_program is exactly how it interacts with ARB_tex_env_combine or any other per-fragment operation: it works precisely the spec says it will.

Besides, nVidia has to specify interactions. They don’t allow their fp textures to be used as render targets outside of NV_fragment_program; ATi’s extension is completely indiscriminant. As such, there is no issue to discuss.

Indeed, after reading the spec, I have some questions as to whether or not you can even use an fp texture outside of NV_fragment_program. I wonder if you can even access an fp texture with ARB_fragment_program. The ATi extension leaves no such doubts.

It would be overkill to have hardware to support reading 128bits in once when that’s hardly going to be the usual case.

A hardware implementation of ARB_shadow is hardly the common case of texture fetches. Yet, nVidia went though the trouble of implementing a fast, hardware-based version of this lookup/compare operation. That it doesn’t happen to fall under common usage doesn’t mean that hardware doesn’t get devoted to it, especially if their texture fetch units are built to handle variable-sized textures.

Now, it may be that the texture unit hardware itself may take multiple cycles, internally doing some form of loop to access memory over multiple cycles. However, I seriously doubt their shader compiler itself has to do anything to permit access to fp textures. After all, how would it know?

A shader is allowed to fail compiling, yes. But, it isn’t allowed to fail based on the current state of bound textures. Given that fact, ATi is unable to write a fragment program compiler that knows which texture accesses use an fp texture and which ones don’t. ARB_fragment_program allows such things to happen invisibly, so their hardware must be able to handle it.

If it couldn’t do it transparently in hardware, then they would have to re-compile a fragment program at every glBegin-equivalent call. Also, they would have no error recourse if the new instructions caused it to violate some hardware limit (too many instructions, too may texture accesses, etc).

but what I know is that with that method you make your cache 1/4 its size and that cannot be good.

And? It’s not like people using floating-point textures are expecting the performance you get out of an S3TC one. If you increase your bandwidth by a factor of 4 (at best), you can’t reasonably expect the same performance on a bandwidth-limitted piece of hardware (like most modern video cards). You use fp textures because they are required for the effect you’re trying to achieve.

Granted that, ATi or nVidia could have expanded their cache size, such that, while it doesn’t really help regular textures too much, fp textures don’t hurt the cache as much.

harsman · July 5, 2003, 4:40am

I thought the point of doing the filtering yourself was doing PCF, not bilinear filtering

Well, yeah That didn’t come out right. The fact that you don’t get PCF was the reason to do the filtering yourself. However I think you were confused by my use of the word bilinear. You have to have a filter kernel for pcf as well. When I said “bilinear” i just meant sampling the four closest neighbours. After all you could use a 16-sample filter kernel if you wanted, or use a cubic weighting instead of linear etc. My point was that to find the four closest neighbours you really need more taps or different coordinates for each lookup. Otherwise you just get a cheapo box filter.

Doing some research here has revealed that evanGLizr was right however.

RADEON 9500/9700 can perform point or bilinear filtering of one texture request per clock cycle per pixel shader pipe, if texture format does not exceed 32 bits. For texture formats fatter than 32 bits it will take 2 clocks for processing 64 bit texture formats and 4 clocks for 128 bit formats. Trilinear filtering doubles number of clocks because it requires two bilinear blends. For all floating point formats on RADEON 9500/9700 only point filtering is supported.

So then I might as well do all four lookups and get full PCF filtering. It’s kind of annoying to burn four fetches on a shadow map lookup but I guess that’s the price you have to pay to get PCF on the Radeon. Unless anyone else has any bright ideas? There’s two sample AA of course, so it should be possible to get some sort of filtering even with say a two channel 16-bit fp lookup.

EDIT: clarifications

[This message has been edited by harsman (edited 07-05-2003).]

evanGLizr · July 5, 2003, 5:05am

Indeed, after reading the spec, I have some questions as to whether or not you can even use an fp texture outside of NV_fragment_program. I wonder if you can even access an fp texture with ARB_fragment_program. The ATi extension leaves no such doubts.

I think it’s very clear from NVIDIA’s spec:

What happens if you try to use an floating-point texture without a fragment program?

RESOLVED: No error is generated, but that texture is effectively disabled. This is similar to the behavior if an application tried to use a normal texture having an inconsistent set of mipmaps.

Look at all the cases NVIDIA has considered when using a floating point buffer (readpixels, bitmap, clears, etc) even if it’s just to say that the extension works transparently in those cases. That’s the kind of wording I miss from ATI’s spec (even if ATI’s spec is just for textures, not for any kind of fp buffer).

I’m not talking about which spec is functionally better (obviously ATI’s orthogonal implementation is much better), but which has considered all the corner cases and interactions with other specs.

Now, it may be that the texture unit hardware itself may take multiple cycles, internally doing some form of loop to access memory over multiple cycles. However, I seriously doubt their shader compiler itself has to do anything to permit access to fp textures. After all, how would it know?

It will know the same way it knows for the fixed function pipeline. The compiler already depends on the OpenGL state for any other fixed function fragment pipeline setting.

A shader is allowed to fail compiling, yes. But, it isn’t allowed to fail based on the current state of bound textures. Given that fact, ATi is unable to write a fragment program compiler that knows which texture accesses use an fp texture and which ones don’t. ARB_fragment_program allows such things to happen invisibly, so their hardware must be able to handle it.

That’s not necessary true, you can do pessimistic assumptions on the OpenGL state when you compile an ARB_fragment_program and fail. Or you can just assume nobody will reach that problematic case your hardware doesn’t support (and pray and worry only when someone logs a bug).

Note that I don’t say their fragment shader hardware is not independent of the OpenGL state, I just say that there are ways of making the app believe it is.

If it couldn’t do it transparently in hardware, then they would have to re-compile a fragment program at every glBegin-equivalent call. Also, they would have no error recourse if the new instructions caused it to violate some hardware limit (too many instructions, too may texture accesses, etc).

Recompiling on each glBegin is a non-issue. I guess they are already recompiling internal fragment shaders on every OpenGL state change, or do you really think that fixed function fragment pipeline is implemented via dedicated hardware resources?

It’s not like people using floating-point textures are expecting the performance you get out of an S3TC one. If you increase your bandwidth by a factor of 4 (at best), you can’t reasonably expect the same performance on a bandwidth-limitted piece of hardware (like most modern video cards). You use fp textures because they are required for the effect you’re trying to achieve.

I agree with you in that and it’s exactly my point: The original discussion was whether quadruplicating the size of a shadow texture via replicating the 4 adjacent texels in a 4-component fp texture would be a win or not for shadow mapping and PCF wrt just doing 4 independent texture lookups.

My guess is that it won’t for all the reasons I’ve exposed: the hardware will actually do (transparently or not) multiple lookups per >32bit component and you will dwarf your texture cache.

davepermen · July 5, 2003, 8:14am

Originally posted by evanGLizr:
Look at all the cases NVIDIA has considered when using a floating point buffer (readpixels, bitmap, clears, etc) even if it’s just to say that the extension works transparently in those cases. That’s the kind of wording I miss from ATI’s spec (even if ATI’s spec is just for textures, not for any kind of fp buffer).

dunno… for me its not like nvidia concideret all cases, but merely its a list of where the fp textures will not work. not in fixed mode, not for this, not for that, etc…

the ati spec is merely “we have floatingpoint textures”. yes, its not much, and could be more (espencially some clamping stuff in different situations…), but at least the ati way implies that they tried to fully implement it…

and all in all, the nvidia fp textures are only to laugh anyways. they are way too restricted to be really useful the way gl is made…

i think the only way are 4 samples, and doing bilinear manually… but not from the actual results, but sort of more weighted possibly… dunno, read once about more accurate shadowmap sampling…

and, while we’re at it… as you do the bilinear yourself, you could use a neat s-curve for the filtering, too looks much nicer…

Korval · July 5, 2003, 7:47pm

Doing some researchhere has revealed that evanGLizr was right however.

That doesn’t show that it is using 4 texture accesses. It’s saying precisely what I said it might do: spend more time in the texture unit to access large texture formats. It doesn’t cause the fragment program to take one texture access instruction and convert it into 4. It simply makes it take 4 times longer.

Unless anyone else has any bright ideas?

Don’t bother with PCF?

I think it’s very clear from NVIDIA’s spec:

That doesn’t specify whether or not fp textures work with ARB_fragment_program. It’s a little nebulous on the details.

That’s not necessary true, you can do pessimistic assumptions on the OpenGL state when you compile an ARB_fragment_program and fail.

And then nobody uses your hardware, because what was once 24 separate texture lookups becomes 6, which is highly unacceptable.

Is it really that hard to believe that their hardware can do the accessing internally, though it may take longer than a regular texture access?

Recompiling on each glBegin is a non-issue. I guess they are already recompiling internal fragment shaders on every OpenGL state change, or do you really think that fixed function fragment pipeline is implemented via dedicated hardware resources?

It is on nVidia hardware. On ATi’s it may be, but it may also not be.

And, even if it is, you don’t really think they create a program string and pass it like they were compiling a regular shader, do you? They probably have pre-compiled versions of pieces of shaders for different state (one for each tex-gen mode, for lights, etc). Then, when the state is changed, they simply slap them together, do some link-time binding of temporaries and other registers, and be done. It’s hardly compiliation. It’s more like glslang program linking, though likely even simpler than that.

The original discussion was whether quadruplicating the size of a shadow texture via replicating the 4 adjacent texels in a 4-component fp texture would be a win or not for shadow mapping and PCF wrt just doing 4 independent texture lookups.

That wasn’t the discussion. The discussion was how to emulate PCF on hardware that doesn’t do it natively. That’s where the 4-sample method came from. It was never implied that this was, in either performance or quality, better than a hardware-based PCF method, because it could never possibly equal the hardware-based method (as nVidia’s does seem cheaper). It was simply a question of how to do it on non-PCF hardware.