Level of detail

How is the lambda parameter for texture lookup calculated in a fragment shader?
Does the implementation just calculate it as if I had supplied texture coordinate n to texture unit n? Does it mirror all my calculations of the real coordinates used to calculate their derivatives?

The hw always calculates several neighbor fragments simultaneously in lockstep mode. When it needs to calculate derivatives for texturing instruction, it computes them from texture coordinates passed to that instruction in neighbor fragments.

I always thought that totally independent processing of fragments is what makes massive parallelism in fragment shaders possible.
If executing in lockstep mode that would mean that if I have a loop in my fragment shader followed by a texture access and the number of loop iterations varied greatly all neighbours of a fragment that needs a large number of loop iterations would be slowed down.
However this would explain why derivatives are undefined within the body of non-uniform conditionals.

I always thought that totally independent processing of fragments is what makes massive parallelism in fragment shaders possible.
The idea is that, since each of a 2x2 (or however many) pixel group is going to run the same instructions, just over several bits of data, then all you need is hardware that can do 4 simultaneous vector operations in a single cycle. It saves a lot in transistors to have each of the 2x2 groups using the same sequence of instructions. You don’t need to have 4 processing units; you only need one that can handle 4x the data.

If executing in lockstep mode that would mean that if I have a loop in my fragment shader followed by a texture access and the number of loop iterations varied greatly all neighbours of a fragment that needs a large number of loop iterations would be slowed down.
Yes. And this is why doing loops like that is discouraged by hardware makers in their performance guides.

However this would explain why derivatives are undefined within the body of non-uniform conditionals.
Precisely.

Originally posted by Korval:
The idea is that, since each of a 2x2 (or however many) pixel group is going to run the same instructions, just over several bits of data, then all you need is hardware that can do 4 simultaneous vector operations in a single cycle. It saves a lot in transistors to have each of the 2x2 groups using the same sequence of instructions. You don’t need to have 4 processing units; you only need one that can handle 4x the data.

I doubt it saves a lot of hardware (floating-point computations are expensive anyway, the overhead forcontrol logic shouldn’t matter that much), but I see this saves a lot of instructions in shaders since you get the derivatives for free.
However I have one question left: Suppose we have hardware that always processes groups of 2x2 fragments. How does that work at the border of a window with uneven height or width? There’s no neighbouring fragment that can be accessed to get an approximation of the derivatives.

Originally posted by PkK:
However I have one question left: Suppose we have hardware that always processes groups of 2x2 fragments. How does that work at the border of a window with uneven height or width? There’s no neighbouring fragment that can be accessed to get an approximation of the derivatives.
Similar situations happens on border of geometry. There is no logical fragment at that position in the 2x2 quad however the hw will still run shader calculation as if there was one and ignores result of calculations for fragments that should not exist.

Regarding branching, it’s also the case that there’s a branching granularity that is typically bigger than 2x2. For instance X1800 has 16 pixels, 48 for X1900 and 64 for HD 2900XT. If any fragment within that block takes a different path, both paths need to be executed for all fragments in that block.

This topic was automatically closed 183 days after the last reply. New replies are no longer allowed.