PDA

View Full Version : Instance shaders?



Golgoth
08-07-2006, 12:21 PM
Hi all!

I ve noticed a major loading time increase when loading a complex shader... especially when using for loops... each loop cycle seams to increases the memory needed...

Simple example of a frag shader would be :

void main(void)
{
for (int i = 0; i < 1; ++i)
{
//Do something
}

}

Will load about 8 times faster than:

void main(void)
{
for (int i = 0; i < 9; ++i)
{
//Do something
}

}so my question is:

lets say we render 1000s of meshes using the same shader … the only difference are the uniform sampler2D sent to this shader…

what would you recommand…

1 – standard - loading the same shader for each mesh… take a while to load…
2 – load the shader once and use it for all meshes... update uniform sampler2D each frame for each mesh… is this even possible... i m having a doubt?
3 - other suggestions?

thx

sqrt[-1]
08-07-2006, 08:39 PM
Update the uniform sampler2D for each mesh. (if necessary - you could just bind the new textures to a fixed set of texture stages - The uniform sampler2D points to a texture stage not a texture ID remember)

Michael Gold
08-07-2006, 10:02 PM
What you really want here is pseudo-instancing, not by changing a shader constant but setting the per-instance value with immediate mode in between each draw call. Enable all attributes required by the shader *except* the instanced attribute, which will instead be read from current state.

If you google "pseudo-instancing" you will find a white paper on nvidia.com which describes the technique.

API
08-08-2006, 03:05 AM
And about the load time using loops ...

remember that at a shader, the loops are commonly resolved using loop unrolling. So after compiling the shader, a shader with 9 cycles will be about 8 times greater that the shader with 1 cycle.

For example, at a card that doesn't support real branch at the pixel shader, as my card :-( , loop unrolling is the only way to manage loops (so I only can use a constant number of cycles).

API

Golgoth
08-08-2006, 10:22 AM
Thx for your inputs guys!

To resume, Im using a Gforce 7800… and im working on a umber shader that support multiple lights, textures, bumps, shadows and so on… but I think I hit a brick wall with the loading time… if I can load the umber shader only once to render any mesh… it may solve my problem.


you could just bind the new textures to a fixed set of texture stagesinteresting, im not quite sure what you meant by texture stages... did you meant render states classes that hold uniforms data client side... wild magic engine style?


What you really want here is pseudo-instancing, not by changing a shader constant but setting the per-instance value with immediate mode in between each draw call. Enable all attributes required by the shader *except* the instanced attribute, which will instead be read from current state.Sounds exactly like what I need.

I ve found this:

http://download.nvidia.com/developer/SDK..._instancing.pdf (http://download.nvidia.com/developer/SDK/Individual_Samples/DEMOS/OpenGL/src/glsl_pseudo_instancing/docs/glsl_pseudo_instancing.pdf)

but referring on what you described the term pseudo-instancing doesn’t seam to be related to my problem… so im still looking…


the loops are commonly resolved using loop unrollingCommonly? is this means that a card that support real branching may unroll as well depending on the shader complexity?

Thx again

k_szczech
08-08-2006, 10:50 AM
Commonly? is this means that a card that support real branching may unroll as well depending on the shader complexity?Why not? As long as that improves performance it's what we want from compiler.

Komat
08-08-2006, 11:14 AM
interesting, im not quite sure what you meant by texture stages... did you meant render states classes that hold uniforms data client side... wild magic engine style?
He meant texture units. The sampler uniform in the GLSL shader contains index of texture unit to which the texture is bound. So if you always bind the normal map to texture unit 1, you can set that sampler only once for each shader and then bind different texture to that unit for each object.



but referring on what you described the term pseudo-instancing doesn’t seam to be related to my problem… so im still looking…
Altrough it is called a pseudo-instancing, this method can be used to send limited number of per object values to the shaders even if the objects are entirely unrelated. This method will however not help you with changing index of the sampler uniform.


Commonly? is this means that a card that support real branching may unroll as well depending on the shader complexity?
The driver may assume that it is faster to do that because there is runtime cost for the branches and loops. There are also hw limitations on what operations are supported inside some flow control constructs so the driver may have to unroll the loop to work around that.

Golgoth
08-08-2006, 11:16 AM
My lack of HW knowledge slap me in the face once more...


Commonly? is this means that a card that support real branching may unroll as well depending on the shader complexity?what I meant by that is... first, if the shader is not unrolled... how is it handled?... is a JIT compiling technique possible with modern hardware? and how is the card/compiler drawn the line between unrolling a shader and another method?

thx

Golgoth
08-08-2006, 12:23 PM
Hi Komat!


He meant texture units. The sampler uniform in the GLSL shader contains index of texture unit to which the texture is bound. So if you always bind the normal map to texture unit 1, you can set that sampler only once for each shader and then bind different texture to that unit for each object.yes, texture units are fixed, colormap 0, specmap 1, bumpmap 2, reflecmap 3, projectmap 4... keep in mind that I only have 1 shader... but I think that pretty much covers the original question... it leads me to another question:

What if the shader expects a bump map sampler2d but for some reason, it has not been sent to the shader... is there any known concept on querying if a texture unit is valid inside a shader?

Something that would do this:

uniform sampler2D u_NormalMap;

void main(void)
{
if (u_NormalMap != NULL)
{
}



The driver may assumeim not crazy about the ass u me part ... is there any handy specs on this?

thx

API
08-08-2006, 12:40 PM
what I meant by that is... first, if the shader is not unrolled... how is it handled?... is a JIT compiling technique possible with modern hardware? and how is the card/compiler drawn the line between unrolling a shader and another method? I suppose that if the compiler detects that the loop can be unrolled, unrolls it, to get a better performance.

But there are situations where you can unroll the loop, for example, if the number of cycles is unknow at compile time. In my card the compiler should show an error, but with your card the compiler can use the branch functions.

I think that you don't need a very complex compiler to manage this, this is very common.

Another reason could be that the index was changed at the loop, but i don't sure if this is allowed.

Komat
08-08-2006, 12:52 PM
What if the shader expects a bump map sampler2d but for some reason, it has not been sent to the shader... is there any known concept on querying if a texture unit is valid inside a shader?
There is no way to detect that. All uniforms are initialized to zero when the shader is linked so the uninitialized sampler should reference the first texture unit.




The driver may assumeim not crazy about the ass u me part ... is there any handy specs on this?
As far as I know there is no document covering this since that highly depends on hw and driver version. Some informations about hw/driver limitations can be found in nVidia and ATI SDK and papers or in DirectX documentation.

Golgoth
08-08-2006, 01:08 PM
I think that may help, the example here shows the structure I ve choosed:

void main(void)
{


vec3 l_color;

for (int i = 0; i < gl_MaxLights; ++i)
{
if (i < u_LightCount)
{
SetLight(i);

if (u_BumpType == 1)
l_color += SetDot3(i);
else if (u_BumpType == 2)
l_color += SetParallax(i);
else if (u_BumpType == 3)
l_color += SetRelief(i);
else if (u_BumpType == 4)
l_color += SetCurved(i);
}
}

l_ambient *= gl_FrontMaterial.ambient.rgb + gl_LightModel.ambient.rgb;
l_diffuse *= gl_FrontMaterial.diffuse.rgb;
l_specular *= gl_FrontMaterial.specular.rgb;
l_lightColor = gl_FrontMaterial.emission.rgb + l_ambient + l_diffuse + l_specular;


l_color *= l_lightColor;


gl_FragColor = vec4(l_color, 1.0);
}Note that this uniform u_LightCount is used because I use gl_* variables, but it still unroll the loop 8 times (8 being the gl_MaxLights on my gforce 7800). The shader is quite complexe and loading time is massive when loading the same shader once per mesh. So loading it only once and change texture id client side will work just fine!

What do you guys think about this, can you see any brick wall I may hit doing so?

Komat
08-08-2006, 01:19 PM
Originally posted by Golgoth:
what I meant by that is... first, if the shader is not unrolled... how is it handled?... is a JIT compiling technique possible with modern hardware?
If shader is not unrolled, the driver generates looping and/or jump instructions.

There is no JIT compilation on the the GPU. The driver may use several JIT optimized variants of your shader based on values set to the uniform variables, especially the boolean ones, however it is not required to do so.



and how is the card/compiler drawn the line between unrolling a shader and another method?
This is driver specific and it is probably based on known limitations of HW and some heuristics.

For example as far as I know there is limited support for relative indexing (e.g. foo[ loop_counter ] ) on current hw. In fragment shader it is not possible to use it to address constants (uniforms) so, if you loop does that, it will be unrolled.

Komat
08-08-2006, 01:35 PM
What do you guys think about this, can you see any brick wall I may hit doing so? I assume that one problem might be that you are using the loop counter to index the uniforms which is not supported on GF7800.

Even if you succeed in removing things that are not supported by the hw, the driver (current or future) may still choose to unroll the loop because it gives him a better instruction reordering oportunities.

Golgoth
08-08-2006, 01:52 PM
considering that unrolling was an issue regarding loading time... I ve succeed using the same shader for all meshes... it takes about 3 sec to load on a amd 3800... may get up till 5 sec when everything is in place in the umber shader... as I go along, I might find ways to optimized the shader… but hey… make it work, optimize later!

Thx again gentlemen!