Well, here is a brain storm for SM5 done in 2 minutes while I was licking a frog like Hommer
-
Infinite shader length. Currently DX10 can execute max. 64k instructions( ok, CUDA can execute 2M ). Let’s increase it a bit more ( 8-16M perhaps? or infinite better ). I’m not sure if this is a HW limitation really or software/convention one but more instructions will be very good for GPGPU programs and to iterate really big tree structures.
-
Shader function recursion. Currently all the shader functions are inlined. Let’s implement a small “stack” so we could call recursively functions to iterate trees, etc… A stack of 1M ( considering most cards are reaching 1Gb VRAM ) will be fine ( VStudio uses 1M by default, that’s why I say that number ).
-
Shader shared memory. Putting this in a fragment shader you could be able to syncronize the rendering thread blocks and to read/write common memory without having to render another pass to other FBO.
Currently you cannot read values(or rewrite existing ones) in a FBO being rendered because is “in use”. With this small quantity of shared memory(for example, 16Kb) you could store there some data for reading or writting and to syncronize thread blocks(like the CUDA __syncthreads() )
uniform sampler2D baseTexSampler;
varying vec2 uv;
__shared__ float maxAlpha;
void main ()
{
vec4 base = texture2D(baseTexSampler,uv);
lock(maxAlpha)
{
if ( s>maxAlpha )
{
maxAlpha = base.a;
}
}
__sync();//sync thread blocks. Wait until all reaches this point so maxAlpha will be the same for all.
gl_FragCoord = base*maxAlpha;
}
This can be very good for GPGPU(i’m thinking in reductions for example) and for image kernels… Imagine you want to do a gaussian filter divided in horizontal and vertical passes… you need to do two passes for that… with the shared memory you could do this in only one pass.
The syncronize method perhaps can be used too to control multi-GPU card like the GX2… but i’m not sure if all those thread locks will affect too much the performance… well… CUDA has this and looks to work ok.
Basically we can define three memory variable types:
shared will share memory between GPU thread blocks ( 2x2 or 8x8 pixels aprox, depends )
global could access system RAM ( fear the speed… the new PCI-X 2.0 and HT3.0 can help here )
device could access global data in VRAM across GPU multiprocessors ( fear the speed but less than global )
Notice with the new Nehalem and Fusion hybrids the global and device won’t be as terrible as you think…
- Advanced tesellation unit. Basically a more complex geometry shader. I’m not sure if the current GS model allows to fetch textures(DX10 docs are not very clear about this and the OGL GF8500 drivers are returning zero for the geometry_shader4 MAX_GEOMETRY_TEXTURE_IMAGE_UNITS_EXT ). Well, if it does not, please add that feature + filtering support and FP32/64. In that way we could do adaptive tessellation for terrains, complex LOD, etc…
Also try to improve a bit the speed of the GS because is a bit slow currently… and increase a bit too the maximum output vertex limit(currently is 1024? )
Perhaps Humus or other ATI person could explain us better the R600 tessellation unit so we could add some ideas here.
-
Full IEEE754 double precision support in textures, blending, filtering, etc…
-
New pipeline. Ok what i’m gonna post here i’m not sure if is a good idea… but here is:
[OPTIONAL]
Geometry shader #1
|
|
Vertex shader
|
|
Triangle soup update
In order to perform raytracing, there is an optional pipeline part. Basically the geometry shader and vertex shader are executed like always… but the vertex shader in this phase has to emit only world space position instead of an hClip position. All the triangles will be then
stored internally in VRAM inside an uniform grid.
This can be done too supporting a “render to volume texture” but just storing triangles with 3 vertex positions.
This is completely optional, we will do this only if we need raytracing as described above.
Ok, now the rest of the pipeline:
Geometry shader
|
|
Vertex shader
|
|
Clipping
|
|
Fragment Shader
|
|
Blend shader
1)The GS#1 will compose triangles from the original mesh or to tessellate the mesh in local space. That works like now.
2)The vertex shader will output a clip-space position, vertex.Z and prepare varyings like always.
3)Clipping is performed using the vertex shader hclip output
- The fragment shader will be executed for each point in the clipped triangle.
Now, with the world space triangle soup in the uniform grid(or from a render to volume texture) we set previously we can add some new GLSL instructions called “closestHitTest” “hitTest”, “gatherPoints” and “computeFlux”.
uniform sampler2D baseTexSampler;
varying vec2 uv;
uniform vec3 wsPos, lightPos;
void main( stream& outStream )
{
vec4 base = texture2D(baseTexSampler,uv);
base += hitTest(wsPos,lightPos);
outStream["base"] = base;
outStream["depth"] = gl_Depth;//depth from triangle interpolants
outStream["other"] = 12;
}
The hitTest accepts ray origin and ray end positions. Return 0 if there is no triangle - ray intersection or 1 if there is. Basically will be used for shadow test. It just executes a very simple DDA over the uniform grid(or volume texture). This could be implemented too just increasing the shader length limits and doing it manually.
The closestHit is the same but return a list of barycentric hit points and triangle indices. This function is much more complicated… and perhaps is too advanced for the current HW… but well, I should mention it for reflections.
Could be used too for physics GPU ( don’t they claim can do GPU physics?.. well, this function is basic on each physics engine… )
We will need some serious changes for this.
The gatherPoint can be used for dynamic AO . It basically collects triangle hits around a hemisphere given a normal direction.
We will need some serious changes for this.
The computeFlux just gather photon mapping samples around an sphere using the mentioned grid.
We will need some serious changes for this.
Like we have all the scene triangles inserted internally into the uniform grid(or the rendered volume texture) this is possible and works for dynamic scenes too! . Notice I just want this for raytacing secondary rays… primary rays is a non sense because you can do it well with rasterization.
The mainly problem with the closestHit, gatherPoints and computeFlux is that, like Korval said well, lead us to a raytracing API… and the OpenGL current structure can be seriously affected. For the hitTest we could just hardcode the uniform grid stage in 3) or increase the shader length limit and do it manually(but the ideal is to hardcode it and implement the gatherPoints, etc… too )
Well, whatever it is, if this won’t fit well in OpenGL notice I set the title of this post “Shader Model 5 suggestions” and not “openGL 3.0 suggestions”.
Just notice other thing… I write output values with “outStream[key]”. That’s because I want to write “raw structs”, not fixed like we do at the moment. That’s vital for the blend shader that comes now:
- Now a “blend shader” is executed, so you can control individually for each pixel outputted from the fragment shader the alpha blending. Basically works like this:
-
Each time a pixel is outputted from the fragment shader to the screen position [X,Y,Z] will be saved into a layered array list. To avoid to implement “linked lists” in the framebuffer, which is not very HW suitable, we will define a maximum layer depth… for example 32…
-
Once all the pixels have been written we execute a one-pass blend shader for each screen position… Something like:
bool myDepthSort ( const stream& a, const stream& b )
{
const float depthA = a.GetValue("depth");
const float depthB = b.GetValue("depth");
return depthA<depthB;
}
vec4 main ( const int x, const int y)
{
//[x,y] are the screen coordinates of the pixel being processed in the framebuffer
vec4 finalCol(0.0);
const array<stream&>& a = gl_LayerEntries(x,y);
a.sort(myDepthSort);
foreach ( stream in a )
{
finalCol = blend(finalColor,a.GetValue("color").rgb,a.GetValue("color").a);
}
return finalCol;
}
Notice you fetch the framebuffer layer entries with gl_LayerEntries(x,y), which accepts the current [X,Y] screen coordinate of the pixel being processed and returns a read-only array with the layer values for that pixel. In that way you could be able to fetch ANY pixel in the framebuffer, not only the one being processed.
If this kills the performace we could restrict it just to the current pixel and don’t allow to fetch other pixels.
With this one-pass shader you can do things like order independent transparency sorting, SSS, independent alpha blending for each point, etc…
The question is if we could do the same just with multiple render targets and shaders like “depth peeling”. Perhaps yes, but then we will need more than 4 MRTs (or more passes) and much more texture bandwith/quantity and a way to control per-pixel the alpha blend mode.
Well, what do thing about all this?