Shader Model 5 suggestions

Well, here is a brain storm for SM5 done in 2 minutes while I was licking a frog like Hommer

  1. Infinite shader length. Currently DX10 can execute max. 64k instructions( ok, CUDA can execute 2M ). Let’s increase it a bit more ( 8-16M perhaps? or infinite better ). I’m not sure if this is a HW limitation really or software/convention one but more instructions will be very good for GPGPU programs and to iterate really big tree structures.

  2. Shader function recursion. Currently all the shader functions are inlined. Let’s implement a small “stack” so we could call recursively functions to iterate trees, etc… A stack of 1M ( considering most cards are reaching 1Gb VRAM ) will be fine ( VStudio uses 1M by default, that’s why I say that number ).

  3. Shader shared memory. Putting this in a fragment shader you could be able to syncronize the rendering thread blocks and to read/write common memory without having to render another pass to other FBO.

Currently you cannot read values(or rewrite existing ones) in a FBO being rendered because is “in use”. With this small quantity of shared memory(for example, 16Kb) you could store there some data for reading or writting and to syncronize thread blocks(like the CUDA __syncthreads() )

uniform sampler2D baseTexSampler;
varying vec2 uv;
__shared__ float maxAlpha;

void main ()
{
   vec4 base = texture2D(baseTexSampler,uv);                             

   lock(maxAlpha)
   {
      if ( s>maxAlpha )                          
      {
         maxAlpha = base.a;
      }
   }

   __sync();//sync thread blocks. Wait until all reaches this point so maxAlpha will be the same for all.

  gl_FragCoord = base*maxAlpha;
}

This can be very good for GPGPU(i’m thinking in reductions for example) and for image kernels… Imagine you want to do a gaussian filter divided in horizontal and vertical passes… you need to do two passes for that… with the shared memory you could do this in only one pass.

The syncronize method perhaps can be used too to control multi-GPU card like the GX2… but i’m not sure if all those thread locks will affect too much the performance… well… CUDA has this and looks to work ok.

Basically we can define three memory variable types:

shared will share memory between GPU thread blocks ( 2x2 or 8x8 pixels aprox, depends )

global could access system RAM ( fear the speed… the new PCI-X 2.0 and HT3.0 can help here )

device could access global data in VRAM across GPU multiprocessors ( fear the speed but less than global )

Notice with the new Nehalem and Fusion hybrids the global and device won’t be as terrible as you think…

  1. Advanced tesellation unit. Basically a more complex geometry shader. I’m not sure if the current GS model allows to fetch textures(DX10 docs are not very clear about this and the OGL GF8500 drivers are returning zero for the geometry_shader4 MAX_GEOMETRY_TEXTURE_IMAGE_UNITS_EXT ). Well, if it does not, please add that feature + filtering support and FP32/64. In that way we could do adaptive tessellation for terrains, complex LOD, etc…

Also try to improve a bit the speed of the GS because is a bit slow currently… and increase a bit too the maximum output vertex limit(currently is 1024? )

Perhaps Humus or other ATI person could explain us better the R600 tessellation unit so we could add some ideas here.

  1. Full IEEE754 double precision support in textures, blending, filtering, etc…

  2. New pipeline. Ok what i’m gonna post here i’m not sure if is a good idea… but here is:

[OPTIONAL]
Geometry shader #1
|
|
Vertex shader
|
|
Triangle soup update

In order to perform raytracing, there is an optional pipeline part. Basically the geometry shader and vertex shader are executed like always… but the vertex shader in this phase has to emit only world space position instead of an hClip position. All the triangles will be then
stored internally in VRAM inside an uniform grid.
This can be done too supporting a “render to volume texture” but just storing triangles with 3 vertex positions.

This is completely optional, we will do this only if we need raytracing as described above.

Ok, now the rest of the pipeline:

Geometry shader
|
|
Vertex shader
|
|
Clipping
|
|
Fragment Shader
|
|
Blend shader

1)The GS#1 will compose triangles from the original mesh or to tessellate the mesh in local space. That works like now.

2)The vertex shader will output a clip-space position, vertex.Z and prepare varyings like always.

3)Clipping is performed using the vertex shader hclip output

  1. The fragment shader will be executed for each point in the clipped triangle.

Now, with the world space triangle soup in the uniform grid(or from a render to volume texture) we set previously we can add some new GLSL instructions called “closestHitTest” “hitTest”, “gatherPoints” and “computeFlux”.

uniform sampler2D baseTexSampler;
varying vec2 uv;
uniform vec3 wsPos, lightPos;

void main( stream& outStream )
{
   vec4 base = texture2D(baseTexSampler,uv);
   base += hitTest(wsPos,lightPos);

   outStream["base"] = base;
   outStream["depth"] = gl_Depth;//depth from triangle interpolants
   outStream["other"] = 12;
}

The hitTest accepts ray origin and ray end positions. Return 0 if there is no triangle - ray intersection or 1 if there is. Basically will be used for shadow test. It just executes a very simple DDA over the uniform grid(or volume texture). This could be implemented too just increasing the shader length limits and doing it manually.

The closestHit is the same but return a list of barycentric hit points and triangle indices. This function is much more complicated… and perhaps is too advanced for the current HW… but well, I should mention it for reflections.
Could be used too for physics GPU ( don’t they claim can do GPU physics?.. well, this function is basic on each physics engine… )
We will need some serious changes for this.

The gatherPoint can be used for dynamic AO . It basically collects triangle hits around a hemisphere given a normal direction.
We will need some serious changes for this.

The computeFlux just gather photon mapping samples around an sphere using the mentioned grid.
We will need some serious changes for this.

Like we have all the scene triangles inserted internally into the uniform grid(or the rendered volume texture) this is possible and works for dynamic scenes too! . Notice I just want this for raytacing secondary rays… primary rays is a non sense because you can do it well with rasterization.

The mainly problem with the closestHit, gatherPoints and computeFlux is that, like Korval said well, lead us to a raytracing API… and the OpenGL current structure can be seriously affected. For the hitTest we could just hardcode the uniform grid stage in 3) or increase the shader length limit and do it manually(but the ideal is to hardcode it and implement the gatherPoints, etc… too )
Well, whatever it is, if this won’t fit well in OpenGL notice I set the title of this post “Shader Model 5 suggestions” and not “openGL 3.0 suggestions”.

Just notice other thing… I write output values with “outStream[key]”. That’s because I want to write “raw structs”, not fixed like we do at the moment. That’s vital for the blend shader that comes now:

  1. Now a “blend shader” is executed, so you can control individually for each pixel outputted from the fragment shader the alpha blending. Basically works like this:
  • Each time a pixel is outputted from the fragment shader to the screen position [X,Y,Z] will be saved into a layered array list. To avoid to implement “linked lists” in the framebuffer, which is not very HW suitable, we will define a maximum layer depth… for example 32…

  • Once all the pixels have been written we execute a one-pass blend shader for each screen position… Something like:

bool myDepthSort ( const stream& a, const stream& b )
{
   const float depthA = a.GetValue("depth");
   const float depthB = b.GetValue("depth");
   return depthA<depthB;
}

vec4 main ( const int x, const int y)
{
   //[x,y] are the screen coordinates of the pixel being processed in the framebuffer

   vec4 finalCol(0.0);
   
   const array<stream&>& a = gl_LayerEntries(x,y);

   a.sort(myDepthSort);                          

   foreach ( stream in a )
   {
      finalCol = blend(finalColor,a.GetValue("color").rgb,a.GetValue("color").a);
   }
   
   return finalCol;
}

Notice you fetch the framebuffer layer entries with gl_LayerEntries(x,y), which accepts the current [X,Y] screen coordinate of the pixel being processed and returns a read-only array with the layer values for that pixel. In that way you could be able to fetch ANY pixel in the framebuffer, not only the one being processed.
If this kills the performace we could restrict it just to the current pixel and don’t allow to fetch other pixels.

With this one-pass shader you can do things like order independent transparency sorting, SSS, independent alpha blending for each point, etc…

The question is if we could do the same just with multiple render targets and shaders like “depth peeling”. Perhaps yes, but then we will need more than 4 MRTs (or more passes) and much more texture bandwith/quantity and a way to control per-pixel the alpha blend mode.

Well, what do thing about all this?

Btw, we should take into consideration the following upcoming CPU-GPU hybrids:

Larrabee/Desher/Polaris
Nehalem

Fusion

In the future i’m pretty sure the GPU and CPU will share transistors and instructions.

Perhaps this “fusion” could allow fragment shaders to fire CPU functions… but I bet will kill the performance… well, we could make something to inter-exchange CPU and GPU data faster… who know.

considering most cards are reaching 1Gb VRAM
I don’t know what graphics cards you’re talking about, but it’s only been in the last year that cards have cracked the 512MB mark.

If they provide recursion of any kind, you can expect the depth to be short. Like maybe 4K of stack space.

And when I say “they”, I mean ATi and nVidia. Intel’s new x86-based GPU may be able to have a more sizable stack, but I wouldn’t even bet on that.

Lets put some example so the people can understand it better… Imagine the transparency sorting…
You can pretty much forget that. Unless you’re on PowerVR hardware, transparency sorting is right out, as is any mechanism like this to make it work.

Now, with enough buffers, you can probably finesse it to work. And that’s the whole point of programmable hardware: to give you the ability to do the things you need done without having IHVs hardcode everything.

Let’s allow to do adaptive tesellation easier( for example, for terrains).
What’s getting in your way? I mean, geometry shaders can do it just fine.

Mesh hierarchical spatial iteration.
Hardcoded features are dead. They died the day the GeForce 3 came out. Everything’s moving to shaders.

If you want this, they will have to expose the ability for you to walk a piece of memory from a shader (which you can basically already do, given G80-class hardware and uniform buffers of huge size), and then you will have to walk it yourself. At no time should IHVs be implementing this themselves.

can be an independent shading stage between the vertex shader and the fragment shader.
I seem to recall that we just got an independent stage between the vertex shader and the fragment shader. Oh yeah, that’s right, it was called a Geometry Shader.

So why can’t you do this with a geometry shader?

I suspect that will be present in Mount Evans
I will bet you $1000 that double precision float support will not be in Mt Evans.

Originally posted by Korval:
[QUOTE]I will bet you $1000 that double precision float support will not be in Mt Evans.
As an extension for sure :wink:

Originally posted by Korval:
I don’t know what graphics cards you’re talking about, but it’s only been in the last year that cards have cracked the 512MB mark.

http://techreport.com/reviews/2007q3/radeon-hd-2900xt-1gb/index.x?pg=1
http://www.madshrimps.be/vbulletin/f22/nvidia-geforce-9800-specifications-known-35701/
And the 8800 is 768 with is close to 1Gb.

Also think SM5.0 won’t be available until 2009-2010, so almost all the decent cards will have 1Gb.

If they provide recursion of any kind, you can expect the depth to be short. Like maybe 4K of stack space.

Well, any stack will be good as start.
With 700-1000Mb of VRAM I thought they chould achieve 1Mb stack well…

Originally posted by Korval:
You can pretty much forget that. Unless you’re on PowerVR hardware, transparency sorting is right out, as is any mechanism like this to make it work.

Well, the Kyro and PMX processors used a tile rendering system. Was useful to sort transparencies indeed, but was too slow. Also i’m not sure if a tile system in HW is really a good thing.

Originally posted by Korval:
Now, with enough buffers, you can probably finesse it to work. And that’s the whole point of programmable hardware: to give you the ability to do the things you need done without having IHVs hardcode everything.

Yep, this can be done with depth peeling but is such a pain and very slow. I’m not sure the proposed layered system will be faster but almost you could do a thing you simply can’t at the moment… to control the alpha blending individually for each pixel in one pass. This kind of buffer can be used for more things too, is not limited to transparency sorting(for example, for raytraced SSS)

Originally posted by Korval:
What’s getting in your way? I mean, geometry shaders can do it just fine.

Nope. The current GS cannot use texture1D/texture2D/texture3D fetches so is hard to perform adaptive tessellation based on a heightmap. That’s why I think the R600 has a tessellation unit, but the API is not public. As Humus said here

http://forum.beyond3d.com/showpost.php?p=1006632&postcount=23
http://forum.beyond3d.com/showpost.php?p=1007104&postcount=26

that terrain demo was not using GS.
See this demo:
http://www.youtube.com/watch?v=Bcalc8UoJzo

is NOT using geometry shaders according to all the sources.

This other uses tessellation too:
http://www.youtube.com/watch?v=SgQj4JRgMo0&mode=related&search=

but that one i’m not 100% sure is not using the GS.

Originally posted by Korval:
Hardcoded features are dead. They died the day the GeForce 3 came out. Everything’s moving to shaders.

If you want this, they will have to expose the ability for you to walk a piece of memory from a shader (which you can basically already do, given G80-class hardware and uniform buffers of huge size), and then you will have to walk it yourself. At no time should IHVs be implementing this themselves.

Already did with shaders. Too slow and painful:
http://www.clockworkcoders.com/oglsl/journal.html
also has some serious limitations due to the maximum length of shaders and loop limitation.

Think the G80 has a 64k instruction slot limit in DX10(or is even less?). I don’t think that’s enough to process a complex scene spatial tree.

Need specialized HW for that, like the
http://graphics.cs.uni-sb.de/~woop/rpu/rpu.html#gallery
btw, runs at 66Mhz and OWNS any current HW.

Is time to think in raytracing. If the SaarCor university project did I don’t see why a company like NVIDIA with 10000x resources more cannot do it. In fact, if you see the page 16 of the Larrabee presentation you can see Intel wanna do exactly that( and that explain why they hired the Quake4 raytracing guy)

Originally posted by Korval:
I seem to recall that we just got an independent stage between the vertex shader and the fragment shader. Oh yeah, that’s right, it was called a Geometry Shader.
So why can’t you do this with a geometry shader?

Because the current rasterization method cannot shade subpixels. The fragment shader stops at pixel level. Need dedicated HW for this based on the Renderman REYES and fr-1 MTD architecture.

[/QUOTE]Originally posted by Korval:
I will bet you $1000 that double precision float support will not be in Mt Evans.

Are you sure?
See the new August 2007 DirectX SDK, Direct3D 10.1 appendix for Windows Vista SP1 ( which is comming imminently in Novemeber with the G92 ). Also see the http://www.madshrimps.be/vbulletin/f22/nvidia-geforce-9800-specifications-known-35701/

The current GS cannot use texture1D/texture2D/texture3D fetches so is hard to perform adaptive tessellation based on a heightmap.
Then ask for that, not some hardcoded feature. Not only that, unlike hardcoded features, it’s useful for other people, not just you.

Already did with shaders. Too slow and painful:
Then wait for shaders to get faster. That’s something that’s guaranteed to happen with new hardware; you don’t even have to ask for it.

btw, runs at 66Mhz and OWNS any current HW.
Ownage is in the eye of the beholder. Those screenshots look like something out of graphics ca 2002 or so (except for the shadows).

Now, when they can do something at interactive framerates that has complexity (both geometrically and in shaders) equal to modern games, let me know.

Is time to think in raytracing.
Maybe you weren’t paying attention, but OpenGL is not a ray tracing API. It is a rasterization API. The OpenGL specification is predicated on that. A ray tracing API would have a very different structure.

Because the current rasterization method cannot shade subpixels.
Sure they can. What is a subpixel but just another pixel. Just turn off multisampling and do super sampling manually. You can shade as many subpixels as you want.

Originally posted by Korval:
Then ask for that, not some hardcoded feature. Not only that, unlike hardcoded features, it’s useful for other people, not just you.

That’s what i’m asking. I never did the contrary… my words were “an advanced GS with adaptive tessellation support”.

The current GS model is a bit weak, no texture fetches can be done(what makes adaptive tessellation very hard if not impossible). That’s why ATI R600 tessAPI and the G100 are going to solve this.

Then wait for shaders to get faster. That’s something that’s guaranteed to happen with new hardware; you don’t even have to ask for it.

I said “painful” too. Faster shaders won’t solve it if is not possible to iterate a 1M tree nodes structure due to the shader slot limit and inline unrolling. Do make that I need to do really complex things(tons of passes, stackless spatial structures, etc…) and are very limited by the current HW.

Ownage is in the eye of the beholder. Those screenshots look like something out of graphics ca 2002 or so (except for the shadows).

That presentation is very old. The screenshots can look crappy indeed, but thing the FPGA runs at 66Mhz… also see it is a project from the students in an university not from a professional company. I think they did a very decent job.

Now, when they can do something at interactive framerates that has complexity (both geometrically and in shaders) equal to modern games, let me know.

If I could do that I would be rich :stuck_out_tongue:
Let’s wait for Larrabee.

Maybe you weren’t paying attention, but OpenGL is not a ray tracing API. It is a rasterization API. The OpenGL specification is predicated on that. A ray tracing API would have a very different structure.

I’m just asking for a trace GLSL instruction in the fragment shader. I don’t see how can affect that so bad the OpenGL structure. In fact just need that. Notice i’m talkin only about secondary rays. Primary rays has no sense, you can do it with rasterization + layered depth peeling for transparencies.

With a driver internal change in the VBOs(0% structural change to store the uniform grid) and that closestHit/hit GLSL instruction will be more than enough… No big structural changes will be needed for this pre-raytracing step. Is just like a texture3D fetch but retrieving hit coordinates. Can be implemented internally in different ways ( for example with the uniform grid iteration or more complex ones )

Sure they can. What is a subpixel but just another pixel. Just turn off multisampling and do super sampling manually. You can shade as many subpixels as you want
Well, that’s an option but to do MDT I need to play with the zbuffer and write “outside” the triangle limits. That’s not supported currently and I don’t know how to solve it. Need something like described here:

http://forum.beyond3d.com/showthread.php?p=1039492#post1039492

and requires the clipping and interpolants engine to be changed sightly.

Oh btw…

http://www.trustedreviews.com/graphics/review/2007/06/28/Intel-Research-Day-2007/p1

there is the 80 core thing, Quake4 raytraced running like mad with only 8 cores, larabee and other things.

That’s what i’m asking. I never did the contrary… my words were “an advanced GS with adaptive tessellation support”.
Yes, but what you asked for was “an advanced GS with adaptive tessellation support”, not “texture fetches in geometry shader”.

You can implement “adaptive tessellation” in any number of ways, both fixed-function and programmatically. Adaptive tessellation is an end, not a means. Texture fetching is a means to that end. Ask for means, not ends.

I’m just asking for a trace GLSL instruction in the fragment shader.
You say that so simply, as though it were the most obvious thing after getting a dot product.

Having a “trace” function is about the last thing glslang should provide. It might provide adequate means to create a trace function, but it should never have hardware dedicated to the task. Doubly so considering the fact that there are innumerable ways to trace a scene, to describe a scene, and so forth, and all of them are more efficient for some tasks vs. others.

Well, that’s an option but to do MDT I need to play with the zbuffer and write “outside” the triangle limits.
The whole point of microtriangles is to have a pixel-sized or smaller triangle. Then you just use the shader to displace its position(s) appropriately. That way, you don’t need specialized hardware for pixel-level displacement.

The GS can do all of this (with texture fetches). So once again, you’re asking for a particular end rather than a means to an end.

Quake4 raytraced running like mad with only 8 cores, larabee and other things.
It was not “running like mad” with 8 cores. It was running pretty slowly with 8 cores.

And I seriously doubt that Larrabee with have the equivalent power of 8 cores tasked with raytracing.

Most important of all, even implementing this required not using OpenGL. Or, at the very least, violating large swaths of the OpenGL spec. That’s why if you want ray tracing, you need a special ray tracing API.

Originally posted by santyhamer:
The current GS cannot use texture1D/texture2D/texture3D fetches
I am sure it does. Could you point to the part of the spec that states the opposite?

[quote]Originally posted by Zengar:
[b]

tex2D (DirectX HLSL)
Samples a 2D texture.

ret tex2D(s, t) 

Parameters
s 
[in] The input sampler state.

t 
[in] The input texture coordinate.

Return Value
The value of the texture data.

Type Description
Name In/Out Template Type Component Type Size 
s in object sampler2D 1 
t in scalar float 1 
ret out vector float 4 

Minimum Shader Model
This function is supported in the following shader models.

Shader Model Supported 
Shader Model 4 (DirectX HLSL) yes (vertex/pixel shader only) 
Shader Model 3 (DirectX HLSL) yes (vertex/pixel shader only) 
Shader Model 2 (DirectX HLSL) yes (pixel shader only) 
Shader Model 1 (DirectX HLSL) yes (pixel shader only) 

See Also
Intrinsic Functions (DirectX HLSL)

Also the GS DX10 examples… none fetches textures. If it has is very hidden, no evidence in the SDK but who know…

Also tryed to find in the G80 OpenGL specs… nothing. Neither in the NVIDIA SDK 9.5. Neither in the ATI latests GI, GSNPatches examples.

I thing the most important feature for SM5.0 could be to move the Raster Operating Processors to the fragment pipline.

Many algorithms like depthpeeling for order independent transparency could be accelerated or replaced.

Informations about the block granularity (Blocks of fragments that share the instruction pointer would be nice too) With that it would be possible to build high effective tile based deferred renderer (1st pass fill GBuffer;2nd pass create low res texture with min max Z values for each tile, 3rd pass render tile informations about lights into a texture, 4th pass process all lights in a single pass)

It would be nice if it possible to process fragments in blocks instead each self. That could help, to accelerate blur or filter algorithms (or tile processing)

(Only brainstorming stuff)

The G80 GS most certainly does support texture fetches, I have used them myself. DX tends to be vague about so many things. It may not be explicitly mentioned in the OGL extensions because all three shader stages now use the same instruction set - that is from the ext_gpu_shader4 extension, I think.

Originally posted by santyhamer:
The current GS cannot use texture1D/texture2D/texture3D fetch
Please grep the spec for MAX_GEOMETRY_TEXTURE_IMAGE_UNITS_EXT.

Originally posted by arekkusu:
Please grep the spec for MAX_GEOMETRY_TEXTURE_IMAGE_UNITS_EXT. [/QB]
I found this there:

Texture Access

    Geometry shaders have the ability to do a lookup into a texture map, if
    supported by the GL implementation. The maximum number of texture image
    units available to a geometry shader is
    MAX_GEOMETRY_TEXTURE_IMAGE_UNITS_EXT; a maximum number of zero indicates
    that the GL implementation does not support texture accesses in geometry
    shaders.

Currently returns 0 on my GF8500. Perhaps is due to drivers.

Really, it would be VERY surprising if G80 won’t support textures in GS, as everything runs on the same hardware anyway.

Oh, I changed a bit the post #1 with more concrete ideas as Korval wanted. Basically polished a bit the new pipeline suggestion. Sorry :stuck_out_tongue:

I’m not sure if this is a HW limitation really or software/convention one
Well, there is one thing you have to remember: it’s a GPU.

This goes back to the whole GPGPU API discussion from another thread.

If you’re using a GPU for graphical rendering, a set shader execution length is very important. The absolute last thing a rendering user wants to hear is, “Sorry, I can’t render anything; your shader may be in an infinite loop.” Certainly no rendering API has any way to communicate that to the user, nor does it have any actual remedy for it. The most effective thing it can do is detect the degenerate case (quickly, so as not to lose too much performance) and kill the shader off.

Since we know that the halting problem can never be solved, there’s no point in trying to determine if the shader is doing useful work or stuck in an infinite loop. Thus, the best way to keep the GPU running is to simply count opcodes and stop at a suitably high number.

Now, a GPGPU API (that is, one that is designed for stream processing, not graphical rendering) would absolutely want to communicate this to the user. The user ought to know that his stream process is still running, and it should be the user who decides when to stop it. That’s because a user application can run for long periods of time and still be doing useful work.

The right choice for a rendering application is the wrong choice for a GPGPU app. Which is why we need an OpenGPGPU spec that can cater directly to the GPGPU crowd.

Basically will be used for shadow test.
See, I’ve never really understood this.

Since your “shadow test” can’t possibly be over the entire world geometry (because a rasterizer is a rasterizer, not a scene graph like a ray tracer), the only thing you could possibly be testing against is geometry from the same primitive. And even that may not be all of the same “model”.

Which means that your “self-shadowing” could miss your head (uses a different shader from clothing), etc. Furthermore, since you’re only talking about self-shadowing, you’re still going to have to use another method to do real shadowing. Both shadow volumes and shadow maps provide self-shadowing (fully), so either way, you get the same effect.

And without brutalizing your rendering speed.

In the future i’m pretty sure the GPU and CPU will share transistors and instructions.
You’re reading way too much into the Larrabee’s use of x86 and the integration of GPU’s onto the CPU die.

Just as one core of a multicore chip cannot “share” logic with another core, so too cannot a CPU core share logic with a GPU core on the same die.

Now, as you mention, inter-chip communication can certainly be improved. For example, a 1MB or so cache can be set aside for CPU-to-GPU communication, which would make all rendering commands stupidly fast to execute (though not to complete). But any such communication would only deal with the results of rendering; you couldn’t shell your shader out to a CPU function call or something. Not even with Larabee’s x86 nature.

Originally posted by Korval:
If you’re using a GPU for graphical rendering, a set shader execution length is very important. The absolute last thing a rendering user wants to hear is, “Sorry, I can’t render anything; your shader may be in an infinite loop.”

I agree. In fact I think CUDA has a 5 second kernel execution time due to this. That restriction is not applicable in case you use a G80 as non-display device(or you use a Tesla card). But I agree, pure infinite shaders are a problem because can enter into an infinite loop. We need to extend a bit the shader length in a way that we could iterate a large scene “triangle set” using some kind of spatial structure… I think 2-8M instructions should be enough for a very complex scene. The stack thing can help here too, even with a small one(6K or so).

The right choice for a rendering application is the wrong choice for a GPGPU app. Which is why we need an OpenGPGPU spec that can cater directly to the GPGPU crowd.

I agree again. Perhaps the solution for all this is to start making an OpenGPGPU mixed with some kind of OpenRay APIs. Basically we could put there generic streaming kernels, tree iterations, etc

But we need to define some kind of interconnect mechanism between OpenGL and that API too ( like CUDA does ), so the data could be fed into VBOs/PBOs or called directly from shaders.

Well, what i’m trying is to perform some kind of very basic raytracing instructions in a way that we would not need to change OpenGL too much, mainly for some shadows and reflection improvement. But I am not sure if this is totally posible and neither if once performed the results will be better than the current model.

Since your "shadow test" can't possibly be over the entire world geometry (because a rasterizer is a rasterizer, not a scene graph like a ray tracer)

Well, if we consider the stage 3) I mentioned, forcing the user to emit always a world space vertex position in the vertex shader you could fill internally an uniform grid very easy.

Then in the fragment shadow you could be able to call the hitTest GLSL instruction with ray origin and end so the GPU internally transverses the uniform grid using a Digital Differential Analyzer(DDA) seeing if there is hit ( which is a very fast test btw). Calling this various times you can get very good soft shadows.

That, in theory, can be done well and keeping the rasterization model almost instact.

On the other hand, I’m not sure if we could get this just increasing a bit the shader maximum length(and iterating a stackless spatial struct) or rendering the triangle set to a volume texture. In my own tests I can tell you with the current HW is impossible to work with complex scenes… the shader is aborted due to instruction count. I can divide it in multiple tasks, but then I need to play with AGP/PCI-X, multiple passes, etc and is too slow.

The real problem here are the closestHit, gatherPoints and the computeFlux… which requires a real big API change(in fact, as you mention, perhaps a separated raytacing API ) and will be problematic to fit well in the current model ( because need to fetch triangle indices, store and update minimum distances using an internal buffer, etc )

You’re reading way too much into the Larrabee’s use of x86 and the integration of GPU’s onto the CPU die.

Hehehe perhaps! They converted me in a blind fanatic! :stuck_out_tongue:

Nehalem and Fusion are supposed to come by the end of 2008. I think they will be basically a GPU + a CPU in the same BGA package, perhaps joint using some kind of high-bandwith bus like HT3.0 or CTM.

Larrabe is different… I think is like the Cell: Some general purpose multicores surrounded by tons of SIMD-streaming units, but who know…
This one is the 16-48 cores monster and will come near (2010) as the first discrete Intel graphics card(well, will be a CPU too)… we’re far from that and atm is pure vaporware, I must admit… but if you ask Daniel Pohl he probably can tell you how amazing can be. The page 16 of the Intel Research presentation is very clear… they want to do “realtime raytracing”… and that’s why they hired Daniel. If the Quake4 moves decently with 8 cores imagine how can be with the 48 of the high-end Larrabee or with the 80 of the Polaris.

Btw, I heard you need to use the Ct language to program it(which is a highly parallel language based on templated C and SIMD). See http://www.intel.com/research/platform/terascale/TeraScale_whitepaper.pdf

In this video you can see the parent of the Larrabe, Polaris(the 80 core experimental CPU)

http://www.youtube.com/watch?v=TAKG0UvtzpE

it gives you 2Teraflops of power today, which is a bit more than the future G92… but with a complete function stack and x86 compatible(???)…

Originally posted by Korval:
Now, as you mention, inter-chip communication can certainly be improved. For example, a 1MB or so cache can be set aside for CPU-to-GPU communication, which would make all rendering commands stupidly fast to execute (though not to complete). But any such communication would only deal with the results of rendering; you couldn’t shell your shader out to a CPU function call or something. Not even with Larabee’s x86 nature.

You’re right. Looking at the most close API we have for this(CUDA) you can see to read/write to the system RAM is pretty expensive(600 cycles on average) while R/W to thread block memory is only 2 cycles. But well… the shared thing can help a lot, for example, to make image kernels like blur in one pass.

To call a CPU function from a shader today looks excesive… what we can do is to feed a VBO with the results, then use the DrawAuto() DX10 feature to re-feed the shader… exactly what CUDA does at the moment for intercomunication.

If the Quake4 moves decently with 8 cores imagine how can be with the 48 of the high-end Larrabee or with the 80 of the Polaris.
First of all, Polaris (TeraScale) is just a research project into how to design software for massively parallel streaming chips and so forth.

Second, those 48 cores aren’t real CPUs, just like the 8 SPUs in the Cell aren’t real CPUs. They’re in-order stream processors.

Ray tracing, while easily parallelizable, is ultimately very branchy code. Is there something in this voxel? Yes, then loop over the contents and do ray-surface intersection. Etc. That’s not suitable for in-order chips, which will make them much slower than a regular core.