Rasterization Evaluation Shader

david_f_knight · November 9, 2010, 4:57pm

This is a suggestion for OpenGL 5.0: provide a new programmable shader stage that will exist in the rasterization portion of the OpenGL pipeline. This new shader, which I will call the Rasterization Evaluation Shader, would be very much like the Tessellation Evaluation Shader, both in form and function. It would exist to rasterize GL_PATCHES primitives, and would only be permitted in shader programs that do not have any active tessellation shader stages. (Consequently, the same GPU hardware used to implement the Tessellation Evaluation Shader should also be able to implement the Rasterization Evaluation Shader.)

Note: this suggestion is a followup to my previous suggestion for OpenGL 5.0 to provide for native Bezier curve and surface rasterization. See:

http://www.opengl.org/discussion_boards/ubbthreads.php?ubb=showflat&Number=284118

As a consequence of that previous suggestion and comments made regarding it, I spent a substantial amount of time reviewing existing research and conducting my own research. I soon realized that it is very easy to generalize the rasterization of rational bicubic Bezier surface patches to work for any parametric patch. This includes displacement mapping, subdivision surfaces, analytic surfaces, as well as more traditional parametric surfaces like Bezier patches (and by extension NURBS surfaces).

With a suitable implementation, there are many advantages to rasterizing parametric surfaces later in the GL pipeline rather than tessellating them before the geometry shader stage. However, this suggestion is not that any change should be made to the existing tessellation stages, only that this new stage should be added to the pipeline as a fully backwards compatible extension of the OpenGL 4 GL_PATCHES primitive. The advantages all derive from using a suitable automatically adaptive subdivision algorithm, such as the Chung & Field Simple Recursive Tessellator for Adaptive Surface Triangulation:

http://jgt.akpeters.com/papers/ChungField00/

To appreciate why subdivision in the rasterization portion of the pipeline can be superior to tessellation occurring before it, the limitations of tessellation before it must be understood. In OpenGL 4, the tessellation stage has a 64x64 tessellation grid limit, and the tessellation grid has very limited adaptability (and that adaptability is not automatic but must be programmed in the Tessellation Control Shader, which is not a trivial task or entirely effective). The 64x64 tessellation grid limit is very restrictive for highly detailed surfaces, or for very large (in screen space) surfaces.

Generally, to achieve highly detailed surfaces with a 64x64 grid limit, a surface must be pre-tessellated into a number of relatively simple patches, so that the 64x64 grid limit on each is acceptable. Even so, there is an implicit limit to how large any such surface can be rendered before the 64x64 grid reveals itself with artifacts, especially on silhouettes.

The adaptability provided by the Tessellation Control Shader is partly an all-or-nothing kind of thing: either the entire patch must be rendered, or none of it. If only a small portion of a patch is visible, but at large scale, then a 64x64 tessellation grid will be required (producing 8192 triangles), even though only a tiny fraction of a single triangle from it may be within the viewport. Not only is this computationally inefficient (and slow), but the visual results produced are still undesirable.

The solution to all these problems lies with an automatically adaptive subdivision algorithm. This algorithm, implemented in hardware on the GPU, would have the functional responsibility analogous to the Tessellation Control Shader plus the Tessellation Primitive Generator.

As I envision this, the Rasterization Evaluation Shader, just like the Tessellation Evaluation Shader, would have as GLSL inputs the parametric U and V (for quads and isolines) and also W (for triangular patches) as well as the GL_PATCHES patch points (up to 32 of them, just as in OpenGL 4). The Shader will be responsible for defining gl_Position in clip coordinates for the given U, V (and W, if appropriate) parametric variables.

The pipeline would then apply the Perspective Divide and Viewport Transforms (as it currently does), producing window coordinates. The adaptive subdivision algorithm bases its clipping, culling, and subdivision decisions on the window coordinates of the triangles (or lines for isoline patches). This allows it to subdivide only those portions of a patch that are inside the viewport (and scissor box, if enabled), and to some target size, in pixel fragments, and to within some error tolerance, also in pixel fragments.

This target size would be a new OpenGL state variable, such as GL_STAMP_SIZE, set with a new GL function, such as glStampSize(). With this, the OpenGL programmer could trade off speed for quality. For example, Pixar currently uses a stamp size of 0.5 pixels in their REYES renderers for theatrical motion picture releases. For computer games, a value closer to 5.0 pixels might provide a good balance of quality and speed.

An additional benefit is the easy ability to add a simple and efficient Surface Normal Computation Unit to the fixed pipeline (but for patches only). Once a triangle lying on the surface of a patch has been determined to be visible, within error tolerance (another new GL state variable), and no larger than the target size, it is sent to the existing triangle GPU rasterization hardware (and from there to the Fragment Shader). En route, however, if the OpenGL programmer has enabled the automatic Surface Normal Computation Unit, the Rasterization Evaluation Shader could be executed twice more for each triangle vertex with a small displacement to the U or to the V parametric variables, so that the surface tangents at each triangle vertex can be determined, and with their cross product the surface normal. For this feature to work, the Rasterization Evaluation Shader would also have to output, in addition to gl_Position (in clip coordinates), a new variable, such as gl_Normal, which would contain the coordinates of the point in eye coordinates.

This would allow for the efficient automatic on-the-fly calculation of the surface normal vector for each visible pixel fragment. Not only is this simple to implement in a small amount of additional GPU hardware and efficient to compute, but it would simplify shader programming especially for really complex surfaces, such as a dynamically deforming hybrid parametric surface like multiple displacement maps on an analytic displacement on a Bezier surface, for example.

At first guess, it might seem that this would be very difficult to implement and require a huge amount of hardware resources. In fact, a few simple enhancements to the Chung and Field recursive surface subdivision algorithm make it a remarkably simple and efficient algorithm.

Here is an example implementation, in C, of the heart of the algorithm:

/* subdivide_tri() recursively subdivides the current triangular
 * portion of the patch primitive into smaller triangles until it
 * can be rendered accurately.
 *
 * Vertices in parent_tri[] are indexed this way:
 *      2
 *    0   1
 *
 * Points in tri_grid[] are indexed this way:
 *       4
 *     5 6 3
 *    0  1  2
 *
 * 11/1/10 david_f_knight - initial version.
 */

void  subdivide_tri (
tPoint    *parent_tri[3])     //i:pointers to three vertices of triangle about to be subdivided.
{
  tPoint   points[4];         //new vertices of triangles subdivided from parent_tri[].
  tPoint  *subtriangles[12];  //sequence of subtriangle vertices (subtriangles_cnt * 3 vertices).
  long     subtriangles_cnt;  //number of triangles current triangle must be divided into.
  tPoint  *tri_grid[7] = {parent_tri[0], &points[0], parent_tri[1],
                          &points[1], parent_tri[2], &points[2],
                          &points[3]};


  exec_rast_eval_shader_grid (parent_tri, tri_grid);
  error_tri (tri_grid, &subtriangles_cnt, subtriangles);

  switch (subtriangles_cnt) {
    case 4: subdivide_tri (&subtriangles[9]);
    case 3: subdivide_tri (&subtriangles[6]);
    case 2: subdivide_tri (&subtriangles[3]);
            subdivide_tri (&subtriangles[0]);
            break;

    case 0: rasterize_tri (parent_tri);
            break;

    case -1: break;  //then the triangle is not visible.
  } /*switch.*/

  return;
} /*subdivide_tri.*/

subdivide_tri() is recursive. In addition to calling itself, it calls:[ul][li]exec_rast_eval_shader_grid(), which executes the Rasterization Evaluation Shader at the parametric midpoints of the three edges of the triangle passed in, and at the parametric center of the triangle. (Keep in mind that the Perspective Divide and Viewport Transform are also performed on gl_Position defined in each call to the Rasterization Evaluation Shader.)[/ul][ul][]error_tri(), which evaluates the error at each of the four points calculated by exec_rast_eval_shader_grid() to determine which edges of the triangle need to be divided at their midpoint, or if none whether the triangle needs to be divided in the center;[/ul][ul][]rasterize_tri(), which is the interface to the existing GPU triangle rasterization hardware. If the proposed Surface Normal Computation Unit exists and is enabled, then it will cause the Rasterization Evaluation Shader to be executed with a small offset to U and then to V, take the cross product of the two tangents, and normalize the result for each of the triangle’s corners before rasterizing the triangle.[/ul][/li]The structure tPoint is defined something like this (though it could actually be simpler):


typedef struct {                                         //terms defining a point on the patch surface:
  float            parm[3];                              //parametric U, V, (and if triangle patch, W) coordinates of point;
                                                         // if quad or isoline patch then 0 <= {U, V} <= 1 and are Cartesian,
                                                         // if triangle patch then 0 <= {U, V, W} <= 1 and U + V + W = 1 and are barycentric.
  float            window[4];                            //window coordinates {X, Y, Z, 1/W} of point.
  union {
    float          user_f[GL_MAX_VARYING_COMPONENTS];    //array of float variables declared "in" in Fragment Shader.
    long           user_l[GL_MAX_VARYING_COMPONENTS];    //array of long variables declared "in" in Fragment Shader.
    unsigned long  user_u[GL_MAX_VARYING_COMPONENTS];    //array of unsigned long variables declared "in" in Fragment Shader.
    double         user_d[GL_MAX_VARYING_COMPONENTS/2];  //array of double variables declared "in" in Fragment Shader.
  };
} tPoint;

And finally, subdivide_tri() is initially called by any of the glDraw*() functions something like this:

tPoint   points[4];  //surface evaluated at the patch's three or four corners.
long     quad;       //TRUE if quad or isoline patch, FALSE if triangle patch;
                     // emulates the Rasterization Evaluation Shader input layout
                     // qualifier of {quads | isolines, triangles}.
tPoint  *tri[3];     //pointers to the three corners of the current triangle
                     // dividing the quad patch.

exec_rast_eval_shader_quad (quad, points);

//rasterize the lower-left CCW triangular half of the quad-shaped
// patch or the entire triangle-shaped patch:
tri[0] = &points[0];
tri[1] = &points[1];
tri[2] = &points[2];
subdivide_tri (tri);

if (quad) {
//then the patch primitive is quad-shaped, so rasterize the
// upper-right CCW triangular half of the quad patch:
  tri[0] = &points[1];
  tri[1] = &points[3];
  //tri[2] = &points[2];  //still has this value.
  subdivide_tri (tri);
}

Which, in addition to calling subdivide_tri(), calls exec_rast_eval_shader_quad(), which executes the Rasterization Evaluation Shader at the three or four corners of the patch. (Keep in mind that the Perspective Divide and Viewport Transform are also performed on gl_Position defined in each call to the Rasterization Evaluation Shader.)

This is seriously simple. Nvidia’s and AMD’s engineers will have up to three billion new transistors to find a use for with the next generation of integrated circuit technology. I think it’s hard to argue that the algorithm shown above can’t be implemented in a reasonable fraction of three billion transistors. Doing so would allow pixel-perfect free-form shape rendering with per pixel surface normals and require NO level-of-detail programming on the CPU or in the shaders.

Alfonse_Reinheart · November 9, 2010, 5:57pm

In fact, a few simple enhancements to the Chung and Field recursive surface subdivision algorithm make it a remarkably simple and efficient algorithm.

On the CPU, it is a “simple” algorithm. But “recursion” is not a word you want to use in the middle of a graphics pipeline.

Consider this. GLSL has been around for about 5 years. It was initially very restrictive, but it has since been extended with all kinds of functionality. Structs, arrays, buffer textures, direct buffer access via pointers (NVIDIA-only), even the ability to write to textures. But in all that time, nobody has ever extended GLSL, or any hardware shading language, to include recursion. Indeed, not even OpenCL allows recursion, and that’s a language designed for computing arbitrary stuff.

Recursion is “simple”, right? So why don’t they provide it? If you can read and write to textures in a shader, if you can dereference pointers and all that good stuff, why not have recursion?

Recursion is bad for graphics pipelines, plain and simple. It takes what was a straight, direct, linear (aka: fast and easily hardware implementable) process and turns it into something that, well, isn’t anymore. It requires stack space, which takes up local memory. It can therefore run out of stack space, at which point the shader crashes. And so on.

What works on CPUs doesn’t always work in hardware. Not efficiently.

Nvidia’s and AMD’s engineers will have up to three billion new transistors to find a use for with the next generation of integrated circuit technology.

Really? Right now, the Radeon 5870 has about 2.15 billion transistors, and has a die size of about 330mm^2. The GTX 480 has 3 billion, with a die size of around 530mm^2. The recently-released 580 has about the same numbers as the 480.

All of these transistors are in use; all of them are tasked to a purpose. And therefore one does not have to find a use for them. In order to implement what you are talking about, you would have to get more transistors.

GPUs are scheduled to move from the current 40nm process to a 28nm process. This scales approximately (very approximately) linearly with the number of transistors per unit area. In order for them to get “up to three billion new transistors,” the 28nm process would have to double their transistor counts. 28 is not half of 40, so a doubling seems rather unlikely. And that’s for perfect 1:1 scaling, which doesn’t actually happen.

Also, it should be noted that the transistor counts I pointed out are for the highest of the high end cards. The recently released 6870 has only 1.7 billion transistors and a die size of 255mm^2. They get much smaller, with smaller transistor budgets from there.

Spare transistors are not easy to come by. So even if you could get some monstrous 600mm^2 chip to do it, you could never bring that functionality to mainstream products. So there’s no point in allowing it.

It’d be easier all around to just up the limit on the number of tessellations. Or maybe you shouldn’t be trying to render a single patch that takes up that many pixels of the screen. Or just do the recursion yourself by using transform feedback.

I would be happy to have magical hardware REYES. But it isn’t going to happen. Not in the near future, at any rate. And relying on vague, clearly false notions like “3 billion new transistors” isn’t going to help. We’ll likely see hardware-based Megatexture before this.

I think it’s hard to argue that the algorithm shown above can’t be implemented in a reasonable fraction of three billion transistors.

Since you’re making the proposal, you’re the one who needs to argue that it would not use an unreasonable fraction of that “three billion transistors,” that “next generation” cards are supposed to magically have. Thus far, your case consists of some C code that would not function as shader or hardware logic.

david_f_knight · November 10, 2010, 1:50pm

On the CPU, it is a “simple” algorithm. But “recursion” is not a word you want to use in the middle of a graphics pipeline.
Recursion is essentially nothing but a loop with a stack.

Consider this. GLSL has been around for about 5 years. It was initially very restrictive, but it has since been extended with all kinds of functionality. Structs, arrays, buffer textures, direct buffer access via pointers (NVIDIA-only), even the ability to write to textures. But in all that time, nobody has ever extended GLSL, or any hardware shading language, to include recursion. Indeed, not even OpenCL allows recursion, and that’s a language designed for computing arbitrary stuff.

Recursion is “simple”, right? So why don’t they provide it? If you can read and write to textures in a shader, if you can dereference pointers and all that good stuff, why not have recursion?

Recursion is bad for graphics pipelines, plain and simple. It takes what was a straight, direct, linear (aka: fast and easily hardware implementable) process and turns it into something that, well, isn’t anymore. It requires stack space, which takes up local memory. It can therefore run out of stack space, at which point the shader crashes. And so on.
Any loop without a hard-coded number of iterations presents as much trouble to a pipeline as does recursion (disregarding the stack aspect). Yet, loops without hardcoded iteration counts are allowed in GLSL and other such languages.

You have a few mistaken assumptions in your analysis. You have assumed that a language that allows generalized arbitrary recursion algorithms presents the same implementation difficulty as does a specific fixed function recursive algorithm implemented in hardware. But they don’t. The fixed function implementation is no more difficult to implement than any other loop without a fixed iteration count, along with a stack.

The problem of stack overflow is again a generalized recursion problem, but one that need not exist for any particular recursive algorithm. In particular, with a specific recursive algorithm, it is known beforehand exactly how many bytes of stack storage are required for each recursive iteration. The particular recursive algorithm may be such that it is known absolutely what the recursive limit is (allowing a predetermined stack guaranteed to be of sufficient length), but if not, a simple count of recursive depth can be kept to limit the recursive depth, preventing any crash from ever possibly occurring by providing a stack of sufficient length for the recursive depth limit.

In the case of this particular recursive algorithm, every recursive iteration it halves, thirds, or quarters the area of the triangle it is processing. It is a binary subdivision algorithm. Such algorithms converge on a solution extremely quickly. It is not a problem to establish some reasonable recursive depth limit in this case, and to provide a stack guaranteed to be of sufficient length. For example, a recursive depth limit of just 20 allows rasterizing a patch with edges one million pixels long to a resolution of a single pixel. (That’s a patch up to 1,000,000,000,000 pixels in area.) Such a stack only needs to provide space for the data defining 20 triangles. So, what if somebody wants to rasterize a patch with sides two million pixels long (i.e., four trillion pixels in area), and the recursive depth limit prematurely terminates the recursive loop? Then, with a recursive depth limit of 20, the smallest subdivided triangles will be two pixels long on each side. Hardly a crash.

Right now, the Radeon 5870 has about 2.15 billion transistors, and has a die size of about 330mm^2. The GTX 480 has 3 billion, with a die size of around 530mm^2. The recently-released 580 has about the same numbers as the 480.

All of these transistors are in use; all of them are tasked to a purpose. And therefore one does not have to find a use for them. In order to implement what you are talking about, you would have to get more transistors.

GPUs are scheduled to move from the current 40nm process to a 28nm process. This scales approximately (very approximately) linearly with the number of transistors per unit area. In order for them to get “up to three billion new transistors,” the 28nm process would have to double their transistor counts. 28 is not half of 40, so a doubling seems rather unlikely. And that’s for perfect 1:1 scaling, which doesn’t actually happen.[/QUOTE]You have heard of Moore’s Law, haven’t you? Your analysis is again wrong. You’ve confused linear measure for area measure. You do not have to reduce the length of a transistor by half to reduce its area by half. In fact, reducing the length of a transistor by half would reduce its area to one quarter its previous size. The 28nm process you speak of refers not to area but to linear length. To compare the size of transistors at different process nodes, you need to square the length of the process resolution. Let me do the math for you:
40nm^2 = 1600 square nm
28nm^2 = 784 square nm
784 / 1600 = 0.49, that is, a transistor in the 28nm process node will be about 49% the size of a transistor in the 40nm process node. That’s pretty darn close to twice the number of transistors in the same area by going from 40nm to 28nm process nodes.

The non-perfect 1:1 scaling you mentioned is not much more than noise. You should understand, when I wrote that up to 3 billion new transistors would become available, I didn’t actually mean for that to be taken as exactly 3,000,000,000.0 transisitors to 11 significant digits. I meant it as a round working number. The truth is, whether it’s 2.7 billion or 3 billion new transistors, it makes no difference to the case I presented at all.

Also, it should be noted that the transistor counts I pointed out are for the highest of the high end cards. The recently released 6870 has only 1.7 billion transistors and a die size of 255mm^2. They get much smaller, with smaller transistor budgets from there.
Right. Of course. So what? The highest of the high end GPUs just have more shaders and wider buses than lower end GPUs. What’s your point? Your observation doesn’t invalidate or weaken any point I’ve made.

Spare transistors are not easy to come by. So even if you could get some monstrous 600mm^2 chip to do it, you could never bring that functionality to mainstream products. So there’s no point in allowing it.
Oh, Alphonse. Seriously. Read my previous responses to your previous misstatements. There’s no need for 600mm^2 chips to implement a generalized parametric surface rendering stage as I’ve described it.

It’d be easier all around to just up the limit on the number of tessellations.
Right. But that won’t solve anything.

Or maybe you shouldn’t be trying to render a single patch that takes up that many pixels of the screen.
Hugh? Why shouldn’t I?

Or just do the recursion yourself by using transform feedback.
Neither fast nor easy.

I would be happy to have magical hardware REYES. But it isn’t going to happen.
Well, certainly not if you’re in charge.
Not in the near future, at any rate. And relying on vague, clearly false notions like “3 billion new transistors” isn’t going to help.
How is “3 billion new transistors” vague? Please review my previous responses to your truly false statements. Incidentally, your clearly false statements aren’t exactly helpful.

Since you’re making the proposal, you’re the one who needs to argue that it would not use an unreasonable fraction of that “three billion transistors,” that “next generation” cards are supposed to magically have. Thus far, your case consists of some C code that would not function as shader or hardware logic.[/QUOTE]The fact that my “case” consists of some C code that wouldn’t function as an OpenGL 4 shader is why it needs to be implemented in hardware. Alphonse, your previous comments, both here and in the previous thread I started about native Bezier rasterization, have amply demonstrated that you don’t know diddly-squat about implementing anything in hardware. So, you’re in no position to tell me or anyone else what can or can’t be implemented in hardware. In fact, stacks can be implemented in hardware very easily. So can recursive loops. Though they can’t be parallelized, they can be very efficient (but that’s not the point of putting that part of the algorithm in hardware), the point is that many of the other functions of this particular algorithm can be parallelized very effectively and the process can be pipelined (same as any non-fixed iteration loop can be) – each processed triangle can be sent down the rasterization pipeline while the next triangle is processed, not to mention that multiple units can be rasterizing multiple patches simultaneously, especially in higher-end GPUs, and that there is currently no programmable access to this portion of the graphics pipeline and adding it would provide tremendous quality improvements that can’t be matched at any other location within the graphics pipeline. But I actually don’t have to prove anything to anyone. It’s for the engineers, and the management, at AMD and Nvidia to analyze things and to determine the utility, feasibility, and how many transistors are required to implement any new feature… and for Microsoft and the OpenGL ARB and other major users to guide the addition of new features. And speaking of that, you wrote…

Spare transistors are not easy to come by. So even if you could get some monstrous 600mm^2 chip to do it, you could never bring that functionality to mainstream products. So there’s no point in allowing it.
In reality, AMD is coming out with the Fusion APU, and Intel is coming out with Sandy Bridge. Why? Because they don’t have any idea what else to do with all those new transistors that they have to find a use for. They don’t know of enough new features to add, so all they can think of is sticking a GPU and a CPU on the same die. They desperately need some good ideas of new features to add. The problem they face isn’t that “spare” transistors are hard to come by, the real problem they face is that ideas for good uses for all those new transistors are hard to come by.

Jan · November 11, 2010, 2:08am

[quote]I would be happy to have magical hardware REYES. But it isn’t going to happen.

Well, certainly not if you’re in charge.
[/QUOTE]

LOL, so true!

Anyway, sounds interesting to me. Certainly worth to explore. Some time ago i talked to some people who implemented a ray-tracer on the GPU, back then Cuda was brand-new. Of course they needed a stack for all the recursions, and they mentioned that it was a challenge, but certainly doable (and that was on a Geforce 8).

Jan.

kRogue · November 11, 2010, 4:08am

Looking at the proposal, I tend to agree with david_f_knight and disagree with Alfhonse, no big shock

At any rate, right now it is quasi-implementable, with lots of pain, on GL3 class hardware using transform feedback and having the geometry shader generate the triangles… I can see something where one jimmies up the geometry shader to output a vertex stream of triangles, but one does it in two passes (per max recursive level). One pass to only re-emit those triangles that are not tessellated and a second pass to emit the tessellation of those needed to be tessellated. The loop up in C/C++ would get terminated if the 2nd pass does not emit any triangles. The 1st passes always write to the same buffer object at incrementing locations, and the 2nd pass writes to a work buffer object. At the end of the loop all the triangles will be in a huge freaking buffer object (ick!). One could also immediately render the triangles after each pass too.

But looking at it, it seems like incredible computer bureaucratic idiocy, where the API looks like it is getting in the way… worse, the multi-pass approach via transform feedback looks like it induces much more pain and inefficiency as there is all the talk back to the buffer objects, when in fact, it’d be better to just feed the triangle not needed to be tessellated to get rendered. Additionally, this idea gives the control of the recursion in the hands of the GL implementation as such it can decide when/how to stop the recursion.

Tomcat · November 12, 2010, 5:49pm

NV_vertex_program2, NV_vertex_program3 and NV_gpu_program4 support recurise subroutine calls.

aqnuep · November 13, 2010, 10:04am

Issue #38 from NV_gpu_program4 says:

"Note that no stack is provided to hold local registers; a program may implement its own via a temporary array and integer stack “pointer”.

So this means that there is no real recursion as this form of “recursion” without stack is able to do nothing more than just calling the function in iterative manner. So Alfonse is right.

Of course, this way you can also implement recursion in GLSL but the performance results would be not so convincing. Also, think about that also that if you manually do the whole algorithm (using recursion) in a shader, you don’t have any parallelism thus it would not provide adequate performance anyway.