This is a suggestion for OpenGL 5.0: provide a new programmable shader stage that will exist in the rasterization portion of the OpenGL pipeline. This new shader, which I will call the Rasterization Evaluation Shader, would be very much like the Tessellation Evaluation Shader, both in form and function. It would exist to rasterize GL_PATCHES primitives, and would only be permitted in shader programs that do not have any active tessellation shader stages. (Consequently, the same GPU hardware used to implement the Tessellation Evaluation Shader should also be able to implement the Rasterization Evaluation Shader.)
Note: this suggestion is a followup to my previous suggestion for OpenGL 5.0 to provide for native Bezier curve and surface rasterization. See:
http://www.opengl.org/discussion_boards/ubbthreads.php?ubb=showflat&Number=284118
As a consequence of that previous suggestion and comments made regarding it, I spent a substantial amount of time reviewing existing research and conducting my own research. I soon realized that it is very easy to generalize the rasterization of rational bicubic Bezier surface patches to work for any parametric patch. This includes displacement mapping, subdivision surfaces, analytic surfaces, as well as more traditional parametric surfaces like Bezier patches (and by extension NURBS surfaces).
With a suitable implementation, there are many advantages to rasterizing parametric surfaces later in the GL pipeline rather than tessellating them before the geometry shader stage. However, this suggestion is not that any change should be made to the existing tessellation stages, only that this new stage should be added to the pipeline as a fully backwards compatible extension of the OpenGL 4 GL_PATCHES primitive. The advantages all derive from using a suitable automatically adaptive subdivision algorithm, such as the Chung & Field Simple Recursive Tessellator for Adaptive Surface Triangulation:
http://jgt.akpeters.com/papers/ChungField00/
To appreciate why subdivision in the rasterization portion of the pipeline can be superior to tessellation occurring before it, the limitations of tessellation before it must be understood. In OpenGL 4, the tessellation stage has a 64x64 tessellation grid limit, and the tessellation grid has very limited adaptability (and that adaptability is not automatic but must be programmed in the Tessellation Control Shader, which is not a trivial task or entirely effective). The 64x64 tessellation grid limit is very restrictive for highly detailed surfaces, or for very large (in screen space) surfaces.
Generally, to achieve highly detailed surfaces with a 64x64 grid limit, a surface must be pre-tessellated into a number of relatively simple patches, so that the 64x64 grid limit on each is acceptable. Even so, there is an implicit limit to how large any such surface can be rendered before the 64x64 grid reveals itself with artifacts, especially on silhouettes.
The adaptability provided by the Tessellation Control Shader is partly an all-or-nothing kind of thing: either the entire patch must be rendered, or none of it. If only a small portion of a patch is visible, but at large scale, then a 64x64 tessellation grid will be required (producing 8192 triangles), even though only a tiny fraction of a single triangle from it may be within the viewport. Not only is this computationally inefficient (and slow), but the visual results produced are still undesirable.
The solution to all these problems lies with an automatically adaptive subdivision algorithm. This algorithm, implemented in hardware on the GPU, would have the functional responsibility analogous to the Tessellation Control Shader plus the Tessellation Primitive Generator.
As I envision this, the Rasterization Evaluation Shader, just like the Tessellation Evaluation Shader, would have as GLSL inputs the parametric U and V (for quads and isolines) and also W (for triangular patches) as well as the GL_PATCHES patch points (up to 32 of them, just as in OpenGL 4). The Shader will be responsible for defining gl_Position in clip coordinates for the given U, V (and W, if appropriate) parametric variables.
The pipeline would then apply the Perspective Divide and Viewport Transforms (as it currently does), producing window coordinates. The adaptive subdivision algorithm bases its clipping, culling, and subdivision decisions on the window coordinates of the triangles (or lines for isoline patches). This allows it to subdivide only those portions of a patch that are inside the viewport (and scissor box, if enabled), and to some target size, in pixel fragments, and to within some error tolerance, also in pixel fragments.
This target size would be a new OpenGL state variable, such as GL_STAMP_SIZE, set with a new GL function, such as glStampSize(). With this, the OpenGL programmer could trade off speed for quality. For example, Pixar currently uses a stamp size of 0.5 pixels in their REYES renderers for theatrical motion picture releases. For computer games, a value closer to 5.0 pixels might provide a good balance of quality and speed.
An additional benefit is the easy ability to add a simple and efficient Surface Normal Computation Unit to the fixed pipeline (but for patches only). Once a triangle lying on the surface of a patch has been determined to be visible, within error tolerance (another new GL state variable), and no larger than the target size, it is sent to the existing triangle GPU rasterization hardware (and from there to the Fragment Shader). En route, however, if the OpenGL programmer has enabled the automatic Surface Normal Computation Unit, the Rasterization Evaluation Shader could be executed twice more for each triangle vertex with a small displacement to the U or to the V parametric variables, so that the surface tangents at each triangle vertex can be determined, and with their cross product the surface normal. For this feature to work, the Rasterization Evaluation Shader would also have to output, in addition to gl_Position (in clip coordinates), a new variable, such as gl_Normal, which would contain the coordinates of the point in eye coordinates.
This would allow for the efficient automatic on-the-fly calculation of the surface normal vector for each visible pixel fragment. Not only is this simple to implement in a small amount of additional GPU hardware and efficient to compute, but it would simplify shader programming especially for really complex surfaces, such as a dynamically deforming hybrid parametric surface like multiple displacement maps on an analytic displacement on a Bezier surface, for example.
At first guess, it might seem that this would be very difficult to implement and require a huge amount of hardware resources. In fact, a few simple enhancements to the Chung and Field recursive surface subdivision algorithm make it a remarkably simple and efficient algorithm.
Here is an example implementation, in C, of the heart of the algorithm:
/* subdivide_tri() recursively subdivides the current triangular
* portion of the patch primitive into smaller triangles until it
* can be rendered accurately.
*
* Vertices in parent_tri[] are indexed this way:
* 2
* 0 1
*
* Points in tri_grid[] are indexed this way:
* 4
* 5 6 3
* 0 1 2
*
* 11/1/10 david_f_knight - initial version.
*/
void subdivide_tri (
tPoint *parent_tri[3]) //i:pointers to three vertices of triangle about to be subdivided.
{
tPoint points[4]; //new vertices of triangles subdivided from parent_tri[].
tPoint *subtriangles[12]; //sequence of subtriangle vertices (subtriangles_cnt * 3 vertices).
long subtriangles_cnt; //number of triangles current triangle must be divided into.
tPoint *tri_grid[7] = {parent_tri[0], &points[0], parent_tri[1],
&points[1], parent_tri[2], &points[2],
&points[3]};
exec_rast_eval_shader_grid (parent_tri, tri_grid);
error_tri (tri_grid, &subtriangles_cnt, subtriangles);
switch (subtriangles_cnt) {
case 4: subdivide_tri (&subtriangles[9]);
case 3: subdivide_tri (&subtriangles[6]);
case 2: subdivide_tri (&subtriangles[3]);
subdivide_tri (&subtriangles[0]);
break;
case 0: rasterize_tri (parent_tri);
break;
case -1: break; //then the triangle is not visible.
} /*switch.*/
return;
} /*subdivide_tri.*/
subdivide_tri() is recursive. In addition to calling itself, it calls:[ul][li]exec_rast_eval_shader_grid(), which executes the Rasterization Evaluation Shader at the parametric midpoints of the three edges of the triangle passed in, and at the parametric center of the triangle. (Keep in mind that the Perspective Divide and Viewport Transform are also performed on gl_Position defined in each call to the Rasterization Evaluation Shader.)[/ul][ul][]error_tri(), which evaluates the error at each of the four points calculated by exec_rast_eval_shader_grid() to determine which edges of the triangle need to be divided at their midpoint, or if none whether the triangle needs to be divided in the center;[/ul][ul][]rasterize_tri(), which is the interface to the existing GPU triangle rasterization hardware. If the proposed Surface Normal Computation Unit exists and is enabled, then it will cause the Rasterization Evaluation Shader to be executed with a small offset to U and then to V, take the cross product of the two tangents, and normalize the result for each of the triangle’s corners before rasterizing the triangle.[/ul][/li]The structure tPoint is defined something like this (though it could actually be simpler):
typedef struct { //terms defining a point on the patch surface:
float parm[3]; //parametric U, V, (and if triangle patch, W) coordinates of point;
// if quad or isoline patch then 0 <= {U, V} <= 1 and are Cartesian,
// if triangle patch then 0 <= {U, V, W} <= 1 and U + V + W = 1 and are barycentric.
float window[4]; //window coordinates {X, Y, Z, 1/W} of point.
union {
float user_f[GL_MAX_VARYING_COMPONENTS]; //array of float variables declared "in" in Fragment Shader.
long user_l[GL_MAX_VARYING_COMPONENTS]; //array of long variables declared "in" in Fragment Shader.
unsigned long user_u[GL_MAX_VARYING_COMPONENTS]; //array of unsigned long variables declared "in" in Fragment Shader.
double user_d[GL_MAX_VARYING_COMPONENTS/2]; //array of double variables declared "in" in Fragment Shader.
};
} tPoint;
And finally, subdivide_tri() is initially called by any of the glDraw*() functions something like this:
tPoint points[4]; //surface evaluated at the patch's three or four corners.
long quad; //TRUE if quad or isoline patch, FALSE if triangle patch;
// emulates the Rasterization Evaluation Shader input layout
// qualifier of {quads | isolines, triangles}.
tPoint *tri[3]; //pointers to the three corners of the current triangle
// dividing the quad patch.
exec_rast_eval_shader_quad (quad, points);
//rasterize the lower-left CCW triangular half of the quad-shaped
// patch or the entire triangle-shaped patch:
tri[0] = &points[0];
tri[1] = &points[1];
tri[2] = &points[2];
subdivide_tri (tri);
if (quad) {
//then the patch primitive is quad-shaped, so rasterize the
// upper-right CCW triangular half of the quad patch:
tri[0] = &points[1];
tri[1] = &points[3];
//tri[2] = &points[2]; //still has this value.
subdivide_tri (tri);
}
Which, in addition to calling subdivide_tri(), calls exec_rast_eval_shader_quad(), which executes the Rasterization Evaluation Shader at the three or four corners of the patch. (Keep in mind that the Perspective Divide and Viewport Transform are also performed on gl_Position defined in each call to the Rasterization Evaluation Shader.)
This is seriously simple. Nvidia’s and AMD’s engineers will have up to three billion new transistors to find a use for with the next generation of integrated circuit technology. I think it’s hard to argue that the algorithm shown above can’t be implemented in a reasonable fraction of three billion transistors. Doing so would allow pixel-perfect free-form shape rendering with per pixel surface normals and require NO level-of-detail programming on the CPU or in the shaders.