Tessellation shaders and compatible hardware

Hi,

Hardware tessellation shaders only are available on tessellation-capable hardware. Right. But what makes this hardware capable of tessellation?

I mean, GeForce 8xxx or 2xx GPUs have a general-purpose GPU. For instance, one can find CUDA tessellation libraries on the net.

What is this magical thing done by newer hardware that effectively enables hardware tessellation? What is this piece of silicon (maybe the tessellation unit or something), what does it do so much better than software, could somebody explain me maybe how poorly a CUDA piece of code would perform to do the same thing?

Sorry if this question sounds stupid but so far I can’t find the answer.

Thanks,
Fred

There is a piece of silicon for it, if you want to put it that way.

Geforce8 is a DX10 part.
You need DX11, which adds hull shader, tesselator, domain shader.
However, the radeon HD series does include support for it : http://www.tomshardware.com/reviews/opengl-directx,2019-7.html

I would not know how it compares to CUDA. There is only one way to be sure.

Do you know what this piece of silicon does? My wild guess: from the outer/inner tessellation levels calculated in the tessellation control shader, this piece of silicon efficiently calls the tessellation evaluation shader, up to 64*64=4096 times. In other words, in CUDA terms this would mean scheduling the related warps, on the various cores, very quickly. The same scheduling done with GPU/CUDA/OpenCL code would be much slower, and for millions of primitives, would not perform very well. As I said, my wild guess.

What is this magical thing done by newer hardware that effectively enables hardware tessellation?

Why does hardware still have fixed-function blending, when you can do blending just fine in shaders?

Because it’s faster.

Tessellation means taking one primitive of some particular size and breaking it up onto many primitives. You put one triangle in, and you can get 16 or 32 out.

In order to make this process efficient, the hardware has to be designed to expect it. Remember: before tessellation, hardware vendors could assume that if one triangle comes into the vertex shader, then at most one triangle will be rendered (it could be culled as off-screen). Therefore, any buffering between the vertex shader and the primitive rasterizer existed primarily for making vertex shaders execute less frequently. IE: the post-T&L vertex cache. This cache could contain maybe ~15-25 vertices depending on their size.

If one triangle can become 16 triangles, that blows up the post-T&L cache. It blows up most of the memory buffers between the vertex shader and the primitive processor. Indeed, dealing with this explosion in data size requires a completely different way of handling vertex data. You need large buffers between the vertex shader and the rasterizer.

And then there’s the actual process of tessellation. Shader-based tessellation is very free-form: it just creates an arbitrary primitive. This also means that it’s not predictable; it just spews vertex data arbitrarily.

The tessellation shader stuff is very bound to a fixed-function tessellator. You say how much tessellation to do, and that’s how much gets done. A triangle gets split N ways, where N is some value. The hardware can fully predict the possible outcomes and can thus adjust accordingly. Given a particular tessellation level, it knows how much vertex data will come out.

Also, because the tessellation process is separated by two shaders (one feeding data through the tessellator, the other being fed tessellated primitives), there is no need for direct communication between the two shaders. The communication is handled through hardware-mediated buffers, rather than some other form of communication.

Each shader is specialized for doing its task. The control shader figures out what gets tessellated and how much tessellation to do. The evaluation shader takes the tessellated results and interpolates the per-vertex attributes to produce actual primitives. Specialization helps get performance, because each one acts as part of a pipeline.

If it were all one shader, then each stage in the pipe would have to wait on the previous. That’s not an efficient use of hardware resources. A control shader doesn’t have to wait for its evaluation shader to finish chewing one primitive before starting to work on another. Oh sure, if it has to do this consistently, then the intermediary buffers between them fill up and it will have to wait for those to clear out. But if there was just a hiccup in the pipeline, it still works out to be faster on average.