What is this magical thing done by newer hardware that effectively enables hardware tessellation?
Why does hardware still have fixed-function blending, when you can do blending just fine in shaders?
Because it’s faster.
Tessellation means taking one primitive of some particular size and breaking it up onto many primitives. You put one triangle in, and you can get 16 or 32 out.
In order to make this process efficient, the hardware has to be designed to expect it. Remember: before tessellation, hardware vendors could assume that if one triangle comes into the vertex shader, then at most one triangle will be rendered (it could be culled as off-screen). Therefore, any buffering between the vertex shader and the primitive rasterizer existed primarily for making vertex shaders execute less frequently. IE: the post-T&L vertex cache. This cache could contain maybe ~15-25 vertices depending on their size.
If one triangle can become 16 triangles, that blows up the post-T&L cache. It blows up most of the memory buffers between the vertex shader and the primitive processor. Indeed, dealing with this explosion in data size requires a completely different way of handling vertex data. You need large buffers between the vertex shader and the rasterizer.
And then there’s the actual process of tessellation. Shader-based tessellation is very free-form: it just creates an arbitrary primitive. This also means that it’s not predictable; it just spews vertex data arbitrarily.
The tessellation shader stuff is very bound to a fixed-function tessellator. You say how much tessellation to do, and that’s how much gets done. A triangle gets split N ways, where N is some value. The hardware can fully predict the possible outcomes and can thus adjust accordingly. Given a particular tessellation level, it knows how much vertex data will come out.
Also, because the tessellation process is separated by two shaders (one feeding data through the tessellator, the other being fed tessellated primitives), there is no need for direct communication between the two shaders. The communication is handled through hardware-mediated buffers, rather than some other form of communication.
Each shader is specialized for doing its task. The control shader figures out what gets tessellated and how much tessellation to do. The evaluation shader takes the tessellated results and interpolates the per-vertex attributes to produce actual primitives. Specialization helps get performance, because each one acts as part of a pipeline.
If it were all one shader, then each stage in the pipe would have to wait on the previous. That’s not an efficient use of hardware resources. A control shader doesn’t have to wait for its evaluation shader to finish chewing one primitive before starting to work on another. Oh sure, if it has to do this consistently, then the intermediary buffers between them fill up and it will have to wait for those to clear out. But if there was just a hiccup in the pipeline, it still works out to be faster on average.