I have been a graphics programmer for quite some time now (OpenGL and DirectX) but I have not quite understood some of the most intricate details of rendering or their implementation in hardware/drivers.

1) My first question to whoever knows how a driver works or should work is draw call consistency. And that is something I realize it exists, I'm just curious if the official specs talk about it or not cause I haven't read anything of the sort anywhere. So what do I mean ? Well, let's take a drawcall, I send vertices, gets to pixels, but let's say I have 1000 streams in hardware and the last 50 pixels get processed. That obviously means 50 parallel threads with 950 idling cause there's no more pixels right ? Does that mean that the GPU waits until all execution commands from a draw call finish before beginning a new draw call ? Or can it start a new drawcall and process new vertices, even pixels, before the last one is entirely finished ? Cause heck, those 950 streams could do even 2 small drawcalls before the last one is finished if the pixel shader of the previous one is complex enough (theoretically). If it waits however it could explain why sending more data (with newer hardware) is faster than sending a dozen smaller batches.

2) My second question is actually a bit prior to this, do pixel shaders get invoked only after all vertices are processed ? Let's say we have again 1000 streams and 48 vertices down the pipe, 952 streams idling. Can the 952 streams process pixels from the previous vertices or do they all wait for all vertices to get processed. The pipeline describes that all stages are in order but I haven't seen it say that pixels should not be processed right after rasterization is complete.

3) The case for new hardware architectures : If we already have unified architectures that can do any type of computation in parallel, does current gen hardware (HD7XXX and GTX 7XX) still has fixed hardware dedicated to say rasterization ? or AlphaTesting or logical operations on a framebuffer ? For example when we had DirectX10 level hardware and everyone was saying they have unified architectures, I would've assumed that going to DirectX11/GL4 features would not imply needing new hardware. Why ? Well, you could implement tesselation as a shader stage just like the other 3 stages, in effect it doesn't actually require new processor operations. I know DX11 also introduced bit shifting and some other things but I don't see how tesselation needed new hardware in a unified architecture.

4) Do drivers work in server/client mode or just user/kernel/device mode ? I think the second option is true, but I can't be 100% sure. I initially thought it's the first one due to GL specs talking about "client" and "server", so what I thought was like that right after my GPU boots, there's like an operating system, technically a second computer in it with one program, the driver, and when I send commands to the GPU it would just be like PC networking sendings messages in a socket and getting them out at the other end, doing the processing and sending me back the results. I couldn't have imagined true parallelism to happen any other way. Until I read some driver code from mesa and saw that there's actually a ton of CPU code in the driver that doesn't look like it's dealing with sockets, and then some DirectX driver API where they even standardized GPU command buffers.

What I really wanted with 4) a few years back was to know when or if a GPU command was finished. I now realize there's synchronization APIs in DirectX10+/GL3+ that deal with that