GPU guts

While browsing the message board, i came along some posts discussing the internals of the GPU. More specific, it was about integrating CPU techniques in the GPU. Then i thought of a report i once made for a course, discussing the NVIDIA GR 3 GPU internals. More specific, the CPU architecture used to execute vertex programs. I remebered i had never made that information public. I have some time to spare tonight, so now might be a good moment.

The goal was to explore how advanced the vertex program architecture and its compiler were. This information is not public available, and many sources contrdict each other. I do not claim I am 100% correct, but these are my findings:

  • there are 2 kinds of instructions in VP: SIMD instructions and scalars. Some sources say these are executed in a different number of clock cycles, but i came to the conclusion this is not the case. A number of vertex programs were created, varying the scalar/simd instructions in steps from 100%/0% to 0%/100%. They all showed exactly the same framerate.

  • does the VP CPU use pipelining ? One can make a vertex program that contains (dependant) instructions that would cause an possible pipeline to restart (data hazards). I created a number of vertex programs with the same number of instructions, but containg percentual more data hazard instructions. Result: from 0% to 100% data hazard instructions the framerate dropped from 101 to 55 FPS. This indicated that the VP CPU indead uses pipelining.

  • does the compiler optimize ? Ive created a number of vertex programs with 0% - 100% instructions that do not contribute to the final result (written output registers). The framerate went up, so the VP compiler indead optimizes your code (although it is not to advanced either). eg, it will remove a DP3 R1, R2, R3 instruction if R1 is never used anymore.

  • does the compiler reschedule instructions ?
    re-arranging instructions of the same vertex program, so that it calculates the same result, but contains less or more data hazards, is noticable in the FPS. This means the VP compiler most likely does not optimize your code (or is not very advanced in doing so)

clearly the compiler issues have their impact on how the first tests have to be setup (eg if inserting non contributing instructions, they will be optimized away).

these experiments were executed on a NVIDIA GF3 with the NV_vertex_program extension. Again, i do not claim that my conclusions are correct (but i do believe so). What most surprised me, is that apparently the VP CPU uses some kind of pipelining, although most instructions in a vp are indead dependant on the other.