I am going to put my little bits in on the hardware about differences between NVIDIA, AMD and Intel. Caveat: I've spent far less time with AMD than the other two, and Intel I've spent way too much time.
kRogue, AMD hardware is not SIMD, it uses a large "meta instruction" (i'm not sure what terminology they use) that is actually a packet of several independent scalar instructions. This is "instruction-level" parallelism. In practical compiler-optimizing terms it is as good as plain scalar: a real scalar code is almost trivially converted to this model without degradation. As for Intel, i read some of their specs and it looks like their architecture is somewhat a mix. It can operate both as scalar and as vectored. It was mentioned that their execution units run in scalar more when processing fragments but run in vectored mode while processing vertices. So i guess a scalar binary format would do well with their hardware too.
I checked for the PowerVR hardware (it is the most popular in the mobile space) and sure enough, it is scalar too.