[QUOTE=l_belev;1260073]If issuing scalar instruction means 7/8 of the hardware resource is wasted then issuing vec4 instruction means 4/8 (50%) of the resources is wasted. But 4 is the biggest vector size available in glsl (and i don’t suppose their compiler is super-humanly smart as to be able to convert any and all 4-or-less vectored code to 8-vectored) so at all times at least 50% of the hardware resources is wasted? Unless the intel engineers are complete and utter idiots (which obviously is not true), what you state is simply impossible. You should have got something wrong
Please pay attention and DONT mix these two notions: 1) inter-work-item-SIMD and 2) intra-work-item-SIMD. All modern GPUs are the first but NOT the second! Being the first doesn’t mean they are not scalar from the POV of a single work item, but being the second means they are not scalar. My argument is that the assumption that no modern GPUs have the property 2), is good enough. If this assumption can really be made then we can have scalar-only binary code standard, which would greatly help the compilers.
I am pretty sure that i read in some of intel pdfs that their GPU execution units can be configured to work either as scalar or as vectored and their driver uses the scalar mode for fragment shaders but vectored mode for vertex shaders.
As for amd and nvidia, i am 100% sure their architectures are scalar all the way, and this has been so for a long time now. Of course I mean scalar from the POV of single work item, which is what concerns us here.
[/QUOTE]
You are missing my point. Lets just first operate on the hardware, pure hardware first and then state how it is used in implementing an API. Here goes. Intel is a SIMD8 beast. It has a really flexible way to address registers, but at the end of the day the ALU is a SIMD8 thing at the ISA level. There are ways to send instructions to do operations on more than 8 things with one instruction, coming from the flexible addressing system it has.
Now, how that is used for implementation of graphics. For Gen7 and before, one vertex/geometry ISA invocation can do -2- vertices at a time. So if the GL implementation can vectorize everything to full used vec4 operations, then one gets 100% ALU utilization. For fragment shading, there are several modes: SIMD8, SIMD16 and SIMD32 which means that 8, 16, or 32 fragments are processed per fragment ISA invocation. The punchline is that the GL implementation does not need to vectorize for fragment shading at all. As a side note, the registers in Intel Gen are 8-floats per register and there are 128 registers.
Don’t take my word for it, open up within Mesa, the i965 open source driver implementation from Intel at src/mesa/drivers/dri/i965/ and see for yourself. For a -user- of Intel hardware this means that functionally, fragment shader is scalar based and vertex shading is vec4 based (for Gen7 and before).
Talking about “work items” and such is really talking about the software API, not what the hardware actually is.
I agree with you, once the API makes it scalar looking it does not matter (mostly) to a software developer. However, for Gen7 before on Intel, a scalar based IR for vertex and geometry shaders will mean something has to vectorize it back to vec4 operations which is not pleasant work. This is my point. There is hardware out there that a scalar based IR is not all cupcakes and cookies, alteast that hardware is older.
Worse, once we get to fp16, even the fragment shader will want to be vec2 vectorized atleast. So a purely scalar based IR is not going to be ideal for when fp16 support is wanted. The reason one want fp16 is that one can get twice as many ops compared to fp32 per clock.
For nvidia and amd i can confirm this by actual performance tests: I have written some converter from microsoft’s binary shader code for dx9/dx8 (that code is unbelievable disgusting mess beyond words, full of exceptions, exceptions from the exceptions, nasty patches and hacks and so forth) to glsl and i implemented the converter in 2 variants, one that preserves the vectored operations and another that converts vectored to scalar. On both nvidia and amd both perform equally. There is no detectable slowdown for the scalar code. Haven’t tested on intel though because currently their opengl drivers are too buggy to run the application in question.
For NVIDIA, they have this SIMT thing, which means they are very, very scalar happy. Also, that test does NOT prove anything. Indeed, since the scalarization is machine generated and code is not optimized out, then a half-decent vectorizer could re-vectorize the code. Though, for NVIDIA I know they don’t. For AMD I suspect they do not need to vectorize; though AMD is the one that contributed lots of vectorization magicks to LLVM project.
Again, Intel for Gen7 try to keep your vertex shaders vec4-y to keep ALU utilization higher. However, the vast majority of applications are not geometry limited, so even if the ALU utilization for vertex shading is at 25%, it won’t matter. Indeed, most of the time Intel Gen is not even limited by float operations at all, it is limited by bandwidth. To get an idea of why: Intel Gen uses same memory as system RAM, so that is DDR3 with bandwidth around 20-30GB/s (higher numbers for newer hardware) and shared with the CPU. In comparison a dedicated video card, even a midrange one, using GDDR5 gets 200-300 GB/s.