PDA

View Full Version : Scalar binary/intermediate shader code



l_belev
05-23-2014, 08:09 AM
I think its high time we get standard binary shader format. But it should be scalar and NOT vectored! Why scalar? Here are my reasons:

1) Lets define the term "general scalar" to be anything that can (relatively) easy be converted to and from simple scalar form, including any kind of instruction-level parallelism but excluding any kind of data-level parallelism.
As it turned out, "general scalar" GPU architectures are more efficient than vectored ones - they are able to utilize the hardware resources better. For this reason all major GPU architectures (since 10+ years now) are "general scalar". For them any vectored code is converted to their native "general scalar" code before it is executed. Thus vectored code only remains useful as a syntactical convenience, but this only applies to high-level languages intended to be used by people. The binary code in question is not intended to be used for directly writing shaders in it.

2) Converting code from vectored to scalar form is easy and incurs no code quality degradation. (In contrast, efficient conversion from scalar to vectored code is very hard problem.) This means a scalar binary code would not cause additional burden for the compilers. Actually its just the other way around because:

3) Scalar code is much easier for optimization algorithms to analyze and process it. This reason makes scalar ultimately better than vectored.

I have been watching how badly the microsoft's HLSL shader compiler performs. The code it generates is awful mainly because it has to deal with the extreme burden that is the vectored model.

kRogue
05-26-2014, 01:11 AM
I think its high time we get standard binary shader format. But it should be scalar and NOT vectored! Why scalar? Here are my reasons:

1) Lets define the term "general scalar" to be anything that can (relatively) easy be converted to and from simple scalar form, including any kind of instruction-level parallelism but excluding any kind of data-level parallelism.
As it turned out, "general scalar" GPU architectures are more efficient than vectored ones - they are able to utilize the hardware resources better. For this reason all major GPU architectures (since 10+ years now) are "general scalar". For them any vectored code is converted to their native "general scalar" code before it is executed. Thus vectored code only remains useful as a syntactical convenience, but this only applies to high-level languages intended to be used by people. The binary code in question is not intended to be used for directly writing shaders in it.



Depends on the hardware. Both Intel and AMD are SIMD based. Intel hardware is SIMD8 based. For fragment shading, the scalar story is fine since it will invoke a SIMD8, SIMD16 or SIMD32 fragment shader to handle 8, 16 or 32 fragments in one go. However, vertex and geometry shader for Ivy and Sandy Bridge the hardware does 2 runs per invocation, so the it really wants the code vectorized as much as possible. When looking at tessellation evaluation shader stage, that performance can be important since it might be invoked a great deal.




2) Converting code from vectored to scalar form is easy and incurs no code quality degradation. (In contrast, efficient conversion from scalar to vectored code is very hard problem.) This means a scalar binary code would not cause additional burden for the compilers. Actually its just the other way around because:

3) Scalar code is much easier for optimization algorithms to analyze and process it. This reason makes scalar ultimately better than vectored.

I have been watching how badly the microsoft's HLSL shader compiler performs. The code it generates is awful mainly because it has to deal with the extreme burden that is the vectored model.



I think that vectorizing code is hard. It is heck-a-easier to optimize scalar code and just run with scalars and then try to vectorize aftwerwards. The issue is that various optimization on the scalars will then potentially disable a vectorizer from do its job. I am not saying it is impossible, but it is really freaking hard a times.

However, the entire need to vectorize will become mute as SIMD-based hardware shifts to invoking N vertex, geometry or tessellation instances per shot where N is the width of the SIMD. Once we are there, then we can utterly not worry about vectorization. Naturally NVIDIA can be giggling the entire time since there SIMT based arch is scalar based since GeForce8 series, over 7 years ago.

mbentrup
05-26-2014, 01:13 AM
IMHO optimizing the generic shader code (except for size maybe) is a bad idea, because a GPU vendor will do a HW-specific optimization of the code anyway. If the generic optimizer decides to unroll a loop, but on the target hardware a loop would be faster, the optimizer would have to detect that there was a loop that had been unrolled and to un-unroll (re-roll ?) it.

kRogue
05-26-2014, 01:54 AM
IMHO optimizing the generic shader code (except for size maybe) is a bad idea, because a GPU vendor will do a HW-specific optimization of the code anyway. If the generic optimizer decides to unroll a loop, but on the target hardware a loop would be faster, the optimizer would have to detect that there was a loop that had been unrolled and to un-unroll (re-roll ?) it.

My personal pet-preference would be that the IR chosen would be scalar based and LLVM. This way, all the optimization passes embodied in the LLVM project become available. However, then a very, hard nasty part come in: writing an LLVM backend for a GPU. That is hard and really hard for SIMD GPU's because there are so many different ways to access the registers. The jazz in LLVM to help write a backend (CodeGen or TableGen or SomethingGen) really is not up to handling writing backends for SIMD based GPU's. So, GPU vendors need to roll their own in that case.

l_belev
05-26-2014, 05:57 AM
kRogue, AMD hardware is not SIMD, it uses a large "meta instruction" (i'm not sure what terminology they use) that is actually a packet of several independent scalar instructions. This is "instruction-level" parallelism. In practical compiler-optimizing terms it is as good as plain scalar: a real scalar code is almost trivially converted to this model without degradation. As for Intel, i read some of their specs and it looks like their architecture is somewhat a mix. It can operate both as scalar and as vectored. It was mentioned that their execution units run in scalar more when processing fragments but run in vectored mode while processing vertices. So i guess a scalar binary format would do well with their hardware too.
I checked for the PowerVR hardware (it is the most popular in the mobile space) and sure enough, it is scalar too.

The reason why scalar GPU architecture is better is because in the real-world shaders often big percentage of the code is scalar (simply the shader logic/algorithm is such) and when it runs on vectored hardware only one of the 4 channels does useful work. This is a great waste.
This problem does not exist on scalar architecture. Unlike the CPUs, the GPUs have another way to parallelize the work - they process many identical items simultaneously (fragments, vertices). So the hardware architects figured that instead of staying idle the 3 of the 4 channels may process other fragments instead. Thus they arrived to the conclusion that scalar GPU is better than vectored. This is not true for CPUs because there you have no easy source of parallelism and so they better provide SIMD instructions. Even if they are not often used, the little uses they have are still beneficial.

Ilian Dinev
05-26-2014, 06:38 AM
AMD stopped using VLIW ages ago; their GCN is scalar+vector; to get best performance you need the vector instructions, which now operate almost identically to SSE1. 4 floats, no swizzle, no mask. http://developer.amd.com/wordpress/media/2012/12/AMD_Southern_Islands_Instruction_Set_Architecture. pdf

IME, the optimizing GLSL compiler takes less time to parse+optimize than the backend, that further optimizes and converts to HW instructions. Thus, a binary format will speed-up compilation 2 times at most. It'll also introduce the need to slowly try to UNDO optimizations that are bad for the specific gpu.
What needs to be done, is spread awareness, that if you don't instantly look for results of glGetShaderiv()/glGetProgramiv() after glCompileShader() and glLinkShader(), the driver can offload compilation/linkage of multiple shaders to multiple threads, speeding-up the process 8-fold. Also, use shader-binary.

l_belev
05-26-2014, 06:57 AM
I think you misunderstood the new AMD architecture. It is vectored, but each channel of the vector is a separate work item (fragment, vertex). From the point of view of single work item, the architecture is scalar. At leas thats what i gather from the pdf you pointed. Unless i misunderstood something, it will be perfectly happy with scalar-only code

Note that the vector width of the "vectored instructions" is 64 and not 4. Also note that among the "scalar instructions" there are no floating-point ones, only integer, which means they are not intended to be for general-purpose calculations, but are mostly for control-flow logic. Also, as you pointed, there are no swizzle and no write masks. This too speaks alot. How do you imagine would the real-world shaders fit in such model when they always heavily rely on swizzles?

Actually AMD switched from VLIW to pure scalar in the sense i was talking about. Like are nvidia.
Please make difference between "internal" vectored operation that is part of the shader math (e.g. vec4 sum) and "external" vectored operations that span across separate fragments. The first are manually written by the shader author but the later can be done automatically by the GPU because it processes many similar items at the same time.
Regarding the "itnernal" ones, the new AMD architecture doesn't have such, so it is scalar. This is all that is important for us here. How it optimizes it's work by processing many fragments at once is not our concern here

Ilian Dinev
05-26-2014, 07:53 AM
You're completely right; somehow I missed that and "allocation granularity of 4 vgpr" misled me.

l_belev
05-26-2014, 07:58 AM
IMHO optimizing the generic shader code (except for size maybe) is a bad idea, because a GPU vendor will do a HW-specific optimization of the code anyway. If the generic optimizer decides to unroll a loop, but on the target hardware a loop would be faster, the optimizer would have to detect that there was a loop that had been unrolled and to un-unroll (re-roll ?) it.

There are both HW-specific (low-level) and general (high-level) optimizations. Both are very important. The high-level ones deal with things like e.g. replacing one math expression with another that is equivalent but faster to execute, common sub-expression elimination, copy propagation, dead code removal, etc. The high-level optimizations are much easier and powerful to do on scalar code.

l_belev
05-26-2014, 08:19 AM
I think that vectorizing code is hard. It is heck-a-easier to optimize scalar code and just run with scalars and then try to vectorize aftwerwards. The issue is that various optimization on the scalars will then potentially disable a vectorizer from do its job. I am not saying it is impossible, but it is really freaking hard a times.


The idea is that a re-vectorizer is not needed because all modern GPUs are scalar. So my suggestion is to only support vectored format in the high-level language, which will be converted to scalar by the parser, and then only work with the easy scalar format.

Hence lets have scalar binary shader code standard for opengl.

Ilian Dinev
05-26-2014, 08:32 AM
What do you suppose would generate that binary? From what? How to add new features and extensions?
How would you distribute the compiler? Who will create it, manage it, what license. How will compiler-bugs be reported, fixed and then distributed to end-users?

All this was solved automatically for shader-binary.

l_belev
05-26-2014, 01:23 PM
Well one way is to use the existing infrastructure of glGetProgramBinary/glProgramBinary with special new enum for the binaryFormat parameter. In this case the compiler will be built-in (as is now) but will also allow any external compiler, since the binary format is standard. A little tricky detail is that glProgramBinary is about the whole program object and not about specific shader. Maybe the new format will only be allowed for separable program objects that contain single shader. Or maybe allow any program object if its not a problem.

Ah, since "binaryFormat" is an output parameter for glGetProgramBinary, then in order to tell this function to generate the standard format we want, we may have a new program object parameter, e.g.

glProgramParameteri(prog, GL_PROGRAM_BINARY_RETRIEVE_FORMAT, GL_the_new_format_enum);
While this parameter is not GL_ANY (the default), glGetProgramBinary will return binary in the specified format. Something like that.
This function may fail with GL_INVALID_OPERATION if the given program object can't be retrieved in the requested format. For example if it was loaded from other binary format (and the driver can't recompile/convert) or when the program object contains more than one shader.

kRogue
06-17-2014, 04:52 AM
kRogue, AMD hardware is not SIMD, it uses a large "meta instruction" (i'm not sure what terminology they use) that is actually a packet of several independent scalar instructions. This is "instruction-level" parallelism. In practical compiler-optimizing terms it is as good as plain scalar: a real scalar code is almost trivially converted to this model without degradation. As for Intel, i read some of their specs and it looks like their architecture is somewhat a mix. It can operate both as scalar and as vectored. It was mentioned that their execution units run in scalar more when processing fragments but run in vectored mode while processing vertices. So i guess a scalar binary format would do well with their hardware too.
I checked for the PowerVR hardware (it is the most popular in the mobile space) and sure enough, it is scalar too.


I am going to put my little bits in on the hardware about differences between NVIDIA, AMD and Intel. Caveat: I've spent far less time with AMD than the other two, and Intel I've spent way too much time.

Here goes.

Intel is SIMD based all the way. The EU is a SIMD8-thing. Issuing a scalar instruction means that the results of 7 of the 8 slots is ignored completely. You can see this real easily when looking at the advertised GFLOPS, clock speeds and number of EUs . Just to be clear: in Gen7 and before, vertex, geometry and tessellation shaders process [B]2 vertices per invocation, so it really wants, so badly wants, vec4 ops in the code; vecN operations, for N<4, have N/4 utilization for vertex, geometry and tessellation shaders. So for Gen7 and before, the compiler works hard to vectorize and that sucks. Starting in Gen8, there is 8-wide dispatch so 8 invocations are done at a time so, utilization is 100%. For fragment shading there are several modes: SIMD8, SIMD16 and SIMD32. The benefit of the higher modes is more fragments are handled per instruction, additionally since the EU really is SIMD8, SIMD32 is great since instruction scheduling almost does not matter [think of one SIMD32 instruction as 4 SIMD8's, by the time the last SIMD8 is started, the first SIMD8 finishes].


NVIDIA is SIMT. An excellent article about it is here: http://www.yosefk.com/blog/simd-simt-smt-parallelism-in-nvidia-gpus.html (http://ww.yosefk.com/blog/simd-simt-smt-parallelism-in-nvidia-gpus.html). When one first looks at SIMT vs SIMD it seems the difference is hair splitting. However, SIMT makes divergence of source of data so much easier to handle, where as in SIMD it is a giant headache [scatter-scatter-read anyone?]. Additionally, it dramatically simplifies the compiler backend work.

AMD is, as far as I know, SIMD with similar magicks as Intel does [though I think that for AMD it has each shader invocation is many vertices just like Intel Gen8, but details are different].

What this auto-SIMD'ing of Intel and AMD drivers do is that it makes it look like everything is scalar based, but the hardware still is a SIMD thing.

Lastly, that is really about int, float multiply and add. The other operations: reciprocal, exp, ln, trig functions, are usually handled by a dedicated unit that operates on scalars, so those are expensive, i.e. rather than say 8 reciprocals per N clock cycles (where N is the number of iteration steps), it is just 1 reciprocal per N clock cycles. I do not know what NVIDIA really does, but I do not think each CUDA core has anything beyond an ALU to handle int and float multiplicities and adds, so those iterative operations are also much more expensive.


But now getting back to a nice IR form, the purpose of the thread. For NVIDIA, they want a scalar IR form as much as possible. Intel for Gen7(Ivy Bridge and before) wants vectorized for everything but fragment shader and scalar for fragment shader. For AMD, I am pretty sure it would want scalar too. However, for low power things, where float16 is important, the return to want vectorized will come back. The reason being that a SIMD-N thing is N-floats per op, will then also be 2N-fp16's per op, so the compiler backend will need to vec2 vectorize fp16 at the shader level to get maximum utilization.

It would be really neat to have what D3D has had for ages: ability to send byte code to driver rather than source and that byte code does not depend on hardware or driver. The main issue, as someone already pointed out, who would create and maintain that dedicated compiler to that byte code format? Personally, I am all for a LLVM based solution that is scalar based, but it won't be trivial. Even with the LLVM battle, making a backend is diamond-rock-hard. To put it mildly, using LLVM CodeGen does not go well and so, life is still hard.

l_belev
06-19-2014, 02:14 AM
Intel is SIMD based all the way. The EU is a SIMD8-thing. Issuing a scalar instruction means that the results of 7 of the 8 slots is ignored completely.

If issuing scalar instruction means 7/8 of the hardware resource is wasted then issuing vec4 instruction means 4/8 (50%) of the resources is wasted. But 4 is the biggest vector size available in glsl (and i don't suppose their compiler is super-humanly smart as to be able to convert any and all 4-or-less vectored code to 8-vectored) so at all times at least 50% of the hardware resources is wasted? Unless the intel engineers are complete and utter idiots (which obviously is not true), what you state is simply impossible. You should have got something wrong

Please pay attention and DONT mix these two notions: 1) inter-work-item-SIMD and 2) intra-work-item-SIMD. All modern GPUs are the first but NOT the second! Being the first doesn't mean they are not scalar from the POV of a single work item, but being the second means they are not scalar. My argument is that the assumption that no modern GPUs have the property 2), is good enough. If this assumption can really be made then we can have scalar-only binary code standard, which would greatly help the compilers.

I am pretty sure that i read in some of intel pdfs that their GPU execution units can be configured to work either as scalar or as vectored and their driver uses the scalar mode for fragment shaders but vectored mode for vertex shaders.
As for amd and nvidia, i am 100% sure their architectures are scalar all the way, and this has been so for a long time now. Of course I mean scalar from the POV of single work item, which is what concerns us here.

For nvidia and amd i can confirm this by actual performance tests: I have written some converter from microsoft's binary shader code for dx9/dx8 (that code is unbelievable disgusting mess beyond words, full of exceptions, exceptions from the exceptions, nasty patches and hacks and so forth) to glsl and i implemented the converter in 2 variants, one that preserves the vectored operations and another that converts vectored to scalar. On both nvidia and amd both perform equally. There is no detectable slowdown for the scalar code. Haven't tested on intel though because currently their opengl drivers are too buggy to run the application in question.

mbentrup
06-19-2014, 04:47 AM
Doesn't Nvidia convert all GLSL to ARB assembly, i.e. vectorized code, anyway ?

l_belev
06-20-2014, 08:04 AM
I don't know what they do but indeed their program "binaries" appear to be textual ARB assembly plus some binary metadata.
On the other hand their CUDA/OpenCL uses their own assembly language called "PTX" which is much closer to their architecture and is scalar.
I wonder why they don't use it with OpenGL too. Smells like their opengl code contains thick layers of history of the kind that no one has the guts to attempt to dig into.

elFarto
06-20-2014, 10:32 AM
I've been looking at this very issue the past few days, and I think the best thing for OpenGL would be to ARBify NV_gpu_program{4,5} and friends (with a few tweaks) as use that as the base for all features. You can then modify the reference GLSL compiler to output ARB_gpu_program{4,5} programs, or have whatever middleware you're using generate it directly.

I've also noticed that the D3D10 HLSL bytecode maps almost perfectly to NV_gpu_program4 (minus the differences in samplers/textures D3D has).

Also, a ARB_gpu_program{4,5} plus a ARB_separate_shader_samplers would make porting D3D games over to OpenGL very easy.

Regards
elFarto

kRogue
06-29-2014, 11:21 PM
If issuing scalar instruction means 7/8 of the hardware resource is wasted then issuing vec4 instruction means 4/8 (50%) of the resources is wasted. But 4 is the biggest vector size available in glsl (and i don't suppose their compiler is super-humanly smart as to be able to convert any and all 4-or-less vectored code to 8-vectored) so at all times at least 50% of the hardware resources is wasted? Unless the intel engineers are complete and utter idiots (which obviously is not true), what you state is simply impossible. You should have got something wrong

Please pay attention and DONT mix these two notions: 1) inter-work-item-SIMD and 2) intra-work-item-SIMD. All modern GPUs are the first but NOT the second! Being the first doesn't mean they are not scalar from the POV of a single work item, but being the second means they are not scalar. My argument is that the assumption that no modern GPUs have the property 2), is good enough. If this assumption can really be made then we can have scalar-only binary code standard, which would greatly help the compilers.

I am pretty sure that i read in some of intel pdfs that their GPU execution units can be configured to work either as scalar or as vectored and their driver uses the scalar mode for fragment shaders but vectored mode for vertex shaders.
As for amd and nvidia, i am 100% sure their architectures are scalar all the way, and this has been so for a long time now. Of course I mean scalar from the POV of single work item, which is what concerns us here.



You are missing my point. Lets just first operate on the hardware, pure hardware first and then state how it is used in implementing an API. Here goes. Intel is a SIMD8 beast. It has a really flexible way to address registers, but at the end of the day the ALU is a SIMD8 thing at the ISA level. There are ways to send instructions to do operations on more than 8 things with one instruction, coming from the flexible addressing system it has.

Now, how that is used for implementation of graphics. For Gen7 and before, one vertex/geometry ISA invocation can do -2- vertices at a time. So if the GL implementation can vectorize everything to full used vec4 operations, then one gets 100% ALU utilization. For fragment shading, there are several modes: SIMD8, SIMD16 and SIMD32 which means that 8, 16, or 32 fragments are processed per fragment ISA invocation. The punchline is that the GL implementation does not need to vectorize for fragment shading at all. As a side note, the registers in Intel Gen are 8-floats per register and there are 128 registers.

Don't take my word for it, open up within Mesa, the i965 open source driver implementation from Intel at src/mesa/drivers/dri/i965/ and see for yourself. For a -user- of Intel hardware this means that functionally, fragment shader is scalar based and vertex shading is vec4 based (for Gen7 and before).

Talking about "work items" and such is really talking about the software API, not what the hardware actually is.

I agree with you, once the API makes it scalar looking it does not matter (mostly) to a software developer. However, for Gen7 before on Intel, a scalar based IR for vertex and geometry shaders will mean something has to vectorize it back to vec4 operations which is not pleasant work. This is my point. There is hardware out there that a scalar based IR is not all cupcakes and cookies, alteast that hardware is older.

Worse, once we get to fp16, even the fragment shader will want to be vec2 vectorized atleast. So a purely scalar based IR is not going to be ideal for when fp16 support is wanted. The reason one want fp16 is that one can get twice as many ops compared to fp32 per clock.





For nvidia and amd i can confirm this by actual performance tests: I have written some converter from microsoft's binary shader code for dx9/dx8 (that code is unbelievable disgusting mess beyond words, full of exceptions, exceptions from the exceptions, nasty patches and hacks and so forth) to glsl and i implemented the converter in 2 variants, one that preserves the vectored operations and another that converts vectored to scalar. On both nvidia and amd both perform equally. There is no detectable slowdown for the scalar code. Haven't tested on intel though because currently their opengl drivers are too buggy to run the application in question.


For NVIDIA, they have this SIMT thing, which means they are very, very scalar happy. Also, that test does NOT prove anything. Indeed, since the scalarization is machine generated and code is not optimized out, then a half-decent vectorizer could re-vectorize the code. Though, for NVIDIA I know they don't. For AMD I suspect they do not need to vectorize; though AMD is the one that contributed lots of vectorization magicks to LLVM project.

Again, Intel for Gen7 try to keep your vertex shaders vec4-y to keep ALU utilization higher. However, the vast majority of applications are not geometry limited, so even if the ALU utilization for vertex shading is at 25%, it won't matter. Indeed, most of the time Intel Gen is not even limited by float operations at all, it is limited by bandwidth. To get an idea of why: Intel Gen uses same memory as system RAM, so that is DDR3 with bandwidth around 20-30GB/s (higher numbers for newer hardware) and shared with the CPU. In comparison a dedicated video card, even a midrange one, using GDDR5 gets 200-300 GB/s.

l_belev
07-01-2014, 05:47 AM
To summarize your post, nvidia and amd are 100% scalar-firendly while intel are semi-scalar-friendly (only fragment shaders).

But what I know from intel's documentation is that the issue with vertex shaders is not actually hardware but software-related, that is, it is their driver that puts the hardware in vectored mode for vertex shaders and scalar mode for fragment shaders.
In other words the hardware can actually work in scalar mode for vertex shaders too, it's up to the driver. They could change this behavior by driver update, which would be needed anyway in order to support the hypothetical new binary shader format.

When the vectored mode is left unused, they could clean-up their hardware from this redundant "flexibility", which would save power consumption and dye area. Thats what all other GPU vendors figured out already, some of them a long time ago.
That would also ease the job of their driver team.
One would expect that they should have learned lessons from their long history of chip-maker that over-engineering stuff does not result in more powerful hardware but in weaker one (remember itanium?).
Nvidia also learned a hard lesson with geforce 5 when they made it too "flexible" for supporting multiple precisions.

elFarto
07-01-2014, 07:13 AM
In other words the hardware can actually work in scalar mode for vertex shaders too, it's up to the driver.
Yes and no. Yes they could change it, but no, they still might be limited by hardware, specifically bandwidth.


They could change this behavior by driver update, which would be needed anyway in order to support the hypothetical new binary shader format.
A new binary shader format that doesn't support already existing hardware isn't a particularly good format.

Also, I'm not sure I follow your reasoning in your first post. You say a new shader format should be scalar, but then proceed to say that it's easy and loss-less to convert from vectored code to scalar code. So wouldn't keeping a vector format be preferable since it's compatible with either hardware setup? Of course it's not quite true to say that going from vectored to scalar code is loss-less, as you do lose the semantics, which are as you mentioned expensive to recover.

But as I said before, the NV_gpu_program{4,5} format is perfect for an intermediate format. Not only does the extension already exist, but all the drivers have partial implementations of it as it's based on ARB_{fragment,vertex}_program so there wouldn't be as much work to do for them to support it.

Regards
elFarto

l_belev
07-01-2014, 10:17 AM
Also, I'm not sure I follow your reasoning in your first post. You say a new shader format should be scalar, but then proceed to say that it's easy and loss-less to convert from vectored code to scalar code. So wouldn't keeping a vector format be preferable since it's compatible with either hardware setup? Of course it's not quite true to say that going from vectored to scalar code is loss-less, as you do lose the semantics, which are as you mentioned expensive to recover.


Ok, I will try to explain the reasoning again. I make the assumption that all modern GPUs are essentially scalar, which is mostly true with intel being a special case.
This means there is no need of any vectored semantics to be kept all the way to the GPU, and there is no need of expensive recovers (to use your words). And there is NO code degradation whatsoever.

The vectored code is much harder to optimize by compilers. Many of the standard optimizations are not even possible there. Others are much limited and much less efficient.
For example the microsoft HLSL compiler produces some arbitrarily-designed by them intermediate code that is vectored and as a result it is essentially un-optimized, regardless of their HLSL compiler efforts.
This forces all the GPU vendors to convert the code to scalar and then run full optimizations on it over again, including generic non-machine-dependent ones. This can cause high delays because
those optimization passes are extremely complex and slow.

Now if the intermediate code was scalar instead, much of the optimizations could be done beforehand (e.g. by the game developer instead of runtime on the user's machine). Of course the machine-dependent ones still will need to be done afterwards, but often they are the smaller part.
Another thing is that since the binary code would be standard, the conversion from GLSL to it would not be a vendor-specific task. So the OpenGL standard committee could form a working group to develop this compiler front-end
(it could be open-source, etc.) and relieve the individual GPU vendors from this burden. The vendors will only need to take care of the smaller task of converting this already generic-optimized scalar binary code to their specific machine's scalar binary code.

Please remember that one of the main purposes of a binary shader code is to minimize runtime delays from compiling/optimizing.
Another often cited purpose is to obfuscate the shader logic to make it harder to steal. Scalar code obfuscates much better than vectored because there the high-level vector semantics are already discarded and this makes the code much harder to understand.

If now you still don't follow my reasoning i give up explaining anymore.

elFarto
07-01-2014, 01:19 PM
I understand what you are saying, and agree that for the hardware, scalar representation is required. And yes you can do better optimisations on scalar code vs vectorized code, there's no argument there. I just don't think that a generic binary representation is the right place to have this, especially as the vectorized form is really no harder for the drivers to understand, for example:

MUL R0, R1.xyzw, R2.x

is no different to:

MUL R0.x, R1.x, R2.x
MUL R0.y, R1.y, R2.x
MUL R0.z, R1.z, R2.x
MUL R0.w, R1.w, R2.x

In regards to Microsoft's HLSL compiler, most of the complaints I've seen for it is that it does too many optimisations, rather than being in vector format. The optimisations it's doing are detrimental to the drivers (see here (http://www.humus.name/Articles/Persson_LowlevelShaderOptimization.pdf), page 48). This is kinda the heart of the issue. The more optimisations you do, without know what hardware you're targetting, the greater the risk you'll hinder the driver from being able to produce optimal code.

All of which makes this issue far more complicated than it needs to be.

There's also a talk from Ian Romanick from Intel about one of the shader IR's in Mesa that's worth a watch: https://www.youtube.com/watch?v=7I2oujSMDzg.

Regards
elFarto

l_belev
07-01-2014, 03:20 PM
There is big difference in the two variants. Of course they both will produce the same result, but you miss the point.
The difference is that the second can be post-optimizations while the first can not.

As for your statement about "The more optimisations you do, without know what hardware you're targetting, the greater the risk you'll hinder the driver from being able to produce optimal code."
it may be true specifically for the microsoft HLSL compiler, but it is not necessarily true for every possible compiler and general optimizer. The fact that their compiler does "optimiztions" that actually make the code worse
is really their problem. I will say it plainly, their compiler is of very low quality. It is a "half-assed" piece of garbage. Just to give the game developers some "working" tool and to be done with it.
I know very well how bad it is because i have very extensive experience with it, i have been watching what code it produces and trying to workaround numerous bugs and poor decisions it makes.
microsoft aside, there are many possible generic optimizations that can be very powerful without ever making the code worse (if implemented well). Many (maybe all) general-porpose compilers employ such optimizations.
Most of them work only with scalar code. When the code is vectored, end even worse, can have random swizzles, it becomes extremely hard if not impossible.

mbentrup
07-02-2014, 01:08 AM
So, suppose the generic shader optimizer finds an opportunity to eliminate a common subexpression by storing it in a temporary register for later use. How should it decide if the cost of the instructions to recompute it on the fly (if CSE is not done) exceeds the cost of using an extra register (if CSE is done), if it doesn't know about the underlying hardware ?

l_belev
07-02-2014, 03:37 AM
I dont know about CSE, maybe use some generic assumptions that are statistically best.
I consider arithmetic transformations among the more important things since often the shaders are very math-heavy.
To give some random simple example, something like sqrt(a)*b + sqrt(a)*c to be replaced with sqrt(a)*(b + c), or sqrt(x*x) to be replaced with abs(x). (If you wonder why someone would write things like sqrt(x*x) in the first place, people write all sorts of such pointless stuff. Sometimes it comes from different macros expanded together.)
Of course many of those can only be enabled with a compiler option that turns off ieee strictness, which is good enough for most cases (most graphics applications like games rarely care about ieee-strictness in their shaders).

kRogue
07-02-2014, 06:25 AM
To summarize your post, nvidia and amd are 100% scalar-firendly while intel are semi-scalar-friendly (only fragment shaders).

But what I know from intel's documentation is that the issue with vertex shaders is not actually hardware but software-related, that is, it is their driver that puts the hardware in vectored mode for vertex shaders and scalar mode for fragment shaders.
In other words the hardware can actually work in scalar mode for vertex shaders too, it's up to the driver. They could change this behavior by driver update, which would be needed anyway in order to support the hypothetical new binary shader format.
When the vectored mode is left unused, they could clean-up their hardware from this redundant "flexibility", which would save power consumption and dye area. Thats what all other GPU vendors figured out already, some of them a long time ago.


For Intel's Gen7 and before, that only 2 vertices are handled per invocation is a hardware limit. So for vertex and geometry shading, and tessellation too, the backend to the compiler should do everything it can to vectorize the code to vec4s. If a vertex shader cannot be vectorized at all, then for Intel Gen7 and before, ALU utilization is at 25%. However, most programs are almost never vertex bottlenecked. To do 8 vertices at a time requires sand and changes to the logic for pushing vertices into the pipeline and the post vertex cache and so on. On the subject of tessellation, 8-wide dispatch would be great, but I suspect more sand will then be needed on the GPU to buffer the tessellation. Geometry shader 8-wide dispatch is quite icky: lots of room needed and one needs to execute it -every- time (no caching for GS really) and fed to to the triangle-setup-rasterizer unit in the API order, again requiring even more buffering, namely 4 times as much as 2-wide dispatch. I hate geometry shaders :p


For the case of fragment shading, SIMD8, SIMD16 and SIMD32 mode, it is actually wise to have multiple fragment dispatch modes. The reasoning is register space: a hardware thread has 4KB of register space, under SIMD8 that gives one 512B of scratch per thread, under SIMD16 gives 256B of scratch and under SIMD32 128B of scratch. If the shader is too complicated, then more scratch space is needed, hence the different modes for fragment shading.




That would also ease the job of their driver team.
One would expect that they should have learned lessons from their long history of chip-maker that over-engineering stuff does not result in more powerful hardware but in weaker one (remember itanium?).
Nvidia also learned a hard lesson with geforce 5 when they made it too "flexible" for supporting multiple precisions.

The reason the GeForceFX had such a hard time was because it was really designed for lower precision in fragment shader (fixed and halfp). Then DirectX9 came around and said: you need fp32 support in fragment shader. So although the FX could do it, it was not optimized for it at all. Worse, the word on the street was that because MS and NVIDIA were having a spat over the price of the Xbox GPU's is why NVIDIA was left a little in the dark about the fp32 in DX9. That is rumor, though and I cannot find hard data confirmation of it.

As for implementing fp16 or fp32. The gist is this: lets say you have an ALU that is N-SIMD for fp32. It turns out adding fp16 to the ALU is not that bad, then the ALU can do 2N-SIMD for fp16. So now, that is a big deal as one literally doubles the FLOPS if the code can be completely fp16. So the "easier" thing to do is to make the compiler vec2 fp16 operations. The easiest thing to do would be to only support fp16 ops and then double the fragment shader dispatch, but there are plenty of situations where fp32 is really needed, so going for pure fp16 fragment shaders is not going to happen. The case for vertex shading is similiar and stronger: fp16 for vertex shading just is nowhere near good enough.

But back to a shader binary format that is pre-compiled (i.e all high level stuff done). The main icky is vector vs scalar. When optimizing, if doing at the scalar level it is heck-alot-easier, to optimize for fewer ops. However, what ops should this thing have? it would need much more than that GLSL has because there are all sort of gfx commands like say: ClampAdd and so on. Getting worse, is the scalar vs vector issue. On that side my thoughts are pretty simple: hardware vendors get together and create several "modes" for such a thing, for example:

All ops are scalars
Fp16 ops are vec2-loving, all others are scalars
vec4 loving and vec2 loving in general


Naturally it gets really messy when one thinks that for some hypothetical hardware that likes vec ops for VS and support fp16, one can imagine a nightmare like fp16-vec8. Ugly. So the hardware vendors would need to get together and make something that is worth while for each of their hardware. On PC that just means: AMD, Intel and NVIDIA. I shudder when I think about mobile though.

elFarto
07-03-2014, 06:20 AM
So, suppose the generic shader optimizer finds an opportunity to eliminate a common subexpression by storing it in a temporary register for later use. How should it decide if the cost of the instructions to recompute it on the fly (if CSE is not done) exceeds the cost of using an extra register (if CSE is done), if it doesn't know about the underlying hardware ?
I agree that the compiler can't know the cost, but it should use as many registers as it needs so no sub-expression needs to be recomputed. The driver can always undo that optimization later on if it has register pressure.

Regards
elFarto

kRogue
07-03-2014, 10:56 AM
I agree that the compiler can't know the cost, but it should use as many registers as it needs so no sub-expression needs to be recomputed. The driver can always undo that optimization later on if it has register pressure.

Regards
elFarto

Very often compilers nowadays use SSA: Single Static Assignment. In SSA, a register can only be written to -once- but can read many times. LLVM is based off of SSA. So for an IR, the representation is SSA with a virtual pool of (infinitely many)registers. A backend then does the hard work of instruction selection, instruction scheduling and register allocation (which requires to track and understand when a register is no longer read from).

The main ugly for IR is that what may appear as an optimization may actually not be. Case in point: clamp(x, 0, 1). A naive optimization would be that if x is known to be non-negative it can be transformed to min(x,1). However, some GPU's have that many operations have a saturate analogue, for example it can have add and then add_saturate.

Someone somewhere mentioned this link, but I think it is most awesome to read: http://www.humus.name/Articles/Persson_LowlevelShaderOptimization.pdf

Other bat-crazy bits to think about: interpolation is usually a fragment shader instruction (this I know for fact on Intel), texture "state" does not really or need to exist for newer GPU's (think on bindless textures) and on some hardware vertex attribute load is handled by shader code (which means that for such hardware changing attribute format is just as bad as changing vertex shader). In brutal honesty, neither DX11 or GL4 really reflect the hardware enough now. Just the concept of "binding a texture" for hardware of AMD or NVIDIA of the last year or 2 is archaic and meaningless.

elFarto
07-03-2014, 01:37 PM
I was going to mention something about SSA and saturate (and [-1,1] saturation too), but decided not to. Saturates would need to be move to the operation that writes the register, so we get our ADD_SAT and MAD_SSAT or whatever. As for SSA, it would probably be good to have everything in SSA form aswell, although I remember hearing something about SSA being a bit difficult for graphics/vector instruction sets, but I can't find the reference at the moment.

*edit* Found the link, it was in Mesa's GLSL Compiler README (http://cgit.freedesktop.org/mesa/mesa/tree/src/glsl/README#n126).

Regards
elFarto

kRogue
07-04-2014, 04:43 AM
Currently, Mesa's GLSL IR is going through the pain of turning into a SSA form; In truth the IR in Mesa is, in spite of its age, technologically immature. It takes way too much memory, is difficult to manipulate to create optimization passes, and so on. My personal opinion is that the FAQ and the Mesa GLSL IR itself was written by people, that although clever, were very ignorant of compiler stuff. Additionally, much of the current compiler stuff in Mesa was written before or just at the time that SSA came to light.

LunarG has developed this: http://www.lunarglass.org/ which is a shader compiler thing using LLVM. It allows for a backend to specify if vectors must be kept, or drill down to scalar is ok and so on.

If we look at the D3D HLSL compiler, its main issue is that the instruction set it compiles to does not represent the underlying hardware well enough. If it had saturate ops, various ops for inputs that are for free, more realistic understanding of various sampler commands, then the D3D HLSL compiler would not do so many odd things.

One evil and simple thought: Look at the D3D HLSL compiler, identify where the instruction set to which it compiles does not match real hardware and then augment that instruction set. However, one will need to realistically limit what hardware one is targeting, and that is the crux.

My opinion is that each D3D major version should just have a completely new compiler and instruction set to which it compiles to better reflect the hardware of the generation. Ditto for an OpenGL intermediate portable form.

Possibly, we might have a few classes of "compile modes" and an application would ship the compiled form of each shader for each mode, query the hardware what mode is preferred/supported and call it a day. Kind of like a bastard average between the current shader binary and the ideal of one portable form. The reason why I advocate that are things like:

vec2 fp16 support. Some hardware has, some don't. So vectorizing is only worthwhile on some hardware
vec4. Some hardware for some shader stages (Intel Gen7 and before for example) want vec4 as much as possible

and there are more, especially in the land of texture sampling. By the way, shadowSamplers are one of the most icky things in Mesa's i965. Every different comparison mode ends up as a different shader.

A piece of advice: if one is writing GL code for desktop and wants it to work, then make sure that your GL code is, outside of various GL vx DX conventions, mostly trivially portable to DX. If an operation is core in GL and not available to DX, then chances are then it is emulated. The only exceptions that I know of are pixel center convention, normalized z-range, vertex invocation [to a limited extent].

elFarto
07-04-2014, 05:44 AM
If we look at the D3D HLSL compiler, its main issue is that the instruction set it compiles to does not represent the underlying hardware well enough. If it had saturate ops, various ops for inputs that are for free, more realistic understanding of various sampler commands, then the D3D HLSL compiler would not do so many odd things.
Not true, I got this bit of HLSL:

Output.Position = saturate((input.Position * -input.Position) + abs(input.Position));
to compile to this asm:

mad_sat o0.xyzw, v0.xyzw, -v0.xyzw, |v0.xyzw|

I'm not sure I've seen any instruction set that lets you have saturates on input though, abs and neg sure, but sat and ssat only seem to be allowed on output.

I certainly agree with you that the Mesa guys don't really know how to do compilers. They seem to do a lot of their optimizations on the GLSL tree IR, which is just awkward. The TGSI instruction set it's mean to get compiled to doesn't support things like predicates, or updating condition codes on any instruction. The fact that they have 3 IRs (Mesa, GLSL and TGSI) make things even more confusing.

Regards
elFarto

kRogue
07-04-2014, 08:55 AM
Not true, I got this bit of HLSL:

Output.Position = saturate((input.Position * -input.Position) + abs(input.Position));
to compile to this asm:

mad_sat o0.xyzw, v0.xyzw, -v0.xyzw, |v0.xyzw|

I'm not sure I've seen any instruction set that lets you have saturates on input though, abs and neg sure, but sat and ssat only seem to be allowed on output.


Looks like I should read everything on the internet with a huge grain of salt. It looks like the D3D HLSL compiler does know of sat, which I did not think it did. Learn something everyday. However, just to make sure I do not have to learn something again, the D3D compiler bytecode is not scalar but vector? I know for D3D9 that is the case, but just to make sure...

Sadly, no matter what Microsoft does, the current situation is that some hardware will not be happy: most want scalar, except for Intel's which wants vec4 on everything but fragment. Though, I am of the opinion that maybe for that case maybe have option for scalar or vector preference ... or just say hell with it and make it all scalar and let a driver vectorize it.




I certainly agree with you that the Mesa guys don't really know how to do compilers. They seem to do a lot of their optimizations on the GLSL tree IR, which is just awkward. The TGSI instruction set it's mean to get compiled to doesn't support things like predicates, or updating condition codes on any instruction. The fact that they have 3 IRs (Mesa, GLSL and TGSI) make things even more confusing.

Regards
Stephen


TGSI is just for Gallium drivers. My take on it is that it is not meant to be the place to perform optimizations. For Gallium drivers, the Gallium state tracker sits between Mesa and the driver. It reduces the API to a much simpler API. The shader API is through TGSI: Gallium drivers are essentially fed TGSI for shaders. I lie a little bit there: there is an option to feed a Gallium driver LLVM, but AFAIK, that LLVM is generated from TGSI anyways. The Mesa IR is not for GLSL, it is for the assembly interface.

elFarto
07-04-2014, 09:58 AM
However, just to make sure I do not have to learn something again, the D3D compiler bytecode is not scalar but vector?
Yes, it's vector. Although, I guess you could treat it as scalar and just specify one element at a time.


The Mesa IR is not for GLSL, it is for the assembly interface.
I know, I've been knee deep in it for the past week, attempting to implement NV_gpu_program{4,5}. It doesn't look like it's going to be possible without large changes to TGSI IR.

Regards
elFarto

kRogue
07-04-2014, 11:14 AM
To implement NV_gpu_program4 or 5 one will need to first attack just Mesa (not Gallium) to update the assembly interface to accept all that those extensions add, lots of pain there. Then you need to update the Gallium state tracker in converting the Mesa IR (that you updated) to TGSI IR. Then the real pain begins as the TGSI is really not good enough anymore. It is fine-ish for D3D9 feature sets mostly, but it for features from NVIDIA's Geforce8 series and up, it just is not good enough. Situation is quite dire and at the same time really funny.

If I had the authority, I'd say a wise thing to do would be for Gallium to dump TGSI and use NVIDIA's PTX format that their CUDA stack uses.

l_belev
07-07-2014, 03:16 AM
my opinion is that il instruction set should not necessarily try to match any hardware. not not even things that are common among all hardwares.
for example i think it should not have any modifiers (things like saturate output and negate/abs input). instead it should have plain, simple and pure instructions. e.g. use separate negate instruction instead of negate modifier.
the reason for this opinion is that it is very easy for driver il-to-hw converters to merge a negate instruction into modifiers for the the following instructions that use the negated register. same with the other modifiers.
but keeping the modifiers separate instructions in the il allows for much simpler il, which helps the frontend/optimization/whatever passes that work with il - now they don't have to worry about special casess and awkward rules and exceptions e.g. got to remember that the bitwise not and other unary operations are separate instructions but arithmetic negate and absolute value are input modifiers, and saturate is output modifier, and complicate our algorithms to handle the mess, or probably have many separate algorithms to handle different cases and somehow tangle them togehter.

Sorry, by "il" i meant the intermediate binary code we are talking about. I recently read the AMD's intermediate language (il) documentation and i carried this word from there only by inertia :) (i don't consider it very appropriate because "language" sounds more like high level one, but it is assembly)

mbentrup
07-07-2014, 09:19 AM
If Khronos defines a generic assembly language, should there also be a strict GLSL to assembly mapping in the spec, or should every vendor be able to adjust his GLSL to assembly compiler to produce assembly that maps optimally to his hardware ?

l_belev
07-07-2014, 10:32 AM
This is how i imagine it should be:

By no means there should by strict GLSL to assembly mapping requirements. The vendors are free to develop their own compilers if they feel they can do better job at optimizing than the (hypothetical) standard reference compiler developed by Khronos.

On the other hand a shader given in the standard assembly language form should be able to run on any implementation (and have the same result) regardless of how the shader was produced - by the standard reference compiler, by private vendor's compiler, converted from other shader assembly by tool or written by hand. After all thats what the "standard" is about. This probably means there will be little-to-no incentive for the vendors to develop their own compiler front-ends, which is not bad. If they find a problem in the reference compiler they better contribute fix to it rather than make their private branch. I think it'll also be easier for them as it will lift some burden and let them focus more on developing their hardware instead of re-inventing the wheel of writing compilers.

Then again i'm not hw vendor, they may have different opinions :)
And in the end of the day its their opinion that matters. i am nobody, i just post some suggestion.

elFarto
07-08-2014, 07:44 AM
...for example i think it should not have any modifiers (things like saturate output and negate/abs input). instead it should have plain, simple and pure instructions. e.g. use separate negate instruction instead of negate modifier.
the reason for this opinion is that it is very easy for driver il-to-hw converters to merge a negate instruction into modifiers for the the following instructions that use the negated register...

I'm not sure I buy that. It's surely going to be easier for a driver to de-optimize a MAD_SAT to a MAD and SAT, than it is to combine a separate MAD/SAT into a single instruction. Combining the instructions means looking though all the usages of the destination register to see if its only use is in a SAT instruction (which it might not be). The same goes for a neg/abs on the inputs. It's easy for the assembler to see that an input operand needs a negation/abs (and if the hardware doesn't support it) emit a NEG/ABS before the MAD.


If Khronos defines a generic assembly language, should there also be a strict GLSL to assembly mapping in the spec, or should every vendor be able to adjust his GLSL to assembly compiler to produce assembly that maps optimally to his hardware ?

I have to agree with l_belev here, if the driver's got the GLSL you might as well let it compile as best it can.

Regards
elFarto

l_belev
07-09-2014, 02:35 AM
sometimes the best way to do something is not the obvious way. unfortunately people often rush and do the obvious and fail to consider many important perspectives.

Nikki_k
07-10-2014, 01:49 AM
sometimes the best way to do something is not the obvious way. unfortunately people often rush and do the obvious and fail to consider many important perspectives.


Correct. But before defining 'the best way' we first need to define precisely what a binary shader format is supposed to achieve.

I think your main concern is compilation time, right?
If you ask me, it's completely pointless to define the specifics of a binary format unless we know what part of the compilation process is the bottleneck here. Correct me if I'm wrong but from my experience I'd guess it's the optimization process, not turning human-readable source into an equivalent binary representation. But if that's the case I believe that a binary format is utterly pointless because the optimization results will be entirely different for different hardware (even different generations from the same manufacturer as hardware evolves!) so with a fixed low level binary representation you'd inevitably run into other problems later in the game when the driver developers have to sort out the mess - and that's problems I'm seriously worried about because they affect everybody.
So, my conclusion is that for this case there is no better solution than the current method of doing a precompilation run and locally cache the generated binary.

On the other hand, if your concern is not having to provide human-readable code the binary output should be as close as possible to the source code, not even trying to create pseudo-assembly out of it, so that the optimizers have as generic data as possible to work with. Any concept of scalar vs. vectorized would just be plain wrong in this case.

l_belev
07-10-2014, 03:54 AM
You make the assumption that the hw-independent work finishes with the parsing and any and all possible optimizations are only hardware-specific. This assumption is false and that makes your reasoning, which is based on it, invalid. Typically the hw-independent optimizations are more expensive and time-consuming the hw-specific ones. But giving you lectures on compiler optimizations is beyond our goal here.

At any rate, the suggestion was not directed to some random guy from the internet, which neither can make additions to the opengl spec, nor can be expected to even know what he is talking about. It was directed to hw vendors/khronos, which should know their stuff and can appreciate a suggestion if it is any good.