Part of the Khronos Group
OpenGL.org

The Industry's Foundation for High Performance Graphics

from games to virtual reality, mobile phones to supercomputers

Page 1 of 2 12 LastLast
Results 1 to 10 of 41

Thread: Scalar binary/intermediate shader code

Hybrid View

  1. #1
    Member Regular Contributor
    Join Date
    Apr 2004
    Posts
    260

    Scalar binary/intermediate shader code

    I think its high time we get standard binary shader format. But it should be scalar and NOT vectored! Why scalar? Here are my reasons:

    1) Lets define the term "general scalar" to be anything that can (relatively) easy be converted to and from simple scalar form, including any kind of instruction-level parallelism but excluding any kind of data-level parallelism.
    As it turned out, "general scalar" GPU architectures are more efficient than vectored ones - they are able to utilize the hardware resources better. For this reason all major GPU architectures (since 10+ years now) are "general scalar". For them any vectored code is converted to their native "general scalar" code before it is executed. Thus vectored code only remains useful as a syntactical convenience, but this only applies to high-level languages intended to be used by people. The binary code in question is not intended to be used for directly writing shaders in it.

    2) Converting code from vectored to scalar form is easy and incurs no code quality degradation. (In contrast, efficient conversion from scalar to vectored code is very hard problem.) This means a scalar binary code would not cause additional burden for the compilers. Actually its just the other way around because:

    3) Scalar code is much easier for optimization algorithms to analyze and process it. This reason makes scalar ultimately better than vectored.

    I have been watching how badly the microsoft's HLSL shader compiler performs. The code it generates is awful mainly because it has to deal with the extreme burden that is the vectored model.

  2. #2
    Advanced Member Frequent Contributor
    Join Date
    Apr 2009
    Posts
    607
    Quote Originally Posted by l_belev View Post
    I think its high time we get standard binary shader format. But it should be scalar and NOT vectored! Why scalar? Here are my reasons:

    1) Lets define the term "general scalar" to be anything that can (relatively) easy be converted to and from simple scalar form, including any kind of instruction-level parallelism but excluding any kind of data-level parallelism.
    As it turned out, "general scalar" GPU architectures are more efficient than vectored ones - they are able to utilize the hardware resources better. For this reason all major GPU architectures (since 10+ years now) are "general scalar". For them any vectored code is converted to their native "general scalar" code before it is executed. Thus vectored code only remains useful as a syntactical convenience, but this only applies to high-level languages intended to be used by people. The binary code in question is not intended to be used for directly writing shaders in it.
    Depends on the hardware. Both Intel and AMD are SIMD based. Intel hardware is SIMD8 based. For fragment shading, the scalar story is fine since it will invoke a SIMD8, SIMD16 or SIMD32 fragment shader to handle 8, 16 or 32 fragments in one go. However, vertex and geometry shader for Ivy and Sandy Bridge the hardware does 2 runs per invocation, so the it really wants the code vectorized as much as possible. When looking at tessellation evaluation shader stage, that performance can be important since it might be invoked a great deal.


    2) Converting code from vectored to scalar form is easy and incurs no code quality degradation. (In contrast, efficient conversion from scalar to vectored code is very hard problem.) This means a scalar binary code would not cause additional burden for the compilers. Actually its just the other way around because:

    3) Scalar code is much easier for optimization algorithms to analyze and process it. This reason makes scalar ultimately better than vectored.

    I have been watching how badly the microsoft's HLSL shader compiler performs. The code it generates is awful mainly because it has to deal with the extreme burden that is the vectored model.
    I think that vectorizing code is hard. It is heck-a-easier to optimize scalar code and just run with scalars and then try to vectorize aftwerwards. The issue is that various optimization on the scalars will then potentially disable a vectorizer from do its job. I am not saying it is impossible, but it is really freaking hard a times.

    However, the entire need to vectorize will become mute as SIMD-based hardware shifts to invoking N vertex, geometry or tessellation instances per shot where N is the width of the SIMD. Once we are there, then we can utterly not worry about vectorization. Naturally NVIDIA can be giggling the entire time since there SIMT based arch is scalar based since GeForce8 series, over 7 years ago.

  3. #3
    Member Regular Contributor
    Join Date
    Apr 2004
    Posts
    260
    Quote Originally Posted by kRogue View Post
    I think that vectorizing code is hard. It is heck-a-easier to optimize scalar code and just run with scalars and then try to vectorize aftwerwards. The issue is that various optimization on the scalars will then potentially disable a vectorizer from do its job. I am not saying it is impossible, but it is really freaking hard a times.
    The idea is that a re-vectorizer is not needed because all modern GPUs are scalar. So my suggestion is to only support vectored format in the high-level language, which will be converted to scalar by the parser, and then only work with the easy scalar format.

    Hence lets have scalar binary shader code standard for opengl.

  4. #4
    Senior Member OpenGL Pro Ilian Dinev's Avatar
    Join Date
    Jan 2008
    Location
    Watford, UK
    Posts
    1,294
    What do you suppose would generate that binary? From what? How to add new features and extensions?
    How would you distribute the compiler? Who will create it, manage it, what license. How will compiler-bugs be reported, fixed and then distributed to end-users?

    All this was solved automatically for shader-binary.

  5. #5
    Member Regular Contributor
    Join Date
    Apr 2004
    Posts
    260
    Well one way is to use the existing infrastructure of glGetProgramBinary/glProgramBinary with special new enum for the binaryFormat parameter. In this case the compiler will be built-in (as is now) but will also allow any external compiler, since the binary format is standard. A little tricky detail is that glProgramBinary is about the whole program object and not about specific shader. Maybe the new format will only be allowed for separable program objects that contain single shader. Or maybe allow any program object if its not a problem.

    Ah, since "binaryFormat" is an output parameter for glGetProgramBinary, then in order to tell this function to generate the standard format we want, we may have a new program object parameter, e.g.
    Code :
    glProgramParameteri(prog, GL_PROGRAM_BINARY_RETRIEVE_FORMAT, GL_the_new_format_enum);
    While this parameter is not GL_ANY (the default), glGetProgramBinary will return binary in the specified format. Something like that.
    This function may fail with GL_INVALID_OPERATION if the given program object can't be retrieved in the requested format. For example if it was loaded from other binary format (and the driver can't recompile/convert) or when the program object contains more than one shader.
    Last edited by l_belev; 05-26-2014 at 12:59 PM.

  6. #6
    Advanced Member Frequent Contributor
    Join Date
    Apr 2009
    Posts
    607
    kRogue, AMD hardware is not SIMD, it uses a large "meta instruction" (i'm not sure what terminology they use) that is actually a packet of several independent scalar instructions. This is "instruction-level" parallelism. In practical compiler-optimizing terms it is as good as plain scalar: a real scalar code is almost trivially converted to this model without degradation. As for Intel, i read some of their specs and it looks like their architecture is somewhat a mix. It can operate both as scalar and as vectored. It was mentioned that their execution units run in scalar more when processing fragments but run in vectored mode while processing vertices. So i guess a scalar binary format would do well with their hardware too.
    I checked for the PowerVR hardware (it is the most popular in the mobile space) and sure enough, it is scalar too.
    I am going to put my little bits in on the hardware about differences between NVIDIA, AMD and Intel. Caveat: I've spent far less time with AMD than the other two, and Intel I've spent way too much time.

    Here goes.

    Intel is SIMD based all the way. The EU is a SIMD8-thing. Issuing a scalar instruction means that the results of 7 of the 8 slots is ignored completely. You can see this real easily when looking at the advertised GFLOPS, clock speeds and number of EUs [beware when talking GFLOPS everyone counts MADD as 2-flops]. Just to be clear: in Gen7 and before, vertex, geometry and tessellation shaders process 2 vertices per invocation, so it really wants, so badly wants, vec4 ops in the code; vecN operations, for N<4, have N/4 utilization for vertex, geometry and tessellation shaders. So for Gen7 and before, the compiler works hard to vectorize and that sucks. Starting in Gen8, there is 8-wide dispatch so 8 invocations are done at a time so, utilization is 100%. For fragment shading there are several modes: SIMD8, SIMD16 and SIMD32. The benefit of the higher modes is more fragments are handled per instruction, additionally since the EU really is SIMD8, SIMD32 is great since instruction scheduling almost does not matter [think of one SIMD32 instruction as 4 SIMD8's, by the time the last SIMD8 is started, the first SIMD8 finishes].


    NVIDIA is SIMT. An excellent article about it is here: http://www.yosefk.com/blog/simd-simt-smt-parallelism-in-nvidia-gpus.html. When one first looks at SIMT vs SIMD it seems the difference is hair splitting. However, SIMT makes divergence of source of data so much easier to handle, where as in SIMD it is a giant headache [scatter-scatter-read anyone?]. Additionally, it dramatically simplifies the compiler backend work.

    AMD is, as far as I know, SIMD with similar magicks as Intel does [though I think that for AMD it has each shader invocation is many vertices just like Intel Gen8, but details are different].

    What this auto-SIMD'ing of Intel and AMD drivers do is that it makes it look like everything is scalar based, but the hardware still is a SIMD thing.

    Lastly, that is really about int, float multiply and add. The other operations: reciprocal, exp, ln, trig functions, are usually handled by a dedicated unit that operates on scalars, so those are expensive, i.e. rather than say 8 reciprocals per N clock cycles (where N is the number of iteration steps), it is just 1 reciprocal per N clock cycles. I do not know what NVIDIA really does, but I do not think each CUDA core has anything beyond an ALU to handle int and float multiplicities and adds, so those iterative operations are also much more expensive.


    But now getting back to a nice IR form, the purpose of the thread. For NVIDIA, they want a scalar IR form as much as possible. Intel for Gen7(Ivy Bridge and before) wants vectorized for everything but fragment shader and scalar for fragment shader. For AMD, I am pretty sure it would want scalar too. However, for low power things, where float16 is important, the return to want vectorized will come back. The reason being that a SIMD-N thing is N-floats per op, will then also be 2N-fp16's per op, so the compiler backend will need to vec2 vectorize fp16 at the shader level to get maximum utilization.

    It would be really neat to have what D3D has had for ages: ability to send byte code to driver rather than source and that byte code does not depend on hardware or driver. The main issue, as someone already pointed out, who would create and maintain that dedicated compiler to that byte code format? Personally, I am all for a LLVM based solution that is scalar based, but it won't be trivial. Even with the LLVM battle, making a backend is diamond-rock-hard. To put it mildly, using LLVM CodeGen does not go well and so, life is still hard.
    Last edited by kRogue; 06-17-2014 at 04:50 AM.

  7. #7
    Junior Member Regular Contributor
    Join Date
    Dec 2009
    Posts
    241
    IMHO optimizing the generic shader code (except for size maybe) is a bad idea, because a GPU vendor will do a HW-specific optimization of the code anyway. If the generic optimizer decides to unroll a loop, but on the target hardware a loop would be faster, the optimizer would have to detect that there was a loop that had been unrolled and to un-unroll (re-roll ?) it.

  8. #8
    Advanced Member Frequent Contributor
    Join Date
    Apr 2009
    Posts
    607
    Quote Originally Posted by mbentrup View Post
    IMHO optimizing the generic shader code (except for size maybe) is a bad idea, because a GPU vendor will do a HW-specific optimization of the code anyway. If the generic optimizer decides to unroll a loop, but on the target hardware a loop would be faster, the optimizer would have to detect that there was a loop that had been unrolled and to un-unroll (re-roll ?) it.
    My personal pet-preference would be that the IR chosen would be scalar based and LLVM. This way, all the optimization passes embodied in the LLVM project become available. However, then a very, hard nasty part come in: writing an LLVM backend for a GPU. That is hard and really hard for SIMD GPU's because there are so many different ways to access the registers. The jazz in LLVM to help write a backend (CodeGen or TableGen or SomethingGen) really is not up to handling writing backends for SIMD based GPU's. So, GPU vendors need to roll their own in that case.

  9. #9
    Member Regular Contributor
    Join Date
    Apr 2004
    Posts
    260
    kRogue, AMD hardware is not SIMD, it uses a large "meta instruction" (i'm not sure what terminology they use) that is actually a packet of several independent scalar instructions. This is "instruction-level" parallelism. In practical compiler-optimizing terms it is as good as plain scalar: a real scalar code is almost trivially converted to this model without degradation. As for Intel, i read some of their specs and it looks like their architecture is somewhat a mix. It can operate both as scalar and as vectored. It was mentioned that their execution units run in scalar more when processing fragments but run in vectored mode while processing vertices. So i guess a scalar binary format would do well with their hardware too.
    I checked for the PowerVR hardware (it is the most popular in the mobile space) and sure enough, it is scalar too.

    The reason why scalar GPU architecture is better is because in the real-world shaders often big percentage of the code is scalar (simply the shader logic/algorithm is such) and when it runs on vectored hardware only one of the 4 channels does useful work. This is a great waste.
    This problem does not exist on scalar architecture. Unlike the CPUs, the GPUs have another way to parallelize the work - they process many identical items simultaneously (fragments, vertices). So the hardware architects figured that instead of staying idle the 3 of the 4 channels may process other fragments instead. Thus they arrived to the conclusion that scalar GPU is better than vectored. This is not true for CPUs because there you have no easy source of parallelism and so they better provide SIMD instructions. Even if they are not often used, the little uses they have are still beneficial.
    Last edited by l_belev; 05-26-2014 at 05:48 AM.

  10. #10
    Senior Member OpenGL Pro Ilian Dinev's Avatar
    Join Date
    Jan 2008
    Location
    Watford, UK
    Posts
    1,294
    AMD stopped using VLIW ages ago; their GCN is scalar+vector; to get best performance you need the vector instructions, which now operate almost identically to SSE1. 4 floats, no swizzle, no mask. http://developer.amd.com/wordpress/m...chitecture.pdf

    IME, the optimizing GLSL compiler takes less time to parse+optimize than the backend, that further optimizes and converts to HW instructions. Thus, a binary format will speed-up compilation 2 times at most. It'll also introduce the need to slowly try to UNDO optimizations that are bad for the specific gpu.
    What needs to be done, is spread awareness, that if you don't instantly look for results of glGetShaderiv()/glGetProgramiv() after glCompileShader() and glLinkShader(), the driver can offload compilation/linkage of multiple shaders to multiple threads, speeding-up the process 8-fold. Also, use shader-binary.
    Last edited by Ilian Dinev; 05-26-2014 at 06:56 AM. Reason: Edit: striked-out wrong statement

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •