Part of the Khronos Group
OpenGL.org

The Industry's Foundation for High Performance Graphics

from games to virtual reality, mobile phones to supercomputers

Page 3 of 5 FirstFirst 12345 LastLast
Results 21 to 30 of 41

Thread: Scalar binary/intermediate shader code

  1. #21
    Member Regular Contributor
    Join Date
    Apr 2004
    Posts
    260
    Quote Originally Posted by elFarto View Post
    Also, I'm not sure I follow your reasoning in your first post. You say a new shader format should be scalar, but then proceed to say that it's easy and loss-less to convert from vectored code to scalar code. So wouldn't keeping a vector format be preferable since it's compatible with either hardware setup? Of course it's not quite true to say that going from vectored to scalar code is loss-less, as you do lose the semantics, which are as you mentioned expensive to recover.
    Ok, I will try to explain the reasoning again. I make the assumption that all modern GPUs are essentially scalar, which is mostly true with intel being a special case.
    This means there is no need of any vectored semantics to be kept all the way to the GPU, and there is no need of expensive recovers (to use your words). And there is NO code degradation whatsoever.

    The vectored code is much harder to optimize by compilers. Many of the standard optimizations are not even possible there. Others are much limited and much less efficient.
    For example the microsoft HLSL compiler produces some arbitrarily-designed by them intermediate code that is vectored and as a result it is essentially un-optimized, regardless of their HLSL compiler efforts.
    This forces all the GPU vendors to convert the code to scalar and then run full optimizations on it over again, including generic non-machine-dependent ones. This can cause high delays because
    those optimization passes are extremely complex and slow.

    Now if the intermediate code was scalar instead, much of the optimizations could be done beforehand (e.g. by the game developer instead of runtime on the user's machine). Of course the machine-dependent ones still will need to be done afterwards, but often they are the smaller part.
    Another thing is that since the binary code would be standard, the conversion from GLSL to it would not be a vendor-specific task. So the OpenGL standard committee could form a working group to develop this compiler front-end
    (it could be open-source, etc.) and relieve the individual GPU vendors from this burden. The vendors will only need to take care of the smaller task of converting this already generic-optimized scalar binary code to their specific machine's scalar binary code.

    Please remember that one of the main purposes of a binary shader code is to minimize runtime delays from compiling/optimizing.
    Another often cited purpose is to obfuscate the shader logic to make it harder to steal. Scalar code obfuscates much better than vectored because there the high-level vector semantics are already discarded and this makes the code much harder to understand.

    If now you still don't follow my reasoning i give up explaining anymore.
    Last edited by l_belev; 07-01-2014 at 09:39 AM.

  2. #22
    Junior Member Regular Contributor
    Join Date
    Aug 2006
    Posts
    230
    I understand what you are saying, and agree that for the hardware, scalar representation is required. And yes you can do better optimisations on scalar code vs vectorized code, there's no argument there. I just don't think that a generic binary representation is the right place to have this, especially as the vectorized form is really no harder for the drivers to understand, for example:

    MUL R0, R1.xyzw, R2.x

    is no different to:

    MUL R0.x, R1.x, R2.x
    MUL R0.y, R1.y, R2.x
    MUL R0.z, R1.z, R2.x
    MUL R0.w, R1.w, R2.x

    In regards to Microsoft's HLSL compiler, most of the complaints I've seen for it is that it does too many optimisations, rather than being in vector format. The optimisations it's doing are detrimental to the drivers (see here, page 48). This is kinda the heart of the issue. The more optimisations you do, without know what hardware you're targetting, the greater the risk you'll hinder the driver from being able to produce optimal code.

    All of which makes this issue far more complicated than it needs to be.

    There's also a talk from Ian Romanick from Intel about one of the shader IR's in Mesa that's worth a watch: https://www.youtube.com/watch?v=7I2oujSMDzg.

    Regards
    elFarto

  3. #23
    Member Regular Contributor
    Join Date
    Apr 2004
    Posts
    260
    There is big difference in the two variants. Of course they both will produce the same result, but you miss the point.
    The difference is that the second can be post-optimizations while the first can not.

    As for your statement about "The more optimisations you do, without know what hardware you're targetting, the greater the risk you'll hinder the driver from being able to produce optimal code."
    it may be true specifically for the microsoft HLSL compiler, but it is not necessarily true for every possible compiler and general optimizer. The fact that their compiler does "optimiztions" that actually make the code worse
    is really their problem. I will say it plainly, their compiler is of very low quality. It is a "half-assed" piece of garbage. Just to give the game developers some "working" tool and to be done with it.
    I know very well how bad it is because i have very extensive experience with it, i have been watching what code it produces and trying to workaround numerous bugs and poor decisions it makes.
    microsoft aside, there are many possible generic optimizations that can be very powerful without ever making the code worse (if implemented well). Many (maybe all) general-porpose compilers employ such optimizations.
    Most of them work only with scalar code. When the code is vectored, end even worse, can have random swizzles, it becomes extremely hard if not impossible.

  4. #24
    Junior Member Regular Contributor
    Join Date
    Dec 2009
    Posts
    248
    So, suppose the generic shader optimizer finds an opportunity to eliminate a common subexpression by storing it in a temporary register for later use. How should it decide if the cost of the instructions to recompute it on the fly (if CSE is not done) exceeds the cost of using an extra register (if CSE is done), if it doesn't know about the underlying hardware ?

  5. #25
    Member Regular Contributor
    Join Date
    Apr 2004
    Posts
    260
    I dont know about CSE, maybe use some generic assumptions that are statistically best.
    I consider arithmetic transformations among the more important things since often the shaders are very math-heavy.
    To give some random simple example, something like sqrt(a)*b + sqrt(a)*c to be replaced with sqrt(a)*(b + c), or sqrt(x*x) to be replaced with abs(x). (If you wonder why someone would write things like sqrt(x*x) in the first place, people write all sorts of such pointless stuff. Sometimes it comes from different macros expanded together.)
    Of course many of those can only be enabled with a compiler option that turns off ieee strictness, which is good enough for most cases (most graphics applications like games rarely care about ieee-strictness in their shaders).
    Last edited by l_belev; 07-02-2014 at 02:51 AM.

  6. #26
    Advanced Member Frequent Contributor
    Join Date
    Apr 2009
    Posts
    607
    Quote Originally Posted by l_belev View Post
    To summarize your post, nvidia and amd are 100% scalar-firendly while intel are semi-scalar-friendly (only fragment shaders).

    But what I know from intel's documentation is that the issue with vertex shaders is not actually hardware but software-related, that is, it is their driver that puts the hardware in vectored mode for vertex shaders and scalar mode for fragment shaders.
    In other words the hardware can actually work in scalar mode for vertex shaders too, it's up to the driver. They could change this behavior by driver update, which would be needed anyway in order to support the hypothetical new binary shader format.
    When the vectored mode is left unused, they could clean-up their hardware from this redundant "flexibility", which would save power consumption and dye area. Thats what all other GPU vendors figured out already, some of them a long time ago.
    For Intel's Gen7 and before, that only 2 vertices are handled per invocation is a hardware limit. So for vertex and geometry shading, and tessellation too, the backend to the compiler should do everything it can to vectorize the code to vec4s. If a vertex shader cannot be vectorized at all, then for Intel Gen7 and before, ALU utilization is at 25%. However, most programs are almost never vertex bottlenecked. To do 8 vertices at a time requires sand and changes to the logic for pushing vertices into the pipeline and the post vertex cache and so on. On the subject of tessellation, 8-wide dispatch would be great, but I suspect more sand will then be needed on the GPU to buffer the tessellation. Geometry shader 8-wide dispatch is quite icky: lots of room needed and one needs to execute it -every- time (no caching for GS really) and fed to to the triangle-setup-rasterizer unit in the API order, again requiring even more buffering, namely 4 times as much as 2-wide dispatch. I hate geometry shaders


    For the case of fragment shading, SIMD8, SIMD16 and SIMD32 mode, it is actually wise to have multiple fragment dispatch modes. The reasoning is register space: a hardware thread has 4KB of register space, under SIMD8 that gives one 512B of scratch per thread, under SIMD16 gives 256B of scratch and under SIMD32 128B of scratch. If the shader is too complicated, then more scratch space is needed, hence the different modes for fragment shading.


    That would also ease the job of their driver team.
    One would expect that they should have learned lessons from their long history of chip-maker that over-engineering stuff does not result in more powerful hardware but in weaker one (remember itanium?).
    Nvidia also learned a hard lesson with geforce 5 when they made it too "flexible" for supporting multiple precisions.
    The reason the GeForceFX had such a hard time was because it was really designed for lower precision in fragment shader (fixed and halfp). Then DirectX9 came around and said: you need fp32 support in fragment shader. So although the FX could do it, it was not optimized for it at all. Worse, the word on the street was that because MS and NVIDIA were having a spat over the price of the Xbox GPU's is why NVIDIA was left a little in the dark about the fp32 in DX9. That is rumor, though and I cannot find hard data confirmation of it.

    As for implementing fp16 or fp32. The gist is this: lets say you have an ALU that is N-SIMD for fp32. It turns out adding fp16 to the ALU is not that bad, then the ALU can do 2N-SIMD for fp16. So now, that is a big deal as one literally doubles the FLOPS if the code can be completely fp16. So the "easier" thing to do is to make the compiler vec2 fp16 operations. The easiest thing to do would be to only support fp16 ops and then double the fragment shader dispatch, but there are plenty of situations where fp32 is really needed, so going for pure fp16 fragment shaders is not going to happen. The case for vertex shading is similiar and stronger: fp16 for vertex shading just is nowhere near good enough.

    But back to a shader binary format that is pre-compiled (i.e all high level stuff done). The main icky is vector vs scalar. When optimizing, if doing at the scalar level it is heck-alot-easier, to optimize for fewer ops. However, what ops should this thing have? it would need much more than that GLSL has because there are all sort of gfx commands like say: ClampAdd and so on. Getting worse, is the scalar vs vector issue. On that side my thoughts are pretty simple: hardware vendors get together and create several "modes" for such a thing, for example:
    1. All ops are scalars
    2. Fp16 ops are vec2-loving, all others are scalars
    3. vec4 loving and vec2 loving in general


    Naturally it gets really messy when one thinks that for some hypothetical hardware that likes vec ops for VS and support fp16, one can imagine a nightmare like fp16-vec8. Ugly. So the hardware vendors would need to get together and make something that is worth while for each of their hardware. On PC that just means: AMD, Intel and NVIDIA. I shudder when I think about mobile though.

  7. #27
    Junior Member Regular Contributor
    Join Date
    Aug 2006
    Posts
    230
    Quote Originally Posted by mbentrup View Post
    So, suppose the generic shader optimizer finds an opportunity to eliminate a common subexpression by storing it in a temporary register for later use. How should it decide if the cost of the instructions to recompute it on the fly (if CSE is not done) exceeds the cost of using an extra register (if CSE is done), if it doesn't know about the underlying hardware ?
    I agree that the compiler can't know the cost, but it should use as many registers as it needs so no sub-expression needs to be recomputed. The driver can always undo that optimization later on if it has register pressure.

    Regards
    elFarto

  8. #28
    Advanced Member Frequent Contributor
    Join Date
    Apr 2009
    Posts
    607
    Quote Originally Posted by elFarto View Post
    I agree that the compiler can't know the cost, but it should use as many registers as it needs so no sub-expression needs to be recomputed. The driver can always undo that optimization later on if it has register pressure.

    Regards
    elFarto
    Very often compilers nowadays use SSA: Single Static Assignment. In SSA, a register can only be written to -once- but can read many times. LLVM is based off of SSA. So for an IR, the representation is SSA with a virtual pool of (infinitely many)registers. A backend then does the hard work of instruction selection, instruction scheduling and register allocation (which requires to track and understand when a register is no longer read from).

    The main ugly for IR is that what may appear as an optimization may actually not be. Case in point: clamp(x, 0, 1). A naive optimization would be that if x is known to be non-negative it can be transformed to min(x,1). However, some GPU's have that many operations have a saturate analogue, for example it can have add and then add_saturate.

    Someone somewhere mentioned this link, but I think it is most awesome to read: http://www.humus.name/Articles/Perss...timization.pdf

    Other bat-crazy bits to think about: interpolation is usually a fragment shader instruction (this I know for fact on Intel), texture "state" does not really or need to exist for newer GPU's (think on bindless textures) and on some hardware vertex attribute load is handled by shader code (which means that for such hardware changing attribute format is just as bad as changing vertex shader). In brutal honesty, neither DX11 or GL4 really reflect the hardware enough now. Just the concept of "binding a texture" for hardware of AMD or NVIDIA of the last year or 2 is archaic and meaningless.

  9. #29
    Junior Member Regular Contributor
    Join Date
    Aug 2006
    Posts
    230
    I was going to mention something about SSA and saturate (and [-1,1] saturation too), but decided not to. Saturates would need to be move to the operation that writes the register, so we get our ADD_SAT and MAD_SSAT or whatever. As for SSA, it would probably be good to have everything in SSA form aswell, although I remember hearing something about SSA being a bit difficult for graphics/vector instruction sets, but I can't find the reference at the moment.

    *edit* Found the link, it was in Mesa's GLSL Compiler README.

    Regards
    elFarto
    Last edited by elFarto; 07-03-2014 at 12:48 PM. Reason: added link

  10. #30
    Advanced Member Frequent Contributor
    Join Date
    Apr 2009
    Posts
    607
    Currently, Mesa's GLSL IR is going through the pain of turning into a SSA form; In truth the IR in Mesa is, in spite of its age, technologically immature. It takes way too much memory, is difficult to manipulate to create optimization passes, and so on. My personal opinion is that the FAQ and the Mesa GLSL IR itself was written by people, that although clever, were very ignorant of compiler stuff. Additionally, much of the current compiler stuff in Mesa was written before or just at the time that SSA came to light.

    LunarG has developed this: http://www.lunarglass.org/ which is a shader compiler thing using LLVM. It allows for a backend to specify if vectors must be kept, or drill down to scalar is ok and so on.

    If we look at the D3D HLSL compiler, its main issue is that the instruction set it compiles to does not represent the underlying hardware well enough. If it had saturate ops, various ops for inputs that are for free, more realistic understanding of various sampler commands, then the D3D HLSL compiler would not do so many odd things.

    One evil and simple thought: Look at the D3D HLSL compiler, identify where the instruction set to which it compiles does not match real hardware and then augment that instruction set. However, one will need to realistically limit what hardware one is targeting, and that is the crux.

    My opinion is that each D3D major version should just have a completely new compiler and instruction set to which it compiles to better reflect the hardware of the generation. Ditto for an OpenGL intermediate portable form.

    Possibly, we might have a few classes of "compile modes" and an application would ship the compiled form of each shader for each mode, query the hardware what mode is preferred/supported and call it a day. Kind of like a bastard average between the current shader binary and the ideal of one portable form. The reason why I advocate that are things like:
    • vec2 fp16 support. Some hardware has, some don't. So vectorizing is only worthwhile on some hardware
    • vec4. Some hardware for some shader stages (Intel Gen7 and before for example) want vec4 as much as possible

    and there are more, especially in the land of texture sampling. By the way, shadowSamplers are one of the most icky things in Mesa's i965. Every different comparison mode ends up as a different shader.

    A piece of advice: if one is writing GL code for desktop and wants it to work, then make sure that your GL code is, outside of various GL vx DX conventions, mostly trivially portable to DX. If an operation is core in GL and not available to DX, then chances are then it is emulated. The only exceptions that I know of are pixel center convention, normalized z-range, vertex invocation [to a limited extent].

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •