Part of the Khronos Group
OpenGL.org

The Industry's Foundation for High Performance Graphics

from games to virtual reality, mobile phones to supercomputers

Page 3 of 7 FirstFirst 12345 ... LastLast
Results 21 to 30 of 63

Thread: NVIDIA releases OpenGL 4.3 beta drivers

  1. #21
    Intern Newbie
    Join Date
    Oct 2007
    Posts
    47

    Questions and more bugs in OpenGL driver (II)

    Also two questions before second bug report..
    1.is multi_draw_indirect fully supported in current Nvidia HW (and Fermi) without CPU support in driver or is implemented some part in software I say that because this originated from AMD extension so seems AMD really supports in HW..
    2.seems fbo_no_atachments extensions is equivalent to Dx11.1 TIR target independant rasterization if so how it's possible implementable in Nvidia first gen d3d11 HW (fermi) if I read on some sites that TIR was just one thing current D3D 11 HW doesn't supported.. what are the differences then?
    Second sample is in 430-compute shader sample seems I get fixed by removing readonly and writeonly qualifiers from storage buffer and then fixing a bug by g-truc
    ie changing
    Code :
    Out.Ouput[gl_GlobalInvocationID].Texcoord = In.Input[gl_GlobalInvocationID].Texcoord;
    by
    Code :
    Out.Ouput[gl_GlobalInvocationID.x].Texcoord = In.Input[gl_GlobalInvocationID.x].Texcoord;

    so seems
    this should work but doesn't work
    Code :
    layout(binding = BUFFER_INPUT) readonly buffer iBuffer
    {
        vertex Input[];
    } In;
     
    layout(binding = BUFFER_OUTPUT) writeonly buffer oBuffer
    {
        vertex Ouput[];
    } Out;

    i get:
    Compute info
    ------------
    Code :
    0(22) : error C0000: syntax error, unexpected identifier, expecting "::" at token "iBuffer"
    0(27) : error C0000: syntax error, unexpected identifier, expecting "::" at token "oBuffer"
     and fixing it by ex:
     
    //layout(binding = BUFFER_INPUT) readonly buffer iBuffer
    layout(binding = BUFFER_INPUT)  buffer iBuffer //works!

    At least for now with this fix it works!
    Last edited by oscarbg; 08-21-2012 at 09:45 PM.

  2. #22
    Intern Newbie
    Join Date
    Oct 2007
    Posts
    47

    More compute shader bugs and questions!

    One question is compute shader stage being exposed in Nvidia Cg language (where to ask?) similar to tesselation stages where added in Cg 3.0.. also if not, really another question is if Nvidia Cg compiler cgc.exe will at least get support for that so we can compile GLSL compute shaders to NV_compute_shader assembly similar to which I get from driver when assembly errors occur (see below)..
    So this is really the kind of bug in compute shader that I'm afraid Nvidia will not fix for security/hanging kind of thing:
    I was happy to see I could port Nvidia matmul CUDA sample to OGL compute shader without much effort basically doing two things:
    *Adding defines:
    Code :
    #define blockIdx gl_WorkGroupID
    #define __syncthreads() barrier()
    #define threadIdx gl_GlobalInvocationID
    #define __shared__ shared
    and changing all int vars in CUDA kernel to uint vars (another solution would have been to cast gl_WorkGroupID and gl_GlobalInvocationID to int)

    2.and another one was to put out shared mem definitions out of kernel to be in global scope
    Code :
    __shared__ REAL As[BLOCK_SIZE][BLOCK_SIZE];

    Then I was happy code to be able to be free of errors but a shame is in compilation:
    Code :
    Internal error: assembly compile error for compute shader at offset 108543:
    -- error message --
    line 2094, column 6:  error: BAR not allowed inside flow control blocks.
    line 2108, column 6:  error: BAR not allowed inside flow control blocks.

    First let's be clear this code works in CUDA and recommended by Nvidia for it's GPUs.. second it's using a fine detail in GPUs in that barriers can be in control flow assuming all warps in block execute follow same path in control flow..
    why not in OGL compute shaders?..
    also see the "flow control" is cause by this loop:
    Code :
    for (uint a = aBegin, b = bBegin;
             a <= aEnd;
             a += aStep, b += bStep)

    not a branch and even more is executed same iterations for every thread but it's not trivial in the form stated to infer by compiler so I thought hey fix it i changed to (in my sample matrix w=h=1024 and block size=32 so iters=1024/32):

    Code :
    uint a = aBegin, b = bBegin;
        for (int i=0; i<32; i++)
        {
    ..loop body containing barriers
     
    a += aStep, b += bStep
    }

    even with this I get same error so unique way to fix is to unroll loop manually which is ugly at least.. I think even I added #pragma unroll to loop to force programatically to expand the compiler for me but not working .. there is a Nvidia way to force unroll like in CUDA with #pragma unroll
    Full shader:
    Code :
    #version 430 core
    #extension GL_ARB_compute_shader : require
    #extension GL_ARB_shader_storage_buffer_object : require
    #extension GL_ARB_gpu_shader_fp64: require
     
    #define REAL float
    //#define REAL double
     
     
    #define BUFFER_A    0
    #define BUFFER_B    1
    #define BUFFER_C    2
     
    #define BLOCK_SIZE 32
     
     
    layout (local_size_x = BLOCK_SIZE, local_size_y = BLOCK_SIZE) in;
     
    layout(binding = BUFFER_A) buffer iBuffer
    {
        REAL A[];
    } ;
     
    layout(binding = BUFFER_B) buffer oBuffer
    {
        REAL B[];
    } ;
     
    layout(binding = BUFFER_C) buffer oBuffer
    {
        REAL C[];
    } ;
     
     
    //matrixMul(float *C, float *A, float *B, int wA, int wB)
    #define blockIdx gl_WorkGroupID
    #define __syncthreads() barrier()
    //memoryBarrierShared();
    #define threadIdx gl_GlobalInvocationID
    #define __shared__ shared
     
    #define wA 1024
    #define wB 1024
    #define AS(i, j) As[i][j]
    #define BS(i, j) Bs[i][j]
     
            // Declaration of the shared memory array As used to
            // store the sub-matrix of A
            __shared__ REAL As[BLOCK_SIZE][BLOCK_SIZE];
     
            // Declaration of the shared memory array Bs used to
            // store the sub-matrix of B
            __shared__ REAL Bs[BLOCK_SIZE][BLOCK_SIZE];
     
    void main()
    {    
     
        // Block index
        uint bx = blockIdx.x;
        uint by = blockIdx.y;
     
        // Thread index
        uint tx = threadIdx.x;
        uint ty = threadIdx.y;
     
        // Index of the first sub-matrix of A processed by the block
        uint aBegin = wA * BLOCK_SIZE * by;
     
        // Index of the last sub-matrix of A processed by the block
        uint aEnd   = aBegin + wA - 1;
     
        // Step size used to iterate through the sub-matrices of A
        uint aStep  = BLOCK_SIZE;
     
        // Index of the first sub-matrix of B processed by the block
        uint bBegin = BLOCK_SIZE * bx;
     
        // Step size used to iterate through the sub-matrices of B
        uint bStep  = BLOCK_SIZE * wB;
     
        // Csub is used to store the element of the block sub-matrix
        // that is computed by the thread
        REAL Csub = 0;
     
        // Loop over all the sub-matrices of A and B
        // required to compute the block sub-matrix
        /*for (uint a = aBegin, b = bBegin;
             a <= aEnd;
             a += aStep, b += bStep)*/
        uint a = aBegin, b = bBegin;
        for (int i=0; i<32; i++)
        {
     
     
            // Load the matrices from device memory
            // to shared memory; each thread loads
            // one element of each matrix
            AS(ty, tx) = A[a + wA * ty + tx];
            BS(ty, tx) = B[b + wB * ty + tx];
     
            // Synchronize to make sure the matrices are loaded
            __syncthreads();
     
            // Multiply the two matrices together;
            // each thread computes one element
            // of the block sub-matrix
    #pragma unroll
            for (int k = 0; k < BLOCK_SIZE; ++k)
                Csub += AS(ty, k) * BS(k, tx);
     
            // Synchronize to make sure that the preceding
            // computation is done before loading two new
            // sub-matrices of A and B in the next iteration
            __syncthreads();
     
            a += aStep, b += bStep;
        }
     
        // Write the block sub-matrix to device memory;
        // each thread writes one element
        uint c = wB * BLOCK_SIZE * by + BLOCK_SIZE * bx;
        C[c + wB * ty + tx] = Csub;
    }


    Full assembly:
    Code :
    Compute info
    ------------
    Internal error: assembly compile error for compute shader at offset 108543:
    -- error message --
    line 2094, column 6:  error: BAR not allowed inside flow control blocks.
    line 2108, column 6:  error: BAR not allowed inside flow control blocks.
    -- internal assembly text --
    !!NVcp5.0
    OPTION NV_shader_storage_buffer;
    OPTION NV_shader_atomic_float;
    GROUP_SIZE 32 32;
    # cgc version 3.1.0001, build date Aug  8 2012
    # command line args: 
    #vendor NVIDIA Corporation
    #version 3.1.0.1
    #profile gp5cp
    #program main
    #semantic As : SHARED
    #semantic Bs : SHARED
    #semantic oBuffer : SBO_BUFFER[2]
    #semantic iBuffer : SBO_BUFFER[0]
    #semantic oBuffer : SBO_BUFFER[1]
    #var float Bs[0][0] :  : shared_mem[8188] : -1 : 1
    # lots more
    #var float Bs[31][31] :  : shared_mem[8188] : -1 : 1
    #var uint3 gl_GlobalInvocationID : $vin.GBLID : GBLID[4] : -1 : 1
    #var uint3 gl_WorkGroupID : $vin.CTAID : CTAID[2] : -1 : 1
    #var float C[0] :  : sbo_buffer[2][0] : -1 : 1
    #var float A[0] :  : sbo_buffer[0][0] : -1 : 1
    #var float B[0] :  : sbo_buffer[1][0] : -1 : 1
    SHARED_MEMORY 8192;
    SHARED shared_mem[] = { program.sharedmem };
    STORAGE sbo_buf0[] = { program.storage[0] };
    STORAGE sbo_buf1[] = { program.storage[1] };
    STORAGE sbo_buf2[] = { program.storage[2] };
    TEMP R0, R1;
    MOV.F R0.y, {0, 0, 0, 0}.x;
    MUL.U R0.z, invocation.groupid.y, {32768, 0, 0, 0}.x;
    MUL.U R0.w, invocation.groupid.x, {32, 0, 0, 0}.x;
    REP.S {32, 0, 0, 0};
    MAD.U R1.x, invocation.globalid.y, {1024, 0, 0, 0}, R0.w;
    MAD.U R0.x, invocation.globalid.y, {1024, 0, 0, 0}, R0.z;
    ADD.U R0.x, invocation.globalid, R0;
    MUL.S R1.y, R0.x, {4, 0, 0, 0}.x;
    ADD.U R1.x, invocation.globalid, R1;
    MUL.S R0.x, invocation.globalid.y, {128, 0, 0, 0};
    MOV.U R1.z, R1.y;
    MAD.S R1.y, invocation.globalid.x, {4, 0, 0, 0}.x, R0.x;
    LDB.F32 R0.x, sbo_buf0[R1.z];
    MOV.U R1.y, R1;
    STS.F32 R0, shared_mem[R1.y];
    MUL.S R1.x, R1, {4, 0, 0, 0};
    MOV.U R0.x, R1;
    LDB.F32 R0.x, sbo_buf1[R0.x];
    STS.F32 R0, shared_mem[R1.y + 4096];
    BAR  ;
    MOV.S R1.y, {0, 0, 0, 0}.x;
    REP.S {32, 0, 0, 0};
    MUL.S R1.x, R1.y, {128, 0, 0, 0};
    MAD.S R1.x, invocation.globalid, {4, 0, 0, 0}, R1;
    MUL.S R0.x, invocation.globalid.y, {128, 0, 0, 0};
    MAD.S R0.x, R1.y, {4, 0, 0, 0}, R0;
    MOV.U R1.z, R1.x;
    MOV.U R1.x, R0;
    LDS.F32 R0.x, shared_mem[R1.z + 4096];
    LDS.F32 R1.x, shared_mem[R1.x];
    MAD.F R0.y, R1.x, R0.x, R0;
    ADD.S R1.y, R1, {1, 0, 0, 0}.x;
    ENDREP;
    BAR  ;
    ADD.U R0.z, R0, {32, 0, 0, 0}.x;
    ADD.U R0.w, R0, {32768, 0, 0, 0}.x;
    ENDREP;
    MUL.U R0.x, invocation.groupid, {32, 0, 0, 0};
    MAD.U R0.x, invocation.groupid.y, {32768, 0, 0, 0}, R0;
    MAD.U R0.x, invocation.globalid.y, {1024, 0, 0, 0}, R0;
    ADD.U R0.x, R0, invocation.globalid;
    MUL.S R0.x, R0, {4, 0, 0, 0};
    MOV.U R0.x, R0;
    STB.F32 R0.y, sbo_buf2[R0.x];
    END
    # 44 instructions, 2 R-regs
    Last edited by oscarbg; 08-21-2012 at 09:47 PM.

  3. #23
    Intern Newbie
    Join Date
    Oct 2007
    Posts
    47
    From post below I want to ask some questions/suggestions:
    *Would be good similar to Nvidia exposing NV_CG_shader extension for allowing "some" Cg functions (and syntax?) in GLSL code if exposed some call NV_CUDA_shader extension exposing to relaxation things in compute shaders to allow easier porting of CUDA kernels to GLSL (I'm smoking crack?):
    first defines for all CUDA grid, threadid syntax and the likes having GLSL equivalents like:
    #define blockIdx gl_WorkGroupID
    #define __syncthreads() barrier()
    #define threadIdx gl_GlobalInvocationID
    #define __shared__ shared
    and defining like int and not uint like intrinsics in GLSL for mentions mentioned in previous post for portability questions
    and second allow __shared__ definitions inside compute shader functions which compiler will put out internally if needed..
    with that these matrix mul code could be ported immediately!..
    *Specification doesn't inform about any restriction on usage of barriers inside control flow or restrictions in that case similar to CUDA for threads to follow same path reaching barrier.. it's this going to be fixed in spec or is IHV dependant..
    if some restrictions apply in spec Nvidia could allow similar relaxation as done in CUDA by an extensions GL_NV_barriers_in_control_flow or similar to inform of that support..
    *Also sorry but what is Directcompute restrictions in that case? there are any?
    *Finally allow to programatically exposing work group size like by requiring local_size_x variables not to be constant like (new extension like GL_NV_uniform_local_size)

    uniform int local_size_x_from_app;
    layout (local_size_x = local_size_x_from_app) in;

    or better new function
    DispatchComputeSetLocalSize(globx,globy,globz,locx ,locy,locz)
    equivalent of CUDA support
    kernelname<<globalsize,localsize>>(..)
    I know that DirectCompute doesn't support exposing local group size but CUDA yes..

    Please answer questions!!

  4. #24
    Senior Member OpenGL Guru
    Join Date
    May 2009
    Posts
    4,948
    In the future, please use [ code ] blocks to format pieces of code in your posts.

  5. #25
    Intern Newbie
    Join Date
    Oct 2007
    Posts
    47

    One more question..

    I have fixed all using [Code].. thanks alfonse..

    Another bad news altough this would be Nvidia only anyway for now: pointers on compute shaders not working on Nvidia..

    I have tried to answer another question towards easy portability of GPGPU codes in GLSL compute shader:
    using pointers in compute programs! of course using GL_NV_shader_buffer_load..
    if I not enable parser correctly points:
    Code :
    0(47) : error C7531: pointers require "#extension GL_NV_shader_buffer_load : enable" before use
    but if I enable I get:
    Code :
    0(13) : warning C7547: extension GL_NV_shader_buffer_load not supported in profile gp5cp
    but GL_NV_shader_buffer_load is not supported in every stage? why not in compute?!
    so Nvidia this is a limitation of implementation for now or forever (well more or less ).. anyway this is supported on CUDA so surely supported in GPU HW for ages..

  6. #26
    Junior Member Newbie
    Join Date
    May 2012
    Location
    Germany
    Posts
    4
    Hi oscar,
    ran into the barrier problem during development as well, currently it seems like an oversight in the spec itself, if you look into 4.3 GLSL :

    The barrier() function may only be placed inside the function main() of the tessellation control shader
    and may not be called within any control flow. Barriers are also disallowed after a return statement in the
    function main(). Any such misplaced barriers result in a compile-time error.
    This hasn't been specced in detail for compute yet. Ideally spec and implementation can be relaxed accordingly!

    In general things are in an early stage yet, it will take a couple revisions until the implementation has all the features it should have.

    Regarding your twitter on NV_shader_buffer_load, a quick compile test using
    Code :
    #extension GL_NV_gpu_shader5 : enable
    did not give any errors when declaring some pointers. Although use of pointers on shared memory currently won't be possible
    Last edited by Christoph Kubisch; 08-22-2012 at 08:32 AM.

  7. #27
    Intern Contributor
    Join Date
    Mar 2010
    Posts
    59
    Hi oscarbg

    Thanks for the bug reports. We have looked at the g-truc OpenGL 4.3 samples and have a few driver fixes pending. There are also a few bugs in the OpenGL 4.3 samples, which I have reported to g-truc.

    1) The issue with calling imageSize() on an image declared with "coherent" has been fixed in the driver. The beta will be updated soon with this fix. Until then you can workaround the issue by removing "coherent" from the declaration.

    2) Yes, multi_draw_indirect is fully hardware accelerated on Fermi and above GPUs.

    3) fbo_no_attachments just allows an FBO to be "complete" without any attachments. I don't think it's equivalent to DX11 TIR.

    4) The "readonly" and "writeonly" memory qualifiers on buffer declarations is broken in the driver. We're working on a fix for this. For now just remove these qualifiers.

    5) Some of your later questions might be better served in the "Official feedback on OpenGL 4.3 thread".

    Thanks for your feedback.

    cheers
    Piers

  8. #28
    Junior Member Regular Contributor
    Join Date
    Sep 2001
    Location
    Wake Forest, NC, USA
    Posts
    171
    Quote Originally Posted by oscarbg View Post
    Another bad news altough this would be Nvidia only anyway for now: pointers on compute shaders not working on Nvidia..
    Sorry I'm behind on looking at these forum updates.

    The error with #extension GL_NV_shader_buffer_load is a compiler bug that I was able to reproduce. The fix was an oversight in the logic handling "what #extension" directives are supported in which shader types/architectures. I'll work on a fix for the next beta driver update.

    As Christoph Kubisch mentions, one workaround for this is to enable the extension GL_NV_gpu_shader5, which also includes the ability to store to pointers. GL_NV_shader_buffer_load only allows pointer reads.

  9. #29
    Intern Newbie
    Join Date
    Oct 2007
    Posts
    47
    Many thanks for responses Christoph and Pat!
    some updates on my findings:
    First I think that comment about not currently possible to use codes that have barriers with control flow is correctly to be posted here because if not allowed in spec it's here the place to motivate Nvidia to relax this restriction..
    Reasons to allow it are big first and foremost:
    1.D3D11 compute shader allows it!.. sorry to be pedantic but OGL ARB can't claim OGL 4.3 is a superset of D3D11..
    for example I have found in D3D SDK ComputeShaderSort11 sample in kernel BitonicSort
    this code
    Code :
    for (unsigned int j = g_iLevel >> 1 ; j > 0 ; j >>= 1)
        {
            unsigned int result = ((shared_data[GI & ~j] <= shared_data[GI | j]) == (bool)(g_iLevelMask & DTid.x))? shared_data[GI ^ j] : shared_data[GI];
            GroupMemoryBarrierWithGroupSync();
            shared_data[GI] = result;
            GroupMemoryBarrierWithGroupSync();
        }
    well this code can't be without control flow which I verified getting compiled DX IL assembly by fxc which has barriers inside "loop" in DX IL parlance keyworks.. even more I compiled down to AMD 5850 ISA using AMD GPU shader analyzer and yes barriers is there between control flow..
    2. Even in D3D 11 don't allowed it, lots of *basic* GPGPU algorithms are fully optimized in this conditions like sorts, perhaps even scans and many more I think searching CUDA code over web will grasp the importance of this support in CUDA..
    3. I have found that AMD is going to allow it! well details follow..
    by using some leaked AMD drivers (9.00) and headers about it I found about AMD_compute_shader in fglrx drivers and seeing inside driver binary searching which keywords are implemented from AMD_compute_shader and which ones from ARB_compute_shader I have compiled successfully a compute shader having barrier inside control flow impossible to remove (the control flow) from the compiler.. I can post full code and binary to check if you dont' believe..

    *Allow selective unroll of loops similar to D3D [unroll] (i.e. better control over #pragma optionNV(unroll all))
    Motivation:
    Another question is well matmul code after all has the loop with fixed iters (assuming tuned for a specific matrix dimensions) so we can unroll I have found using #pragma optionNV(unroll all) solves the problem and compiles the kernel *BUT* produces big binary so question there is a pragma in NV compiler to allow unrolling of selected loops similar to D3D [unroll].. hey that's another place D3D11 CS are better than OpenGL.. with that support I could unroll outer loop of matmul code posted before and the loop between barriers don't be unrolled..


    I have found using GL_NV_gpu_shader5 works perfectly.. and allows using without less massage more cuda codes.. I found happily that using OpenCL codes (CUDA code also use it) that pass shared mem pointers as function args compile correctly up to assembly compile that founds
    something like a LDS.? line and returns with error..
    So I have hope with your comment "Although use of pointers on shared memory currently won't be possible" that this will be possible in the future.. Please allow that use case!

  10. #30
    Senior Member OpenGL Guru
    Join Date
    May 2009
    Posts
    4,948
    First I think that comment about not currently possible to use codes that have barriers with control flow is correctly to be posted here because if not allowed in spec it's here the place to motivate Nvidia to relax this restriction
    Presumably by releasing an extension, right? Because we don't want anyone, AMD, NVIDIA, Intel, etc just ignoring what the spec says because that's not what it is that they want. Like your AMD_compute_shader example is a separate specification, thus allowing AMD to expose whatever functionality they choose.

Tags for this Thread

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •