Part of the Khronos Group
OpenGL.org

The Industry's Foundation for High Performance Graphics

from games to virtual reality, mobile phones to supercomputers

Page 4 of 7 FirstFirst ... 23456 ... LastLast
Results 31 to 40 of 63

Thread: NVIDIA releases OpenGL 4.3 beta drivers

  1. #31
    Intern Newbie
    Join Date
    Oct 2007
    Posts
    47
    Alfonse wouldn't be better to say hey it's a bug in spec definition after all is some days old and only implementd by Nvidia right now.. and we have two options:
    1. spec doesn't prohibite barriers in control flow it's implementation dependant..
    2. spec supports barriers inside control flow after all all D3D11 HW and even OpenCL HW allow it..

  2. #32
    Senior Member OpenGL Guru
    Join Date
    May 2009
    Posts
    4,948
    NVIDIA doesn't control the OpenGL specification. This would be a proposal for the ARB, not for NVIDIA.

  3. #33
    Junior Member Newbie
    Join Date
    Aug 2012
    Location
    Switzerland
    Posts
    14
    "barrier();" isn't allowed in control flow but the new barriers like "groupMemoryBarrier();" (which would be the equivalent right?) are allowed I think and i successfully used those with the Nvidia Driver (see line 269 here https://github.com/progschj/OpenGL-E...ader_nbody.cpp). Actually the way I read it ogl even has global barriers (?) which aren't even present in for example opencl.

    Edit: while I'm at it. One i think obvious bug with the 4.3 driver I found was that according to spec "gl_WorkGroupSize" is supposed to be const so it can be used as array size. In 7.1 of the spec it says:
    The built-in constant gl_WorkGroupSize is a compute-shader constant containing the local work-group
    size of the shader. The size of the work group in the X, Y, and Z dimensions is stored in the x, y, and z
    components. The values stored in gl_WorkGroupSize match those specified in the required local_size_x,
    local_size_y, and local_size_z layout qualifiers for the current shader. This value is constant so that it can
    be used to size arrays of memory that can be shared within the local work group.
    but when I try exactely that: "shared vec4 tmp[gl_WorkGroupSize.x];" (for example on line 258 of the example linked above)
    it errors with: "0(6) : error C1307: non constant expression for array size"

    Also shared memory seems somehow broken I have this example: http://ideone.com/aI4WL
    Essentially it tries to first fill a shared array with the numbers of 0-256 and then write those out to memory. But it appears that line 14 writes to a "wrong offset" partially overwriting the first half of the local array.
    Last edited by JakobProgsch; 08-25-2012 at 01:18 PM.

  4. #34
    Junior Member Newbie
    Join Date
    Aug 2012
    Location
    Switzerland
    Posts
    14
    Hmm, the forum ate my first post. Lets try again:

    About the barriers: "barrier();" does indeed not work in flow control which is in accordance with the specs. But the other barriers such as groupMemoryBarrier etc. do. See for example line 268 here: https://github.com/progschj/OpenGL-E...ader_nbody.cpp

    There are two issues I found which I think ar bugs in the driver.

    The first one being that the values in gl_WorkGroupSize are not constant it seems.
    Trying something like "shared float tmp[gl_WorkGroupSize.x];"
    results in a shader compiler error: "error C1307: non constant expression for array size"
    But the spec explicitly states that WorkGroupSize is constant for exactly that purpose (in 7.1, page 112 of the not annotated version):
    The built-in constant gl_WorkGroupSize is a compute-shader constant containing the local work-group
    size of the shader. The size of the work group in the X, Y, and Z dimensions is stored in the x, y, and z
    components. The values stored in gl_WorkGroupSize match those specified in the required local_size_x,
    local_size_y, and local_size_z layout qualifiers for the current shader. This value is constant so that it can
    be used to size arrays of memory that can be shared within the local work group
    .
    The other is that indexing/writing to shared memory seems "off". I broke down my example to the following shader:
    Code :
            #version 430
            layout(local_size_x=128) in;
     
            layout(r32f, location = 0) uniform imageBuffer data;
     
            shared float local[256];
            void main() {
               int N = imageSize(data);
               int index = int(gl_GlobalInvocationID);
               int localindex = int(gl_LocalInvocationIndex);
     
               local[localindex] = localindex;
               local[localindex+128] = localindex+128;
     
               groupMemoryBarrier();
               imageStore(data, index, vec4(local[localindex]));
               imageStore(data, index+128, vec4(local[localindex+128]));
            }

    which I run a single group of. The expected result is that the imageBuffer gets filled with the values 0...255. But for some reason the result is this:
    Code :
        0    1    2    3    4    5    6    7  128  129  130  131  132  133  134  135
      136  137  138  139  140  141  142  143  144  145  146  147  148  149  150  151
      152  153  154  155  156  157  158  159  160  161  162  163  164  165  166  167
      168  169  170  171  172  173  174  175  176  177  178  179  180  181  182  183
      184  185  186  187  188  189  190  191  192  193  194  195  196  197  198  199
      200  201  202  203  204  205  206  207  208  209  210  211  212  213  214  215
      216  217  218  219  220  221  222  223  224  225  226  227  228  229  230  231
      232  233  234  235  236  237  238  239  240  241  242  243  244  245  246  247
      128  129  130  131  132  133  134  135  136  137  138  139  140  141  142  143
      144  145  146  147  148  149  150  151  152  153  154  155  156  157  158  159
      160  161  162  163  164  165  166  167  168  169  170  171  172  173  174  175
      176  177  178  179  180  181  182  183  184  185  186  187  188  189  190  191
      192  193  194  195  196  197  198  199  200  201  202  203  204  205  206  207
      208  209  210  211  212  213  214  215  216  217  218  219  220  221  222  223
      224  225  226  227  228  229  230  231  232  233  234  235  236  237  238  239
      240  241  242  243  244  245  246  247  248  249  250  251  252  253  254  255
    so the first half gets partially overwritten with the second somehow. But the overwriting starts at an offset of 8 elements...
    And if I change apperances of "local[localindex+128]" to "local[localindex+2*128]" it starts at 16 elements etc. so its like the +128 somehow gets added in with a wrong factor?

    Here is the full test case:
    Code :
    #include <GL3/gl3w.h>
    #include <GL/glfw.h>
     
    #include <iostream>
    #include <iomanip>
    #include <algorithm>
    #include <string>
    #include <vector>
    #include <cstdlib>
    #include <cmath>
     
    bool running;
     
    // window close callback function
    int closedWindow()
    {
        running = false;
        return GL_TRUE;
    }
     
    // helper to check and display for shader compiler errors
    bool check_shader_compile_status(GLuint obj)
    {
        GLint status;
        glGetShaderiv(obj, GL_COMPILE_STATUS, &status);
        if(status == GL_FALSE)
        {
            GLint length;
            glGetShaderiv(obj, GL_INFO_LOG_LENGTH, &length);
            std::vector<char> log(length);
            glGetShaderInfoLog(obj, length, &length, &log[0]);
            std::cerr << &log[0];
            return false;
        }
        return true;
    }
     
    // helper to check and display for shader linker error
    bool check_program_link_status(GLuint obj)
    {
        GLint status;
        glGetProgramiv(obj, GL_LINK_STATUS, &status);
        if(status == GL_FALSE)
        {
            GLint length;
            glGetProgramiv(obj, GL_INFO_LOG_LENGTH, &length);
            std::vector<char> log(length);
            glGetProgramInfoLog(obj, length, &length, &log[0]);
            std::cerr << &log[0];
            return false;   
        }
        return true;
    }
     
    int main()
    {
        int width = 640;
        int height = 480;
     
        if(glfwInit() == GL_FALSE)
        {
            std::cerr << "failed to init GLFW" << std::endl;
            return 1;
        } 
        glfwOpenWindowHint(GLFW_OPENGL_PROFILE, GLFW_OPENGL_CORE_PROFILE);
        glfwOpenWindowHint(GLFW_OPENGL_VERSION_MAJOR, 4);
        glfwOpenWindowHint(GLFW_OPENGL_VERSION_MINOR, 3);
     
        // create a window
        if(glfwOpenWindow(width, height, 0, 0, 0, 8, 24, 8, GLFW_WINDOW) == GL_FALSE)
        {
            std::cerr << "failed to open window" << std::endl;
            glfwTerminate();
            return 1;
        }
     
        // setup windows close callback
        glfwSetWindowCloseCallback(closedWindow);
     
        glfwSwapInterval(0);
     
        if (gl3wInit())
        {
            std::cerr << "failed to init GL3W" << std::endl;
            glfwCloseWindow();
            glfwTerminate();
            return 1;
        }
        const char *source;
        int length;
     
        std::string test_source =
            "#version 430\n"
            "layout(local_size_x=128) in;\n"
     
            "layout(r32f, location = 0) uniform imageBuffer data;\n"
     
            "shared float local[256];\n"
            "void main() {\n"
            "   int N = imageSize(data);\n"
            "   int index = int(gl_GlobalInvocationID);\n"
            "   int localindex = int(gl_LocalInvocationIndex);\n"
     
            "   local[localindex] = localindex;\n"
            "   local[localindex+128] = localindex+128;\n"
     
            "   groupMemoryBarrier();\n"
            "   imageStore(data, index, vec4(local[localindex]));\n"
            "   imageStore(data, index+128, vec4(local[localindex+128]));\n"
            "}\n";
     
        // program and shader handles
        GLuint test_program, test_shader;
     
        // create and compiler vertex shader
        test_shader = glCreateShader(GL_COMPUTE_SHADER);
        source = test_source.c_str();
        length = test_source.size();
        glShaderSource(test_shader, 1, &source, &length); 
        glCompileShader(test_shader);
        if(!check_shader_compile_status(test_shader))
        {
            return 1;
        }
     
        // create program
        test_program = glCreateProgram();
     
        // attach shaders
        glAttachShader(test_program, test_shader);
     
        // link the program and check for errors
        glLinkProgram(test_program);
        check_program_link_status(test_program);   
     
     
        std::vector<float> data(256);
        //~ std::generate(data.begin(), data.end(), randf);
        std::fill(data.begin(), data.end(), 1.0f);
     
        for(int i = 0;i<256;++i)
        {
            data[i] = -1;
        }
     
        GLuint buffer;
     
        glGenBuffers(1, &buffer);
        glBindBuffer(GL_TEXTURE_BUFFER, buffer);
        glBufferData(GL_TEXTURE_BUFFER, sizeof(float)*data.size(),  &data[0], GL_STATIC_DRAW);                  
     
     
        // texture handle
        GLuint buffer_texture;
     
        glGenTextures(1, &buffer_texture);
        glBindTexture(GL_TEXTURE_BUFFER, buffer_texture);
        glTexBuffer(GL_TEXTURE_BUFFER, GL_R32F, buffer);
     
        // bind images
        glBindImageTexture(0, buffer_texture, 0, GL_FALSE, 0, GL_READ_WRITE, GL_R32F);
     
        glUseProgram(test_program);
        glUniform1i(0, 0);
     
        glDispatchCompute(1, 1, 1);
     
        glGetBufferSubData(GL_TEXTURE_BUFFER, 0, sizeof(float)*data.size(), &data[0]);
     
        for(size_t i = 0;i<data.size();i+=1)
        {
            if(i%16==0) std::cout << std::endl;
            std::cout << std::setw(5) << data[i];
        }
        std::cout << std::endl;
     
        GLint shared_size;
        glGetIntegerv(GL_MAX_COMPUTE_SHARED_MEMORY_SIZE, &shared_size);
        std::cout << "max shared: " << shared_size << std::endl;
     
        glfwCloseWindow();
        glfwTerminate();
        return 0;
    }

    Edit:
    I did some more research into the second thing and was wondering if instead of writing "localindex+128" i could precompute a localindex128 and use that since that would get rid of the inlined +128 that I suspected to be added incorrectly. At first that didn't change anything which I then reallized doesn't mean much since the compiler might inline my localindex128 during static analysis. So I needed a way to write "+128" without the compiler going nuts with optimization. First I tried making localindex128 volatile, which didn't work. Then I thought "well all I need is a constant 128 the compiler thinks isn't constant"... See where this is going? So i used gl_WorkGroupSize.x from the first issue and ended up with this:
    Code :
            #version 430
            layout(local_size_x=128) in;
     
            layout(r32f, location = 0) uniform imageBuffer data;
     
            shared float local[256];
            void main() {
               int N = imageSize(data);
               int index = int(gl_GlobalInvocationID);
               int localindex = int(gl_LocalInvocationIndex);
               int localindex128 = int(gl_LocalInvocationIndex+gl_WorkGroupSize.x); //gl_WorkGroupSize.x is 128 but because of the first bug the compiler doesn't know it's constant
     
               local[localindex] = localindex;
               local[localindex128] = localindex128;
     
               groupMemoryBarrier();
               imageStore(data, index, vec4(local[localindex]));
               imageStore(data, index+128, vec4(local[localindex128]));
            }
    which gives the expected result. I guess that is about as far as I can narrow it down from my end.
    Last edited by JakobProgsch; 08-26-2012 at 04:52 AM.

  5. #35
    Junior Member Regular Contributor
    Join Date
    Sep 2001
    Location
    Wake Forest, NC, USA
    Posts
    171
    Jakob,

    Thanks for the feedback.

    I agree that the first issue (not treating gl_WorkGroupSize as a constant expression) looks like a driver compiler bug, where it is treating it as an "in" instead of a "const". Hopefully, this should be easy to fix.

    I'll have to look at the second issue in more detail. One thing that looks wrong about that shader is that there is no barrier() call between the shared memory stores and loads. There are two types of barriers with different purposes:

    - groupMemoryBarrier() ensures that memory transactions are flushed so other threads can see them
    - barrier() ensure that all threads have finished their stores before we continue

    Typically, you need both for safety. It's not clear to me that the lack of a barrier() call in your shader has anything to do with the problem here, because each thread appears to read only shared memory values written by its own thread (and not touched by any other thread).

  6. #36
    Junior Member Regular Contributor
    Join Date
    Sep 2001
    Location
    Wake Forest, NC, USA
    Posts
    171
    Quote Originally Posted by Alfonse Reinheart View Post
    NVIDIA doesn't control the OpenGL specification. This would be a proposal for the ARB, not for NVIDIA.
    Yes, that's correct, though it would certainly be possible to create a NVIDIA extension to GLSL that relaxes this restriction.

    The limitation on barrier() is inherited from tessellation control shaders. The intent of the restriction is to prevent you from writing shaders that will hang. For example, in the following code:
    Code :
    if (divergent_conditional_expression) {
      barrier();
    }
    the threads where the expression is true will call barrier() and then wait around for the other threads. But the other threads might not call barrier() at all.

    This restriction wasn't a big deal for tessellation control shaders, as the places where you want barrier() calls are typically very limited and in well-defined places (i.e., before computing outer or inner tessellation levels). This is a bigger issue for compute shaders because there are algorithms where you may want to run multiple phases of computation in a loop, which would naturally result in barrier() calls inside the loop.

    There's a few options here, none of which are perfect:

    (1) Just allow barrier() anywhere, making hangs possible.

    (2) Allow barrier() in more places, but still have some limitations to avoid hangs. For example, allow it in loops with uniform flow control (e.g., uniform start/end points, no conditional "break", no conditional "continue" before the barrier). This will be fairly tricky to specify and implement.

    (3) Leave it as-is.

    I've already filed a Khronos bug report on this issue, but it was too late for GLSL 4.30.

  7. #37
    Junior Member Regular Contributor
    Join Date
    Sep 2001
    Location
    Wake Forest, NC, USA
    Posts
    171
    Jakob,
    Quote Originally Posted by pbrown View Post
    I'll have to look at the second issue in more detail. One thing that looks wrong about that shader is that there is no barrier() call between the shared memory stores and loads.
    I think I've root-caused it. It should be easy to fix.

    It has nothing to do with the absence of barrier() in your shader, though you will need barrier() calls for more complex shared memory usage as I noted in my previous comment.

  8. #38
    Junior Member Newbie
    Join Date
    Aug 2012
    Location
    Switzerland
    Posts
    14
    Yep, I just reread the that section of the specs and also realized that this means my tiled nbody kernel I ported from CUDA is incorrect (and can't be fixed since it would need the barrier in the flow control... ). Thanks for looking into these.

  9. #39
    Intern Newbie
    Join Date
    Oct 2007
    Posts
    47
    Hi Pat,
    Is too much too ask for NV removing this check (barriers in control flow) in assembly *compute shader* code so we can start playing with advanced compute codes that require it?
    can we expect some solution before GL 4.3 gets into mainline drivers i.e. in next 4.3 beta drivers?

    Really I tried also to patch binary shaders but binaries doesn't seem similar to CUDA binaries..

    hope NV implements either solution #1 or #2..
    as said seems #1 is what D3D Compute and CUDA and OpenCL allow.. i.e. no restrictions and programmers are responsible for "good" code without hangs..
    if you want to avoid hangs realistically solution #2 can be enough for just algorithms that require control flow..
    I'm waiting for Nvidia to lift that restriction soon either way for broad testing of some compute codes..

    Thanks for detailed info on the issue!
    Quote Originally Posted by pbrown View Post
    Yes, that's correct, though it would certainly be possible to create a NVIDIA extension to GLSL that relaxes this restriction.

    The limitation on barrier() is inherited from tessellation control shaders. The intent of the restriction is to prevent you from writing shaders that will hang. For example, in the following code:
    Code :
    if (divergent_conditional_expression) {
      barrier();
    }
    the threads where the expression is true will call barrier() and then wait around for the other threads. But the other threads might not call barrier() at all.

    This restriction wasn't a big deal for tessellation control shaders, as the places where you want barrier() calls are typically very limited and in well-defined places (i.e., before computing outer or inner tessellation levels). This is a bigger issue for compute shaders because there are algorithms where you may want to run multiple phases of computation in a loop, which would naturally result in barrier() calls inside the loop.

    There's a few options here, none of which are perfect:

    (1) Just allow barrier() anywhere, making hangs possible.

    (2) Allow barrier() in more places, but still have some limitations to avoid hangs. For example, allow it in loops with uniform flow control (e.g., uniform start/end points, no conditional "break", no conditional "continue" before the barrier). This will be fairly tricky to specify and implement.

    (3) Leave it as-is.

    I've already filed a Khronos bug report on this issue, but it was too late for GLSL 4.30.

  10. #40
    Junior Member Newbie
    Join Date
    May 2008
    Location
    Austin, TX
    Posts
    3
    Think I've run into a bug in the Linux driver. I reduced a misbehaving parallel scan to this minimal test case:

    Code :
    // ARB_compute_shader shared[] array test case.
    //
    // When using arrays of shared variables, index expressions involving
    // gl_LocalInvocationID lead to stores not being observed by subsequent
    // loads to the same location.
     
    #include <GL/glew.h>
    #include <stdio.h>
    #include <stdlib.h>
    #include <string.h>
     
    #ifdef __linux__
    #  include <GL/glx.h>
    #  include <X11/Xlib.h>
    #endif
    #ifdef _WIN32
    #  include <Windows.h>
    #endif
     
    #define mChkGL(X) X; CheckGLError(__LINE__);
     
    const GLchar *ComputeCode =
      // This shader should write 1 to the second element of sharedBuf,
      // then read that value back and write it to OutputBuffer.
      // There is 1 work group of size 1. The code is guarded so that
      // only gl_LocationInvocationID.x == 0 executes the test.
      "#version 430 core                                                   \n"
      "                                                                    \n"
      "layout(std430, binding = 0) buffer Output { int OutputBuffer[1]; }; \n"
      "shared int sharedBuf[2];                                            \n"
      "                                                                    \n"
      "layout(local_size_x = 1, local_size_y = 1, local_size_z = 1) in;    \n"
      "void main() {                                                       \n"
      "  if(gl_LocalInvocationID.x == 0) {                                 \n"
      "    /* Initialize second element with 0. */                         \n"
      "    sharedBuf[1] = 0;                                               \n"
      "                                                                    \n"
    // Uncomment the following line to make validation pass.
    //  "    sharedBuf[1] = 1;                                               \n"
      "                                                                    \n"
      "    /* This store to the second element is not seen by the load. */ \n"
      "    sharedBuf[1 + gl_LocalInvocationID.x] = 1;                      \n"
      "                                                                    \n"
      "    /* Copy second element out for validation. */                   \n"
      "    OutputBuffer[0] = sharedBuf[1];                                 \n"
      "  }                                                                 \n"
      "}                                                                   \n"
    ;
     
    void CheckGLError(int line) {
      GLint glErr = glGetError();
     
      if(glErr != GL_NO_ERROR) {
        printf("OpenGL error %d at line %d\n", glErr, line);
        exit(1);
      }
    }
     
    int main() {
      // Minimal OpenGL context setup.
    #ifdef __linux__
      Display *display = XOpenDisplay(NULL);
      Window window = XCreateSimpleWindow(display, DefaultRootWindow(display), 0, 0, 1, 1, 0, 0, 0);
      int visual[] = { GLX_RGBA, 0 };
      XVisualInfo *vInfo = glXChooseVisual(display, DefaultScreen(display), visual);
      GLXContext glCtx = glXCreateContext(display, vInfo, NULL, 1);
      glXMakeCurrent(display, window, glCtx);
    #endif
    #ifdef _WIN32
      HWND window = CreateWindow(L"edit", 0, 0, 0, 0, 1, 1, NULL, NULL, NULL, NULL);
      HDC dc = GetDC(window);
      PIXELFORMATDESCRIPTOR pfd;
      memset(&pfd, 0, sizeof(pfd));
      pfd.dwFlags = PFD_SUPPORT_OPENGL;
      SetPixelFormat(dc, ChoosePixelFormat(dc, &pfd), &pfd);
      HGLRC glCtx = wglCreateContext(dc);
      wglMakeCurrent(dc, glCtx);
    #endif
     
      glewInit();
     
      // Allocate buffer for both programs.
      GLuint buffer;
      mChkGL(glGenBuffers(1, &buffer));
      mChkGL(glBindBufferBase(GL_SHADER_STORAGE_BUFFER, 0, buffer));
      mChkGL(glBufferData(GL_SHADER_STORAGE_BUFFER, sizeof(int), NULL, GL_DYNAMIC_DRAW));
     
      // Build compute shader.
      GLuint shader = glCreateShader(GL_COMPUTE_SHADER);
      mChkGL(glShaderSource(shader, 1, &ComputeCode, NULL));
      mChkGL(glCompileShader(shader));
     
      GLint compileLogSize;
      mChkGL(glGetShaderiv(shader, GL_INFO_LOG_LENGTH, &compileLogSize));
     
      if(compileLogSize > 0) {
        char *compileLog = new char[compileLogSize];
        mChkGL(glGetShaderInfoLog(shader, compileLogSize, NULL, compileLog));
        printf("%s", compileLog);
      }
     
      // Build compute program.
      GLuint program = glCreateProgram();
      mChkGL(glAttachShader(program, shader));
      mChkGL(glLinkProgram(program));
     
      GLint linkLogSize;
      mChkGL(glGetProgramiv(program, GL_INFO_LOG_LENGTH, &linkLogSize));
     
      if(linkLogSize > 0) {
        char *linkLog = new char[linkLogSize];
        mChkGL(glGetProgramInfoLog(program, linkLogSize, NULL, linkLog));
        printf("%s", linkLog);
      }
     
      // Invoke compute program and check result.
      mChkGL(glClearBufferData(GL_SHADER_STORAGE_BUFFER, GL_R32UI, GL_RED, GL_INT, NULL));
      mChkGL(glUseProgram(program));
      mChkGL(glDispatchCompute(1, 1, 1));
      mChkGL(glMemoryBarrier(GL_SHADER_STORAGE_BARRIER_BIT));
     
      int *bufferData = (int *)glMapBuffer(GL_SHADER_STORAGE_BUFFER, GL_READ_ONLY);
      int bufferVal = *bufferData;
      mChkGL(glUnmapBuffer(GL_SHADER_STORAGE_BUFFER));
     
      printf("Validation %s\n", (bufferVal == 1) ? "PASSED" : "FAILED");
    }

    The shader store to sharedBuf[1 + gl_LocalInvocationID.x] (where local id is 0) is not observed by the subsequent load from sharedBuf[1].

    This is running on a GTX 470. The test passes fine with the Windows driver.
    Last edited by jcornwall; 09-01-2012 at 06:40 AM.

Tags for this Thread

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •