PDA

View Full Version : NVIDIA releases OpenGL 4.3 beta drivers



Khronos_webmaster
08-06-2012, 05:57 AM
NVIDIA is proud to announce the immediate availability of OpenGL 4.3 beta drivers for Windows and Linux.

You will need any one of the following Fermi based GPU to get access to the full OpenGL 4.3 and GLSL 4.30 functionality:
Quadro series: 6000, 600, 5000, 410, 4000, 400, 2000D, 2000
GeForce 600 series: GTX 690, GTX 680, GTX 670, GT 645, GT 640, GT 630, GT 620, GT 610, 605
GeForce 500 series: GTX 590, GTX 580, GTX 570, GTX 560 Ti, GTX 560 SE, GTX 560, GTX 555, GTX 550 Ti, GT 545, GT 530, GT 520, 510
GeForce 400 series: GTX 480, GTX 470, GTX 465, GTX 460 v2, GTX 460 SE v2, GTX 460 SE, GTX 460, GTS 450, GT 440, GT 430, GT 420, 405

For OpenGL 3 capable hardware, these new extensions are provided:
ARB_arrays_of_arrays
ARB_clear_buffer_object
ARB_copy_image
ARB_ES3_compatibility
ARB_explicit_uniform_location
ARB_fragment_layer_viewport
ARB_framebuffer_no_attachments
ARB_internalformat_query2
ARB_invalidate_subdata
ARB_program_interface_query
ARB_robust_buffer_access_behavior
ARB_stencil_texturing
ARB_texture_buffer_range
ARB_texture_query_levels
ARB_texture_storage_multisample
ARB_texture_view
ARB_vertex_attrib_binding
KHR_debug

For OpenGL 4 capable hardware, these new extensions are provided:
ARB_compute_shader
ARB_multi_draw_indirect
ARB_shader_image_size
ARB_shader_storage_buffer_object

The drivers and extension documentation can be downloaded from http://www.nvidia.com/content/devzone/opengl-driver-4.3.html

Piers Daniell
08-06-2012, 09:53 AM
GLEW 1.9.0 has also been released which includes support for OpenGL 4.3:
http://glew.sourceforge.net/

This will make using the new API easier.

aqnuep
08-06-2012, 12:59 PM
GLEW 1.9.0 has also been released which includes support for OpenGL 4.3
Wow! That was the quickest GLEW update in history :)

malexander
08-10-2012, 09:36 AM
Just tried out glObjectLabel(), and it doesn't work with GL_BUFFER or GL_TEXTURE - it throws an INVALID_ENUM for both (at least the feedback is nice: "GL_INVALID_ENUM error generated. ObjectLabel: invalid <identifier> enum value"). I also tried GL_TEXTURE_2D just in case, but this didn't work either (nor should it, from the spec).

It works with GL_FRAMEBUFFER, GL_PROGRAM and GL_SHADER, though - at least it doesn't throw a GL error. However, a later performance message "Program/shader state performance warning: Fragment Shader is going to be recompiled because the shader key based on GL state mismatches." doesn't reference my shader name at all, which seems to defeat the purpose. It'd also be nice to know what part of the GL state it's referring to.

Alfonse Reinheart
08-10-2012, 09:41 AM
It'd also be nice to know what part of the GL state it's referring to.

It seems to always say this for me. I'm always getting that for any programs I create. But at least it's an initialization-time thing.

Brandon J. Van Every
08-10-2012, 09:56 AM
I installed NVIDIA beta driver 304.32 on Lubuntu 12.04 Linux using the xorg-edgers PPA (https://launchpad.net/~xorg-edgers/+archive/ppa) so that I wouldn't have to fool with .run scripts and turning off X servers. On my laptop I have an old GeForce 8600M GT card, which is OpenGL 3.3 class HW. The NVIDIA OpenGL driver page (http://developer.nvidia.com/opengl-driver) says, "For OpenGL 3 capable hardware, these new extensions are provided:" and gives a list of 18 extensions. Unfortunately none of them are shown as available by either glxinfo or nvidia-settings.


Anyone have OpenGL 3 class HW that's showing the new extensions, on either Windows or Linux?

malexander
08-10-2012, 09:59 AM
Hm, perhaps I will disable that warning.

Correction to my original post - glObjectName(GL_FRAMEBUFFER, ...) produces an invalid enum as well. I mistook that error for a GL_TEXTURE error.

Piers Daniell
08-10-2012, 02:34 PM
It appears we missed support for GL_TEXTURE, GL_FRAMEBUFFER and GL_RENDERBUFFER. All the others are supported. This is a trivial fix and it'll be in the next beta update next week. Sorry for the trouble and thanks for reporting the bug.

We'll also audit our error messages and make sure they use the object label instead of the number if it exists. This fix will come later. The extension is still useful because middleware tools can query objects to get names which the app sets. But clearly having the names in the debug_output strings makes sense.

malexander
08-11-2012, 10:53 AM
Thanks! I also wouldn't mind something like "object name (object id)" in the debug output.

Brandon J. Van Every
08-11-2012, 12:57 PM
What is the best way to file a bug report to NVIDIA for their beta drivers? I used their generic "Submit feedback on an NVIDIA product (http://nvidia-submit.custhelp.com/app/ask)" form, but the release notes for the driver direct one to an NVIDIA developer site (https://nvdeveloper.nvidia.com/). This site seems to be isolated from the main NVIDIA site; the developer area reached from the main site is different. I used to have an NVIDIA developer account eons ago, but it seems that has lapsed. I don't see any obvious way to create an account for the nvdeveloper.nvidia.com site, and it doesn't accept the username and password I have for the main site. So, I have the feeling that the left hand is separated from what the right hand is doing, and nobody's actually going to read the feedback I just filed. What's the best way?

ScottManDeath
08-12-2012, 02:18 AM
Please send Piers or me a private message so we can take care of this and file a bug.

(note beside, our developer forums were recently compromised by a third party which might explain why your account is not working anymore)

Piers Daniell
08-13-2012, 08:31 PM
I installed NVIDIA beta driver 304.32 on Lubuntu 12.04 Linux using the xorg-edgers PPA (https://launchpad.net/~xorg-edgers/+archive/ppa) so that I wouldn't have to fool with .run scripts and turning off X servers. On my laptop I have an old GeForce 8600M GT card, which is OpenGL 3.3 class HW. The NVIDIA OpenGL driver page (http://developer.nvidia.com/opengl-driver) says, "For OpenGL 3 capable hardware, these new extensions are provided:" and gives a list of 18 extensions. Unfortunately none of them are shown as available by either glxinfo or nvidia-settings.


Anyone have OpenGL 3 class HW that's showing the new extensions, on either Windows or Linux?

I think if you installed a driver with version 304.32 you probably didn't use the OpenGL 4.3 beta driver, which is version 304.15.00.02 for Linux. See here for the driver location:
http://www.nvidia.com/content/devzone/opengl-driver-4.3.html

malexander
08-14-2012, 03:12 PM
Found another small one. I'm not sure when this crept in, but I was previously using 295.49 without a hitch. I have a vertex shader with the following outputs:


flat out ivec4 pickID;
out float pickZ;

When I call glGetProgramiv( pid, GL_TRANSFORM_FEEDBACK_VARYING_MAX_LENGTH, &max_len) on the parent program of the vertex shader (no other shader stages), it sets max_len = 6. According to the GL spec it should return the length of the largest string including the null terminator, in this case 7. When I then pass max_len to glGetTransformFeedbackVarying() as the bufSize parameter, 'pickID' is cut off to 'pickI' (which is expected if bufSize==6) and this messes up further rendering.

In the meantime I've accounted for the null-terminator by adding one to max_size. I can afford the extra byte :)

pbrown
08-14-2012, 07:31 PM
When I call glGetProgramiv( pid, GL_TRANSFORM_FEEDBACK_VARYING_MAX_LENGTH, &max_len) on the parent program of the vertex shader (no other shader stages), it sets max_len = 6. According to the GL spec it should return the length of the largest string including the null terminator, in this case 7.

I've confirmed that this is a recent regression, and might only be in the OpenGL 4.3 beta driver. I'll try to get this into the next beta driver.

pbrown
08-14-2012, 07:58 PM
Please send Piers or me a private message so we can take care of this and file a bug.

You're also welcome to send me bug reports. My NVIDIA email ID is the same as my user ID on these forums.

guibou
08-15-2012, 10:49 AM
pbrown:

I have an issue with ComputeShader, Buffer and the *usage* hint of Shader Buffer:

http://www.opengl.org/discussion_boards/showthread.php/178764-Compute-shader-and-buffer-format?p=1241449#post1241449

I'm pretty sure I'm doing ugly stuff, but they behavior of the code changes totally depending on the usage hint of the buffer.

Should i send you an email?

Brandon J. Van Every
08-15-2012, 11:52 AM
I think if you installed a driver with version 304.32 you probably didn't use the OpenGL 4.3 beta driver, which is version 304.15.00.02 for Linux. See here for the driver location:
http://www.nvidia.com/content/devzone/opengl-driver-4.3.html

When I installed the exact driver version 304.15.00.02, I did get the 18 new extensions for my 3.x class HW. Unfortunately it also sent me up a driver installation and package update learning curve that destabilized my system. Tried to use Duplicity to restore it and it totally failed, although I didn't lose any personal data. I don't recommend the xorg-edgers stuff. They don't actually have the beta driver, and all their extra package junk is pretty much what destabilized my system. Hopefully I'll have better luck working from mainstream Lubuntu 12.04. Got a fresh system now. Once I've found a more reliable system backup program to use, I'll try the NVIDIA .run install again, and then hopefully have a viable development system.

pbrown
08-16-2012, 11:40 AM
pbrown:

I have an issue with ComputeShader, Buffer and the *usage* hint of Shader Buffer:
...
Should i send you an email?

Someone already spotted that, and Piers root-caused and fixed the bug. Thanks for the helpful report and reproducer. If you have more issues that you suspect are likely driver things, feel free to shoot me an email.

Pat

gdewan
08-16-2012, 02:55 PM
EDIT: Something I thought was actually related to this driver was actually present in an earlier driver, making a seperate post

oscarbg
08-21-2012, 05:44 PM
First is Nvidia monitoring OpenGL 4.3.0.0 sample pack by g-truc and aware of three samples not working..

I'm concentrating on two I'm more interested first in 430-image_store sample I get I think and error in assembly
line 22, column 5: error: supported only on load, store, and atomic instructions
which is:

IMQ.COH R0.xy, images[R0.x], 2D;

this is related to new function

ivec2 Size = imageSize(Diffuse);
in shader which I fix removing coherent in image definition

layout(binding = 0, rgba8) coherent uniform image2D Diffuse; //dont work
to this:

layout(binding = 0, rgba8) uniform image2D Diffuse; //it works!!


Full details (saying no where it fails) and patch applied:


#version 420 core
#extension GL_ARB_shader_image_size : require

#define FRAG_COLOR 0
#define DIFFUSE 0

in vec4 gl_FragCoord;
//layout(binding = 0, rgba8) coherent uniform image2D Diffuse; //dont work
layout(binding = 0, rgba8) uniform image2D Diffuse; //it works!!


layout(location = FRAG_COLOR, index = 0) out vec4 Color;

const int Border = 8;

void main()
{
ivec2 Size = imageSize(Diffuse);

if(gl_FragCoord.x < Border)
Color = vec4(1.0, 0.0, 0.0, 1.0);
if(gl_FragCoord.x > Size.x - Border)
Color = vec4(0.0, 1.0, 0.0, 1.0);
if(gl_FragCoord.y < Border)
Color = vec4(1.0, 1.0, 0.0, 1.0);
if(gl_FragCoord.y > Size.y - Border)
Color = vec4(0.0, 0.0, 1.0, 1.0);
else
Color = imageLoad(Diffuse, ivec2(gl_FragCoord.xy));
}


Fragment info
-------------
Internal error: assembly compile error for fragment shader at offset 611:
-- error message --
line 22, column 5: error: supported only on load, store, and atomic instructions
-- internal assembly text --

!!NVfp5.0
OPTION ARB_shader_image_size;
OPTION NV_shader_atomic_float;
# cgc version 3.1.0001, build date Aug 8 2012
# command line args:
#vendor NVIDIA Corporation
#version 3.1.0.1
#profile gp5fp
#program main
#semantic Diffuse : IMAGE[0]
#var float4 Color : $vout.COL0 : COL0[0] : -1 : 1
#var int Diffuse.__remap : : c[0] : -1 : 1
#var float4 gl_FragCoord : $vin.WPOS : WPOS : -1 : 1
PARAM c[1] = { program.local[0] };
TEMP R0;
TEMP RC, HC;
IMAGE images[] = { image[0..7] };
OUTPUT result_color0 = result.color;
MOV.S R0.x, c[0];
SLT.F R0.z, fragment.position.x, {8, 0, 0, 0}.x;
TRUNC.U.CC HC.x, R0.z;
IMQ.COH R0.xy, images[R0.x], 2D;
IF NE.x;
MOV.F result_color0, {1, 0, 0, 0}.xyyx;
ENDIF;
ADD.S R0.x, R0, -{8, 0, 0, 0};
I2F.S R0.x, R0;
SGT.F R0.x, fragment.position, R0;
TRUNC.U.CC HC.x, R0;
IF NE.x;
MOV.F result_color0, {0, 1, 0, 0}.xyxy;
ENDIF;
SLT.F R0.x, fragment.position.y, {8, 0, 0, 0};
TRUNC.U.CC HC.x, R0;
IF NE.x;
MOV.F result_color0, {1, 0, 0, 0}.xxyx;
ENDIF;
ADD.S R0.x, R0.y, -{8, 0, 0, 0};
I2F.S R0.x, R0;
SGT.F R0.x, fragment.position.y, R0;
TRUNC.U.CC HC.x, R0;
IF NE.x;
MOV.F result_color0, {0, 1, 0, 0}.xxyy;
ELSE;
TRUNC.S R0.xy, fragment.position;
MOV.S R0.z, c[0].x;
LOADIM.U32.COH R0.x, R0, images[R0.z], 2D;
UP4UB.F result_color0, R0.x;
ENDIF;
END
# 31 instructions, 1 R-regs

oscarbg
08-21-2012, 06:02 PM
Also two questions before second bug report..
1.is multi_draw_indirect fully supported in current Nvidia HW (and Fermi) without CPU support in driver or is implemented some part in software I say that because this originated from AMD extension so seems AMD really supports in HW..
2.seems fbo_no_atachments extensions is equivalent to Dx11.1 TIR target independant rasterization if so how it's possible implementable in Nvidia first gen d3d11 HW (fermi) if I read on some sites that TIR was just one thing current D3D 11 HW doesn't supported.. what are the differences then?
Second sample is in 430-compute shader sample seems I get fixed by removing readonly and writeonly qualifiers from storage buffer and then fixing a bug by g-truc
ie changing

Out.Ouput[gl_GlobalInvocationID].Texcoord = In.Input[gl_GlobalInvocationID].Texcoord;
by

Out.Ouput[gl_GlobalInvocationID.x].Texcoord = In.Input[gl_GlobalInvocationID.x].Texcoord;

so seems
this should work but doesn't work

layout(binding = BUFFER_INPUT) readonly buffer iBuffer
{
vertex Input[];
} In;

layout(binding = BUFFER_OUTPUT) writeonly buffer oBuffer
{
vertex Ouput[];
} Out;

i get:
Compute info
------------

0(22) : error C0000: syntax error, unexpected identifier, expecting "::" at token "iBuffer"
0(27) : error C0000: syntax error, unexpected identifier, expecting "::" at token "oBuffer"
and fixing it by ex:

//layout(binding = BUFFER_INPUT) readonly buffer iBuffer
layout(binding = BUFFER_INPUT) buffer iBuffer //works!


At least for now with this fix it works!

oscarbg
08-21-2012, 06:26 PM
One question is compute shader stage being exposed in Nvidia Cg language (where to ask?) similar to tesselation stages where added in Cg 3.0.. also if not, really another question is if Nvidia Cg compiler cgc.exe will at least get support for that so we can compile GLSL compute shaders to NV_compute_shader assembly similar to which I get from driver when assembly errors occur (see below)..
So this is really the kind of bug in compute shader that I'm afraid Nvidia will not fix for security/hanging kind of thing:
I was happy to see I could port Nvidia matmul CUDA sample to OGL compute shader without much effort basically doing two things:
*Adding defines:

#define blockIdx gl_WorkGroupID
#define __syncthreads() barrier()
#define threadIdx gl_GlobalInvocationID
#define __shared__ shared
and changing all int vars in CUDA kernel to uint vars (another solution would have been to cast gl_WorkGroupID and gl_GlobalInvocationID to int)

2.and another one was to put out shared mem definitions out of kernel to be in global scope

__shared__ REAL As[BLOCK_SIZE][BLOCK_SIZE];

Then I was happy code to be able to be free of errors but a shame is in compilation:

Internal error: assembly compile error for compute shader at offset 108543:
-- error message --
line 2094, column 6: error: BAR not allowed inside flow control blocks.
line 2108, column 6: error: BAR not allowed inside flow control blocks.

First let's be clear this code works in CUDA and recommended by Nvidia for it's GPUs.. second it's using a fine detail in GPUs in that barriers can be in control flow assuming all warps in block execute follow same path in control flow..
why not in OGL compute shaders?..
also see the "flow control" is cause by this loop:

for (uint a = aBegin, b = bBegin;
a <= aEnd;
a += aStep, b += bStep)

not a branch and even more is executed same iterations for every thread but it's not trivial in the form stated to infer by compiler so I thought hey fix it i changed to (in my sample matrix w=h=1024 and block size=32 so iters=1024/32):


uint a = aBegin, b = bBegin;
for (int i=0; i<32; i++)
{
..loop body containing barriers

a += aStep, b += bStep
}

even with this I get same error so unique way to fix is to unroll loop manually which is ugly at least.. I think even I added #pragma unroll to loop to force programatically to expand the compiler for me but not working .. there is a Nvidia way to force unroll like in CUDA with #pragma unroll
Full shader:

#version 430 core
#extension GL_ARB_compute_shader : require
#extension GL_ARB_shader_storage_buffer_object : require
#extension GL_ARB_gpu_shader_fp64: require

#define REAL float
//#define REAL double


#define BUFFER_A 0
#define BUFFER_B 1
#define BUFFER_C 2

#define BLOCK_SIZE 32


layout (local_size_x = BLOCK_SIZE, local_size_y = BLOCK_SIZE) in;

layout(binding = BUFFER_A) buffer iBuffer
{
REAL A[];
} ;

layout(binding = BUFFER_B) buffer oBuffer
{
REAL B[];
} ;

layout(binding = BUFFER_C) buffer oBuffer
{
REAL C[];
} ;


//matrixMul(float *C, float *A, float *B, int wA, int wB)
#define blockIdx gl_WorkGroupID
#define __syncthreads() barrier()
//memoryBarrierShared();
#define threadIdx gl_GlobalInvocationID
#define __shared__ shared

#define wA 1024
#define wB 1024
#define AS(i, j) As[i][j]
#define BS(i, j) Bs[i][j]

// Declaration of the shared memory array As used to
// store the sub-matrix of A
__shared__ REAL As[BLOCK_SIZE][BLOCK_SIZE];

// Declaration of the shared memory array Bs used to
// store the sub-matrix of B
__shared__ REAL Bs[BLOCK_SIZE][BLOCK_SIZE];

void main()
{

// Block index
uint bx = blockIdx.x;
uint by = blockIdx.y;

// Thread index
uint tx = threadIdx.x;
uint ty = threadIdx.y;

// Index of the first sub-matrix of A processed by the block
uint aBegin = wA * BLOCK_SIZE * by;

// Index of the last sub-matrix of A processed by the block
uint aEnd = aBegin + wA - 1;

// Step size used to iterate through the sub-matrices of A
uint aStep = BLOCK_SIZE;

// Index of the first sub-matrix of B processed by the block
uint bBegin = BLOCK_SIZE * bx;

// Step size used to iterate through the sub-matrices of B
uint bStep = BLOCK_SIZE * wB;

// Csub is used to store the element of the block sub-matrix
// that is computed by the thread
REAL Csub = 0;

// Loop over all the sub-matrices of A and B
// required to compute the block sub-matrix
/*for (uint a = aBegin, b = bBegin;
a <= aEnd;
a += aStep, b += bStep)*/
uint a = aBegin, b = bBegin;
for (int i=0; i<32; i++)
{


// Load the matrices from device memory
// to shared memory; each thread loads
// one element of each matrix
AS(ty, tx) = A[a + wA * ty + tx];
BS(ty, tx) = B[b + wB * ty + tx];

// Synchronize to make sure the matrices are loaded
__syncthreads();

// Multiply the two matrices together;
// each thread computes one element
// of the block sub-matrix
#pragma unroll
for (int k = 0; k < BLOCK_SIZE; ++k)
Csub += AS(ty, k) * BS(k, tx);

// Synchronize to make sure that the preceding
// computation is done before loading two new
// sub-matrices of A and B in the next iteration
__syncthreads();

a += aStep, b += bStep;
}

// Write the block sub-matrix to device memory;
// each thread writes one element
uint c = wB * BLOCK_SIZE * by + BLOCK_SIZE * bx;
C[c + wB * ty + tx] = Csub;
}


Full assembly:

Compute info
------------
Internal error: assembly compile error for compute shader at offset 108543:
-- error message --
line 2094, column 6: error: BAR not allowed inside flow control blocks.
line 2108, column 6: error: BAR not allowed inside flow control blocks.
-- internal assembly text --
!!NVcp5.0
OPTION NV_shader_storage_buffer;
OPTION NV_shader_atomic_float;
GROUP_SIZE 32 32;
# cgc version 3.1.0001, build date Aug 8 2012
# command line args:
#vendor NVIDIA Corporation
#version 3.1.0.1
#profile gp5cp
#program main
#semantic As : SHARED
#semantic Bs : SHARED
#semantic oBuffer : SBO_BUFFER[2]
#semantic iBuffer : SBO_BUFFER[0]
#semantic oBuffer : SBO_BUFFER[1]
#var float Bs[0][0] : : shared_mem[8188] : -1 : 1
# lots more
#var float Bs[31][31] : : shared_mem[8188] : -1 : 1
#var uint3 gl_GlobalInvocationID : $vin.GBLID : GBLID[4] : -1 : 1
#var uint3 gl_WorkGroupID : $vin.CTAID : CTAID[2] : -1 : 1
#var float C[0] : : sbo_buffer[2][0] : -1 : 1
#var float A[0] : : sbo_buffer[0][0] : -1 : 1
#var float B[0] : : sbo_buffer[1][0] : -1 : 1
SHARED_MEMORY 8192;
SHARED shared_mem[] = { program.sharedmem };
STORAGE sbo_buf0[] = { program.storage[0] };
STORAGE sbo_buf1[] = { program.storage[1] };
STORAGE sbo_buf2[] = { program.storage[2] };
TEMP R0, R1;
MOV.F R0.y, {0, 0, 0, 0}.x;
MUL.U R0.z, invocation.groupid.y, {32768, 0, 0, 0}.x;
MUL.U R0.w, invocation.groupid.x, {32, 0, 0, 0}.x;
REP.S {32, 0, 0, 0};
MAD.U R1.x, invocation.globalid.y, {1024, 0, 0, 0}, R0.w;
MAD.U R0.x, invocation.globalid.y, {1024, 0, 0, 0}, R0.z;
ADD.U R0.x, invocation.globalid, R0;
MUL.S R1.y, R0.x, {4, 0, 0, 0}.x;
ADD.U R1.x, invocation.globalid, R1;
MUL.S R0.x, invocation.globalid.y, {128, 0, 0, 0};
MOV.U R1.z, R1.y;
MAD.S R1.y, invocation.globalid.x, {4, 0, 0, 0}.x, R0.x;
LDB.F32 R0.x, sbo_buf0[R1.z];
MOV.U R1.y, R1;
STS.F32 R0, shared_mem[R1.y];
MUL.S R1.x, R1, {4, 0, 0, 0};
MOV.U R0.x, R1;
LDB.F32 R0.x, sbo_buf1[R0.x];
STS.F32 R0, shared_mem[R1.y + 4096];
BAR ;
MOV.S R1.y, {0, 0, 0, 0}.x;
REP.S {32, 0, 0, 0};
MUL.S R1.x, R1.y, {128, 0, 0, 0};
MAD.S R1.x, invocation.globalid, {4, 0, 0, 0}, R1;
MUL.S R0.x, invocation.globalid.y, {128, 0, 0, 0};
MAD.S R0.x, R1.y, {4, 0, 0, 0}, R0;
MOV.U R1.z, R1.x;
MOV.U R1.x, R0;
LDS.F32 R0.x, shared_mem[R1.z + 4096];
LDS.F32 R1.x, shared_mem[R1.x];
MAD.F R0.y, R1.x, R0.x, R0;
ADD.S R1.y, R1, {1, 0, 0, 0}.x;
ENDREP;
BAR ;
ADD.U R0.z, R0, {32, 0, 0, 0}.x;
ADD.U R0.w, R0, {32768, 0, 0, 0}.x;
ENDREP;
MUL.U R0.x, invocation.groupid, {32, 0, 0, 0};
MAD.U R0.x, invocation.groupid.y, {32768, 0, 0, 0}, R0;
MAD.U R0.x, invocation.globalid.y, {1024, 0, 0, 0}, R0;
ADD.U R0.x, R0, invocation.globalid;
MUL.S R0.x, R0, {4, 0, 0, 0};
MOV.U R0.x, R0;
STB.F32 R0.y, sbo_buf2[R0.x];
END
# 44 instructions, 2 R-regs

oscarbg
08-21-2012, 06:39 PM
From post below I want to ask some questions/suggestions:
*Would be good similar to Nvidia exposing NV_CG_shader extension for allowing "some" Cg functions (and syntax?) in GLSL code if exposed some call NV_CUDA_shader extension exposing to relaxation things in compute shaders to allow easier porting of CUDA kernels to GLSL (I'm smoking crack?):
first defines for all CUDA grid, threadid syntax and the likes having GLSL equivalents like:
#define blockIdx gl_WorkGroupID
#define __syncthreads() barrier()
#define threadIdx gl_GlobalInvocationID
#define __shared__ shared
and defining like int and not uint like intrinsics in GLSL for mentions mentioned in previous post for portability questions
and second allow __shared__ definitions inside compute shader functions which compiler will put out internally if needed..
with that these matrix mul code could be ported immediately!..
*Specification doesn't inform about any restriction on usage of barriers inside control flow or restrictions in that case similar to CUDA for threads to follow same path reaching barrier.. it's this going to be fixed in spec or is IHV dependant..
if some restrictions apply in spec Nvidia could allow similar relaxation as done in CUDA by an extensions GL_NV_barriers_in_control_flow or similar to inform of that support..
*Also sorry but what is Directcompute restrictions in that case? there are any?
*Finally allow to programatically exposing work group size like by requiring local_size_x variables not to be constant like (new extension like GL_NV_uniform_local_size)

uniform int local_size_x_from_app;
layout (local_size_x = local_size_x_from_app) in;

or better new function
DispatchComputeSetLocalSize(globx,globy,globz,locx ,locy,locz)
equivalent of CUDA support
kernelname<<globalsize,localsize>>(..)
I know that DirectCompute doesn't support exposing local group size but CUDA yes..

Please answer questions!!

Alfonse Reinheart
08-21-2012, 09:14 PM
In the future, please use [ code ] blocks to format pieces of code in your posts.

oscarbg
08-21-2012, 09:51 PM
I have fixed all using
.. thanks alfonse..

Another bad news altough this would be Nvidia only anyway for now: pointers on compute shaders not working on Nvidia..

I have tried to answer another question towards easy portability of GPGPU codes in GLSL compute shader:
using pointers in compute programs! of course using GL_NV_shader_buffer_load..
if I not enable parser correctly points:
[CODE]0(47) : error C7531: pointers require "#extension GL_NV_shader_buffer_load : enable" before use
but if I enable I get:

0(13) : warning C7547: extension GL_NV_shader_buffer_load not supported in profile gp5cp
but GL_NV_shader_buffer_load is not supported in every stage? why not in compute?!:mad:
so Nvidia this is a limitation of implementation for now or forever (well more or less :)).. anyway this is supported on CUDA so surely supported in GPU HW for ages..

Christoph Kubisch
08-22-2012, 03:13 AM
Hi oscar,
ran into the barrier problem during development as well, currently it seems like an oversight in the spec itself, if you look into 4.3 GLSL :


The barrier() function may only be placed inside the function main() of the tessellation control shader
and may not be called within any control flow. Barriers are also disallowed after a return statement in the
function main(). Any such misplaced barriers result in a compile-time error.

This hasn't been specced in detail for compute yet. Ideally spec and implementation can be relaxed accordingly!

In general things are in an early stage yet, it will take a couple revisions until the implementation has all the features it should have.

Regarding your twitter on NV_shader_buffer_load, a quick compile test using
#extension GL_NV_gpu_shader5 : enable did not give any errors when declaring some pointers. Although use of pointers on shared memory currently won't be possible

Piers Daniell
08-22-2012, 01:41 PM
Hi oscarbg

Thanks for the bug reports. We have looked at the g-truc OpenGL 4.3 samples and have a few driver fixes pending. There are also a few bugs in the OpenGL 4.3 samples, which I have reported to g-truc.

1) The issue with calling imageSize() on an image declared with "coherent" has been fixed in the driver. The beta will be updated soon with this fix. Until then you can workaround the issue by removing "coherent" from the declaration.

2) Yes, multi_draw_indirect is fully hardware accelerated on Fermi and above GPUs.

3) fbo_no_attachments just allows an FBO to be "complete" without any attachments. I don't think it's equivalent to DX11 TIR.

4) The "readonly" and "writeonly" memory qualifiers on buffer declarations is broken in the driver. We're working on a fix for this. For now just remove these qualifiers.

5) Some of your later questions might be better served in the "Official feedback on OpenGL 4.3 thread".

Thanks for your feedback.

cheers
Piers

pbrown
08-23-2012, 08:28 AM
Another bad news altough this would be Nvidia only anyway for now: pointers on compute shaders not working on Nvidia..

Sorry I'm behind on looking at these forum updates.

The error with #extension GL_NV_shader_buffer_load is a compiler bug that I was able to reproduce. The fix was an oversight in the logic handling "what #extension" directives are supported in which shader types/architectures. I'll work on a fix for the next beta driver update.

As Christoph Kubisch mentions, one workaround for this is to enable the extension GL_NV_gpu_shader5, which also includes the ability to store to pointers. GL_NV_shader_buffer_load only allows pointer reads.

oscarbg
08-24-2012, 06:56 AM
Many thanks for responses Christoph and Pat!
some updates on my findings:
First I think that comment about not currently possible to use codes that have barriers with control flow is correctly to be posted here because if not allowed in spec it's here the place to motivate Nvidia to relax this restriction..
Reasons to allow it are big first and foremost:
1.D3D11 compute shader allows it!.. sorry to be pedantic but OGL ARB can't claim OGL 4.3 is a superset of D3D11..
for example I have found in D3D SDK ComputeShaderSort11 sample in kernel BitonicSort
this code

for (unsigned int j = g_iLevel >> 1 ; j > 0 ; j >>= 1)
{
unsigned int result = ((shared_data[GI & ~j] <= shared_data[GI | j]) == (bool)(g_iLevelMask & DTid.x))? shared_data[GI ^ j] : shared_data[GI];
GroupMemoryBarrierWithGroupSync();
shared_data[GI] = result;
GroupMemoryBarrierWithGroupSync();
}

well this code can't be without control flow which I verified getting compiled DX IL assembly by fxc which has barriers inside "loop" in DX IL parlance keyworks.. even more I compiled down to AMD 5850 ISA using AMD GPU shader analyzer and yes barriers is there between control flow..
2. Even in D3D 11 don't allowed it, lots of *basic* GPGPU algorithms are fully optimized in this conditions like sorts, perhaps even scans and many more I think searching CUDA code over web will grasp the importance of this support in CUDA..
3. I have found that AMD is going to allow it! well details follow..
by using some leaked AMD drivers (9.00) and headers about it I found about AMD_compute_shader in fglrx drivers and seeing inside driver binary searching which keywords are implemented from AMD_compute_shader and which ones from ARB_compute_shader I have compiled successfully a compute shader having barrier inside control flow impossible to remove (the control flow) from the compiler.. I can post full code and binary to check if you dont' believe..

*Allow selective unroll of loops similar to D3D [unroll] (i.e. better control over #pragma optionNV(unroll all))
Motivation:
Another question is well matmul code after all has the loop with fixed iters (assuming tuned for a specific matrix dimensions) so we can unroll I have found using #pragma optionNV(unroll all) solves the problem and compiles the kernel *BUT* produces big binary so question there is a pragma in NV compiler to allow unrolling of selected loops similar to D3D [unroll].. hey that's another place D3D11 CS are better than OpenGL.. with that support I could unroll outer loop of matmul code posted before and the loop between barriers don't be unrolled..


I have found using GL_NV_gpu_shader5 works perfectly.. and allows using without less massage more cuda codes.. I found happily that using OpenCL codes (CUDA code also use it) that pass shared mem pointers as function args compile correctly up to assembly compile that founds
something like a LDS.? line and returns with error..
So I have hope with your comment "Although use of pointers on shared memory currently won't be possible" that this will be possible in the future.. Please allow that use case!

Alfonse Reinheart
08-24-2012, 08:59 AM
First I think that comment about not currently possible to use codes that have barriers with control flow is correctly to be posted here because if not allowed in spec it's here the place to motivate Nvidia to relax this restriction

Presumably by releasing an extension, right? Because we don't want anyone, AMD, NVIDIA, Intel, etc just ignoring what the spec says because that's not what it is that they want. Like your AMD_compute_shader example is a separate specification, thus allowing AMD to expose whatever functionality they choose.

oscarbg
08-24-2012, 07:36 PM
Alfonse wouldn't be better to say hey it's a bug in spec definition after all is some days old and only implementd by Nvidia right now.. and we have two options:
1. spec doesn't prohibite barriers in control flow it's implementation dependant..
2. spec supports barriers inside control flow after all all D3D11 HW and even OpenCL HW allow it..

Alfonse Reinheart
08-25-2012, 03:19 AM
NVIDIA doesn't control the OpenGL specification. This would be a proposal for the ARB, not for NVIDIA.

JakobProgsch
08-25-2012, 12:52 PM
"barrier();" isn't allowed in control flow but the new barriers like "groupMemoryBarrier();" (which would be the equivalent right?) are allowed I think and i successfully used those with the Nvidia Driver (see line 269 here https://github.com/progschj/OpenGL-Examples/blob/master/experimental/XXcompute_shader_nbody.cpp). Actually the way I read it ogl even has global barriers (?) which aren't even present in for example opencl.

Edit: while I'm at it. One i think obvious bug with the 4.3 driver I found was that according to spec "gl_WorkGroupSize" is supposed to be const so it can be used as array size. In 7.1 of the spec it says:

The built-in constant gl_WorkGroupSize is a compute-shader constant containing the local work-group
size of the shader. The size of the work group in the X, Y, and Z dimensions is stored in the x, y, and z
components. The values stored in gl_WorkGroupSize match those specified in the required local_size_x,
local_size_y, and local_size_z layout qualifiers for the current shader. This value is constant so that it can
be used to size arrays of memory that can be shared within the local work group.


but when I try exactely that: "shared vec4 tmp[gl_WorkGroupSize.x];" (for example on line 258 of the example linked above (https://github.com/progschj/OpenGL-Examples/blob/master/experimental/XXcompute_shader_nbody.cpp))
it errors with: "0(6) : error C1307: non constant expression for array size"

Also shared memory seems somehow broken I have this example: http://ideone.com/aI4WL
Essentially it tries to first fill a shared array with the numbers of 0-256 and then write those out to memory. But it appears that line 14 writes to a "wrong offset" partially overwriting the first half of the local array.

JakobProgsch
08-26-2012, 03:07 AM
Hmm, the forum ate my first post. Lets try again:

About the barriers: "barrier();" does indeed not work in flow control which is in accordance with the specs. But the other barriers such as groupMemoryBarrier etc. do. See for example line 268 here: https://github.com/progschj/OpenGL-Examples/blob/master/experimental/XXcompute_shader_nbody.cpp

There are two issues I found which I think ar bugs in the driver.

The first one being that the values in gl_WorkGroupSize are not constant it seems.
Trying something like "shared float tmp[gl_WorkGroupSize.x];"
results in a shader compiler error: "error C1307: non constant expression for array size"
But the spec explicitly states that WorkGroupSize is constant for exactly that purpose (in 7.1, page 112 of the not annotated version):

The built-in constant gl_WorkGroupSize is a compute-shader constant containing the local work-group
size of the shader. The size of the work group in the X, Y, and Z dimensions is stored in the x, y, and z
components. The values stored in gl_WorkGroupSize match those specified in the required local_size_x,
local_size_y, and local_size_z layout qualifiers for the current shader. This value is constant so that it can
be used to size arrays of memory that can be shared within the local work group.


The other is that indexing/writing to shared memory seems "off". I broke down my example to the following shader:


#version 430
layout(local_size_x=128) in;

layout(r32f, location = 0) uniform imageBuffer data;

shared float local[256];
void main() {
int N = imageSize(data);
int index = int(gl_GlobalInvocationID);
int localindex = int(gl_LocalInvocationIndex);

local[localindex] = localindex;
local[localindex+128] = localindex+128;

groupMemoryBarrier();
imageStore(data, index, vec4(local[localindex]));
imageStore(data, index+128, vec4(local[localindex+128]));
}


which I run a single group of. The expected result is that the imageBuffer gets filled with the values 0...255. But for some reason the result is this:


0 1 2 3 4 5 6 7 128 129 130 131 132 133 134 135
136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151
152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167
168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183
184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199
200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215
216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231
232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247
128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143
144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159
160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175
176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191
192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207
208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223
224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239
240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255

so the first half gets partially overwritten with the second somehow. But the overwriting starts at an offset of 8 elements...
And if I change apperances of "local[localindex+128]" to "local[localindex+2*128]" it starts at 16 elements etc. so its like the +128 somehow gets added in with a wrong factor?

Here is the full test case:


#include <GL3/gl3w.h>
#include <GL/glfw.h>

#include <iostream>
#include <iomanip>
#include <algorithm>
#include <string>
#include <vector>
#include <cstdlib>
#include <cmath>

bool running;

// window close callback function
int closedWindow()
{
running = false;
return GL_TRUE;
}

// helper to check and display for shader compiler errors
bool check_shader_compile_status(GLuint obj)
{
GLint status;
glGetShaderiv(obj, GL_COMPILE_STATUS, &status);
if(status == GL_FALSE)
{
GLint length;
glGetShaderiv(obj, GL_INFO_LOG_LENGTH, &length);
std::vector<char> log(length);
glGetShaderInfoLog(obj, length, &length, &log[0]);
std::cerr << &log[0];
return false;
}
return true;
}

// helper to check and display for shader linker error
bool check_program_link_status(GLuint obj)
{
GLint status;
glGetProgramiv(obj, GL_LINK_STATUS, &status);
if(status == GL_FALSE)
{
GLint length;
glGetProgramiv(obj, GL_INFO_LOG_LENGTH, &length);
std::vector<char> log(length);
glGetProgramInfoLog(obj, length, &length, &log[0]);
std::cerr << &log[0];
return false;
}
return true;
}

int main()
{
int width = 640;
int height = 480;

if(glfwInit() == GL_FALSE)
{
std::cerr << "failed to init GLFW" << std::endl;
return 1;
}
glfwOpenWindowHint(GLFW_OPENGL_PROFILE, GLFW_OPENGL_CORE_PROFILE);
glfwOpenWindowHint(GLFW_OPENGL_VERSION_MAJOR, 4);
glfwOpenWindowHint(GLFW_OPENGL_VERSION_MINOR, 3);

// create a window
if(glfwOpenWindow(width, height, 0, 0, 0, 8, 24, 8, GLFW_WINDOW) == GL_FALSE)
{
std::cerr << "failed to open window" << std::endl;
glfwTerminate();
return 1;
}

// setup windows close callback
glfwSetWindowCloseCallback(closedWindow);

glfwSwapInterval(0);

if (gl3wInit())
{
std::cerr << "failed to init GL3W" << std::endl;
glfwCloseWindow();
glfwTerminate();
return 1;
}
const char *source;
int length;

std::string test_source =
"#version 430\n"
"layout(local_size_x=128) in;\n"

"layout(r32f, location = 0) uniform imageBuffer data;\n"

"shared float local[256];\n"
"void main() {\n"
" int N = imageSize(data);\n"
" int index = int(gl_GlobalInvocationID);\n"
" int localindex = int(gl_LocalInvocationIndex);\n"

" local[localindex] = localindex;\n"
" local[localindex+128] = localindex+128;\n"

" groupMemoryBarrier();\n"
" imageStore(data, index, vec4(local[localindex]));\n"
" imageStore(data, index+128, vec4(local[localindex+128]));\n"
"}\n";

// program and shader handles
GLuint test_program, test_shader;

// create and compiler vertex shader
test_shader = glCreateShader(GL_COMPUTE_SHADER);
source = test_source.c_str();
length = test_source.size();
glShaderSource(test_shader, 1, &source, &length);
glCompileShader(test_shader);
if(!check_shader_compile_status(test_shader))
{
return 1;
}

// create program
test_program = glCreateProgram();

// attach shaders
glAttachShader(test_program, test_shader);

// link the program and check for errors
glLinkProgram(test_program);
check_program_link_status(test_program);


std::vector<float> data(256);
//~ std::generate(data.begin(), data.end(), randf);
std::fill(data.begin(), data.end(), 1.0f);

for(int i = 0;i<256;++i)
{
data[i] = -1;
}

GLuint buffer;

glGenBuffers(1, &buffer);
glBindBuffer(GL_TEXTURE_BUFFER, buffer);
glBufferData(GL_TEXTURE_BUFFER, sizeof(float)*data.size(), &data[0], GL_STATIC_DRAW);


// texture handle
GLuint buffer_texture;

glGenTextures(1, &buffer_texture);
glBindTexture(GL_TEXTURE_BUFFER, buffer_texture);
glTexBuffer(GL_TEXTURE_BUFFER, GL_R32F, buffer);

// bind images
glBindImageTexture(0, buffer_texture, 0, GL_FALSE, 0, GL_READ_WRITE, GL_R32F);

glUseProgram(test_program);
glUniform1i(0, 0);

glDispatchCompute(1, 1, 1);

glGetBufferSubData(GL_TEXTURE_BUFFER, 0, sizeof(float)*data.size(), &data[0]);

for(size_t i = 0;i<data.size();i+=1)
{
if(i%16==0) std::cout << std::endl;
std::cout << std::setw(5) << data[i];
}
std::cout << std::endl;

GLint shared_size;
glGetIntegerv(GL_MAX_COMPUTE_SHARED_MEMORY_SIZE, &shared_size);
std::cout << "max shared: " << shared_size << std::endl;

glfwCloseWindow();
glfwTerminate();
return 0;
}


Edit:
I did some more research into the second thing and was wondering if instead of writing "localindex+128" i could precompute a localindex128 and use that since that would get rid of the inlined +128 that I suspected to be added incorrectly. At first that didn't change anything which I then reallized doesn't mean much since the compiler might inline my localindex128 during static analysis. So I needed a way to write "+128" without the compiler going nuts with optimization. First I tried making localindex128 volatile, which didn't work. Then I thought "well all I need is a constant 128 the compiler thinks isn't constant"... See where this is going? :D So i used gl_WorkGroupSize.x from the first issue and ended up with this:


#version 430
layout(local_size_x=128) in;

layout(r32f, location = 0) uniform imageBuffer data;

shared float local[256];
void main() {
int N = imageSize(data);
int index = int(gl_GlobalInvocationID);
int localindex = int(gl_LocalInvocationIndex);
int localindex128 = int(gl_LocalInvocationIndex+gl_WorkGroupSize.x); //gl_WorkGroupSize.x is 128 but because of the first bug the compiler doesn't know it's constant

local[localindex] = localindex;
local[localindex128] = localindex128;

groupMemoryBarrier();
imageStore(data, index, vec4(local[localindex]));
imageStore(data, index+128, vec4(local[localindex128]));
}

which gives the expected result. I guess that is about as far as I can narrow it down from my end.

pbrown
08-27-2012, 06:59 AM
Jakob,

Thanks for the feedback.

I agree that the first issue (not treating gl_WorkGroupSize as a constant expression) looks like a driver compiler bug, where it is treating it as an "in" instead of a "const". Hopefully, this should be easy to fix.

I'll have to look at the second issue in more detail. One thing that looks wrong about that shader is that there is no barrier() call between the shared memory stores and loads. There are two types of barriers with different purposes:

- groupMemoryBarrier() ensures that memory transactions are flushed so other threads can see them
- barrier() ensure that all threads have finished their stores before we continue

Typically, you need both for safety. It's not clear to me that the lack of a barrier() call in your shader has anything to do with the problem here, because each thread appears to read only shared memory values written by its own thread (and not touched by any other thread).

pbrown
08-27-2012, 07:13 AM
NVIDIA doesn't control the OpenGL specification. This would be a proposal for the ARB, not for NVIDIA.
Yes, that's correct, though it would certainly be possible to create a NVIDIA extension to GLSL that relaxes this restriction.

The limitation on barrier() is inherited from tessellation control shaders. The intent of the restriction is to prevent you from writing shaders that will hang. For example, in the following code:

if (divergent_conditional_expression) {
barrier();
}
the threads where the expression is true will call barrier() and then wait around for the other threads. But the other threads might not call barrier() at all.

This restriction wasn't a big deal for tessellation control shaders, as the places where you want barrier() calls are typically very limited and in well-defined places (i.e., before computing outer or inner tessellation levels). This is a bigger issue for compute shaders because there are algorithms where you may want to run multiple phases of computation in a loop, which would naturally result in barrier() calls inside the loop.

There's a few options here, none of which are perfect:

(1) Just allow barrier() anywhere, making hangs possible.

(2) Allow barrier() in more places, but still have some limitations to avoid hangs. For example, allow it in loops with uniform flow control (e.g., uniform start/end points, no conditional "break", no conditional "continue" before the barrier). This will be fairly tricky to specify and implement.

(3) Leave it as-is.

I've already filed a Khronos bug report on this issue, but it was too late for GLSL 4.30.

pbrown
08-27-2012, 07:20 AM
Jakob,


I'll have to look at the second issue in more detail. One thing that looks wrong about that shader is that there is no barrier() call between the shared memory stores and loads.

I think I've root-caused it. It should be easy to fix.

It has nothing to do with the absence of barrier() in your shader, though you will need barrier() calls for more complex shared memory usage as I noted in my previous comment.

JakobProgsch
08-27-2012, 07:42 AM
Yep, I just reread the that section of the specs and also realized that this means my tiled nbody kernel I ported from CUDA is incorrect (and can't be fixed since it would need the barrier in the flow control... :( ). Thanks for looking into these.

oscarbg
08-27-2012, 09:02 AM
Hi Pat,
Is too much too ask for NV removing this check (barriers in control flow) in assembly *compute shader* code so we can start playing with advanced compute codes that require it?
can we expect some solution before GL 4.3 gets into mainline drivers i.e. in next 4.3 beta drivers?

Really I tried also to patch binary shaders but binaries doesn't seem similar to CUDA binaries..

hope NV implements either solution #1 or #2..
as said seems #1 is what D3D Compute and CUDA and OpenCL allow.. i.e. no restrictions and programmers are responsible for "good" code without hangs..
if you want to avoid hangs realistically solution #2 can be enough for just algorithms that require control flow..
I'm waiting for Nvidia to lift that restriction soon either way for broad testing of some compute codes..

Thanks for detailed info on the issue!

Yes, that's correct, though it would certainly be possible to create a NVIDIA extension to GLSL that relaxes this restriction.

The limitation on barrier() is inherited from tessellation control shaders. The intent of the restriction is to prevent you from writing shaders that will hang. For example, in the following code:

if (divergent_conditional_expression) {
barrier();
}
the threads where the expression is true will call barrier() and then wait around for the other threads. But the other threads might not call barrier() at all.

This restriction wasn't a big deal for tessellation control shaders, as the places where you want barrier() calls are typically very limited and in well-defined places (i.e., before computing outer or inner tessellation levels). This is a bigger issue for compute shaders because there are algorithms where you may want to run multiple phases of computation in a loop, which would naturally result in barrier() calls inside the loop.

There's a few options here, none of which are perfect:

(1) Just allow barrier() anywhere, making hangs possible.

(2) Allow barrier() in more places, but still have some limitations to avoid hangs. For example, allow it in loops with uniform flow control (e.g., uniform start/end points, no conditional "break", no conditional "continue" before the barrier). This will be fairly tricky to specify and implement.

(3) Leave it as-is.

I've already filed a Khronos bug report on this issue, but it was too late for GLSL 4.30.

jcornwall
09-01-2012, 06:31 AM
Think I've run into a bug in the Linux driver. I reduced a misbehaving parallel scan to this minimal test case:



// ARB_compute_shader shared[] array test case.
//
// When using arrays of shared variables, index expressions involving
// gl_LocalInvocationID lead to stores not being observed by subsequent
// loads to the same location.

#include <GL/glew.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>

#ifdef __linux__
# include <GL/glx.h>
# include <X11/Xlib.h>
#endif
#ifdef _WIN32
# include <Windows.h>
#endif

#define mChkGL(X) X; CheckGLError(__LINE__);

const GLchar *ComputeCode =
// This shader should write 1 to the second element of sharedBuf,
// then read that value back and write it to OutputBuffer.
// There is 1 work group of size 1. The code is guarded so that
// only gl_LocationInvocationID.x == 0 executes the test.
"#version 430 core \n"
" \n"
"layout(std430, binding = 0) buffer Output { int OutputBuffer[1]; }; \n"
"shared int sharedBuf[2]; \n"
" \n"
"layout(local_size_x = 1, local_size_y = 1, local_size_z = 1) in; \n"
"void main() { \n"
" if(gl_LocalInvocationID.x == 0) { \n"
" /* Initialize second element with 0. */ \n"
" sharedBuf[1] = 0; \n"
" \n"
// Uncomment the following line to make validation pass.
// " sharedBuf[1] = 1; \n"
" \n"
" /* This store to the second element is not seen by the load. */ \n"
" sharedBuf[1 + gl_LocalInvocationID.x] = 1; \n"
" \n"
" /* Copy second element out for validation. */ \n"
" OutputBuffer[0] = sharedBuf[1]; \n"
" } \n"
"} \n"
;

void CheckGLError(int line) {
GLint glErr = glGetError();

if(glErr != GL_NO_ERROR) {
printf("OpenGL error %d at line %d\n", glErr, line);
exit(1);
}
}

int main() {
// Minimal OpenGL context setup.
#ifdef __linux__
Display *display = XOpenDisplay(NULL);
Window window = XCreateSimpleWindow(display, DefaultRootWindow(display), 0, 0, 1, 1, 0, 0, 0);
int visual[] = { GLX_RGBA, 0 };
XVisualInfo *vInfo = glXChooseVisual(display, DefaultScreen(display), visual);
GLXContext glCtx = glXCreateContext(display, vInfo, NULL, 1);
glXMakeCurrent(display, window, glCtx);
#endif
#ifdef _WIN32
HWND window = CreateWindow(L"edit", 0, 0, 0, 0, 1, 1, NULL, NULL, NULL, NULL);
HDC dc = GetDC(window);
PIXELFORMATDESCRIPTOR pfd;
memset(&pfd, 0, sizeof(pfd));
pfd.dwFlags = PFD_SUPPORT_OPENGL;
SetPixelFormat(dc, ChoosePixelFormat(dc, &pfd), &pfd);
HGLRC glCtx = wglCreateContext(dc);
wglMakeCurrent(dc, glCtx);
#endif

glewInit();

// Allocate buffer for both programs.
GLuint buffer;
mChkGL(glGenBuffers(1, &buffer));
mChkGL(glBindBufferBase(GL_SHADER_STORAGE_BUFFER, 0, buffer));
mChkGL(glBufferData(GL_SHADER_STORAGE_BUFFER, sizeof(int), NULL, GL_DYNAMIC_DRAW));

// Build compute shader.
GLuint shader = glCreateShader(GL_COMPUTE_SHADER);
mChkGL(glShaderSource(shader, 1, &ComputeCode, NULL));
mChkGL(glCompileShader(shader));

GLint compileLogSize;
mChkGL(glGetShaderiv(shader, GL_INFO_LOG_LENGTH, &compileLogSize));

if(compileLogSize > 0) {
char *compileLog = new char[compileLogSize];
mChkGL(glGetShaderInfoLog(shader, compileLogSize, NULL, compileLog));
printf("%s", compileLog);
}

// Build compute program.
GLuint program = glCreateProgram();
mChkGL(glAttachShader(program, shader));
mChkGL(glLinkProgram(program));

GLint linkLogSize;
mChkGL(glGetProgramiv(program, GL_INFO_LOG_LENGTH, &linkLogSize));

if(linkLogSize > 0) {
char *linkLog = new char[linkLogSize];
mChkGL(glGetProgramInfoLog(program, linkLogSize, NULL, linkLog));
printf("%s", linkLog);
}

// Invoke compute program and check result.
mChkGL(glClearBufferData(GL_SHADER_STORAGE_BUFFER, GL_R32UI, GL_RED, GL_INT, NULL));
mChkGL(glUseProgram(program));
mChkGL(glDispatchCompute(1, 1, 1));
mChkGL(glMemoryBarrier(GL_SHADER_STORAGE_BARRIER_B IT));

int *bufferData = (int *)glMapBuffer(GL_SHADER_STORAGE_BUFFER, GL_READ_ONLY);
int bufferVal = *bufferData;
mChkGL(glUnmapBuffer(GL_SHADER_STORAGE_BUFFER));

printf("Validation %s\n", (bufferVal == 1) ? "PASSED" : "FAILED");
}


The shader store to sharedBuf[1 + gl_LocalInvocationID.x] (where local id is 0) is not observed by the subsequent load from sharedBuf[1].

This is running on a GTX 470. The test passes fine with the Windows driver.

Piers Daniell
09-04-2012, 11:53 AM
Thanks for the bug report. The issue with "sharedBuf[1 + gl_LocalInvocationID.x]" under Linux will be fixed in the next OpenGL 4.3 beta driver, which is scheduled for the end of next week.

The issue reported by JakobProgsch with "local[localindex+128] = localindex+128;" should also be fixed then.

Johnny Tremain
09-22-2012, 12:18 PM
Hi,

Is there a lack of support for glDebugMessageCallback running on linux beta driver?

Thank you.

malexander
09-22-2012, 01:57 PM
Is there a lack of support for glDebugMessageCallback running on linux beta driver?

I've been using it successfully on Ubuntu 11.04 64b (driver 304.15.00.02). Have you declared you callback function with APIENTRY? Here's the declaration I use:


static void APIENTRY
showDebugOutput(GLenum source, GLenum type, GLuint id, GLenum severity,
GLsizei length, const RE_GLchar *message, GLvoid *userParam)

Johnny Tremain
09-22-2012, 04:52 PM
Hi malexander. I am using the following code. My issue is that glDebugMessageCallback is null at runtime using freeglut and GLEW (both are latest version).
I also posted more code in a different thread: http://www.opengl.org/discussion_boards/showthread.php/179098-glDebugMessageCallback-on-nVidia-using-Ubuntu?p=1242744#post1242744

My system is ubuntu 12.04 x64 with nVidia 304.15 beta drivers.


static void debugLog(GLenum source, GLenum type, GLuint id,
GLenum severity, GLsizei /*length*/, const GLchar *message,
void * /*userParam*/) {
std::cerr << " -- \n" << "Type: " << getStringForType(type).c_str() <<
"; Source: " << getStringForSource(source).c_str() << "; ID: " << id <<
"; Severity: " << getStringForSeverity(severity).c_str() << "\n" <<
message << std::endl;
}

malexander
09-22-2012, 05:11 PM
Perhaps freeglut/GLEW is the issue then. I'm using direct glX calls to get the function pointer to glDebugMessageCallback (glXGetProcAddressARB).

Johnny Tremain
09-23-2012, 05:53 PM
I filed a bug on glew, so hopefully an answer will come soon.

Johnny Tremain
10-03-2012, 10:30 AM
After moving to the new nvidia beta drivers AND defining GL_GLEXT_PROTOTYPES, glDebugMessageCallback works. The previous driver reported missing glDebugMessageCallback/ARB as missing when I ran glewinfo.

randall
10-04-2012, 08:38 AM
Texture sampling in Compute Shader always return vec4(0).
I use sampler2D in the shader and create texture like this:



glGenTextures(1, &m_texture);
glActiveTexture(GL_TEXTURE0);
glBindTexture(GL_TEXTURE_2D, m_texture);
glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_MIN_FILTER, GL_NEAREST);
glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_MAG_FILTER, GL_NEAREST);
glTexImage2D(GL_TEXTURE_2D, 0, GL_RGBA32F, kWidth, kHeight, 0, GL_RGBA, GL_FLOAT, &texture_data[0]);

Alfonse Reinheart
10-04-2012, 11:26 AM
Just for the sake of completeness, what texture functions are you using in your shader?

randall
10-05-2012, 01:53 AM
Just for the sake of completeness, what texture functions are you using in your shader?

I have tried various functions: texelFetch, texture, textureProj, textureProjOffset, textureLodOffset, texelFetchOffset.

randall
10-12-2012, 06:29 AM
This compute shader fails to compile with "array access is out of bounds" error. I think that this is a driver bug. Unsized array being last block member should be dynamically sized.



#version 430 core
buffer Output {
vec4 g_output[];
};
void main() {
g_output[128] = vec4(1, 2, 3, 4);
}

randall
10-15-2012, 03:29 AM
This compute shader fails to compile with "invalid operands to !=" error. This is clearly a driver bug.



#version 430 core
layout(local_size_x = 1) in;
buffer Output { int g_output; };
uniform ivec3 g_uniform;
void main() {
if (g_uniform != gl_MaxComputeWorkGroupCount) g_output = 0;
else g_output = 1;
}

Piers Daniell
10-16-2012, 03:32 PM
I wasn't able to reproduce the "array access is out of bounds" error. I did need to add something like "layout(local_size_x=16, local_size_y=16) in;" to avoid a different error. What driver are you using?

randall
10-17-2012, 01:40 AM
Sorry, shader must be more complicated, like this:



#version 430 core
layout(local_size_x = 128) in;
buffer Output {
vec4 g_output[];
};
void main() {
if (gl_LocalInvocationID.x == 0) {
g_output[128] = vec4(1);
} else {
const uint index = gl_LocalInvocationIndex;
g_output[index + 128] = vec4(1);
}
}


I am using 306.63.

Piers Daniell
10-17-2012, 03:46 PM
Thanks, I was able to reproduce the problem with your new sample. This is a bug in our driver. Before ARB_shader_storage_buffer_objects, which allows the last element of a buffer to be an unsized array, it was possible to use unsized arrays but it was necessary to index that unsized array somewhere in the shader in such a way that the compiler could determine the size from its use. In your example the g_output[128] is causing us to think it's the old style of unsized array and we're sizing it to 128. Then the other g_output[128 + index] is thinking it's not sized right.

You can work around this bug by avoiding indexing into the buffer unsized array with a constant. For example, change g_output[128] to g_output[g_LocalInvocationID.x + 128] instead.

Piers Daniell
10-17-2012, 03:53 PM
The issue you found with "g_uniform != gl_MaxComputeWorkGroupCount" is another bug. We'll fix this shortly.

randall
10-18-2012, 04:16 AM
Thanks. One more thing. GLSL spec revision 7:

The control flow barrier built-in function barrier() is allowed inside uniform flow control for
compute shaders.

Currenty (in 306.63) this is an error. Will this be fixed (for GLSL and assembly shaders)?

pbrown
10-23-2012, 07:10 AM
Thanks. One more thing. GLSL spec revision 7:

The control flow barrier built-in function barrier() is allowed inside uniform flow control for
compute shaders.

Currenty (in 306.63) this is an error. Will this be fixed (for GLSL and assembly shaders)?

Yes. As I mentioned earlier in the thread (August 27), I had filed a Khronos bug on this issue after realizing that this behavior would be problematic. We decided to simply remove the restriction from the GLSL 4.30 specification, rather than postponing to a future version of GLSL or leaving as an extension. As you observe, this happened in revision 7. NVIDIA hasn't yet published a driver removing the error, but we will definitely do so.

Thanks,
Pat

Alfonse Reinheart
10-23-2012, 10:52 AM
Oh, there's another spec bug in regards to this. The expressions leading to the execution of a barrier() must be "dynamically uniform". However, the section titled "dynamically uniform expressions" states that the concept only applies to fragment shaders.

randall
10-26-2012, 07:26 AM
I think I have found a bug.

I have this shader code:



// ...

struct Struct0 {
ivec2 m0;
};
layout(std430) buffer Input {
int data0; // offset 0
Struct0 data1; // I think that offset should be 16 according to the rule: the base alignment of the structure is N, where
// N is the largest base alignment value of any of its members, and rounded
// up to the base alignment of a vec4. When using 310.33 driver offset is 8 (so, structure base alignment is not rounded up to the base alignment of a vec4).
} g_input;

// ...


Can you confirm? Or, am I missing something?

Thanks.

Alfonse Reinheart
10-26-2012, 10:05 AM
You're using std430, not std140. As stated in the last paragraph of 7.6.2.2, the base alignment of structures is not rounded up to that of a vec4 in std430.

That's a feature, not a bug.

randall
10-26-2012, 11:05 AM
You're using std430, not std140. As stated in the last paragraph of 7.6.2.2, the base alignment of structures is not rounded up to that of a vec4 in std430.

That's a feature, not a bug.

Yes, I know that I am using std430.

I missed "and of structures in rule 9" part in the last paragraph of 7.6.2.2.

Thanks.

randall
10-30-2012, 06:44 AM
'shared' keyword is invalid keyword in CS when I use:
#version 420 core
#extension GL_ARB_shader_storage_buffer_object : require
#extension GL_ARB_compute_shader : require

It works as expected when I use:
#version 430 core