PDA

View Full Version : Compute Shader: shared variables



john_connor
04-28-2017, 11:10 AM
hi everyone,

lets assume we invoke a compute shader with 1 (x 1 x 1) workgroups, and the compute shader's task is to compute the sum of all numbers from 1 to n = 999. theres a formula how to calculate the result:

result = n * (n + 1) / 2 = 499.500

https://en.wikipedia.org/wiki/1_%2B_2_%2B_3_%2B_4_%2B_%E2%8B%AF

using uint variables:
#version 450 core

layout (local_size_x = 1000, local_size_y = 1, local_size_z = 1) in;

layout (binding = 1, std430) buffer OutputBlock { uint Result; };

shared uint Total;

void main()
{
atomicAdd(Total, gl_LocalInvocationID.x);
barrier();
if (gl_LocalInvocationID.x == 50)
Result = Total;
}

here the actual math (reading variable, processing the temporary result, writing to variable) is done in the "atomicAdd(Total, gl_LocalInvocationID.x)" instruction. we then only need to wait for all other invocations to reach the "barrier()" point in the code, finally 1 certain (here: the 51st) invocation is allowed to write the sum into the shader storage buffer.

that works.

question: how can we do that with float variables ??

i currently have this:
#version 450 core

layout (local_size_x = 1000, local_size_y = 1, local_size_z = 1) in;

layout (binding = 1, std430) buffer OutputBlock { float Result; };

shared float Total;

void main()
{
memoryBarrierShared();
Total += float(gl_LocalInvocationID.x);
memoryBarrierShared();
barrier();
if (gl_LocalInvocationID.x == 50)
Result = Total;
}

but it doesnt deliver the correct result. this is how i query the result:
glBindBufferBase(GL_SHADER_STORAGE_BUFFER, 1, buffer);
glBufferData(GL_SHADER_STORAGE_BUFFER, 4, NULL, GL_STATIC_READ);

/* execute shader */
glUseProgram(program);
glDispatchCompute(1, 1, 1);
glUseProgram(0);

glMemoryBarrier(GL_SHADER_STORAGE_BARRIER_BIT);

float result = 0.0f;
glGetBufferSubData(GL_SHADER_STORAGE_BUFFER, 0, 4, &result);
cout << "result = " << result << endl;

john_connor
04-29-2017, 03:04 AM
i managed to get it working, but not as originally intended: because there are no atomic floating-point operations available, all the different invocations have to write "their results" into a shared "results array", and when done, only 1 invocation is allowed to merge the results. am i missing something or is that the usual way to compute things in a compute shader ?


#version 450 core

layout (local_size_x = 1000, local_size_y = 1, local_size_z = 1) in;

layout (binding = 1, std430) buffer OutputBlock { float Result; };

shared float TotalArray[1000];

void syncronize()
{
memoryBarrierShared();
barrier();
}

void main()
{
/* initialize shared memory */
if (gl_LocalInvocationID.x == 50)
{
for (int i = 0; i < 1000; i++)
TotalArray[i] = 0.0f;
}
syncronize();

TotalArray[gl_LocalInvocationID.x] = float(gl_LocalInvocationID.x);
syncronize();

/* write result into buffer */
if (gl_LocalInvocationID.x != 50)
return;
for (int i = 0; i < 1000; i++)
Result += TotalArray[i];
}

GClements
04-29-2017, 12:30 PM
i managed to get it working, but not as originally intended: because there are no atomic floating-point operations available, all the different invocations have to write "their results" into a shared "results array", and when done, only 1 invocation is allowed to merge the results. am i missing something or is that the usual way to compute things in a compute shader ?

Merging (fold/reduce) operations can typically be parallelised using a divide-and-conquer strategy. E.g. for summation, N^2 items can be summed by having N threads each sum N items, then a single thread sum the N partial sums to yield a final result. Additional steps increase the exponent. Even where an atomic operation is available, divide-and-conquer may be more efficient as it avoids contention.