Compute Shader: shared variables

hi everyone,

lets assume we invoke a compute shader with 1 (x 1 x 1) workgroups, and the compute shader’s task is to compute the sum of all numbers from 1 to n = 999. theres a formula how to calculate the result:

result = n * (n + 1) / 2 = 499.500

using uint variables:

#version 450 core

layout (local_size_x = 1000, local_size_y = 1, local_size_z = 1) in;

layout (binding = 1, std430) buffer OutputBlock { uint Result; };

shared uint Total;

void main()
{
	atomicAdd(Total, gl_LocalInvocationID.x);
	barrier();
	if (gl_LocalInvocationID.x == 50)
		Result = Total;
}

here the actual math (reading variable, processing the temporary result, writing to variable) is done in the “atomicAdd(Total, gl_LocalInvocationID.x)” instruction. we then only need to wait for all other invocations to reach the “barrier()” point in the code, finally 1 certain (here: the 51st) invocation is allowed to write the sum into the shader storage buffer.

that works.

question: how can we do that with float variables ??

i currently have this:

#version 450 core

layout (local_size_x = 1000, local_size_y = 1, local_size_z = 1) in;

layout (binding = 1, std430) buffer OutputBlock { float Result; };

shared float Total;

void main()
{
	memoryBarrierShared();
	Total += float(gl_LocalInvocationID.x);
	memoryBarrierShared();
	barrier();
	if (gl_LocalInvocationID.x == 50)
		Result = Total;
}

but it doesnt deliver the correct result. this is how i query the result:

glBindBufferBase(GL_SHADER_STORAGE_BUFFER, 1, buffer);
glBufferData(GL_SHADER_STORAGE_BUFFER, 4, NULL, GL_STATIC_READ);

/* execute shader */
glUseProgram(program);
glDispatchCompute(1, 1, 1);
glUseProgram(0);

glMemoryBarrier(GL_SHADER_STORAGE_BARRIER_BIT);

float result = 0.0f;
glGetBufferSubData(GL_SHADER_STORAGE_BUFFER, 0, 4, &result);
cout << "result = " << result << endl;

i managed to get it working, but not as originally intended: because there are no atomic floating-point operations available, all the different invocations have to write “their results” into a shared “results array”, and when done, only 1 invocation is allowed to merge the results. am i missing something or is that the usual way to compute things in a compute shader ?

#version 450 core

layout (local_size_x = 1000, local_size_y = 1, local_size_z = 1) in;

layout (binding = 1, std430) buffer OutputBlock { float Result; };

shared float TotalArray[1000];

void syncronize()
{
	memoryBarrierShared();
	barrier();
}

void main()
{
	/* initialize shared memory */
	if (gl_LocalInvocationID.x == 50)
	{
		for (int i = 0; i < 1000; i++)
			TotalArray[i] = 0.0f;
	}
	syncronize();
	
	TotalArray[gl_LocalInvocationID.x] = float(gl_LocalInvocationID.x);
	syncronize();
	
	/* write result into buffer */
	if (gl_LocalInvocationID.x != 50)
		return;
	for (int i = 0; i < 1000; i++)
		Result += TotalArray[i];
}

[QUOTE=john_connor;1286858]i managed to get it working, but not as originally intended: because there are no atomic floating-point operations available, all the different invocations have to write “their results” into a shared “results array”, and when done, only 1 invocation is allowed to merge the results. am i missing something or is that the usual way to compute things in a compute shader ?
[/QUOTE]
Merging (fold/reduce) operations can typically be parallelised using a divide-and-conquer strategy. E.g. for summation, N^2 items can be summed by having N threads each sum N items, then a single thread sum the N partial sums to yield a final result. Additional steps increase the exponent. Even where an atomic operation is available, divide-and-conquer may be more efficient as it avoids contention.

This topic was automatically closed 183 days after the last reply. New replies are no longer allowed.