PDA

View Full Version : Parallel Prefix Sum in compute shader - unexpected results



michalferko
10-05-2015, 03:02 AM
Hi, I'm trying to implement parallel radix sort through GLSL compute shaders. I need a prefix sum calculation for that, but the first step of calculating it using Blelloch scan is giving be trouble.

My problem size can be pretty high, up to approx. 2 million unsigned integers (stored in a 2D texture). I implemented the first step of Blelloch scan according to the Udacity course on parallel programming: https://www.udacity.com/course/viewer#!/c-cs344/l-86719951/e-88035124/m-87690374

My approach was the following: Execute one compute shader for every single array element. After each step of the Blelloch scan, call barrier() to ensure that results from the previous steps have been written. I even tried double-buffering (in the code below) to be completely sure nothing is writing over my data.

I am storing intermediate results in an SSBO and finally write the output to an image. As long as there's only one work group, it works fine. Increasing the global dimensions to 32x16 (2 workgroups) causes the result to be mostly correct (the following code should output 512), but sometimes, it just returns 400 + something.

Where could be the problem? I know I'm working just with global memory and that it's probably going to be slow at the moment, but the whole array will not fit into shared memory anyway (2 mil. uint values).

I was checking out the code from OpenGL Superbible, but it's limited to one workgroup and shared memory: https://github.com/openglsuperbible/sb6code/blob/master/bin/media/shaders/prefixsum/prefixsum.cs.glsl

Here's the compute shader code (for debugging purposes, I do not read anything from the input image, I just set every value to 1):


#version 430 core

layout (local_size_x = 16, local_size_y = 16) in;

uniform int problemSize;

//layout (binding = 0, rg32ui) readonly uniform uimage2D input_image;
layout (binding = 1, rg32ui) writeonly uniform uimage2D output_image;

layout (binding = 0, std430) buffer tempData
{
uint temp[];
};

void main(void)
{
bool whichBuffer = false;
uint offset = problemSize;
uint offset1 = 0;
uint offset2 = offset;
uint id = gl_GlobalInvocationID.x + gl_GlobalInvocationID.y * (gl_NumWorkGroups.x * gl_WorkGroupSize.x);
ivec2 C = ivec2(gl_GlobalInvocationID.xy);

temp[offset1 + id] = 1;
memoryBarrierBuffer();

const uint steps = uint(log2(problemSize)) + 1;

// Reduce
for (uint step = 0; step < steps; step++)
{
uint p = 1 << (step + 1);
uint q = 1 << step;
uint s = temp[offset1 + id];
uint t = temp[offset1 + id - q];
if (id % p == p - 1)
temp[offset2 + id] = s + t;
else
temp[offset2 + id] = s;
memoryBarrierBuffer();
offset = offset1;
offset1 = offset2;
offset2 = offset;
}

imageStore(output_image, C, uvec4(temp[offset2 + id], id, 0, 0));
}


I have seen some code snippets working with CUDA, but I believe this can be achieved with GLSL compute shaders as well (with comparable performance). Some of those CUDA implementations promise more than 1 billion values sorted per second, which means that for my problem size, the whole sort should take around 2-3ms (see for instance the graphs at the bottom of this page: https://code.google.com/p/back40computing/wiki/RadixSorting )

Any help is appreciated. Thanks.

Some PC data:
Windows 10, NVIDIA GeForce GT 740M laptop GPU