I've been working on computing image histogram using OpenGL compute shaders, but it's very slow. What I do is to divide image into rows between threads and each thread computes the histogram of the respective rows. I use imageLoad() function to read pixels from a image texture.

I tried to measure OpenGL compute shaders performance just to sum up a constant value, but it's still very slow

Code :
for (uint i = start; i < end; ++i)
		for (uint j = 0; j < 480; ++j)
			uint mask = 1;
			uvec4 color = uvec4(1);
			sum+= color.r + mask;

I want to know if OpenGL compute shaders are running into the OpenGL rendering pipeline or on the CUDA Multiprocessors. Now it seems like the code above runs as slow as a fragment shader code.

On my GTX 460 I have 7 CUDA Multiprocessors/OpenCL compute units running at 1526 Mhz and 336 shader units. It should be possible to execute the above loop extremely fast on a 1526 Multiprocessor, shouldn't it?

Please clarify for me the difference between OpenGL compute shaders and OpenCL. Where do they run? What's the cost of switching between OpenCL and OpenGL?

Best regards,