OpenGL Compute Shaders vs OpenCL

Hi,

I’ve been working on computing image histogram using OpenGL compute shaders, but it’s very slow. What I do is to divide image into rows between threads and each thread computes the histogram of the respective rows. I use imageLoad() function to read pixels from a image texture.

I tried to measure OpenGL compute shaders performance just to sum up a constant value, but it’s still very slow


for (uint i = start; i < end; ++i)
	{
		for (uint j = 0; j < 480; ++j)
		{
			uint mask = 1;
			uvec4 color = uvec4(1);
			
			sum+= color.r + mask;
                }
         }

I want to know if OpenGL compute shaders are running into the OpenGL rendering pipeline or on the CUDA Multiprocessors. Now it seems like the code above runs as slow as a fragment shader code.

On my GTX 460 I have 7 CUDA Multiprocessors/OpenCL compute units running at 1526 Mhz and 336 shader units. It should be possible to execute the above loop extremely fast on a 1526 Multiprocessor, shouldn’t it?

Please clarify for me the difference between OpenGL compute shaders and OpenCL. Where do they run? What’s the cost of switching between OpenCL and OpenGL?

Best regards,
Alin

First of all - OpenGL pipeline runs on CUDA multiprocessors. It is just a different view of the same hardware. Compute shaders are just more tightly coupled with the rest of OpenGL than OpenCL.

How many threads do you use for the calculation? GPU in general needs to run several hundreds of threads for good performance. The code looks very single-threaded to me.

Re the difference between OpenCL and OpenGL compute shaders: compute shaders are easier to use if you need to add a bit of compute to an OpenGL application, because you don’t need to deal with all the complications of sharing devices and resources between OpenGL and OpenCL. Depending on the GPU vendor, compute shaders may also have less run-time overhead (particularly if the vendor doesn’t implement ARB_cl_event, which I don’t think NVIDIA did last time I checked). On the other hand, compute shaders are a bit more limited since it’s just GLSL with a few extras bolted on to support workgroups, rather than a language designed for compute.

Regarding number of threads: you definitely need more threads e.g. a thread per pixel rather than per row. Depending on how many histogram bins you have you might get reasonable performance just by using atomic image operations to write a single histogram.

Hi,

Thanks for your replies.

Here is the code. Please tell me if something is wrong performance wise. I want to compute histogram for a 640x480 image. I’m putting the histogram in shared memory as the access to it should be a lot faster than using imageAtomicAdd which operates on global memory.


#version 430 compatibility
#extension GL_ARB_compute_shader : enable
#extension GL_ARB_shader_image_load_store : enable
#extension GL_ARB_shader_storage_buffer_object : enable

uniform vec2 ImageSize;

layout(binding=0, rgba8ui) uniform readonly uimage2D Texture;
layout(std140, binding=0) buffer blocks {
	double result;
};

layout (local_size_x = 320, local_size_y = 1, local_size_z = 1) in;

shared uint histogram[3][256];

void main(void)
{	
	uint totalSize = gl_WorkGroupSize.x * gl_WorkGroupSize.y * gl_WorkGroupSize.z;
	uint index = gl_LocalInvocationIndex;
	
        uvec2 size = uvec2(ImageSize);
	start = index * size.y / totalSize;
	end = (index+1) * size.y / totalSize;
	
	if (index == totalSize-1)
		end = uint(ImageSize.y);
	
	for (uint i = start; i < end; ++i)
	{
		for (uint j = 0; j < size.x; ++j)
		{
			ivec2 position = ivec2(j, i);
			uvec4 color = imageLoad(Texture, position);
			
			atomicAdd(histogram[0][color.r], 1);
			atomicAdd(histogram[1][color.g], 1);
			atomicAdd(histogram[2][color.b], 1);
		}
	}
}



I need further clarification. GPU Caps Viewer says that GFX 460 has 336 shaders and 7 Multiprocessors with warp size 32. From what I understand vertex and fragment shaders should run on the two 336 shaders and CUDA/OpenCL/OpenGL compute shaders on the 7 Multiprocessors. Please correct me if I’m wrong.

I did then a performance test with this code as I suspected that imageLoad and shared memory should be slow. I removed imageLoad operations and access to shared memory, but it is still very slow.


#version 430 compatibility
#extension GL_ARB_compute_shader : enable
#extension GL_ARB_shader_image_load_store : enable
#extension GL_ARB_shader_storage_buffer_object : enable

uniform vec2 ImageSize;

layout(binding=0, rgba8ui) uniform readonly uimage2D Texture;
layout(std140, binding=0) buffer blocks {
	double result;
};

layout (local_size_x = 256, local_size_y = 1, local_size_z = 1) in;

shared uint histogram[3][256];

void main(void)
{	
	uint totalSize = gl_WorkGroupSize.x * gl_WorkGroupSize.y * gl_WorkGroupSize.z;
	uint index = gl_LocalInvocationIndex;
	
	start = index * size.y / totalSize;
	end = (index+1) * size.y / totalSize;
	
	if (index == totalSize-1)
		end = uint(ImageSize.y);
		
	uint sum = 0;
	
	for (uint i = start; i < end; ++i)
	{
		for (uint j = 0; j < size.x; ++j)
		{
			uint mask = 1;
			uvec4 color = uvec4(1);
			sum += mask + color.r;
		}
	}
}

I also tried to compute the histogram inside a fragment shader using imageAtomicAdd but is it like 1 ms slow due to the fact that it accesses global memory and that is very slow. I moved then to compute shaders as they give access to shared memory and that should be faster than global memory.


#version 420

layout(early_fragment_tests) in;

uniform sampler2D Texture;

// here comes the GLSL 4.2 magic
layout(binding=0, r32ui) uniform uimage2D ImageHistograms;

in vec2 texcoordFB;
in vec2 texcoordCT;

out vec4 FragColor;

void histogram(vec3 color, int row)
{
	const int red = int(color.r * 255);
	imageAtomicAdd(ImageHistograms, ivec2(red, row+0), 1);
	
	const int green = int(color.g * 255);
	imageAtomicAdd(ImageHistograms, ivec2(green, row+1), 1);
	
	const int blue = int(color.b * 255);
	imageAtomicAdd(ImageHistograms, ivec2(blue, row+2), 1);
}

void main()
{
	vec4 color = texture2D(Texture, texcoordCT);
	
	histogram(color.rgb, 0);
}

Please clarify this for me before switching to OpenCL.

Those 336 shader cores are practically the same as the 7 CUDA cores, just each CUDA core has then 48 shader core in it. It’s just defined on a different granularity. Kind of like OpenCL processing elements vs compute units. Believe me, no matter whether you run D3D, OpenGL or any other graphics shader or you run CUDA, OpenCL or OpenGL compute shaders, it will all run on the same piece of hardware. That’s why it is called a unified architecture.

The main problem with your code anyway ends up begin having a tight loop in the shader.

Also, regarding performing atomic global memory operations with images is in fact slower than accessing shared memory, the problem is that if you want to use shared memory, you must have all your shader instances (or kernel instances) running on the same compute unit, which means in your case that you can only use one of your CUDA cores, as you cannot share shared memory between compute units, thus even if you have thousands of threads and properly parallel code, you practically will serialize your work on a single compute unit.

On the other hand, using atomic global memory operations (i.e. image load/store), you can use all the computing resources of your hardware.

Actually there are way better ways to maximally utilize both the power of shared memory and global memory atomics, but then you must learn more about parallel computing and GPU parallelism in general. For now, I think image load/store is your best bet. You should still be able to get the histogram of thousands of textures a second anyways.

Hi aqnuep,

Thanks for your quick reply.

Also, regarding performing atomic global memory operations with images is in fact slower than accessing shared memory, the problem is that if you want to use shared memory, you must have all your shader instances (or kernel instances) running on the same compute unit, which means in your case that you can only use one of your CUDA cores, as you cannot share shared memory between compute units, thus even if you have thousands of threads and properly parallel code, you practically will serialize your work on a single compute unit.

Ok, I will then dispatch 7 work groups, each one with its shared memory. Each work group has to compute 640/7=91 rows. Each work group will have 45 threads, each will have to compute 91/45=2 rows. At the end, one thread from each work group will update global histograms using imageAtomicAdd. I will use barrier to synchronize threads before updating global histogram.

What do you think? Will this perform faster?


#version 430 compatibility
#extension GL_ARB_compute_shader : enable
#extension GL_ARB_shader_image_load_store : enable
#extension GL_ARB_shader_storage_buffer_object : enable
 
uniform vec2 ImageSize;
 
layout(binding=0, rgba8ui) uniform readonly uimage2D Texture;
layout(binding=1, rgba8ui) uniform uimage2D GlobalHistograms;
layout(std140, binding=0) buffer blocks {
	double result;
};
 
layout (local_size_x = 45, local_size_y = 1, local_size_z = 1) in;
 
shared uint histogram[3][256];
 
void main(void)
{	
	uint totalSize = gl_WorkGroupSize.x * gl_WorkGroupSize.y * gl_WorkGroupSize.z/(gl_NumWorkGroups.x*gl_NumWorkGroups.y*gl_NumWorkGroups.z);
	uint index = gl_LocalInvocationIndex;
 
        uvec2 size = uvec2(ImageSize);
	start = index * size.y / totalSize;
	end = (index+1) * size.y / totalSize;
 
	if (index == totalSize-1)
		end = uint(ImageSize.y);
 
	for (uint i = start; i < end; ++i)
	{
		for (uint j = 0; j < size.x; ++j)
		{
			ivec2 position = ivec2(j, i);
			uvec4 color = imageLoad(Texture, position);
 
			atomicAdd(histogram[0][color.r], 1);
			atomicAdd(histogram[1][color.g], 1);
			atomicAdd(histogram[2][color.b], 1);
		}
	}
	
	barrier();
	
	if (index == 0)
	{
		for (uint i = 0; i < 256; ++i)
		{
			imageAtomicAdd(GlobalHistograms, ivec(0, i), histogram[0][i]);
			imageAtomicAdd(GlobalHistograms, ivec(1, i), histogram[1][i]);
			imageAtomicAdd(GlobalHistograms, ivec(2, i), histogram[2][i]);
		}
	}
}

Smart GPU guys at (probably) all of NVidia, AMD, and Intel have already written fast histogram reduction code (and other fast reduction code in general) – I know NVidia has – just check out their GPU computing SDK for things with histogram and/or reduc in the filename.

One thing you can do to help accelerate your development is look at their OpenCL, CUDA, and/or D3D computer shader code for histogram reduction, and convert it to OpenGL compute. Should give you a leg-up on seeing how to employ shared memory and threading to maximum benefit.

How would that be any good? You still have a two level nested loop inside your shader. Good would mean you have none.

The shader code you write is executed for each work item and for each work group, thus it practically the shader itself should not deal with anything else than a single pixel of the image (again, no loops).

Also, don’t use image variables if you don’t need to. Using a read-only image does not necessarily perform the same as a texture, at least I can guarantee you that it doesn’t perform the same on all hardware. If you need only read-only access use texture fetches instead of image loads.

A simple pseudocode to demonstrate (assuming a work group size of 256):


void main()
{
    // get the current pixel's color and write the histogram information to shared memory
    uvec4 color = texelFetch(texture, coords);
    atomicAdd(histogram[0][color.r], 1);
    atomicAdd(histogram[1][color.g], 1);
    atomicAdd(histogram[2][color.b], 1);

    barrier();
    memoryBarrierShared();

    // accumulate the shared memory histogram information into the load/store image
    // as we have 256 work items in our work groups, each has to export only one value
    imageAtomicAdd(GlobalHistrogram, ivec2(0, gl_LocalInvocationIndex), histogram[0][gl_LocalInvocationIndex]);
    imageAtomicAdd(GlobalHistrogram, ivec2(1, gl_LocalInvocationIndex), histogram[0][gl_LocalInvocationIndex]);
    imageAtomicAdd(GlobalHistrogram, ivec2(2, gl_LocalInvocationIndex), histogram[0][gl_LocalInvocationIndex]);
}

So your work group size will be 256 (let it be 16x16) and then your total work item count (domain) should be practically the size of the texture you calculate the histogram of.

Sure, you could use other work group sizes, e.g. 64 or 128, then each shader invocation has to take over multiple exports to the final histogram image.

First of all, forget about aligning your work group size to your particular hardware, it won’t help. Always use a work group size of at least 64 (power of two values preferred) but even more wouldn’t hurt, especially for more complicated compute shaders (unlike this one).

Anyways, the inefficiency of your approach DOES NOT come from the fact that you use image load/store, but because you code is serial, not parallel. Once you remove all loops, you’ll see that no matter you use shared memory or whether you use work group size of A or B, it will be still at least an order of magnitude faster than what you have now.

1 Like

Hi.
I have found this thread using Google :slight_smile:

I have framework with OpenGL 4.2 & OpenCL. Now i’m updating it to OpenGL 4.3 version and trying to use compute shaders. But computer shaders are ~20% slower than OpenCL shaders in my case.

Did anybody “seriously” compare compute shaders vs OpenCL?
What is the strategy of work with global memory in compute shaders (i.e. shader_storage_buffer or imageStore/imageLoad or imageStore/texelFetch)?
Do i always need glMemoryBarrier() after glDispatchCompute()?
Is “fullscreen quad” better than compute shader?

You practically hijacked the topic because this isn’t really related to the original question, but I’d still try to answer you, because you brought up an interesting topic.

So here are my answers:

It is possible that there is a completely different compiler behind OpenGL compute shaders and OpenCL kernels.
Also, not all compute capabilities present in OpenCL are available in OpenGL.
Not to mention that the way synchronization is handled in OpenCL is wildly different than that of OpenGL.
Finally, it may even vary from hardware to hardware.

All of these are options. Personally, for read-only data I’d prefer using texture fetches, simply because on some hardware might have a different path for storage buffers or load/store images as those are R/W data sources.
Also, there could be a difference between storage buffer and load/store image implementations as well, as the later has a fixed element size while the former doesn’t really have the definition of an element at all, thus especially dynamic indexing could result in different performance in the two cases.
Another thing is that storage buffers, image buffers and texture buffers access linear memory, while other images and textures usually access tiled memory, thus there can be a huge difference in performance because of this as well.

No, why would you? Unless you plan to use data written by the compute shader through image stores, storage buffer writes or atomic counter writes you don’t have to. The memory barrier rules are the same as before.
Also note that while calling glMemoryBarrier is not free and people are afraid of its performance, don’t think that other write-to-read hazards, like those in case of framebuffer writes or transform feedback writes are free, just they are implicit, no additional API call is done, but still might happen behind the scenes, which is even worse than the new mechanism, as here at least the app developer has explicit control over whether he needs sync or not.

Maybe on some hardware, maybe not on others. Fragment shaders are kind of different than compute shaders. They are instantiated by the rasterizer which means that the granularity (work group size in compute shader terminology) might be different. Compute shaders provide more explicit behavior. If you specify a work group size of 16x16, you are guaranteed that those will be on the same compute unit as they may share memory, while the number fragment shader instances that are issued on a single compute unit and which fragments they processed is determined by the rasterizer and can vary wildly between different GPUs.

Also, the individual shader instances might be submitted in a different pattern to the actual ALUs for compute shaders and fragment shaders thus access to various types of resources (linear or tiled) also result in different patterns, thus one can be worse than the other. But all this depends on the GPU design, the type of the resource you access and the access pattern of your shader.

A benefit of using fragment shaders is that you can use framebuffer writes to output data which is almost guaranteed to be faster than writing storage buffers or performing image writes. You can even perform limited atomic read-modify-write when doing framebuffer writes thanks to blending, color logic op and stencil operations.

Finally, note that GPUs don’t rasterize quads, thus if you do compute with a fragment shader you actually rendering two triangles, which means across the diagonal edge where the two triangles meet, on some hardware, you might end up having half-full groups of shaders being executed on a compute unit which will on its own already result in a slight drop in overall performance.

To sum it up, there is no general answer whether OpenCL is better than OpenGL compute shaders, or that OpenGL compute shaders are better than fragment shaders. It all depends on the hardware, driver, your shader code, and the problem you want to solve.

What I can suggest based on what I’ve heard from developers though is that if you want to do some compute stuff in a graphics application that already uses OpenGL, you better of not using OpenCL GL interop as it seems that the interop performance is usually pretty bad, independent of GPU generation or vendor.

Did anybody “seriously” compare compute shaders vs OpenCL?

How can someone “seriously” compare them when:

1: NVIDIA’s Compute Shader implementation is barely 6 months old and hasn’t been out of beta for more than a few months.

2: AMD doesn’t even have a Compute Shader implementation yet.

In short, compute shaders are too immature at the present time to be “seriously” comparing them to anything. It’s simply too soon to answer most of your questions.

The other question is this: what exactly are you using them for?

Compute Shaders exist to do exactly one thing: make it more convenient to do GPU computations that directly support rendering operations. Before CS’s, if you wanted to do instanced rendering with frustum culling, you had to use Geometry Shaders via some hacky methods. This involved a pointless vertex shader, as well as working around the semantics of GS output streams and other such things.

Now, you can just use a CS (hardware and drivers willing).

Compute Shaders are not a replacement for all uses of OpenCL. If all you’re doing with your compute tasks is to read the data back on the CPU, stick with OpenCL. Compute Shaders are primarily intended to help rendering operations. They exist so that you don’t have to do expensive context switches and such that OpenCL/GL interop requires. If your compute tasks aren’t about rendering operations, then you shouldn’t be using CS’s.

Simply saying that you lost 20% performance “in my case” really isn’t enough information to know if that’s reasonable or not.

Thank you guys for quick and detailed answers!

AMD does have compute shaders for a few weeks now, in their latest 13.4 and 13.5 beta drivers. I tried them and they work for this ao_bench compute shader test.
Just google for Catalyst 13.4 or 13.5
Feature Highlights of AMD Catalyst™ 13.4

Support added for AMD Radeon HD7790 and AMD Radeon HD 7990

Delivers support for the following OpenGL 4.3 features:
GL_ARB_compute_shader
GL_ARB_multi_draw_indirect
GL_ARB_shader_storage_buffer_object
GL_ARB_arrays_of_arrays
GL_ARB_clear_buffer_object
GL_ARB_ES3_compatibility
GL_ARB_explicit_uniform_location
GL_ARB_fragment_layer_viewport
GL_ARB_invalidate_subdata
GL_ARB_program_interface_query
GL_ARB_shader_image_size
GL_ARB_stencil_texturing
GL_ARB_texture_buffer_range
GL_ARB_texture_query_levels
GL_ARB_texture_storage_multisample

The newer 13.5 beta driver also support OpenGL Compute.