PDA

View Full Version : imageLoad/imageStore VERY slow



JasonRay
12-14-2015, 01:52 PM
Hello folks,

I need to write shaders that build per-pixel-linked lists, sort the lists (one list per pixel) and draw the result. This procedure is, in literature, known as 'Order-Independent Transparency'.

My fragment shader that builds a list for the current pixel is:



in vec4 gl_FragCoord;

out vec4 out_col;

uniform layout(binding=0, r32i) coherent iimage2D head_buffer;
uniform layout(binding=1, r32i) coherent iimageBuffer next_buffer;
uniform layout(binding=2, rgba32f) coherent imageBuffer data_buffer;
uniform layout(binding=3, offset=0) atomic_uint ac;

void main()
{
int index = int(atomicCounterIncrement(ac));
if (index >= 1024 * 768 * 16) // this is the maximal number of elements in [head/next/data]_buffer
discard;

int indexOld = imageAtomicExchange(head_buffer, ivec2(gl_FragCoord.xy), index);
imageStore(next_buffer, index, ivec4(indexOld, 0, 0, 0));
float depth = gl_FragCoord.z;
imageStore(data_buffer, index, vec4(1.0, 1.0, 1.0, depth)); // test-wise only white pixel used for simplification

out_col = vec4(1.0, 1.0, 1.0, 1.0);
}



Actually some rather easy shader, but the performance is very poor. Executing the just shown shader needs around 120 ms. I would have expected much less!

My hardware is:

AMD Phenom 9650
4 GB DDR2 800 RAM
Palit GeForce GTX 460 with 768 MB VRAM
Ubuntu 14.04 LTS
nVidia binary driver 340.96

I tried out the never nVidia driver '352.63', but that one is a catastrophe, my shader executes around two to three times slower on the newer nVidia driver.

I appreciate ANY comments, also critics - but please indulge me, I'm a beginner in what concerns GLSL :)

Alfonse Reinheart
12-14-2015, 06:20 PM
Executing the just shown shader needs around 120 ms.

Executing it on what? What are you drawing? How much overdraw is there?

GClements
12-14-2015, 09:22 PM
I would expect using a global atomic counter to kill parallelism.

Unfortunately, most of the resources on contention-avoiding algorithms are for CUDA; there's almost nothing regarding the performance costs of atomics or coherence in GLSL.

JasonRay
12-16-2015, 05:03 PM
Thanks guys for your replies.


Executing it on what? What are you drawing? How much overdraw is there?

A 'sponza' scene. I tried 8 times to post a screen shot image or at least a link to a screen shot image here, but the OpenGL forum software 'denied' that. Please re-assemble the following URL: http://uploads.gamedev.net/monthly_08_2014/post-222765-0-11808500-1409079548.png
This is not exactly my scene but looks very close to mine.

I got this scene together with an OpenGL render framework by a fellow student.

My fragment shaders render the sponza scene smoothly when implementing e.g. screen space reflections that take the screen buffer data from a buffer texture of a previous, 'normal' scene render pass, but as soon as I involve one to two imageLoad() calls per pixel to get data from per-pixel-linked lists instead of buffer textures, rendering one frame takes between 1.5 to 3 seconds.

As I already told, the fragment shaders I wrote before were all rather fast, but when doing imageLoad or imageStore calls, performance is suddenly very poor.

GClements
12-16-2015, 06:53 PM
As I already told, the fragment shaders I wrote before were all rather fast, but when doing imageLoad or imageStore calls, performance is suddenly very poor.
Have you confirmed that it's imageLoad/imageStore that causes the performance hit, rather than atomicCounterIncrement or imageAtomicExchange?

What effect, if any, does removing the "coherent" qualifier on the images have upon performance?

I'm aware that simply removing the calls and/or qualifier will break functionality, but it may also point toward a more viable approach.

Dark Photon
12-17-2015, 07:16 AM
I tried 8 times to post a screen shot image or at least a link to a screen shot image here, but the OpenGL forum software 'denied' that. Please re-assemble the following URL:

Hi JasonRay. Yes, new forum users aren't allowed to post links and images. This is to prevent folks from subscribing just to post spam (we've had this problem before). Sorry for the inconvenience. After you've posted a few more times, you'll be able to post links and images like everyone else here.

I've edited your post above and fixed-up the image link for other readers.

JasonRay
12-17-2015, 09:15 AM
Thanks for your reply and for assembling the link.


Have you confirmed that it's imageLoad/imageStore that causes the performance hit, rather than atomicCounterIncrement or imageAtomicExchange?

What effect, if any, does removing the "coherent" qualifier on the images have upon performance?

I'm aware that simply removing the calls and/or qualifier will break functionality, but it may also point toward a more viable approach.

Yes, it strongly seems as if imageLoad() is the call that does extremely slow down e.g. my Screen Space Reflections fragment shader.
That shader ran smoothly when reading the current fragment's depth value from a buffer texture. I replaced the reading from the buffer texture by traversing the per-pixel-linked list, and now I get render times in second- instead of millisecond-range.

Here's an excerpt of the original the shader code:



float current_z_buffer_value = get_depth(ray_pos_current_on_screen.x, ray_pos_current_on_screen.y, layer_index); // texture(depthbuffer_tex, vec2(ray_pos_current_on_screen.x, ray_pos_current_on_screen.y)).r;


Please notice reading from the buffer texture is now commented-out, and I call get_depth(), which looks like this:



float get_depth(float x, float y, int layer_index)
{
int iMax = min(debug_val2, layer_index);
int next = imageLoad(head_buffer, ivec2(int(x * screendim.x), int(y * screendim.y))).r;
for (int i = 0; i < iMax; i ++)
{
next = imageLoad(next_buffer, next).r;
if (next < 0)
return 100000000.0; // HACK; use a constant here!
}
return imageLoad(data_buffer, next).w;
}


debug_val2 is 4. I tried 0, 1, 2, ... as well. 0 does hardly slow down the shader, when using more the shader quickly becomes practically unusable as too slow :(

I already tried removing 'coherent', it had no (significant) effect on performance, in particular the shader didn't become sufficient fast.

I also tried 'readonly' and 'restrict', both didn't have a measurable effect.

:doh:

JacobVR
01-06-2016, 03:41 AM
I have a version 420 GLSL vertex shader that writes to a texel (always the same one) of an uimage using the imageStore function. (In the actual release version, I use more than one texel, so this setup is just an extreme test case for performance measuring.)