PDA

View Full Version : Random memory access inside shader.



Feature420
12-01-2017, 10:41 AM
Are there any chances to optimize this piece of code?

There is a 8x8 matrix. Each element inside is u64vec4[3]. Every value depends on the previous one. As you can see, each pixel/cell has its own temporary matrix and state.

A random row index is selected(64 times in total). matrix and state are modified a lot of times previously, but it doesn't affect performance that much.

A random row index is a bottleneck(selected 64 times in total).
These lines reduce performance by half or even more.

mmatrix[c+0] ^= u64vec4(st[2].w, st[0].x, st[0].y, st[0].z);
mmatrix[c+1] ^= u64vec4(st[0].w, st[1].x, st[1].y, st[1].z);
mmatrix[c+2] ^= u64vec4(st[1].w, st[2].x, st[2].y, st[2].z);

I've tried both fragment and compute shader versions. Compute shader(depending on the local_size settings) can be up to 30% faster than fragment using exact the same code exept these lines:

const uvec2 coord = uvec2(gl_FragCoord.xy);
// or
const uvec2 coord = uvec2(uint(gl_GlobalInvocationID.x), uint(gl_GlobalInvocationID.y));

I've tried to make mmatrix as shared. It is 2.3x faster, but i have only 32768 bytes available for shared memory. It is only 5 matrices(32768/6144).
With these settings, shared memory is 2.3x faster.

layout(local_size_x=1, local_size_y=5, local_size_z=1) in;

This configuration without shared memory is 2.4x times faster than those 5 threads. So shared memory makes no sense at all.

layout(local_size_x=8, local_size_y=8, local_size_z=1) in;

Here is a part of this shader.

layout(binding = 0, rgba32ui) writeonly uniform uimage2DArray dynamicImage;
uniform usampler2DArray bImage;

// load input
const uvec2 coord = uvec2(uint(gl_GlobalInvocationID.x), uint(gl_GlobalInvocationID.y));
uvec4 i0 = texelFetch(bImage, ivec3(coord.x, coord.y, 0), 0);
uvec4 i1 = texelFetch(bImage, ivec3(coord.x, coord.y, 1), 0);

u64vec4 mmatrix[192];
u64vec4 st[4];
int stepp, ct, prev, row, ra;
uint a,b;

// compute matrix and state, from i0 and i1 values
//....

// This part takes 98% of time.
prev = 7;
row = 0;
for (ct = 1; ct <= 8; ++ct)
{
stepp = (ct % 2 == 0) ? -1 : 3;

// for each row
for (int ik = 0; ik < 8; ++ik)
{
// select a pseudorandom row index. 8 possible values.
ra = int((uint(st[0].x)) & (7U));

a = prev*24;
uint c = ra*24;
b = row*24;

// for each column
for (int jj = 0; jj < 8; ++jj)
{
st[0] ^= (mmatrix[a+0] + mmatrix[c+0]);
st[1] ^= (mmatrix[a+1] + mmatrix[c+1]);
st[2] ^= (mmatrix[a+2] + mmatrix[c+2]);

rotateState(st);

mmatrix[b+0] ^= st[0];
mmatrix[b+1] ^= st[1];
mmatrix[b+2] ^= st[2];

// performance killer
mmatrix[c+0] ^= u64vec4(st[2].w, st[0].x, st[0].y, st[0].z);
mmatrix[c+1] ^= u64vec4(st[0].w, st[1].x, st[1].y, st[1].z);
mmatrix[c+2] ^= u64vec4(st[1].w, st[2].x, st[2].y, st[2].z);

b += 3;
c += 3;
a += 3;
}
prev = row;
row = (row + stepp) & (7);
}
}

a = ra*24;
st[0] ^= mmatrix[a+0];
st[1] ^= mmatrix[a+1];
st[2] ^= mmatrix[a+2];

uvec4 firstHalf, secondHalf;
firstHalf.xy = unpackUint2x32(st[0].x);
firstHalf.zw = unpackUint2x32(st[0].y);
secondHalf.xy = unpackUint2x32(st[0].z);
secondHalf.zw = unpackUint2x32(st[0].w);

imageStore(dynamicImage, ivec3(coord.x, coord.y, 0), firstHalf);
imageStore(dynamicImage, ivec3(coord.x, coord.y, 1), secondHalf);

My GPU is AMD Radeon HD 7850.
Obviously all drivers are bad with GL_ARB_gpu_shader_int64.
AMD driver on Windows works OK(GL_AMD_gpu_shader_int64). I haven't tried AMDGPU PRO on Linux(my target platform). It is an experimental for my GPU.
RadeonSI on Linux is 40% faster with exact the same code both fragment and compute shader. It doesn't work with shared variables(freezes my OS).
I've also tried Nvidia GPU on linux. They claim that they support it, but they don't.

I am pretty sure there is something that blocks performance here.

john_connor
12-02-2017, 02:31 AM
is there a reason why you dont use a shader storage buffer instead of the texture ? you have a 8x8 matrix, each element is ui64vec4[3], so i'd try to use:
layout (std430, binding = 1) buffer MATRIXBLOCK {
ui64vec4 data[8][8][3];
};

Feature420
12-02-2017, 12:55 PM
Texture doesn't store matrix. It is a TEXTURE_2D_ARRAY(RGBA32ui, 2 layers). It stores only the result (2 uvec4s]).
Matrix u64vec4[192] is a glsl local array and it is differs for every pixel/cell.

I've tried this approach.
Instead of glsl local array inside shader, create an ssbo with size = sizeof(matrix)*imgWidth*imgHeight and use it as a temporal storage/cache.

instead of:

u64vec4 matrix[192];
matrix[n] = someData;

it will be:

struct matrixBlock
{
u64vec4 matrixData[192];
};

layout(binding = 2, std430) restrict buffer matrixBuffer
{
matrixBlock matrices[];
};

uint gridRes = 32; // for 32x32 image
uint idx = gl_GlobalInvocationID.y*gridRes + gl_GlobalInvocationID.x;
matrices[idx].matrixData[n] = someData;
I thought that using SSBO as a storage for temporal local variable(matrix) will be slower than a simple glsl local array. My shader reads/writes that matrix 1000+ times.

I was wrong. I've also changed a grid(input/output) size from 2 dimentional(32x32) to 1(1024x1). It proved that AMD Windows driver is inefficient.
As a result, i've got the same performance as RadeonSI on Linux, i.e +40% faster on Windows.
It also depends on grid and SSBO size. For 64x64 it is 21.5% slower, while for 32x32 is 20% and 1024x1 is 40% faster than glsl local array version.
My target is Linux and on RadeonSI this approach changed nothing(+0.7%). Performance is still the same.

It seems that performance directly depends on the memory layout.
I've tried to create a smaller SSBO with size of the WorkGroup and access it as:

matrices[gl_LocalInvocationIndex].matrixData[n] = someData;
It showed an insane performance boost, but its not viable. Threads/Invocations overwrite each other, before the final value is computed.
It can work only in: 1,1,1 mode, which makes no sense.


glDispatchCompute(1, 1, 1);
glMemoryBarrier(GL_TEXTURE_FETCH_BARRIER_BIT);
glDispatchCompute(1, 1, 1);
...