Are there any chances to optimize this piece of code?
There is a 8x8 matrix. Each element inside is u64vec4[3]. Every value depends on the previous one. As you can see, each pixel/cell has its own temporary matrix and state.
A random row index is selected(64 times in total). matrix and state are modified a lot of times previously, but it doesn’t affect performance that much.
A random row index is a bottleneck(selected 64 times in total).
These lines reduce performance by half or even more.
mmatrix[c+0] ^= u64vec4(st[2].w, st[0].x, st[0].y, st[0].z);
mmatrix[c+1] ^= u64vec4(st[0].w, st[1].x, st[1].y, st[1].z);
mmatrix[c+2] ^= u64vec4(st[1].w, st[2].x, st[2].y, st[2].z);
I’ve tried both fragment and compute shader versions. Compute shader(depending on the local_size settings) can be up to 30% faster than fragment using exact the same code exept these lines:
const uvec2 coord = uvec2(gl_FragCoord.xy);
// or
const uvec2 coord = uvec2(uint(gl_GlobalInvocationID.x), uint(gl_GlobalInvocationID.y));
I’ve tried to make mmatrix as shared. It is 2.3x faster, but i have only 32768 bytes available for shared memory. It is only 5 matrices(32768/6144).
With these settings, shared memory is 2.3x faster.
layout(local_size_x=1, local_size_y=5, local_size_z=1) in;
This configuration without shared memory is 2.4x times faster than those 5 threads. So shared memory makes no sense at all.
layout(local_size_x=8, local_size_y=8, local_size_z=1) in;
Here is a part of this shader.
layout(binding = 0, rgba32ui) writeonly uniform uimage2DArray dynamicImage;
uniform usampler2DArray bImage;
// load input
const uvec2 coord = uvec2(uint(gl_GlobalInvocationID.x), uint(gl_GlobalInvocationID.y));
uvec4 i0 = texelFetch(bImage, ivec3(coord.x, coord.y, 0), 0);
uvec4 i1 = texelFetch(bImage, ivec3(coord.x, coord.y, 1), 0);
u64vec4 mmatrix[192];
u64vec4 st[4];
int stepp, ct, prev, row, ra;
uint a,b;
// compute matrix and state, from i0 and i1 values
//....
// This part takes 98% of time.
prev = 7;
row = 0;
for (ct = 1; ct <= 8; ++ct)
{
stepp = (ct % 2 == 0) ? -1 : 3;
// for each row
for (int ik = 0; ik < 8; ++ik)
{
// select a pseudorandom row index. 8 possible values.
ra = int((uint(st[0].x)) & (7U));
a = prev*24;
uint c = ra*24;
b = row*24;
// for each column
for (int jj = 0; jj < 8; ++jj)
{
st[0] ^= (mmatrix[a+0] + mmatrix[c+0]);
st[1] ^= (mmatrix[a+1] + mmatrix[c+1]);
st[2] ^= (mmatrix[a+2] + mmatrix[c+2]);
rotateState(st);
mmatrix[b+0] ^= st[0];
mmatrix[b+1] ^= st[1];
mmatrix[b+2] ^= st[2];
// performance killer
mmatrix[c+0] ^= u64vec4(st[2].w, st[0].x, st[0].y, st[0].z);
mmatrix[c+1] ^= u64vec4(st[0].w, st[1].x, st[1].y, st[1].z);
mmatrix[c+2] ^= u64vec4(st[1].w, st[2].x, st[2].y, st[2].z);
b += 3;
c += 3;
a += 3;
}
prev = row;
row = (row + stepp) & (7);
}
}
a = ra*24;
st[0] ^= mmatrix[a+0];
st[1] ^= mmatrix[a+1];
st[2] ^= mmatrix[a+2];
uvec4 firstHalf, secondHalf;
firstHalf.xy = unpackUint2x32(st[0].x);
firstHalf.zw = unpackUint2x32(st[0].y);
secondHalf.xy = unpackUint2x32(st[0].z);
secondHalf.zw = unpackUint2x32(st[0].w);
imageStore(dynamicImage, ivec3(coord.x, coord.y, 0), firstHalf);
imageStore(dynamicImage, ivec3(coord.x, coord.y, 1), secondHalf);
My GPU is AMD Radeon HD 7850.
Obviously all drivers are bad with GL_ARB_gpu_shader_int64.
AMD driver on Windows works OK(GL_AMD_gpu_shader_int64). I haven’t tried AMDGPU PRO on Linux(my target platform). It is an experimental for my GPU.
RadeonSI on Linux is 40% faster with exact the same code both fragment and compute shader. It doesn’t work with shared variables(freezes my OS).
I’ve also tried Nvidia GPU on linux. They claim that they support it, but they don’t.
I am pretty sure there is something that blocks performance here.