Part of the Khronos Group

The Industry's Foundation for High Performance Graphics

from games to virtual reality, mobile phones to supercomputers

Results 1 to 3 of 3

Thread: Random memory access inside shader.

  1. #1
    Junior Member Newbie
    Join Date
    Jun 2014

    Random memory access inside shader.

    Are there any chances to optimize this piece of code?

    There is a 8x8 matrix. Each element inside is u64vec4[3]. Every value depends on the previous one. As you can see, each pixel/cell has its own temporary matrix and state.

    A random row index is selected(64 times in total). matrix and state are modified a lot of times previously, but it doesn't affect performance that much.

    A random row index is a bottleneck(selected 64 times in total).
    These lines reduce performance by half or even more.
    Code :
     mmatrix[c+0] ^= u64vec4(st[2].w, st[0].x, st[0].y, st[0].z);
     mmatrix[c+1] ^= u64vec4(st[0].w, st[1].x, st[1].y, st[1].z);
     mmatrix[c+2] ^= u64vec4(st[1].w, st[2].x, st[2].y, st[2].z);

    I've tried both fragment and compute shader versions. Compute shader(depending on the local_size settings) can be up to 30% faster than fragment using exact the same code exept these lines:
    Code :
    const uvec2 coord = uvec2(gl_FragCoord.xy);
    // or
    const uvec2 coord = uvec2(uint(gl_GlobalInvocationID.x), uint(gl_GlobalInvocationID.y));

    I've tried to make mmatrix as shared. It is 2.3x faster, but i have only 32768 bytes available for shared memory. It is only 5 matrices(32768/6144).
    With these settings, shared memory is 2.3x faster.
    Code :
    layout(local_size_x=1, local_size_y=5, local_size_z=1) in;

    This configuration without shared memory is 2.4x times faster than those 5 threads. So shared memory makes no sense at all.
    Code :
    layout(local_size_x=8, local_size_y=8, local_size_z=1) in;

    Here is a part of this shader.
    Code :
    layout(binding = 0, rgba32ui) writeonly uniform uimage2DArray dynamicImage;
    uniform usampler2DArray bImage;
    // load input
    const uvec2 coord = uvec2(uint(gl_GlobalInvocationID.x), uint(gl_GlobalInvocationID.y));
    uvec4 i0 = texelFetch(bImage, ivec3(coord.x, coord.y, 0), 0);
    uvec4 i1 = texelFetch(bImage, ivec3(coord.x, coord.y, 1), 0);
    u64vec4 mmatrix[192];
    u64vec4 st[4];
    int stepp, ct, prev, row, ra;
    uint a,b;
    // compute matrix and state, from i0 and i1 values
    // This part takes 98% of time.
    prev = 7;	
    row = 0;
    for (ct = 1; ct <= 8; ++ct)
        stepp = (ct % 2 == 0) ? -1 : 3;
        // for each row
        for (int ik = 0; ik < 8; ++ik)
    	// select a pseudorandom row index. 8 possible values.
            ra = int((uint(st[0].x)) & (7U));
            a = prev*24;
            uint c = ra*24;
            b = row*24;
            // for each column
            for (int jj = 0; jj < 8; ++jj)
               st[0] ^= (mmatrix[a+0] + mmatrix[c+0]);
               st[1] ^= (mmatrix[a+1] + mmatrix[c+1]);
               st[2] ^= (mmatrix[a+2] + mmatrix[c+2]);
               mmatrix[b+0] ^= st[0];
               mmatrix[b+1] ^= st[1];
               mmatrix[b+2] ^= st[2];
               // performance killer
               mmatrix[c+0] ^= u64vec4(st[2].w, st[0].x, st[0].y, st[0].z);
               mmatrix[c+1] ^= u64vec4(st[0].w, st[1].x, st[1].y, st[1].z);
               mmatrix[c+2] ^= u64vec4(st[1].w, st[2].x, st[2].y, st[2].z);
               b += 3;
               c += 3;
               a += 3;
            prev = row;
            row = (row + stepp) & (7);
    a = ra*24;
    st[0] ^= mmatrix[a+0];
    st[1] ^= mmatrix[a+1];
    st[2] ^= mmatrix[a+2];
    uvec4 firstHalf, secondHalf;
    firstHalf.xy = unpackUint2x32(st[0].x); = unpackUint2x32(st[0].y);
    secondHalf.xy = unpackUint2x32(st[0].z); = unpackUint2x32(st[0].w);
    imageStore(dynamicImage, ivec3(coord.x, coord.y, 0), firstHalf);
    imageStore(dynamicImage, ivec3(coord.x, coord.y, 1), secondHalf);

    My GPU is AMD Radeon HD 7850.
    Obviously all drivers are bad with GL_ARB_gpu_shader_int64.
    AMD driver on Windows works OK(GL_AMD_gpu_shader_int64). I haven't tried AMDGPU PRO on Linux(my target platform). It is an experimental for my GPU.
    RadeonSI on Linux is 40% faster with exact the same code both fragment and compute shader. It doesn't work with shared variables(freezes my OS).
    I've also tried Nvidia GPU on linux. They claim that they support it, but they don't.

    I am pretty sure there is something that blocks performance here.

  2. #2
    Member Regular Contributor
    Join Date
    May 2016
    is there a reason why you dont use a shader storage buffer instead of the texture ? you have a 8x8 matrix, each element is ui64vec4[3], so i'd try to use:
    Code glsl:
    layout (std430, binding = 1) buffer MATRIXBLOCK {
        ui64vec4 data[8][8][3];

  3. #3
    Junior Member Newbie
    Join Date
    Jun 2014
    Texture doesn't store matrix. It is a TEXTURE_2D_ARRAY(RGBA32ui, 2 layers). It stores only the result (2 uvec4s]).
    Matrix u64vec4[192] is a glsl local array and it is differs for every pixel/cell.

    I've tried this approach.
    Instead of glsl local array inside shader, create an ssbo with size = sizeof(matrix)*imgWidth*imgHeight and use it as a temporal storage/cache.

    instead of:
    Code :
    u64vec4 matrix[192];
    matrix[n] = someData;

    it will be:
    Code :
    struct matrixBlock
        u64vec4 matrixData[192];
    layout(binding = 2, std430) restrict buffer matrixBuffer
        matrixBlock matrices[];
    uint gridRes = 32; // for 32x32 image
    uint idx = gl_GlobalInvocationID.y*gridRes + gl_GlobalInvocationID.x;
    matrices[idx].matrixData[n] = someData;
    I thought that using SSBO as a storage for temporal local variable(matrix) will be slower than a simple glsl local array. My shader reads/writes that matrix 1000+ times.

    I was wrong. I've also changed a grid(input/output) size from 2 dimentional(32x32) to 1(1024x1). It proved that AMD Windows driver is inefficient.
    As a result, i've got the same performance as RadeonSI on Linux, i.e +40% faster on Windows.
    It also depends on grid and SSBO size. For 64x64 it is 21.5% slower, while for 32x32 is 20% and 1024x1 is 40% faster than glsl local array version.
    My target is Linux and on RadeonSI this approach changed nothing(+0.7%). Performance is still the same.

    It seems that performance directly depends on the memory layout.
    I've tried to create a smaller SSBO with size of the WorkGroup and access it as:
    Code :
    matrices[gl_LocalInvocationIndex].matrixData[n] = someData;
    It showed an insane performance boost, but its not viable. Threads/Invocations overwrite each other, before the final value is computed.
    It can work only in: 1,1,1 mode, which makes no sense.

    Code :
    glDispatchCompute(1, 1, 1);
    glDispatchCompute(1, 1, 1);

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts