Part of the Khronos Group

The Industry's Foundation for High Performance Graphics

from games to virtual reality, mobile phones to supercomputers

Results 1 to 2 of 2

Thread: parallel work on various functions

  1. #1
    Newbie Newbie
    Join Date
    Dec 2017

    parallel work on various functions

    I have problem with parrallel work.

    This is example of larger program.

    uint4 a,b;

    a.s0 = b.s3 +1;
    a.s1 = b.s2 +2;
    a.s2 = b.s1 + 3;
    a.s3 = b.s0 + 4;

    I want to this function parralel, this is structor of larger loop. I tried to do it like this but it not work:

    Code cpp:
    for(int i = get_local_id(0); i < 4 ; i +=1)
      if (i==0)
        a.s0 =  b.s3 +1;
      if (i==1)
        a.s0 =  b.s3 +1;
      if (i==2)
        a.s0 =  b.s3 +1;
      if (i==3)
        a.s0 =  b.s3 +1;
    Last edited by Dark Photon; 12-31-2017 at 04:09 PM.

  2. #2
    Super Moderator OpenGL Guru dorbie's Avatar
    Join Date
    Jul 2000
    Bay Area, CA, USA
    This is off topic so I'll be closing this thread but in general you want to structure your parallel data to perform substantial work on non contested memory to both be worth the overhead and actually work in parallel. Your code sample doesn't make much sense has an excess of branches (fine grained conditionals are to be avoided even when single threaded) while thrashing memory as threads contend for the same cache line access.

    You should be doing something like:
    1) arrange data for parallel work
    2) start n threads of work with loops over their own (mostly) unique data
    3) consolodate results.

    The following would be better, it turns your large loop inside out I think (although I haven't seen your code)

    int thread_num= get_local_id(0); // assumes threads are numbered from 0 - thread_count-1
    int batch_size = datasize/thread_count; //assumes evenly divisible
    for(i = batchsize*thread_num; i<batchsize*(thread_num+1); i++) {
    //do work here, this becomes your LARGE loop, launching n threads effectively becomes your small loop

    Note that this is only a win if batch_size is reasonably large. It is better than your code because it allows threads to work uncontended on contiguous blocks of memory, exploiting cache line fetch, thread affinity etc. and properly amortizes the overhead of launching threads etc. and eliminates a LOT of unnecessary conditional branching.
    It also assumes some compiler optimization in the inner loop. I'd be tempted to unroll the inner loop and use vector parallel instructions as a further optimization if the compiler doesn't handle it well.

    There are other patterns of parallelism of course but based on your sample this is probably what you need.
    Last edited by dorbie; 12-28-2017 at 07:19 PM.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts