parallel work on various functions

I have problem with parrallel work.

This is example of larger program.

uint4 a,b;

a.s0 = b.s3 +1;
a.s1 = b.s2 +2;
a.s2 = b.s1 + 3;
a.s3 = b.s0 + 4;

I want to this function parralel, this is structor of larger loop. I tried to do it like this but it not work:


for(int i = get_local_id(0); i < 4 ; i +=1)
{
  if (i==0)
  {
    a.s0 =  b.s3 +1;
  }
  if (i==1)
  {
    a.s0 =  b.s3 +1;
  }
  if (i==2)
  {
    a.s0 =  b.s3 +1;
  }
  if (i==3)
  {
    a.s0 =  b.s3 +1;
  }
}

This is off topic so I’ll be closing this thread but in general you want to structure your parallel data to perform substantial work on non contested memory to both be worth the overhead and actually work in parallel. Your code sample doesn’t make much sense has an excess of branches (fine grained conditionals are to be avoided even when single threaded) while thrashing memory as threads contend for the same cache line access.

You should be doing something like:

  1. arrange data for parallel work
  2. start n threads of work with loops over their own (mostly) unique data
  3. consolodate results.

The following would be better, it turns your large loop inside out I think (although I haven’t seen your code)

int thread_num= get_local_id(0); // assumes threads are numbered from 0 - thread_count-1
int batch_size = datasize/thread_count; //assumes evenly divisible
for(i = batchsizethread_num; i<batchsize(thread_num+1); i++) {
//do work here, this becomes your LARGE loop, launching n threads effectively becomes your small loop
}

Note that this is only a win if batch_size is reasonably large. It is better than your code because it allows threads to work uncontended on contiguous blocks of memory, exploiting cache line fetch, thread affinity etc. and properly amortizes the overhead of launching threads etc. and eliminates a LOT of unnecessary conditional branching.
It also assumes some compiler optimization in the inner loop. I’d be tempted to unroll the inner loop and use vector parallel instructions as a further optimization if the compiler doesn’t handle it well.

There are other patterns of parallelism of course but based on your sample this is probably what you need.