Data Transfer, comparing efficiency

Decoder4TheCode2 · August 15, 2012, 3:53pm

Atomic vs. non-atomic…
I know that normally atomic operations provide completely different functionality so the decision to use atomic data operations (such as with the new shader buffer object from openGL 4.3) rather than non-atomic data storage and operations is usually based on what needs to be done. If it doesn’t need to be atomic then generally the advice is to not make it so.

But for me I am cursed with too much outside the box thinking, and I’ve managed to find ways to perform what I want to do atomically or not atomically. And there are considerable benefits to the atomic route of doing things so I can’t just decide to go with the non-atomic route too easily.
Thus I’ve come here is somewhat desperate need of advice (plz plz plz) because the thought of doing all the code and testing for two completely different versions of the incredibly complex mess I’m working with has got me pacing in circles…

In short, I haven’t gotten I chance to set up any good performance evaluations with the new openGL4.3 additions and I’m wondering if anyone could give me an estimate of the compared performance of the following:
Transferring a large amount of data from very large indexed VBOs (data transfer is the bottleneck) to the shaders and then outputting a significantly smaller amount of data with TF
Or…
Transferring that same large amount of data from the new shader buffer storage into compute shaders atomically and then changing the same smaller amount of data in the shader buffer atomically from the shaders. Also it’s fairly unlikely with this large data storage that the exact same value would be edited at the same time by multiple parallel processes, in case that matters…

Almost obviously the atomic shouldn’t be as fast but I really need an estimate of how much slower it would be. And I think I can probably ignore what happens after the main data is transferred into the shaders in relation to performance. I know there are many factors to consider, but can anyone help me out with any estimate or explanation?

Dark_Photon · August 15, 2012, 7:01pm

Sounds like a reduction operation of some sort. More details on the algorithm would be useful. It’s not clear why you’re even talking atomics yet.

Is the reduce is very data-parallel (i.e. many stream processors can be reducing different parts of the data at the same time)? For instance, could you do it with a ping-pong reduction? If so then you might get some speedup beyond that with a compute kernel reduction – where you can accelerate the reduction by doing things like making use of the shared memory on the multiprocessors, optimize the memory access pattern for the GPU, and perform multiple blocks of work per thread to distribute the work better. Don’t know yet what of this the Compute Shader lets you do yet though…haven’t picked through it yet.

Decoder4TheCode2 · August 15, 2012, 11:09pm

Well, it’s pretty complicated, bordering on insanely complicated. There isn’t really one algorithm, more like solid walls of calculations o_O…
Anyway, I suppose I could elaborate on some of the simpler details:
non-atomically, the calculations are done in the tessellation shaders, with the patch primitives being the batches of data that are operated on (and some of this is then rendered, openGL for a reason); I can get by without needing any of the data outside the patch…
atomically I would use compute shaders, but most of the benefits of this system redesign would come from accessing and writing data outside of the group of data (it’s hard to explain…) that is being operated on, so it’s a bit of a different story…
All I really want to know though is the difference in data bandwidths for inputting large amounts of data into shaders by indexed VBO or atomically with shader buffer storage, without assuming that any optimizations are made; this is my bottleneck and I can’t really get by with using any less data, although the same data element can be indexed multiple times in different groups of data, but I’ve seen that indexed VBOs are already somewhat capable of optimizing here so I don’t know that the Compute Shader route would necessarily have an advantage. It’s also worth noting that my indices can be dynamic so optimization of memory access patterns will have to be able to adapt to the changes…
So, does anyone know how much longer it takes to transfer an average chunk of data atomically to the shaders compared to the same amount of data in point form with indexed VBOs? Usually I think there’s a decent hit from using atomic data transfers because of necessary extra steps needed to maintain linearizability, but if the main issue is just reading in all the data as it is in my case then it’s less clear, and I don’t know how efficient the new shader buffer objects will be for large data transfers…
o_O
So, what do you think?

Dark_Photon · August 18, 2012, 8:22pm

Though my intuition is similar to yours, probably best to set up a test case and see. I’d expect there to potentially be more latency with atomics where you care about the return value as this is a “sync point”, whereas with vertex attributes, that meat-n-potatoes GPU parallel – no sync points.