PDA

View Full Version : Wicked GLsync? or I am expecting TOO much?



herb.zhu
08-02-2017, 12:15 PM
I am working on a volume-rendering project. We are having some legacy code doing it in a way like below to draw a frame:


void renderFrame()
{
for (layer_0 : layer_N)
{
renderLayer(i);
}
}
void renderLayer(i)
{
for ( node_0 : node_M in layer_i)
{
renderNode(node_i);
}
}
void renderNode(i)
{
glGetUniformLocation(…);
glUniformxx(…);
// other uniforms
glDrawArrays(…);
}


The code was implemented in old days with graphics cards like Quadro FX xxxx, which usually just has around 256 or 512 MB GPU memory. Considering the amount of data that needs to be on GPU RAM, the above strategy was trying to break down the workload into small chunks (per node rendering). In our applications, usually there are hundreds or thousands of nodes that need to be rendered.

Nowadays, GPUs are getting better. And, according to the AZDO presentation on GTC2014, seems like the above strategy would be bound by the GL driver overhead. Therefore, we are considering to aggregate the number of draw calls for some sort of “layered-rendering”, i.e., breaking down the workload into larger chunks (i.e., per layer rendering).

Recently, I am playing something like below for rendering a frame, within one thread:


void renderFrame()
{
for (layer_0 : layer_N)
renderLayer(i);
}

void renderLayer(int layerIndex)
{
glWaitClientSync();
update_persistently_mapped_UBO_content();
glBindBufferRange(GL_UNIFORM_BUFFER, …);
glDrawArraysInstancedBaseInstance(GL_TRIANGLES,
0,
numVertices,
numInstances,
0);
glFenceSync();
}


As can be seen, those per-node uniforms are aggregated into some uniform structs and were updated through persistently mapped UBO. According to the AZDO presentation, one UBO object with three-layers-worth of data size was used, and then call glBindBufferRange() with proper starting point and offset for a certain layer. Things are working and generating desired images, but, I only saw 30-ish% performance improvement by comparing to the old strategy on a nVidia Quadro M2000 card. I am not really expecting something like 7.5x-speedup as claimed in the AZDO presentation, but, I am definitely expecting a number more than 30%.

nSight’s performance analysis is showing that significant amount of time was spent on calls glWaitClientSync/glFenceSync, e.g., below is a picture showing the rendering of two layers on a configuration of 32-nodes in a layer.
2433

I simply went ahead and tried increasing the number of nodes in one layer, like 64 or 128 nodes per-layer, with the hope of reducing the sync-ing cost with the smaller number of layers (which means smaller number of sync-related calls). Unfortunately, I was beat in my face that I just got similar overall per-frame rendering performance as using 32-nodes in a layer. In nSight, the cost of rendering each layer grows in proportion to the node-number increase, and ALSO, the cost of glWaitClientSync/glFenceSync calls grows with similar ratio too, ☹. What’s the heck?

BTW, I was thinking that the UBO buffer content update would trigger a DMA transfer request, which could stall the draw call, which might cause the heavy glWaitClientSync/glFenceSync calls. So, I gave another try to have another thread responsible for UBO-buffer update (i.e., to have UBO update and GL draw call happen asynchronously), but, so-far no luck, ☹.

I am wondering if I miss anything. Any comments, ideas are appreciated!

Dark Photon
08-02-2017, 08:15 PM
Nowadays, GPUs are getting better. ... AZDO ... Therefore, we are considering to aggregate the number of draw calls ...

As can be seen, those per-node uniforms are aggregated into some uniform structs and were updated through persistently mapped UBO. According to the AZDO presentation, one UBO object with three-layers-worth of data size was used,

No, as I recall the advice was to size your PERSISTENT COHERENT buffer to be able to hold at least 3 "frames" of data, not 3 layers (checking... yes, that's right; see this (http://media.steampowered.com/apps/steamdevdays/slides/beyondporting.pdf)). You want the ClientWaitSync to never or almost never wait here because the GPU has already gotten to this point in the command stream and so there's no need to wait.

What you're seeing suggests you are not giving the GPU enough work for this to be the case.



I simply went ahead and tried increasing th number of nodes in one layer, like 64 or 128 nodes per-layer, with the hope of reducing the sync-ing cost with the smaller number of layers (which means smaller number of sync-related calls). ... In nSight, the cost of rendering each layer grows in proportion to the node-number increase, and ALSO, the cost of glWaitClientSync/glFenceSync calls grows with similar ratio too, ☹. What’s the heck?

It sounds like you don't completely understand how FenceSync/WaitSync work.

FenceSync drops a "breadcrumb" in the command stream you're sending to the GPU. ClientWaitSync tells GL to block your CPU thread until the GPU has executed past the point in the command-stream where a "breadcrumb" was inserted. If the GPU has already gotten past this point, the driver doesn't wait here. But if the GPU hasn't gotten to this point yet, then the driver waits and blocks your CPU thread here.

The latter is what sounds like is happening in your case.

The way you wrote your updated renderLayers() function, it effectively says:



Wait until the GPU has finished rendering the last layer.
Give it some new data for the next layer.
Tell it to render the next layer.


"iff" you are only using one sync object here. So as you can see, you're completely preventing any pipelining of multiple layers.

However, you should use multiple sync objects here (e.g. 3, one for every 1/3rd of the buffer). The whole idea of these is to make sure you don't stomp on some data you've written into this buffer before the GPU has consumed that data, not to make sure the GPU waits at every stage on the CPU.

Try expanding the size of your PERSISTENT COHERENT buffer to hold at least 3 frames of data, use multiple sync objects to prevent scribbling on data the GPU hasn't read, and then see if/how much you end up waiting in ClientWaitSync.

herb.zhu
08-02-2017, 09:52 PM
Thanks, Dark Photon, for the quick inputs!


No, as I recall the advice was to size your PERSISTENT COHERENT buffer to be able to hold at least 3 "frames" of data, not 3 layers (checking... yes, that's right; see this (http://media.steampowered.com/apps/steamdevdays/slides/beyondporting.pdf)). You want the ClientWaitSync to never or almost never wait here because the GPU has already gotten to this point in the command stream and so there's no need to wait.

I thought about holding 3 "frames" of data, and aggregating to only one call of glDrawArraysInstancedBaseInstance() for a frame. However, although GPU RAM is getting much bigger than old days, but we still can not assume GPU RAM is enough to hold all the needed data for rendering one frame in some of our use cases. Sorry, my pseudo code was not clear about this. Actually, within my UBO, there is a list of 3D texture handles (heavy GPU RAM footprints here, :(, but no way to get around it) being passed to shaders to render one layer of nodes.

However, with some small dataset (i.e., all the data can fit on my Quadro M2000 for rendering a frame), I am planning to just do one call of glDrawArraysInstancedBaseInstance() for a frame.



However, you should use multiple sync objects here (e.g. 3, one for every 1/3rd of the buffer). The whole idea of these is to make sure you don't stomp on some data you've written into this buffer before the GPU has consumed that data, not to make sure the GPU waits at every stage on the CPU.

Sorry again, I wasn't clear about the usage of my sync objects in the pseudo code. I did use multiple sync objects as below:




#define kNumBuffers 3
#defome kOneSecondInNanoSeconds 1000000000

std::vector<GLsync> _syncs(kNumBuffers, 0);

int _currentBufferIdx = 0;

void renderLayer(int layerIndex)
{
waitSync();

update_persistently_mapped_UBO_content();
glBindBufferRange(GL_UNIFORM_BUFFER, …);
glDrawArraysInstancedBaseInstance(GL_TRIANGLES,
0,
numVertices,
numInstances,
0);
fenceSync();
}

void waitSync()
{
auto& sync = _syncs[_currentBufferIdx];
if (sync != 0)
{
GLbitfield waitFlags = 0;
GLuint64 waitDuration = 0;
while (1) {
GLenum waitRet = glClientWaitSync(sync, waitFlags, waitDuration);
if (waitRet == GL_ALREADY_SIGNALED || waitRet == GL_CONDITION_SATISFIED) {
return;
}

if (waitRet == GL_WAIT_FAILED) {
assert(!"Not sure what to do here. Probably raise an exception or something.");
return;
}

// After the first time, need to start flushing, and wait for a looong time.
waitFlags = GL_SYNC_FLUSH_COMMANDS_BIT;
waitDuration = kOneSecondInNanoSeconds;
}

glDeleteSync(sync);
sync = 0;
}
}

void fenceSync()
{
GLsync syncName = glFenceSync(GL_SYNC_GPU_COMMANDS_COMPLETE, 0);
_syncs[_currentBufferIdx] = syncName;

_currentBufferIdx = (_currentBufferIdx + 1) % kNumBuffers;
}


As you may see, 3-layers worth of work are pipelined there before it hits glClientWaitSync(). This can be confirmed (assuming nSight is right) by taking a peek into nSight. Let's look at the first four layers:

2435

As you may see, for layer 0, 1, 2, there is no glClientWaitSync() but an expensive glFenceSync() call. Starting from layer 3, the expensive glClientWaitSync() kicks in the timeline.

I have tried using 8-layers length of UBO with 8 sync objects, I did not get any speed up.


What you're seeing suggests you are not giving the GPU enough work for this to be the case.
I think my experiments of increasing the size of layers (i.e., include 64 or 128 nodes in one layer) would add in more GPU work between fenceSync/waitSync, but like I said, I did not get anything as compared to 32-nodes-per-layer, :(