I am working on a volume-rendering project. We are having some legacy code doing it in a way like below to draw a frame:
void renderFrame()
{
for (layer_0 : layer_N)
{
renderLayer(i);
}
}
void renderLayer(i)
{
for ( node_0 : node_M in layer_i)
{
renderNode(node_i);
}
}
void renderNode(i)
{
glGetUniformLocation(…);
glUniformxx(…);
// other uniforms
glDrawArrays(…);
}
The code was implemented in old days with graphics cards like Quadro FX xxxx, which usually just has around 256 or 512 MB GPU memory. Considering the amount of data that needs to be on GPU RAM, the above strategy was trying to break down the workload into small chunks (per node rendering). In our applications, usually there are hundreds or thousands of nodes that need to be rendered.
Nowadays, GPUs are getting better. And, according to the AZDO presentation on GTC2014, seems like the above strategy would be bound by the GL driver overhead. Therefore, we are considering to aggregate the number of draw calls for some sort of “layered-rendering”, i.e., breaking down the workload into larger chunks (i.e., per layer rendering).
Recently, I am playing something like below for rendering a frame, within one thread:
void renderFrame()
{
for (layer_0 : layer_N)
renderLayer(i);
}
void renderLayer(int layerIndex)
{
glWaitClientSync();
update_persistently_mapped_UBO_content();
glBindBufferRange(GL_UNIFORM_BUFFER, …);
glDrawArraysInstancedBaseInstance(GL_TRIANGLES,
0,
numVertices,
numInstances,
0);
glFenceSync();
}
As can be seen, those per-node uniforms are aggregated into some uniform structs and were updated through persistently mapped UBO. According to the AZDO presentation, one UBO object with three-layers-worth of data size was used, and then call glBindBufferRange() with proper starting point and offset for a certain layer. Things are working and generating desired images, but, I only saw 30-ish% performance improvement by comparing to the old strategy on a nVidia Quadro M2000 card. I am not really expecting something like 7.5x-speedup as claimed in the AZDO presentation, but, I am definitely expecting a number more than 30%.
nSight’s performance analysis is showing that significant amount of time was spent on calls glWaitClientSync/glFenceSync, e.g., below is a picture showing the rendering of two layers on a configuration of 32-nodes in a layer.
[ATTACH=CONFIG]1518[/ATTACH]
I simply went ahead and tried increasing the number of nodes in one layer, like 64 or 128 nodes per-layer, with the hope of reducing the sync-ing cost with the smaller number of layers (which means smaller number of sync-related calls). Unfortunately, I was beat in my face that I just got similar overall per-frame rendering performance as using 32-nodes in a layer. In nSight, the cost of rendering each layer grows in proportion to the node-number increase, and ALSO, the cost of glWaitClientSync/glFenceSync calls grows with similar ratio too, . What’s the heck?
BTW, I was thinking that the UBO buffer content update would trigger a DMA transfer request, which could stall the draw call, which might cause the heavy glWaitClientSync/glFenceSync calls. So, I gave another try to have another thread responsible for UBO-buffer update (i.e., to have UBO update and GL draw call happen asynchronously), but, so-far no luck, .
I am wondering if I miss anything. Any comments, ideas are appreciated!