Poor glMultiDrawElementsIndirect Performance

Intent:
I am trying to implement a Shared Mesh structure as described through the Approaching Zero Driver Overhead video. I am Looking for some help as to why I’m getting poor performance.

Previous method:
My world’s terrain was originally broken up into about 9k regions, with about 300-400 visible and rendered at a single time. With my naive implementation, I gave each region a VAO/VBO and bound/rendered. The performance wasn’t bad, but I thought I could do better with a Shared Mesh.

New Method:
I implemented a “Block” style shared mesh where each region gains access to the mesh through a handle. The handle requests a range of “Blocks” from the Vertices and Indices buffers. You then write to the buffer through the handle using a pointer - that was stored at initialization using glMapBufferRange - by worker threads, and finalize the handle which produces a command to be used by your application. The scene is rendered using a single glMultiDrawElementsIndirect call. When a “Block” is returned to the mesh, a sync is added to that range and the application waits before that sync is signaled until it can give it out again to more handles. The shared mesh was created using glBufferStorage and flags persistent and coherent.

Shader Info:
The original, many VAO, approach had 4 Attribute pointers: Vert, Color, Norm, Uv.
The new, Shared Mesh, approach has 12 Attribute pointers: previous 4, a per instance ID, 4 pointers for a per instance model matrix buff, and 3 more pointers for a per instance normal matrix buffer.

Performance Info:
My original, many VAO, approach rendered the scene with about 25% GPU load and around 13% shader work. This is using 3 passes for CSM and 1 pass for actual render.
My new, Shared Mesh, approach renders the scene with about 60-80% GPU load and around 13% shader work. This is using ONLY 1 pass for actual render. Any more passes than one is just horrendous.
The number of triangles rendered in both methods is ~880,000. Through ~300-400 commands/VAOs.

I thought the GPU load would be reduced. How is it that this new method is rendering twice as slow on the gpu? I have to be doing something super wrong…

One thing I noticed is that I am allocating about 1.5GB worth of space using buffer storage, however, my GPU only reports ~0.8GB memory usage and my application reports additional memory usage close to 1.5GB. It is also worth noting that I am not CPU bound in any way. Also, the performance just doesn’t drop off suddenly; I can see a linear drop in performance as more and more chunks/triangles are being rendered. For shits and giggles, I broke up the scene into even more commands, about 3 times as many as I had it allocated to before, In an attempt to see if the number of commands was the cause, and the performance did not get worse. I stripped my shader of any light/shadow code and performance is still the same. I am using a gtx980 ti.

Code:
Here is SharedMesh.h:
pastebin/F5b37Xix

Here is SharedMesh.cpp:
pastebin/aiSZx3Sz

Usage is as follows:

SharedMesh mesh;
mesh.init(...);

SMHandle handle;
mesh.get_handle( handle );

handle.push_set( SMGSet(...) );
handle.buffer_data( ... );
handle.finalize_set( );

handle.submit_commands( );

mesh.buffer_commands( );
mesh.render( );
mesh.clear_commands( );

Please help me. I am going crazy over this.

I’ve tried not using coherent bit and flushing the ranges manually. Still it does not seem like the pure data is on the graphics card. I am still getting very poor performance.

Does anyone here know how to use glBufferStorage and glMapBufferRange correctly with eachother?

If I am doing things completely wrong, does anyone have a resource which I can read that describes using these together?

[QUOTE=Amani77;1283479]Does anyone here know how to use glBufferStorage and glMapBufferRange correctly with eachother?

If I am doing things completely wrong, does anyone have a resource which I can read that describes using these together?[/QUOTE]

https://www.opengl.org/wiki/Buffer_Object#Immutable_Storage
https://www.opengl.org/wiki/Buffer_Object#Mapping

use glBufferStorage (…) ONLY once to allocate (immutable!) memory for your buffer
use glMapBufferRange(…) to upload/download data, and dont forget to unmap it when finished
and dont forget to check for gl errors

https://www.opengl.org/wiki/OpenGL_Error

[QUOTE=john_connor;1283481]https://www.opengl.org/wiki/Buffer_Object#Immutable_Storage
Buffer Object - OpenGL Wiki

use glBufferStorage (…) ONLY once to allocate (immutable!) memory for your buffer
use glMapBufferRange(…) to upload/download data, and dont forget to unmap it when finished
and dont forget to check for gl errors[/quote]

The whole point of persistent mapping is that you don’t unmap it.

Exactly, and with coherent I should not need to flush ranges. I mean everything renders without artifacts ect, It’s just super taxing on gpu.

Most drivers seem to always put the buffer on the client side (Normal CPU system memory) if you map them even once.

Try to create a small transfer buffer that you map read/write and persistently and also set the client side hint to true. Then use that memory like normal system memory with the added ability to call GPU DMA copy commands on it. Just make sure you wait for fences before touching that memory again after a copy command! (For streaming maybe create more then one transfer buffers or only use parts of it in ping pong fashion)

Then you create your drawing buffers not mapped and copy all the data into them via the transfer buffer(s). Again testing for fences before calling drawing commands otherwise the driver must stall if the memory copy is not done.

[QUOTE=Osbios;1283496]Most drivers seem to always put the buffer on the client side (Normal CPU system memory) if you map them even once.

Try to create a small transfer buffer that you map read/write and persistently and also set the client side hint to true. Then use that memory like normal system memory with the added ability to call GPU DMA copy commands on it. Just make sure you wait for fences before touching that memory again after a copy command! (For streaming maybe create more then one transfer buffers or only use parts of it in ping pong fashion)

Then you create your drawing buffers not mapped and copy all the data into them via the transfer buffer(s). Again testing for fences before calling drawing commands otherwise the driver must stall if the memory copy is not done.[/QUOTE]

Hey this seems like a very good idea. I’m currently changing my code to try this out. Thanks for the suggestion!