Intent:
I am trying to implement a Shared Mesh structure as described through the Approaching Zero Driver Overhead video. I am Looking for some help as to why I’m getting poor performance.
Previous method:
My world’s terrain was originally broken up into about 9k regions, with about 300-400 visible and rendered at a single time. With my naive implementation, I gave each region a VAO/VBO and bound/rendered. The performance wasn’t bad, but I thought I could do better with a Shared Mesh.
New Method:
I implemented a “Block” style shared mesh where each region gains access to the mesh through a handle. The handle requests a range of “Blocks” from the Vertices and Indices buffers. You then write to the buffer through the handle using a pointer - that was stored at initialization using glMapBufferRange - by worker threads, and finalize the handle which produces a command to be used by your application. The scene is rendered using a single glMultiDrawElementsIndirect call. When a “Block” is returned to the mesh, a sync is added to that range and the application waits before that sync is signaled until it can give it out again to more handles. The shared mesh was created using glBufferStorage and flags persistent and coherent.
Shader Info:
The original, many VAO, approach had 4 Attribute pointers: Vert, Color, Norm, Uv.
The new, Shared Mesh, approach has 12 Attribute pointers: previous 4, a per instance ID, 4 pointers for a per instance model matrix buff, and 3 more pointers for a per instance normal matrix buffer.
Performance Info:
My original, many VAO, approach rendered the scene with about 25% GPU load and around 13% shader work. This is using 3 passes for CSM and 1 pass for actual render.
My new, Shared Mesh, approach renders the scene with about 60-80% GPU load and around 13% shader work. This is using ONLY 1 pass for actual render. Any more passes than one is just horrendous.
The number of triangles rendered in both methods is ~880,000. Through ~300-400 commands/VAOs.
I thought the GPU load would be reduced. How is it that this new method is rendering twice as slow on the gpu? I have to be doing something super wrong…
One thing I noticed is that I am allocating about 1.5GB worth of space using buffer storage, however, my GPU only reports ~0.8GB memory usage and my application reports additional memory usage close to 1.5GB. It is also worth noting that I am not CPU bound in any way. Also, the performance just doesn’t drop off suddenly; I can see a linear drop in performance as more and more chunks/triangles are being rendered. For shits and giggles, I broke up the scene into even more commands, about 3 times as many as I had it allocated to before, In an attempt to see if the number of commands was the cause, and the performance did not get worse. I stripped my shader of any light/shadow code and performance is still the same. I am using a gtx980 ti.
Code:
Here is SharedMesh.h:
pastebin/F5b37Xix
Here is SharedMesh.cpp:
pastebin/aiSZx3Sz
Usage is as follows:
SharedMesh mesh;
mesh.init(...);
SMHandle handle;
mesh.get_handle( handle );
handle.push_set( SMGSet(...) );
handle.buffer_data( ... );
handle.finalize_set( );
handle.submit_commands( );
mesh.buffer_commands( );
mesh.render( );
mesh.clear_commands( );
Please help me. I am going crazy over this.