[QUOTE=glnoob;1292914]First off, I was erroneously calling glBufferData when glSubBuffer data would have sufficed.
Removing that error sped things up where I am now getting around 600 FPS, but that is still slow.
Our old single-threaded deferred renderer runs at 1000 FPS with the same scene, and the new multithreaded forward renderer should be as fast or faster.[/QUOTE]
Well, you do what you have to do to make it fast. But whether that is an error depends on your buffer usage and the driver. Are you calling glBufferData with a non-NULL pointer repeatedly and/or with a different size? If so, yes, you should probably avoid that.
On NVidia drivers, if you’re re-specifying the same amount of data each update, orphaning the buffer (see this page for details) can be very fast and avoid internal driver synchronization potentially required (due to other references to that same buffer object in the pipeline that are still “in-flight”).
First, I’d recommend you benchmark in milliseconds/frame rather than frames/sec … for many reasons, not the least of which is it actually makes sense to talk about how much your texture buffer object (TBO) update method actually costs you so you can optimize it. Please do read this for details.
Next, I would disable your per-frame TBO updates and time frames with out them. How many msec/frame? Then add those TBO updates back in and re-time your frames. What’s the difference? With this crucial data, you know exactly how much time is tied up in the update specifically, and whether it’s the “big fish”. If it is, you can focus on optimizing its time consumption specifically.
As to methods to optimize your buffer object updates, see this page: Buffer Object Streaming. That said, TBOs are weird beasts that make it a little bit hard to apply some of these techniques. It’s easier when you’re just binding buffer objects directly to the shader. Especially with NVidia bindless extensions where you can pipe the GPU address for the buffer object(s) directly into the shader, bypassing all the binding mess.
I suspect that texture buffers are slower than uniform buffers, but more testing is needed.
From what I’ve read your guess is probably correct. Expect “ordinary uniforms” to be very fast, with uniform buffer objects being next. On NVidia, these can be cached in the fast shared mem local to the GPU multiprocessors. Next-up is probably ordinary textures, with access sped-up through texture tiling and texture caches. And tailing those are things are things which (due to their maximum sizes) virtually have to live in slower global GPU memory such as TBOs and SSBOs. IIRC, the driver does use part of the GPU shared memory as a global memory access cache, which’ll should help with accessing these latter types somewhat, if there’s repetition in or locality of access to global memory.