glBindBufferRange hugely expensive

GPU: Quadro K5000, Driver: 320.86, OS: Windows7_64

just switched over to UBO from plain old uniforms, so each material maintains a unique UBO for its properties and calls glBindBufferRange once per bind instead of calling glUniform 7 times per bind.
there’s 33,000 material binds per frame (don’t ask, just assume that currently I can’t afford to reduce or sort this), and 33,000 mesh draws (so a material bind per mesh).

the frame time has gone from 115ms up to 2215ms when switching to UBO.

has anyone experienced something similar?
is this just the cost of a buffer object bind? (more expensive than all those glUniform calls it was previously doing??!)
is bindless possible on uniform buffers? (quick scan of docs seems to suggest no)

Thanks in advance.

How are you updating the UBOs? My experience is that this is most likely to be the main source of trouble, as unfortunately GL appears to be quite inefficient with what would appear to be the most obvious way of doing so.

Instead of a single UBO per-material, the way to go is just create a single huge UBO that contains all materials, update it once only per-frame, then make your glBindBufferRange calls per-material. That’s probably going to require some heavy reworking of your current code (and will be more fiddly if you want to provide a “without UBOs” path for downlevel hardware), but it does work; see this thread for further discussion.

In addition, I doubt if many driver teams have ever tested performance with such a huge number of binds and draws, and even if they have they’re certainly not optimized for it. I’d say that it’s highly unlikely that you actually have 33,000 unique materials, and that what you’ve got is a serious design problem here. I know you said “don’t ask, etc” but seriously - if you’re looking for extra performance you’re going to get far more mileage out of tackling this problem.

thanks for the reply.

1/ I’m only updating the content of each UBO once at scene load time, not every frame.
2/ Your second suggestion wouldn’t solve this issue because if I just bind the same small UBO every material bind I get the same incredibly slow frame time.
3/ I know, I appreciate what you’re saying. My main render path does sort everything. This is a secondary render path to deal with datasets where the CPU cost of sorting becomes the bottleneck. Any offline sorting is also not wanted by the customer. I can’t use a cached scenegraph approach because of latency issues for the customer (potentially we could get many material property changes per frame).

The question is, when all’s said and done, why is glBindBufferRange so costly? Way more so than 7 glUniform calls?
If I call it once per frame with the first materials properties I get a decent frame rate, so it’s not the shader becoming slow when it’s using a std140 layout UBO.

Thanks for your help, I really appreciate it.