Bindable Uniform Issues

I’d like to see a couple of the issues with bindable uniforms resolved, regarding packing (issue 2) and programmatic determination (issue 6).

The spec authors are aware of the need, so I’m just bringing it to the fore in an attempt to bring some attention to the subject.

All comments are welcome.

DX10 does specify a layout, but due to some unfortunate alignment constraints, it’s not the layout you might hope for.

I think the ideal would be transparent access of structs between CPU and GPU and the ability to memcpy them around.

There was some discussion of whether to match the DX10 memory layout or to leave it unspecified. It was an unfortunate decision (IMO) to leave it unspecified, because it just adds complex introspection where none would have been necessary.

My preference would be just a big fixed data block that the gpu and CPU (though somewhat limited due to the fact it resides in the video memory) can freely read from and write to as either bits, bytes, floats(16, 32 and 64bit) or integers.
Bindable uniforms would then just be a list of pointers to that memory structure.

The main reasons you’d find naked pointers unpleasant are caching, memory layout, and pipeline synchronization.

If you just had a naked pointer like VAR gave you, there is a lot less implementation flexibility on the driver / hardware side, and it puts the burden of synchronization and cache management on the app. Usually this means fences and uncached memory - both of which are teh suck for most apps.

The level of abstraction we have in OpenGL buffer objects is actually pretty good. You can get a pointer which may eliminate a copy if the driver supports it, but in a way that may cause some automatic synchronization that is taken care of by the driver. Or you can send in-band updates via BufferSubData. And buffers can be renamed via BufferData. And finally, it provides an abstraction from raw bytes to formatted data (like texture images). If you had to know all the layouts in memory that formatted data can have and alignment constraints, etc, you would thank your stars for the pixel transfer path.

Implementations of buffer objects have not been perfect, but having direct memory access is no panacea either.

That said, direct memory access is very useful to those who need it and are willing to handle the extra burdens associated. For consoles, it’s usually worth it because it’s a fixed platform. For PCs in general, it’s a much thornier situation.

Just from a practical usage perspective, I think vec4 packing would be preferable to nothing at all. It’s pretty easy to pack and pad structures out to match a shader, with any rule in place, but I can see the trouble in specifying this without bringing hw-specific terms into play, or becoming fixed to a packing scheme that may not be optimal now or at some point in the future. That’s why I asked; the spec mentions the possibility of an extension to address this and it piqued my curiosity toward what it might look like.

Yea, neither i like naked pointers, and to be honest i don’t want just unrestricted DMA, I’m thinking more along the lines of a large array(of lets say 1KB of vec4) and that the pointers are really just offsets to that array, and reading/writing for the cpu would be done in a similar way in how the PBOs are done.
But the main thing is that there is need for some persistent memory that all fragments has access to that is both read and write.

@modus:
I don’t expect to see a fix for 2.x, instead i think it will be redesigned as core in either 3.0 or 3.1 using the new object model.

Good point. It’s possible that GL3 could introduce something that would affect this extension, and as far as I know Mt. Evans will make this extension (or something like it) core (if that’s still the plan). But I reasoned it’ll probably be Mt. Evans before any of this is set in stone so maybe now’s a good time to discuss it.

Cass touched on issue 3 (synchronization), which is also unresolved with respect to buffer interactions; e.g. ReadPixels and the relatively new texture buffer object extension. ReadPixels does seem to complicate things somewhat.

I’m a little surprised by the amount of “GL3 could do this” kind of speculation that seems to be going on.

As soon as the information stopped flowing, I think the logical conclusion should have been that there was nothing good to say. Any other conclusion is really foolishly optimistic. There’s no reason to expect to be pleasantly surprised come SIGGRAPH.

True in this case it’s more like a high probability that it will be done slightly differently(due to the new object model), however i wouldn’t dare speculating on if it will be better or worse with openGL 3, just that they probably decided to push these issues forward a little.

Remember, similar things where promised for FBO (that they will address certain shortcomings in future extensions), instead it was pushed to openGL3.

My only hope for SIGGRAPH is that they will reveal some more info in it and/or even the finished spec.
A simple example app would be more than enough.

Anything short of a finished spec and a timeline for implementation availability would be a massive failure at this point, and they should really provide that before SIGGRAPH so that people could come prepared with detailed questions and issues to discuss.

As it is, people are going to be expecting anything between GL 2.x and draft GL3 proposals with hardware support anywhere between DX9C class and DX10 class.

If OpenGL spec management were being run by a single company, there would be no other way to look at the lack of real progress over the last couple of years and then fire them. I know and really like a lot of the people that work on OpenGL standardization, but the lack of progress really hurts OpenGL.

But the hardware vendors themselves are also really hurting OpenGL. With the lone exception of NVIDIA, vendors don’t have OpenGL drivers that match their DX9 hardware features.

Cass can you spell out some specific hardware features that you would like to see addressed in OpenGL (3.0 or beyond)?

Here’s a quick list off the top of my head:

  • All DX10-class functionality. (This is a pretty long list.)
  • The ability to change objects without binding them into the state vector.
  • Better support for multi-threaded programs that need to work with GL objects at the same time.
  • Display lists that reference VBOs instead of copying them.
  • Improvements to occlusion query and conditional render.
  • Error callbacks, perf counters, and other development-centric facilities.

As I said, I’d also really like the vendors to implement the specs that have been out there for years now too. AFAIK, NVIDIA has the only OpenGL driver that you can do fp16 ReadPixels with, even though everybody’s hardware supports it under D3D.

I don’t think there’s any lack of things to do.

For the future of OpenGL, though, I think that Larrabee style programmability is the biggest question mark. If programmability successfully rips away the API veil (which a lot of people would really like), then OpenGL and D3D both become less relevant - at least how they are in their current incarnation.

I think if the ARB were future-looking, they would already be considering this and trying to understand what role, if any the OpenGL abstraction should play, and how it can continue to be relevant and useful for more than just backward compatibility.

That’s a great list.

#1 could be re-stated as “core GL spec needs to track available hardware in as timely a fashion as possible” and it’s plain within the working group that this is high priority. NVIDIA did a good job on this with G80/8800 with their GL extension set - the obvious question relates to feature coverage in core spec so it becomes a multi-vendor thing.

#2 isn’t really hardware related but I agree would make a nice improvement to the classic GL API. I wouldn’t mind seeing ‘selectors’ go away for good.

#3 is a little vague or could be interpreted a few ways, can you offer a use case ?

#4 - also a clever improvement to the API and not tied to hardware. (API improvements not tied to hardware are really good to pursue, because they can reach more users)

#5 - what kind of improvements ?

#6 - these sound cool, callbacks are tough in an async / threaded-driver world though.

w.r.t long term trends towards full programmability, I think the importance of having a ‘classic’ API will vary based on developer audience. There is always going to be a sector of developers that are ready to go “one step beyond” in the quest to get closer to metal or to fully leverage some new silicon and those people may be the early adopters that use non-classical approaches for their apps.

That said you can already see some Khronos effort underway (the Compute working group) to bring about standardization for GPU/CPU programming as seen in CUDA. I consider that somewhere in between near-term and future-looking?

just some thoughts,

A couple of examples for #3 are:

  • updating “virtual textures” out-of-band with a separate thread
  • altering the contents of buffer objects for skinned models or depth sorted triangles

The fact is, more cores means more threads and the graphics API needs to allow these things to get updated in parallel. Object allocation/destruction is trickier, but needs similar support.

OpenGL’s current context-per-thread binding model is far too coarse grained and vaguely specified and has always been a source of both performance and functional bugs.

Re #5,

  • fine-grained occlusion queries that give you a bitmask of prims that failed the test rather than a single bit for all the prims
  • a boolean result query that could early-out when buffer writes were masked instead of faithfully counting every sample (and taking precious time to do it)
  • predicated rendering that will render a bounding model with buffer writes masked if the predicate is false (or similar)

On #6, some attention needs to be paid to facilitating the development / debug process. A callback from the threaded driver is still better than no callback at all. It will usually get you within a few entry points of your trouble. My point is, today you have to know somebody who knows somebody to get help like good driver breakpoint addresses, and nobody is trying to solve these kinds of problems for OpenGL developers. I don’t even care if it’s platform or vendor specific. I just think these are things that drivers could help with but don’t.

The Compute working group is a different beast with different goals. What I’m talking about is a graphics-centric working group focused on OpenGL for a “mostly compute” device instead of our very biased OpenGL Machine that we have today. To me, forward looking would be trying to understand what OpenGL should look like with such an open, programmable hardware abstraction.
Neither D3D or OpenGL is doing that.

Thanks -
Cass

I would like to say that I agree with the author of the post and with Cass (amongst other) that the programming interface of the extension EXT_bindable_uniform should be revised and redefined to make it easier to use and por portable (Cass referred to the DX10 layout as an example).