GLSL Program -uniform state packet-

One bit that is… irritating is that the value of GLSL uniforms for a GLSL program is a part of the state of the GLSL program. The biggest ugly is for multi-context rendering (which I admit is not often).

Regardless, here is the suggestion:

Create a new type of object, say “GLSLProgramState” created by a function


/*
 Returns a GLSLProgramState
 \param GLSLprogram name of a linked GLSL program
*/
GLuint glCreateProgramState(GLuint glslprogram);

with such a state object one can set the uniform value’s and other state of a values that a part of the state of a GLSL program:


void glProgramStateUniformT(GLuint programState, GLint location, T v);

and now the part which likely needs tweaking and/or discussion: using that GLSPProgramState object for the state of the GLSL program. That a particular GLSPProgramState is used should be local to a context. Several hackish ideas:
[ul][li] For each draw call, create a new draw call that passes the GLSPProgramState object. [*] For each GLSL program, have a state value for each context specifying which GLSLProgramState to use, having that 0 indicates use the state of the GLSL program that is shared between contexts.[/ul][/li]
The latter has the advantage that it likely plays with SSO better.

Using uniform buffer objects for uniform data does not provide that which this would provide:
[ul][li] Texture unit sourcing [] Fetching values from a UBO is slower than fetching from a uniform[] a GLSLProgramState object would also store from which buffer objects a block sources[/ul][/li]
Other uglies are what to do when a GLSL program is relinked. My preference is that any such GLSLProgramState objects of that relinked GLSL program are then invalid. Additionally, using a GLSLProgramState on a GLSLProgram that was not used to generate it is an error.

Comments welcome, flames mostly not.

Hello,

if you want to use the same shader with different state, why not just generate two programs and attach the shaders to both? As OpenGL already has shaders and shader programs as seperate entities the implementation can optimize here and e.g. not dublicate the compiled binary (unless that has some advantages but then that would also be true for program state objects).

Passing extra objects to the draw call means, that all your draw code has to be aware of the current shader program :frowning:

A shader state is only ‘compatible’ with its own program, so I couldn’t use one state with two shaders - only one shader with two states (e.g. the uniforms would not match). That’s not as flexible as e.g. the texture samplers (which are also a state of sampling information, but much more generic).

Would a shader state be a state for the whole pipeline or just a shader - how will it interact with program pipeline objects?

Sharing uniforms can be done much more flexible with uniform buffers (I can share them between different shader programs!).
From the spec there is no reason why they should be slower, I see no reason not to map the uniform block from a buffer into the registers at program start (if it’s small enought). Swapping in the uniforms from a buffer into a register or from a state object makes no performance difference.

I see no usecase that could not be solved by aditional shader programs with already existing shaders and uniform buffers.

Texture unit sourcing

a GLSLProgramState object would also store from which buffer objects a block sources

Is that really a common use case? I imagine most people have a general set of common conventions for texture units and uniform buffer binding points. You know, diffuse comes from texture 0, normal map from texture 2, perspective matrix data comes from binding point 0, model-to-camera comes from binding point 1, etc.

Since the textures and buffers bound to the contexts are already context state, all you need to do is bind the appropriate textures/buffers to the right binding points. There’s no need to change which units/bind points the program itself uses after creation.

After all, that’s a good part of the reason why we can set them in the shader now.

Fetching values from a UBO is slower than fetching from a uniform

Is it? I’d be curious to see some profiling data on that.

I think the idea that UBOs have to be slower that plain uniforms comes from the idea that uniforms are stored in registers and UBOs are somewhere in memory like textures. The possibilities to store msomething on the GPU are much wider and in other languages (CUDA, OpenCL) more exposed than in OpenGL.

UBOs can be too large to fit into the register file of a StreamingMultiprocessor (in NVidia terminology) so in the worst case they have to be stored in a larger but slower memory. But imho you can’t conclude from that that the driver will not try to store even UBO data in the same registers side by side with regular uniforms if it fits.

Some thoughts about access speed to buffers: http://rastergrid.com/blog/2010/11/texture-and-buffer-access-performance/

Nothing in the spect prevents uniform buffers to be as quick as uniforms and nothing would prevent program state stored uniforms to perform as ‘bad’ as UBOs… So the performance can’t be an argument here.

It should also be noted that ATI hardware (even the HD 3xxx series) reports 15 uniform buffer binding points per shader stage. That’s an irregular number; one would usually expect 16. So I suspect that they burn one for the actual program state, especially considering that DX10+ doesn’t have non-buffer uniforms anymore.

On modern GPUs (usually which support UBOs) the uniforms in the default uniform block (i.e. standalone uniforms) are usually stored in a default UBO too. Also, UBO memory is faster than texture memory and a uniform access barely ever means a global memory access but rather access to a local constant store.

In general, nobody should expect standalone uniforms to have better performance than uniforms in UBOs.

Because of this, actually the proposal is kind of useless as even though program state (and as such standalone uniform values) are not shared between contexts, buffer objects can be shared, thus UBOs can be shared which should be enough.

Actually, I wrote that article so you can trust me that there’s nothing there about standalone uniforms being faster than UBO accesses.

That’s true, but there’s no reason to believe that they perform better. Actually, 90% of the cases you’ll experience slower performance with standalone uniforms because of the following reasons:

  1. The driver has to either upload every single value on the fly or cache it in an internal structure and submit it at once. Neither of these approaches are as efficient as UBOs.
  2. The API overhead is much higher with standalone uniforms as updating a potential large number of uniforms would require a lot more API calls. And as most applications are currently CPU bound, in most cases you’ll have better performance with UBOs.

In general, UBOs are the way to go. Actually, standalone uniforms are in practice deprecated (even if the specification did not deprecate them).

In general, UBOs are the way to go. Actually, standalone uniforms are in practice deprecated (even if the specification did not deprecate them).

Without access to actual hardware specifications I have trouble swallowing this. Classically, uniforms are stored in registers and as such accessing their values is for free, where as uniform buffer objects access memory. There is caching, no denial, of that which can hide this some or a lot.

  1. The API overhead is much higher with standalone uniforms as updating a potential large number of uniforms would require a lot more API calls. And as most applications are currently CPU bound, in most cases you’ll have better performance with UBOs.

There are interfaces to set many uniforms at once, i.e. the api calls to set arrays of uniforms. Additionally, if we want to talk efficiency and such, a simple jazz is that the program state (i.e. values of uniforms) is stored cpu side, setting of uniforms sets those values in CPU (which means no big deal) and an implementation simply tracks what ranges of bytes are dirty and sends those over and such… though one can quickly see that this is how an application may choose to update “just those values” that change for their uniform buffer objects.

Regardless though, even if uniforms of the default block are essentially implemented via memory fetches [which I have a hard time swallowing really] bits of GLSL program state are still icked:

  • [li] texture unit jazz[*] binding point sources for uniform buffer objects…

also, the idea of essentially replicating the GLSL program and avoid the recompile/relink overhead via the program binary jazz reeks of hackage. Additionally, plenty of archs do weird things with respect to GL state on a GLSL program [blending and masking in particular].

In general, UBOs are the way to go. Actually, standalone uniforms are in practice deprecated (even if the specification did not deprecate them).

This is a heck-a-strong statement. Source? I have a hard time swallowing this as well since recent specs added glProgramUniform anyways…

Yes, there are only UBOs under the hood. Plain uniforms only existed on DX9 hw, where uniforms were stored on the chip. Nowadays it’s all just buffers in memory (RAM or VRAM).

Yes, there are only UBOs under the hood. Plain uniforms only existed on DX9 hw, where uniforms were stored on the chip. Nowadays it’s all just buffers in memory (RAM or VRAM).

Starting at what generation of hardware for each vendor? And again, a source would be ideal.

There’s no need to believe me. Benchmark it yourself on any OpenGL 3+ GPU.

However, if I would be you, I wouldn’t really expect the proposal to be accepted…

There’s no need to believe me. Benchmark it yourself on any OpenGL 3+ GPU.

All I am asking for is a source that the default block is implemented as a UBO and with that source what hardware does that. A benchmark will not prove it nor will it disprove it [because of caching magicks and all the other variables in graphics performance]. Just a doc, a spec anything. The reason I’d like a source or a reference is essentially just to know which stems from curiosity. Additionally, there have been previous posts that have stated similar thoughs as mine (which you have also responded to):

the later it is:

Edit: since I wasn’t sure about that 8192 logic, I rechecked docs and tested the GLSL limits. The 8192 on G80 isn’t 8kB, but 8192 32-bit registers (32kB). On GTX2xx, it’s 16k registers, 64kB.

GL_MAX_VERTEX_UNIFORM_COMPONENTS = 4096 // 16kB
GL_MAX_FRAGMENT_UNIFORM_COMPONENTS = 2048 // 8kB
So, the driver reserves at least 40kB on the GTX for the thread-data of warps.

Furthermore, I checked if I raise the limit of instances to use the whole 16kB, and there was no performance penalty. (raising it further makes the program fail to link).

Thus, maybe simply we need to use-up the GL_MAX_VERTEX_UNIFORM_COMPONENTS instead of tuning.

4096/ 12 = 341 instances max, if only mat4x3 per instance.
4096/4 = 1024 instances if you use only “vec3 pos; float rotateY;”
4096/3 = 1365 instances if you use only “vec3 pos”;
4096/1 = 4096 instances if you use “int StaticID;” in combination with truly-constant UBOs, which can contain nice 800kB constant data with slightly slower access.

Above, from an older thread about an OpenGL 3 GPU, that Ilian is not too sure what is happening if uniforms are stored in a regfile or an L1-cached constant-memory. Benchmarking is not going to give a conclusive, well conclusion :smiley: nor will benchmarking and knowing some of the specs give one either. That is why I ask for a source. It might be that for G80’s it is like vertex uniforms are in regfile but fragment uniforms are in L1-cache… who knows! or it might be both are UBO’s or it might be that both are regfiles… who knows!

At any rate, lets get back to the suggestion: the idea of a per-context GLSLProgramState. The want for it is not performance oriented issue (though IF default block uniforms are not an UBO it can help). The issue is for using the same GLSL program in multiple contexts possibly simultaneously. The only way to do this realistically currently is to recreate the GLSL program in each context, which means either recompiling and relinking (bad!) or using the binary interface to rebuild it (hack-city, and this is what a certain framework library for the N9 does for GLES2). Moreover, going in back time to OpenGL 2.x (or using a compatibility profile), GLSL programs had limited per-context data: that data of the GL state.

One can point out that the idea of GLSLProgramState is not what one wants then really, but rather per-context GLSLprogram state values. This is fine too, except that the GLSLProgramState gives a little more. Indeed, since it completely up to the IHV discretion how default block uniforms are implemented, the GLSLProgramState interface gives an implementation opportunity to make the implementation match closer to how the hardware implements it.

Just don’t modify the UBO VB data mid-frame; Sometimes drivers will have to clone the whole VB in order to keep the data for previous kernels intact. (to be able to run different kernels in parallel).
Small UBO bound ranges may be DMA’d into internal registers, I guess. (it’s hinted by some DX perf notes by AMD and nVidia for cbuffers). If that’s the case, then this “uniform state packet” brings only one new thing: you won’t need to care about mid-frame VB modifications.
Meanwhile MVP matrices and such states, usually placed in the default uniform buffer, probably are placed in an internal UBO/fifo, which the driver will know doesn’t need cloning ever.

The main purpose of this uniform-state-packet thing is not really about performance, it is about convenience of using the same shader in multiple contexts, and the pain for that is that values of uniforms are part of GLSL program state. Likely I should have named the thread something else like “Multi-context GLSLProgramState thingamajig”. eww.

I don’t think this will be on ARB agenda any time soon given niche usage and complications it involves.

Also, do you get any perf. advantage from actually drawing from multiple threads (i assume multiple threads, otherwise cant see much point)?

it is about convenience of using the same shader in multiple contexts, and the pain for that is that values of uniforms are part of GLSL program state.

We get that. The thing is, you can get this effect now, with the current OpenGL. All you need to do is put all your uniforms in a block and use UBOs. Then establish a convention for your texture units and uniform block binding points, so that you don’t have to have different ones for each thread. Texture 0 is where both threads look for a texture. It’s just that each thread can have a different texture bound to their context, so they can render with different threads.

the use cases I have in mind are under the csaegory of making middle ware framewok type thingies. the jazz of a fixed program having a set of conventions for what a given texture and buffer binding point hold and do is fine, but this gets quite dicey in the framework middleware arena… Again, for a fixed program Ubo’s with conventions of texutre units and buffer binding units is likely enough, but for middleware frameowrk like things, it csn get icky and dicey…

You can either believe me, or obtain that info by yourself, or not believe, I don’t care. There’s driver code in Mesa3D (AMD have been involved in its development) and freely available GPU documentation on the AMD website. There is a group of people who have been reverse-engineering NVIDIA GPUs ever since they were launched and often the people know about NVIDIA GPUs more info than AMD have ever released about their own ones. Given that, Mesa3D is always a good starting point if you have technical questions about low-level GPU programming.

In a nutshell: There are 16 context registers on Radeon DX10 and later GPUs, each expecting a pointer to a uniform buffer in memory. The shader cores then fetch uniforms from those buffers. The registers are called SQ_ALU_CONST_CACHE_*, google R6xx 3D Registers PDF.

The most common slowdowns and frustration from using buffers come from the fact that people just don’t know how to upload data efficiently. There is a CPU-GPU synchronization (like glFinish), which happens everytime you map a buffer without the correct flags. Here’s some advice on how to avoid the synchronization:

  • Use GL_MAP_INVALIDATE_RANGE_BIT or GL_MAP_INVALIDATE_BUFFER_BIT or GL_MAP_UNSYNCHRONIZED_BIT. The last one is usually used if you write to a buffer range of which you know the GPU doesn’t use.
  • Or use glBufferData (uses GL_MAP_INVALIDATE_BUFFER_BIT)
  • Or use glBufferSubData (uses GL_MAP_INVALIDATE_RANGE_BIT)
  • Or use sync objects to determine if a particular buffer range is not being used by the GPU and if it isn’t, map it with GL_MAP_UNSYNCHRONIZED_BIT. Otherwise you can’t map it.
  • If you can’t use any of the ways above and you still wanna map the buffer, it’s gonna cost a lot of time if the buffer is being used by the GPU at the time you call glMapBuffer*.
  • Also create buffers with the flag GL_STREAM_DRAW if you upload data frequently.

Easy there :D, I just wanted to know and you have answered the question wonderfully (and I thank you for it) for the open source ATI drivers and if the public docs on AMD/ATI hardware is complete with respect to 3D rendering, (which I would peg at chance of like 99.99999999%) ATI/AMD hardware regardless of the driver.

Wonder what Intel and NVIDIA do in each of their generations of hardware…