Per-program uniform storage size vs. per-shader

fred_em · May 7, 2011, 6:12am

Hi,

I have two questions regarding total uniform storage available to shaders. By total, I mean the size of the default and user-defined uniform blocks (eg. ARB_uniform_buffer_object).

A) The driver can expose different values/limits for vertex, tessellation, geometry and fragment shaders, eg.:

MAX_COMBINED_VERTEX_UNIFORM_COMPONENTS,
MAX_COMBINED_FRAGMENT_UNIFORM_COMPONENTS,
MAX_COMBINED_GEOMETRY_UNIFORM_COMPONENTS,
MAX_COMBINED_TESS_CONTROL_UNIFORM_COMPONENTS and
MAX_COMBINED_TESS_EVALUATION_UNIFORM_COMPONENTS.

Let’s define the maximum of this value ‘ABSOLUTE_MAX’.

Uniforms are program-wide concepts. In every shader, I’m supposed to be able to declare:

uniform MyBlock
{
  float components[ABSOLUTE_MAX];
};

But the fragment shader limit (I pick the fragment shader as an example here) might be lower indeed. Does the OpenGL allow me to:

declare the above in my fragment shader? Or shall I change the size of the array specifically in the fragment shader, even though more data is available in the buffer?
use the above uniform in my code beyond MAX_COMBINED_FRAGMENT_UNIFORM_COMPONENTS. I.e.


// Fragment shader
uniform MyBlock
{
  float components[ABSOLUTE_MAX];
};

void main(void)
{
  float f = components[MAX_COMBINED_FRAGMENT_UNIFORM_COMPONENTS-1]; // ok
  float f2 = components[MAX_COMBINED_FRAGMENT_UNIFORM_COMPONENTS]; // may crash
}

B) Second question.
MAX_COMBINED_TESS_CONTROL_UNIFORM_COMPONENTS is 198656 / 262144 on nvidia / ATI drivers, respectively. This represents 1 MB of storage.
These values are very high, and go well beyond the ‘constant’ memory space mentionned in the G80 / CUDA documentation, which is 64 KB.

Does this mean uniform blocks are not stored in constant memory any more?

Thanks,
Fred

Alfonse_Reinheart · May 7, 2011, 1:41pm

Uniforms are program-wide concepts. In every shader, I’m supposed to be able to declare:

You’re not supposed to be able to do that.

That is a uniform block definition. Uniform blocks have a different maximum size, based on the byte storage of a uniform buffer. The value you want is GL_MAX_UNIFORM_BLOCK_SIZE. That is the maximum size (in bytes) of any one uniform block. That is the limit for each uniform block.

The max combined value is equivalent to the number of uniform blocks * the maximum size of any one uniform block (divided by the size of a float) plus the maximum number of regular uniform components. I honestly have no idea why they even exposed this number, as it’s functionally useless. If you are using this value to determine the size of anything you pass to OpenGL, you are using it wrong.

The rest of your post doesn’t make sense due to this. At no point can you allocate an array of any max combined uniform component size, whether in a uniform block or in a regular uniform definition.

Does this mean uniform blocks are not stored in constant memory any more?

Who knows? The driver is permitted to do whatever it wants. How do you know the were stored in constant memory before? My GT250 shows ~200,000 max uniform components in each stage, which is well over 64kb.

Maybe if your buffers are sufficiently small, they get copied into constant memory, whereas if they’re too large, they go elsewhere? There’s no way to know.

fred_em · May 8, 2011, 9:07am

Thanks for your reply.

OK, I had not seen that one. It is 65536 bytes and matches the constant memory size.

[quote]Does this mean uniform blocks are not stored in constant memory any more?

Who knows? The driver is permitted to do whatever it wants. How do you know the were stored in constant memory before? My GT250 shows ~200,000 max uniform components in each stage, which is well over 64kb.

Maybe if your buffers are sufficiently small, they get copied into constant memory, whereas if they’re too large, they go elsewhere? There’s no way to know. [/QUOTE]
The question is: how are we supposed to optimize our code if we can’t have an idea about how things work behind the scene?
With CUDA for instance (where I try to get most of my knowledge from, about how the GPU works) everything is explained: SMs, SPs, constant memory, texture memory, shared memory, caches sizes and so on. Every single detail is given.
What about GLSL? A detail or two would be welcome.

My shaders need to access a lot of data. I am shared between the following two possibilites:

using a texture (with a TBO) and get data from there. This is what I am currently doing. The CUDA programming guide - which let’s face it is more a GPU programming guide than anything else - talks about Texture Memory. So I can reasonably well understand the pros and cons of using this memory.
using UBOs. My shader could use, say, 4 uniform blocks (as I can use several of them at the same time) of 65535 bytes instead of one single 262144-byte texture. Generally speaking uniforms are said to be faster than texture fetches. Should I attempt to do this? I need time to do this so having an idea in advance if this will be interesting would be nice. One given shader execution/thread will do most accesses within a given 65536 byte range.

The other, last question that I have is about how varyings are handled by the GPU.

When a given shader passes values down to subsequent stages, where are these values stored, typically? Registers (unlikely)? Shared memory (my guess)? Global memory?

I can potentially pass down some values that I can read just once from the UBO or TBO, in my top level shader (eg. vertex shader) as varyings. Will this be worth it?

Cheers,
Fred

Alfonse_Reinheart · May 8, 2011, 2:00pm

The question is: how are we supposed to optimize our code if we can’t have an idea about how things work behind the scene?

The hard way. Try things and test to see if they’re faster.

malexander · May 9, 2011, 11:32am

using UBOs. My shader could use, say, 4 uniform blocks (as I can use several of them at the same time) of 65535 bytes instead of one single 262144-byte texture. Generally speaking uniforms are said to be faster than texture fetches. Should I attempt to do this? I need time to do this so having an idea in advance if this will be interesting would be nice. One given shader execution/thread will do most accesses within a given 65536 byte range.

On Nvidia GL3 hardware at least, the constant storage memory is actually a cache for global memory. So if you were to access several UBOs of combined size > 64K, it would be swapping out portions of the constant cache (the nvidia whitepaper indicated 128B cache lines). If your access pattern were fairly random, this might work okay. If you were accessing the same element from all 4 UBOs for the same bit of output data, you’d just be doing global memory fetches.

fred_em · May 9, 2011, 10:59pm

malexander:

using UBOs. My shader could use, say, 4 uniform blocks (as I can use several of them at the same time) of 65535 bytes instead of one single 262144-byte texture. Generally speaking uniforms are said to be faster than texture fetches. Should I attempt to do this? I need time to do this so having an idea in advance if this will be interesting would be nice. One given shader execution/thread will do most accesses within a given 65536 byte range.

On Nvidia GL3 hardware at least, the constant storage memory is actually a cache for global memory. So if you were to access several UBOs of combined size > 64K, it would be swapping out portions of the constant cache (the nvidia whitepaper indicated 128B cache lines). If your access pattern were fairly random, this might work okay. If you were accessing the same element from all 4 UBOs for the same bit of output data, you’d just be doing global memory fetches.

In many cases I might be able to fit all my data in 64KB.

It seems it’s best for me to use uniforms/constant memory.

But indeed, the benefit of using constant memory vs texture memory is not completely clear to me yet.

CUDA Programming Guide
http://developer.download.nvidia.com/compute/cuda/3_0/toolkit/docs/NVIDIA_CUDA_ProgrammingGuide.pdf

Excerpt of section §5.3.2.5 about Texture Memory:

Reading device memory through texture fetching present some benefits that can
make it an advantageous alternative to reading device memory from global or
constant memory:

If the memory reads do not follow the access patterns that global or constant
memory reads must respect to get good performance (see Sections 5.3.2.1 and
5.3.2.4), higher bandwidth can be achieved providing that there is locality in the
texture fetches (this is less likely for devices of compute capability 2.0 given that
global memory reads are cached on these devices);
[…]

Now looking at the following PDF
www.networkmultimedia.org/Publications/practicals/beyer2009.pdf
the chart at the top of page 13 (PDF page 25) says:

Constant (C-cache) on-chip cache: register latency (eg. very fast access)
Texture (Tex L1) on-chip cache: > 100 cycles

Does that mean that every texture access, even in the case of the Tex L1 cache hit, as this document mentions, takes about 100 cycles?

malexander · May 10, 2011, 8:08am

Does that mean that every texture access, even in the case of the Tex L1 cache hit, as this document mentions, takes about 100 cycles?

It’s quite likely that texture accesses are that slow, however modern GPUs are bandwidth-optimized, not latency-optimized like CPUs. After a texture request, the thread will be swapped out for another thread and processing will continue. Since GPUs keep a huge number of threads in flight, this helps to hide any memory/texture latency. As long as you have some compute work to do in other threads, the GPU should be able to balance memory/texture bandwidth use with compute, and all should be well.

The number of locals and varyings you use in the program will directly affect the number of threads a shader module can handle at once, just as with CUDA/OpenCL. So depending on your program, you may not see the texture accesses have much of an effect, or you might have them completely bottleneck your shader. So, as Alfonse says, you pretty much need to test it to see

fred_em · May 10, 2011, 8:15am

Do you mean the number of varyings affect how many threads can run concurrently on the GPU?

What makes you believe this?

Agreed.

malexander · May 10, 2011, 11:24am

Do you mean the number of varyings affect how many threads can run concurrently on the GPU? What makes you believe this?

An educated guess based on bits of information here & there, so I could be wrong.

GLSL programs are compiled to the same assembly as CUDA/OpenCL, and have the same hardware resources, so a lot of the rules will be the same.

GL_ARB_separate_shader_objects indicates that shader stages are their own programs, so varyings must be written out to global memory and read back in again between each stage. The varyings can be accessed directly from global memory, or stored in local memory or registers. Since Nvidia’s CUDA/OpenCL best practices recommends loading data into local (shared) memory, I’m thinking that’s likely where the incoming and/or outgoing varyings reside (I suppose they could be put in constant memory, but since that’s global to the chip and is already handling uniforms I don’t think it would be a good candidate). It would also allow the shader to double buffer the varyings to hide latency, ie load the next set while working on the current set.

The maximum number of threads supported is determined by the minimum of (#registers)/(registers used) and (local mem)/(local mem used), so as varyings fill op local memory, the maximum #threads would go down. Loading the varyings into registers would also have the same effect as loading them into shared memory, reducing the number of threads.

However, if I’m completely wrong and the varyings are loaded from and stored directly to global memory, you’re still more likely to bump into the GPU’s bandwidth ceiling with more varyings and stall the GPU.

It would be pretty easy to test with a simple vertex shader that wrote a bunch of vec4 varyings filled with uniform values, then sum them up in the fragment shader. Then create a second shader that does the same thing, but where the fragment accesses the uniforms directly and sums them with only a single varying (position).

Alfonse_Reinheart · May 10, 2011, 3:46pm

It’s quite likely that texture accesses are that slow, however modern GPUs are bandwidth-optimized, not latency-optimized like CPUs. After a texture request, the thread will be swapped out for another thread and processing will continue.

I have serious doubts that this is how it works. After all, most shaders run in lock-step; if one of them hit a texture access, many did so as well, probably at the same time.

In order to swap shaders, you would have to copy all of their state (which includes their constant memory) into global memory, then copy the state for another set of shader data (also from global memory) into the shader. That’s a lot of memory accessing.

GL_ARB_separate_shader_objects indicates that shader stages are their own programs, so varyings must be written out to global memory and read back in again between each stage.

What do you mean by “global memory”, exactly?

Outputs from vertex shaders are passed to the post-T&L cache. You could consider this “global memory”, in that it isn’t shader-local memory. But the post-T&L cache is different from actual global memory.

Interpolated values must pass through the rasterizer, which reads its data from the post-T&L cache, or post tessellation/geometry shader buffers. Are those “global memory?” The rasterizer generates fragments that are passed to the fragment shader. The rasterizer might have a queue of memory for per-fragment data, in case the fragment shader becomes a bottleneck. But again, does that constitute “global memory?”

Some hardware might write values to actual global memory, particularly if tessellation is going on. But this is on a case-by-case basis.

The maximum number of threads supported is determined by the minimum of (#registers)/(registers used) and (local mem)/(local mem used)

The maximum number of threads available is hard-coded into each execution unit. It’s part of the hardware. For example, each SIMD on a Radeon HD-class runs 4 threads. Period. No more, no less. Each thread has a fixed number of registers; it’s up to the compiler to decide how to allocate these registers and constant memory for a shader, so that the separate threads in the same memory pool don’t stomp on each others stuff.

fred_em · May 10, 2011, 11:28pm

Precisely.

If using GL_ARB_separate_shader_objects is forcing the use of global memory for varyings, then I will most likely not use it. But I’m not there yet, and we’re not sure of this. This is interesting though.

Let’s stick with the “Shared memory”. Local memory is different with CUDA, shared memory is unambiguous.

I also think varyings are stored in shared memory, i.e. banked on a per-thread basis. I’ve read here and there shared memory is not exposed in GLSL but this is probably wrong, and my guess also is that varyings explain the existence of shared memory.

Yes, maybe, this is possible, but maybe not: 128 total varyings components per shader = 512 bytes. 512 bytes x 32 warps = 16,384 bytes. The G80 hardware has 16,384 bytes of shared memory space… I would say no double buffering, here.

“local mem” = I’m assuming you mean shared memory here again.

We can make the link between the # of registers and shared memory. With respect to concurrency they’re kind of in the same boat.

Although varyings probably shared memory and not registers, to be simple here let’s only talk about registers.

I’m still a bit lost with regards to how the number of registers used by a thread lowers the number of concurrent threads running at the same time.

A warp is 32 threads that run simultaneously, eg. on G80 hardware that’s 4 threads per SP (4 x 8 = 32 threads running concurrently at any given time in a given SM). We have 8,192 registers for the SM. Let’s assume we only have 1 SM.

A shader is loaded into the SM. This shader uses all 8,192 registers - this is unlikely and almost impossible as the compiler probably won’t allow this, but let’s imagine this for a second.

A warp of 32 threads is scheduled and executed, then another warp of the same shader is scheduled and executed, and so on. Whether the shader uses 1 or 8,192 registers, I don’t see any difference - yet.
There is however a difference if 2 different shader programs were to be loaded in the SM, eg. a vertex and a fragment shader. Is it the case? Maybe.
One shader = one kernel in this case.
In this case, and if the vertex kernel uses all 8,192 registers, the SM scheduler won’t be able to schedule out a warp of the vertex kernel and schedule in a warp of the fragment kernel. The vertex kernel will need to completely finish its work.
Is this what you mean by ‘more registers = less concurrency’?

Agreed, but unlikely to happen. Varyings probably reside in shared memory.

Agreed.

danbartlett · May 11, 2011, 2:15am

AMD’s GPU ShaderAnalyzer allows you to see how many general purpose registers etc. your shader will use on different AMD hardware. I’ve no idea how much effect using more GPRs has on shader performance though.

fred_em · May 11, 2011, 5:33am

I tried, I can’t determine what is done with varyings with ShaderAnaylizer. I see EXP/EXP_DONE in the output IL assembly code, but I don’t know much beyond this.
I haven’t tried compiling GLSL code with the Cg compiler yet, and looking at the generated PTX code.

Alfonse_Reinheart · May 11, 2011, 6:12am

I think trying to second-guess the compiler like this is a path that leads only to tears. Particularly when different hardware will have different answers.

Regardless of where a varying comes from, you should expect a compiler to minimize data access time by caching it locally if you access it more than once. That is, if the varying is actually stored in some non-local memory, the shader will store a copy of the value after then first read in a register or local memory. The next time you read it, it will read from the copy.

The only reason for a compiler to not do this is if the compiler is having trouble fitting resources into memory. For example, if its run out of registers to allocate or local memory to use. But then again, if that happens, what exactly are you going to do about it? If the compiler couldn’t find a way to not do a global memory fetch again, what are the chances that you’re going to find a way to avoid it? Especially since you can’t do an end-run around GLSL and code an assembly routine yourself.

It’s one thing to ask questions like, “are uniform buffers faster than buffer textures?” This is a question that generally has an obvious answer (textures will tend to be slower), but could be wrong in some cases. It’s quite another thing to worry about where varyings just so happen to be stored in one OpenGL implementation or another. Especially since there’s nothing you can do about it one way or another.

If using GL_ARB_separate_shader_objects is forcing the use of global memory for varyings, then I will most likely not use it. But I’m not there yet, and we’re not sure of this. This is interesting though.

I’d like to remind you that program separation is normal for GPUs. Only in OpenGL and GLSL is direct program linking something that is even possible, let alone something that is done. Or, to put it another way, the GLSL model is an entirely artificial restriction compared to what GPUs are capable of.

You should expect no GPU performance difference.

Dark_Photon · May 11, 2011, 5:57pm

As far as I know, no OpenGL extension (nor OpenGL itself) “exposes” shader core shared memory through the API to GLSL shaders (though I would love to be proven wrong here).

I believe about the only way you have in GLSL to communicate (if you can call it that) with adjacent threads in flight at the same time is through the use of derivative instructions.

This is one reason I flipped to OpenCL to do some reductions of OpenGL-generated 2D image data recently.

Dark_Photon · May 11, 2011, 6:07pm

Among other places, check out the description in NVidia’s OpenCL Best Practice’s Guide. Specifically read the first few sections of Chapter 4 on “Occupancy”.

This explains why more registers per thread == fewer thread blocks on an SM.

Registers are one way the GPU hides memory latency. Thread block needs to read? Put it to sleep and run another one while you wait. Only works if you can keep them all “resident”, which means enough registers for all.

fred_em · May 12, 2011, 12:33am

As far as I know, no OpenGL extension (nor OpenGL itself) “exposes” shader core shared memory through the API to GLSL shaders (though I would love to be proven wrong here).
[/QUOTE]
Graphics cards manufacturers needed a way to handle varyings in an efficient manner. They ended up with the design of Shared memory, as we know it today with G80+ hardware. They later explained what shared memory was all about in the CUDA programming guide. My wild guess.

I believe about the only way you have in GLSL to communicate (if you can call it that) with adjacent threads in flight at the same time is through the use of derivative instructions.

What do you call derivative instructions, __syncthreads?
EXT_shader_image_load_store and the barrier() function might be one way to achieve this now.
It is true Shared memory can be used to synchronize threads in the same block but it might not have been designed for this purpose in the first place. Concurrent thread access (and varyings) might have been the primary concern.

Alfonse_Reinheart · May 12, 2011, 12:35am

What do you call derivative instructions, __syncthreads?

dFdx, dFdy.

Dark_Photon · May 12, 2011, 7:01pm

You know, I wondered about that when I saw this extension, but it sure doesn’t look like it. With the atomics for instance, it talks about writing global memory addresses, and ensuring that operation in all threads (without restriction) works properly:

It doesn’t say about writes on a multiprocessor (as you’d expect for shared memory atomics), or transferring intermediates to/from global memory. It specifically guarantees “no other memory transaction” anywhere will occur between the read and the write to global memory.

With umpteen numbers of threads in flight at once, and latency to global memory being high, this sounds pretty darn inefficient, but obviously I don’t have the “secret decoder ring” to translate what’s going on in this extension to GPU compute concepts. It’d be helpful to know, at least in general, what’s beneath the abstraction. A few slick usage examples in this extension with comments on why you’d do it that way would really help.