# Thread: how to implement bindless textures efficiently ???

1. the value baseVertex is not a byte-offset, but instead is how many vertices are in the VBO before the vertex we want this draw to start with, regardless of how many bytes in each vertex-structure (in AoS) or in each vertex-element (in SoA). Previously I had the impression this was a byte-offset into the VBO. But also, the name baseVertex confuses me, and always confused me. Why not call this firstVertex to match the firstIndex name? Don't both of these have the same meaning... the nth element in the IBO and VBO respectively? This difference in naming always implied to me that "something strange or different is going on with these two variables". Was I wrong?
That's all very confused.

This is an array of vertex data:

Code :
Vertex verts[100];

This is an array of index data:

Code :
int indices[10] = {0, 1, 2, 3, 4, 5, 6, 7, 8, 9};

So, we have 100 vertices and 10 indices. Now, consider this loop:

Code :
for(int ix = 0; ix < 10; ++ix)
{
RenderWithVertex(vert[index[ix]], 0);
}

This renders with all 10 of the indices we have in our index array. It fetches each value from `index` from 0 to 9, uses that to fetch a `Vertex`, and then renders with that vertex. Since the index array references the vertices 0-9, the vertices that get renders are `verts[0]` through `verts[9]`.

This loop corresponds to an indirect struct where `count` is 10, `firstIndex` is 0, `baseVertex` is 0, `instanceCount` is 1, and `baseInstance` is zero.

Now, consider this loop:

Code :
for(int ix = 0; ix < 5; ++ix)
{
RenderWithVertex(vert[index[ix + 5]], 0);
}

This renders 5 vertices. But not the first 5 vertices in the `index` list; the last 5 vertices. The vertices that get renders are `verts[5]` through `verts[9]`.

This loop represents an indirect struct where `count` is 5, `firstIndex` is 5, `baseVertex` is 0, `instanceCount` is 1, and `baseInstance` is zero.

Now try this:

Code :
for(int ix = 0; ix < 10; ++ix)
{
RenderWithVertex(vert[index[ix] + 10], 0);
}

This renders 10 vertices, just like the first loop. It iterates over all of the indices in the index list. But look at what it does. It increments the index fetched from the index array by 10. So the vertices that get renders are `verts[10]` through `verts[19]`.

This loop represents an indirect struct where `count` is 10, `firstIndex` is 0, `baseVertex` is 10, `instanceCount` is 1, and `baseInstance` is zero.

Do you see now how `baseVertex` is different from `firstIndex`? `firstIndex` affects where you start reading from the index array. `baseVertex` affects the value of the index after it has been read from the index array.

The idea is to be able to easily have mesh data for multiple objects in the same buffers. Consider indexed meshes in a disk file format. The indices are all relative to the beginning of their respective vertex arrays, of their own mesh. So the index 0 means "my first vertex". But if you load them into the same buffer, you need a way to tell it where "my first vertex" starts.

You could do this by manually changing the index for each mesh at load time. But you'd just be doing what `baseVertex` already does. You could do this by calling `glBindVertexBuffer` or `glVertexAttribPointer` and providing a different offset for the buffer(s). But that's a state change and therefore somewhat expensive. The `baseVertex` is a rendering parameter and therefore cheap.

FYI: the equivalent code for a full indirect draw command is:

Code :
void IndirectDraw(IndirectStruct params)
{
for(auto instance = 0; instance < params.instanceCount; ++instance)
{
for(auto ix = 0; ix < params.count; ++ix)
{
RenderWithVertex(verts[params.baseVertex + index[params.firstIndex + ix]], params.baseInstance + instance);
}
}
}

What does baseInstance do or control? By way of example, if baseInstance == 16, does this mean that the built-in gl_instance variable in the shader (assuming there is such a built-in variable) contains the integer value 16 ???
Going to this page and searching for "baseInstance" would eventually take you not only to how the base instance works, but a very clear warning on what it does not do:

Originally Posted by The Wiki
The input gl_InstanceID does not follow the baseinstance​. gl_InstanceID always falls on the half-open range [0, instancecount​ ). So the base instance is only useful when using Instanced Arrays. You can employ the ARB_shader_draw_parameters extension where available, which gives your shaders access to the base instance.
I genuinely cannot make this more clear.

But if what I infer in #4 immediately above is correct, what happens when instanceCount > 1 ??? If I want to call a draw function that recognizes instances, and potentially draws many or hundreds or thousands of instances, I would want the same exact matrices (at matrix array element 16) to be sent to the shaders in both of these cases.
Um... why? Generally speaking, instancing means rendering the same mesh multiple times, with different parameters. And one of the most important parameters for rendering a mesh multiple times is its transform.

So unless you want to render these instances on top of each other, it's not clear why they would have the "same exact matrices".

Oh, I just noticed the sizeof(DrawElementsIndirectCommand) * 0 argument to glMultiDrawElementsIndirect(). What is that and why zero?
Because the parameter is a byte offset. The conversion between "byte offset" and "array index" is "byte offset = sizeof(struct) * array index".

The index being 0 here is academic; if you want to change the byte offset later, you don't have to remember to add the `sizeof` part.

Fact is, I can't literally go "all the way" to one draw call, because the engine will also offer rendering of real, amazingly rich, highly configurable starfields based upon a technique that I figured out
To be honest, the whole "make only one draw call ever" thing is really overblown. It's not even what AZDO is all about. The AZDO presentation made it perfectly clear that draw calls aren't the problem. The problem is state changes between draw calls. Making multiple draw calls in succession without any state changes is quite fast in OpenGL.

Overall, the thing you need to be worried about is not making one draw call ever. It's making sure that:

1. You're minimizing the number of state changes that happen between draw calls.

2. That the number of state changes between draw calls does not depend on scene complexity. That is, even if you're rendering more stuff, you're not increasing the number of state changes between draw calls.

That is, if your renderer does "draw all my shapes" in one step, then changes some state and "draw my starfield", that's fine. So long as "draw all my shapes" doesn't involve state changes (or the set of state changes is fixed and invariant with how many shapes you render), you're OK.

However, your message does imply one possible way to achieve this result.
The way to do it is, on the C++ side, use an array of `GLuint64`. That's it. It's just that simple.

The GLSL side has an array of `uvec4` that is half the size of the C++ array of `GLuint64`. Obviously, the `GLuint64` array must have an even number of elements.

Maybe you expected me to take the [very tiny] leap to understand I could extract u32vec4.zw elements, create a u32vec2.xy out of them... then follow the instructions in the ARB_GL_bindless_texture.txt specification (which I did read, but found only half informative and half confusing).
... there's no leap. I literally gave you the GLSL code for reading the data. It's right there in my first post:

Code :
uvec4 val = arr[i/2];
uint ix = (i % 2) * 2;
uvec2 handle = uvec2(val[ix], val[ix + 1]);

`arr` is the array of `uvec4` in the UBO. `i` is the index for the bindless handle you want to fetch from the array. `handle` is the `uvec2` containing the handle you wanted to fetch, ready to be cast directly into a `sampler` or `image` type.

All the examples I remember seeing
Stop looking at examples! "Examples" only represent the code that the person who wrote them wanted to write. Do not limit yourself to what other people want to write.

Stop being a copy-and-paste programmer.

2. baseVertex isnt really needed, but if you set it to 0, then every time you put vertices and indices of a mesh into a VBO and IBO, you have to pre-correct / shift all the indices with the number of vertices that were in the VBO before putting the mesh's vertices into it

was that understandable ?? if not read this:
https://www.khronos.org/opengl/wiki/...ing#Base_Index

so now you have 2 buffers (VBO and IBO), and you can address different meshes in them, to do so you can "pack" all the necessary infos in a struct. for example:

----------------------------------------------------------------------------

then comes the 3rd buffer: "instance buffer"
until now, you had to set a uniform mat4 to transform your mesh in the correct space.
but consider the case when you want to draw 10 x the same meshes with different transforms:
Code :
for (uint i = 0; i < 10; i++)
{
mat4 MVP = arrayoftransforms[i];
glUniformMatrix4fv(location, 1, GL_FALSE, value_ptr(MVP));

Draw(mesh);
}
or you could use an array of mat4s and upload the transforms once:
Code :
glUniformMatrix4fv(location, 10, GL_FALSE, arrayoftransforms);

for (uint i = 0; i < 10; i++)
{
Draw(mesh);
}

however you do it, you have to issue 10 draw calls. and if you want to draw meshes 1000x, or more,
then you run into problems:
--> you cant upload 1000 transforms (assuming max. 16kB uniform block space) at once
--> even if you could, you'd have to call Draw(mesh) 1000x, which can impact performance

solution: instanced rendering
if you could upload 1000 transforms, the solution would be "DrawInstanced(mesh, 1000);"
https://www.khronos.org/opengl/wiki/...ing#Instancing

if you can't upload 1000 transforms, then you have to put the transforms into another buffer: the "instance buffer", and you modify your vertex array object (VAO) so that it pulls 1 transform for each mesh instance. here some tutorials:
http://ogldev.atspace.co.uk/www/tuto...utorial33.html

important:
https://www.khronos.org/opengl/wiki/...stanced_arrays
https://www.khronos.org/opengl/wiki/...xAttribDivisor

----------------------------------------------------------------------------

now you can reduce the number of draw calls to the number of different meshes you have.
to reduce even that, make use of a 4th buffer: the "indirect buffer" which effectivelly stores the draw command parameters so that you have to issue only 1 glMultiDrawElementsIndirect(...) to render all your meshes.

that's what i explained previously ...

if you take a look at the arguments of glMultiDrawElementsIndirect(...), you can see that you can specify "padding" in the indirect buffer between 2 drawcall instances. which means you can use the GPU to fill these structs, for example with a geometry or compute shader to perform instance culling (by setting "instancecount" to 0 if the mesh is behind the camera or so).

----------------------------------------------------------------------------

and .. yes, stop looking at examples, try to figure it out first by reading (the wiki / specs / book / etc), only when you run into problems look at them

3. Thanks to Alfonse Reinhart, Dark Photon, john_conner... and some videos I found on the internet... I've had some success. I decided to post some comments here about what worked for anyone who finds these messages in the future.

Also, as I hoped (and sorta thought), you will see that a very perverted but cool feature that I attempted actually works in OpenGL... even thought gurus at nvidia and AMD claimed/implied it wouldn't work on their videos. This was my desire to be able to switch texturemaps, normalmaps and potentially othermaps on a triangle-by-triangle basis (within a single object and draw). I'll explain why I think this worked as I expected, not failed as the gurus expected.

Note that what I got working is only the hyper-generalized bindless texture approach I was shooting for. Next I'll be diving into all the efficiency stuff we've been discussing based on eternal "Storage" instead of "Data", persistent buffers, coherent buffers, glMultiDrawElementsIndirect() and so forth.

But for now, just this one victory, which I nonetheless consider a big victory because it means dozens or even hundreds of images can be "resident" and any of them can be accessed by any number of objects in a single draw call, even on a triangle-by-triangle basis if you're as crazy as me. Though frankly, being able to apply totally different textures, normalmaps, conestepmaps and othermaps to different parts of objects seems like a very desirable situation to me. Otherwise I assume many objects would need to be broken up into separate objects and textured independently.

I removed all error checking (and most overly engine-specific code) from the following code to make the following more readable.

Okay, first my engine created four 4096 element arrays of u64 bindless texture handles.

Code :
bytes = 32768;

u64* buffer0 = (u64*) memory_buffer_create (bytes);   // u64 buffer0[4096] for engine to hold  4096 u64 bindless texture handles for UBO #0
u64* buffer1 = (u64*) memory_buffer_create (bytes);   // u64 buffer1[4096] for engine to hold 4096 u64 bindless texture handles for UBO #1
u64* buffer2 = (u64*) memory_buffer_create (bytes);   // u64 buffer2[4096] for engine to hold 4096 u64 bindless texture handles for UBO #2
u64* buffer3 = (u64*) memory_buffer_create (bytes);   // u64 buffer3[4096] for engine to hold 4096 u64 bindless texture handles for UBO #3

glstate.glimage_handle_buffer0 = buffer0;      // keep copy of OpenGL uniform buffer object that backs up UBO #0
glstate.glimage_handle_buffer1 = buffer1;      // keep copy of OpenGL uniform buffer object that backs up UBO #1
glstate.glimage_handle_buffer2 = buffer2;      // keep copy of OpenGL uniform buffer object that backs up UBO #2
glstate.glimage_handle_buffer3 = buffer3;      // keep copy of OpenGL uniform buffer object that backs up UBO #3

u32 imap0 = 0;                               // uniform buffer object  identifier to contain 4096 u64 bindless texture handles
u32 imap1 = 0;                               // ditto for uniform block 1
u32 imap2 = 0;                               // ditto for uniform block 2
u32 imap3 = 0;                               // ditto for uniform block 3

u32 binding0 = 0;     // this binding must be specified in shaders:  layout (binding = 0) uniform imap00
u32 binding1 = 1;     // this binding must be specified in shaders:  layout (binding = 1) uniform imap01
u32 binding2 = 2;     // this binding must be specified in shaders:  layout (binding = 2) uniform imap02
u32 binding3 = 3;     // this binding must be specified in shaders:  layout (binding = 3) uniform imap03

glCreateBuffers (1, &imap0);             // create four buffer objects to become the four uniform buffers
glCreateBuffers (1, &imap1);             // that contain u64 bindless texture handles for four purposes,
glCreateBuffers (1, &imap2);             // namely texturemaps, normalmaps, conestepmaps, othermaps.
glCreateBuffers (1, &imap3);

glstate.bufferid_imap0 = imap0;        // engine must remember OpenGL uniform block numbers
glstate.bufferid_imap1 = imap1;        // in order to put new bindless texture handles in those
glstate.bufferid_imap2 = imap2;        // those buffers when new textures are created.
glstate.bufferid_imap3 = imap3;

glstate.binding_imap0 = binding0;     // engine remembers binding numbers for these uniform blocks
glstate.binding_imap1 = binding1;     // even though these are fixed by the engine specification.
glstate.binding_imap2 = binding2;
glstate.binding_imap3 = binding3;

bytes = 32768;        // sufficient for 4096 u64 bindless texture handles in each of the four uniform blocks

glNamedBufferStorage (imap0, bytes, 0, GL_MAP_WRITE_BIT | GL_MAP_PERSISTENT_BIT | GL_DYNAMIC_STORAGE_BIT);
glNamedBufferStorage (imap1, bytes, 0, GL_MAP_WRITE_BIT | GL_MAP_PERSISTENT_BIT | GL_DYNAMIC_STORAGE_BIT);
glNamedBufferStorage (imap2, bytes, 0, GL_MAP_WRITE_BIT | GL_MAP_PERSISTENT_BIT | GL_DYNAMIC_STORAGE_BIT);
glNamedBufferStorage (imap3, bytes, 0, GL_MAP_WRITE_BIT | GL_MAP_PERSISTENT_BIT | GL_DYNAMIC_STORAGE_BIT);

glBindBufferBase (GL_UNIFORM_BUFFER, binding0, imap0);
glBindBufferBase (GL_UNIFORM_BUFFER, binding1, imap1);
glBindBufferBase (GL_UNIFORM_BUFFER, binding2, imap2);
glBindBufferBase (GL_UNIFORM_BUFFER, binding3, imap3);

When applications call image create functions, the textures are created in the usual OpenGL way.

Code :
glCreateTextures (GL_TEXTURE_2D, 1, &glimageid);

glTextureStorage2D (glimageid, levels, glinternal, width, height);   // specify image levels, width, height, pixel format
glTextureParameteri (glimageid, GL_TEXTURE_WRAP_S, GL_REPEAT);
glTextureParameteri (glimageid, GL_TEXTURE_WRAP_T, GL_REPEAT);
glTextureParameteri (glimageid, GL_TEXTURE_MAG_FILTER, GL_LINEAR);
glTextureParameteri (glimageid, GL_TEXTURE_MIN_FILTER, GL_LINEAR_MIPMAP_LINEAR);

if ((options & IG_OPTION_IMAGE_NOCLEAR) == 0) {
glClearTexImage (glimageid, 0, glformat, gltype, color);
}

if ((options & IG_OPTION_IMAGE_NOMIP) == 0) {
glGenerateTextureMipmap (glimageid);
}

Then we find an available slot in the appropriate u64 bindless texture array for this new image object:

Code :
switch (glblock) {       // glblock specifies what is the purpose of this image (see the four options below)
case 0:
for (i = 1; i < 4096; i++) {
gltester = glstate.glimage_handle_buffer0[i];    // for colormaps AKA texturemaps
if (gltester == 0) { glindex = i; break; }
}
break;
case 1:
for (i = 1; i < 4096; i++) {
gltester = glstate.glimage_handle_buffer1[i];    // for surfacemaps AKA normalmaps
if (gltester == 0) { glindex = i; break; }
}
break;
case 2:
for (i = 1; i < 4096; i++) {
gltester = glstate.glimage_handle_buffer2[i];    // for conestepmaps AKA parallaxmaps
if (gltester == 0) { glindex = i; break; }
}
break;
case 3:
for (i = 1; i < 4096; i++) {
gltester = glstate.glimage_handle_buffer3[i];    // for othermaps (shader dependent)
if (gltester == 0) { glindex = i; break; }
}
break;
default:
assert(0);
return (CORE_ERROR_INTERNAL);                                                        // invalid uniform block
}

Then get the u64 bindless texture handle for this image and make this image resident.

Code :
glhandle = glGetTextureHandleARB (glimageid);
glMakeTextureHandleResidentARB (glhandle);

Assign the texture handle to the empty slot we found in the appropriate glstate.glimage_handle_buffer#[] array:

Code :
switch (glblock) {
case 0:
glstate.glimage_handle_buffer0[glindex] = glhandle;    // put bindless texture handle into empty slot in buffer0[]
glNamedBufferSubData (glstate.bufferid_imap0, glindex *  sizeof(u64), sizeof(u64), &glhandle);    // and into imap0 in UBO #0  in GPU
break;
case 1:
glstate.glimage_handle_buffer1[glindex] = glhandle;    // put bindless texture handle into empty slot in buffer1[]
glNamedBufferSubData (glstate.bufferid_imap1, glindex *  sizeof(u64), sizeof(u64), &glhandle);    // and into imap1 in UBO #1  in GPU
break;
case 2:
glstate.glimage_handle_buffer2[glindex] = glhandle;    // put bindless texture handle into empty slot in buffer2[]
glNamedBufferSubData (glstate.bufferid_imap2, glindex *  sizeof(u64), sizeof(u64), &glhandle);    // and into imap2 in UBO #2  in GPU
break;
case 3:
glstate.glimage_handle_buffer3[glindex] = glhandle;    // put bindless texture handle into empty sllot in buffer3[]
glNamedBufferSubData (glstate.bufferid_imap3, glindex *  sizeof(u64), sizeof(u64), &glhandle);    // and into imap3 in UBO #3  in GPU
break;
default:
return (CORE_ERROR_INTERNAL);    // impossible error
}

Then copy the image loaded from disk or blob into the texture in GPU memory... then generate mipmap levels.

Code :
glTextureSubImage2D (glimageid, 0, 0, 0, width, height, glformat,  gltype, tbuffer);    // copy image into bindless texture in GPU
if ((options & IG_OPTION_IMAGE_NOMIP) == 0) {
glGenerateTextureMipmap (glimageid);    // generate full mipmap
}

The engine creates procedurally generated content, including 3D physical objects. This is rather complex, even for fairly simple shapes, so I won't show how that works. Regardless, up to four images can be specified for each shape object created, one image for each of four purposes (mentioned several times in the code comments above).

Every vertex contains four u08 fields (in one element of a u32vec2 vertex attribute). Each of those u08 fields specify which image to access from each of the four uniform blocks created and filled in by code above. In other words, every vertex in every shape object can specify on a vertex by vertex basis which texturemap should be applied, which normalmap should be applied, which conestepmap should be applied, which othermap should be applied.

Except that's not exactly true. For various reasons that have to do with the way shape objects are constructed, a function was called during engine initialization that specifies the first vertex of each triangle is the "provoking vertex". That means that all the invoked fragment shaders receive all the integer vertex-attributes ONLY from the first vertex of each triangle (which is why it is called the "provoking vertex"). Since integer vertex attributes are not interpolated like other vertex-attributes, the engine has no other choice except to choose the integer vertex attributes from one of the three vertices to send to all instances of the fragment shaders deployed to draw each triangle.

Which is the hint, I believe, to why the perverse scheme of switching any combination of the images on a triangle-by-triangle basis works in this engine. To elevate to beyond a hint and state why I think this works, the story works like this (so I infer). When a GPU finishes processing the three vertices in a triangle, it needs to deploy a bunch of fragment shaders to process all the fragments within the triangle based upon the outputs of the vertex shader. The GPU needs to send the same information to ALL the fragment shaders it deploys... except for whatever hardware they have that performs the interpolation of floating-point outputs of the vertex shaders.

If somehow different fragment shaders tried to access different images/textures... that probably would not work. In fact, probably a great deal is "fixed, nailed down, cast in stone" when the fragment shaders are deployed to render the fragments within the triangle. And probably this includes images/textures and samplers. Originally I thought maybe it was possible to pass shaders from vertex shader to fragment shaders. And that would work for the purposes of this engine too, since either way the whole triangle is drawn with the same set of images/textures. But I haven't seen any way to pass samplers, so I guess the way the engine does this is more-or-less equivalent (wherein every fragment shader assembles the bindless shader handle from two 32-bit values that overlay the u64 bindless texture handle).

The fragment shader code follows. I'll make some comments about how the shader accesses the four images (actually, the current code only accesses the texturemap from uniform block #0 and the normalmap from block #1. The conestepmap code and othermap code is not yet written (but those images are available).

Code :
//
//
// ###########################  Max Reason
// #####  igtan017.frag  #####  copyright 2005 - 2017+
// ###########################  part of the ICE projects
//
#version 450 core    // requires support for GLSL v4.50 (and OpenGL v4.50)

#extension GL_ARB_bindless_texture : require    // requires support for GL_ARB_bindless_texture

layout (location =  0) uniform    mat4     ig_transform;    // transformation matrix
layout (location =  4) uniform    vec4      ig_clight0;        // color of light #0
layout (location =  5) uniform    vec4      ig_clight1;        // color of light #1
layout (location =  6) uniform    vec4      ig_clight2;        // color of light #2
layout (location =  7) uniform    vec4      ig_clight3;        // color of light #3
layout (location =  8) uniform    vec4      ig_plight0;        // position light #0
layout (location =  9) uniform    vec4      ig_plight1;        // position light #1
layout (location = 10) uniform    vec4     ig_plight2;        // position light #2
layout (location = 11) uniform    vec4     ig_plight3;        // position light #3
layout (location = 12) uniform    vec4     ig_pcamera;     // position camera == currently active camera

// imap00 == 4096 samplers mapped to 4096  images AKA "bindless texture handles" backed by UBO containing 4096 u64 texture handles
// imap01 == ditto
// imap02 == ditto
// imap03 == ditto

// image0 == 2048 uvec4.xyzw vectors containing 4096  u64 bindless texture handles mapped into 2048 uvec4 vectors.xyzw
// image1 == ditto
// image2 == ditto
// image3 == ditto

layout (binding = 0) uniform imap00 {
uvec4    image0[2048];    // backed by 4096 u64 bindless texture handles in UBO #0
};

layout (binding = 1) uniform imap01 {
uvec4     image1[2048];   // backed by 4096 u64 bindless texture handles in UBO #1
};

layout (binding = 2) uniform imap02 {
uvec4    image2[2048];    // backed by 4096 u64 bindless texture handles in UBO #2
};

layout (binding = 3) uniform imap03 {
uvec4    image3[2048];    // backed by 4096 u64 bindless texture handles in UBO #3
};

out        vec4        outcolor;     // color to write to the fragment/pixel
out        float        outdepth;    // depth to write to the fragment/pixel

in        vec4        vcamera;
in        vec4        vlight0;
in        vec4        vlight1;
in        vec4        vlight2;
in        vec4        vlight3;
in        vec4        vtcoord;
in        vec4        vcolor;
flat in  ivec2       vmixmatsay;

// mixmatsay.x == imap0(00:07) | imap1(08:15) | imap2(16:23) |  imap3(24:31)
// mixmatsay.y == tmatid(00:15) | say(16:31)

const float   ambient = 0.1250;    // multiply light color for ambient lighting   ::: 0.0625 to 0.2500 --- or higher
const float   diffuse = 0.6250;      // multiply light color for diffuse lighting   ::: 0.5000 to 0.7500 --- or higher
const float   specular = 1.0000;    // multiply light color for specular lighting  ::: 0.7500 to 1.0000
//
// #################
// #####  main()  #####
// #################
//
void main () {

vec4 color = vec4(1.000, 1.000, 1.000, 1.000);    // color of this pixel == white (default color == starting color)
vec3 tweak;                                                  //  tweak surface.xyz vectors of this pixel (default == untweaked ==  vec3(0,0,1)

// say bit 0 : color = vertex.rgba : otherwise ignore vertex.rgba ::: this alone == emissive lighting
// say bit 1 : color = light.rgba : otherwise ignore light.rgba
// say bit 2 : color modified by imap0 (texturemap)
// say bit 3 : color modified by imap1 (surfacemap AKA normalmap AKA bumpmap)
// say bit 4 : color modified by imap2 (conemap and heightmap)
// say bit 5 : color modified by imap3

uint imap0 = (vmixmatsay.x >>  0) & 0x000000FF;
uint imap1 = (vmixmatsay.x >>  8) & 0x000000FF;
uint imap2 = (vmixmatsay.x >> 16) & 0x000000FF;
uint imap3 = (vmixmatsay.x >> 24) & 0x000000FF;
uint tmatid = (vmixmatsay.y >>  0) & 0x0000FFFF;
uint saybit = (vmixmatsay.y >> 16) & 0x0000FFFF;

// ivec0 is uvec2 containing one u64 bindless  texture handle as one uvec2.xy ... which GLSL will convert  to a sampler
// ivec1, ivec2, ivec3 == ditto

uvec2 ivex0 = bool(imap0 & 1) ? uvec2(image0[imap0 >> 1].zw) : uvec2(image0[imap0 >> 1].xy);
uvec2 ivex1 = bool(imap1 & 1) ? uvec2(image1[imap1 >> 1].zw) : uvec2(image1[imap1 >> 1].xy);
uvec2 ivex2 = bool(imap2 & 1) ? uvec2(image2[imap2 >> 1].zw) : uvec2(image2[imap2 >> 1].xy);
uvec2 ivex3 = bool(imap3 & 1) ? uvec2(image3[imap2 >> 1].zw) : uvec2(image3[imap3 >> 1].xy);

sampler2D isam0 = sampler2D(ivex0);    // isam0 is a 2D sampler to access bindless texture handle in imap00 uniform block
sampler2D isam1 = sampler2D(ivex1);    // isam1 is a 2D sampler to access bindless texture handle in imap01 uniform block
sampler2D isam2 = sampler2D(ivex2);    // isam2 is a 2D sampler to access bindless texture handle in imap02 uniform block
sampler2D isam3 = sampler2D(ivex3);    // isam3 is a 2D sampler to access bindless texture handle in imap03 uniform block

if (bool(saybit & 0x0001u)) {    // saybit:00 == vertex.rgba enabled
color = color * vcolor;          // color = color.rgba * vertex.rgba
}

if (bool(saybit & 0x0004u)) {        // saybit:02 == imap0 enabled
if (bool(saybit & 0x0010u)) {    // saybit:04 == imap2 enabled ::: conestepmap to be implemented later
color = color * texture(isam0, vtcoord.xy);    // temporary fallback to unperturbed texturemap
} else {
color = color * texture(isam0, vtcoord.xy);    // color = color.rgba * texturemap.rgba @ tcoord.xy
}
}

if (bool(saybit & 0x0002u)) {        // saybit:01 == light.rgba enabled
if (bool(saybit & 0x0008u)) {    // saybit:03 == isam1.xyzw  enabled (tweak surface normal to achieve "normal-mapping")
tweak = vec3((2.0 * texture(isam1, vtcoord.xy)) - 1.0);    // tweak surface normal to achieve "normal-mapping"
} else {
tweak = vec3(0.0000, 0.0000, 1.0000);    //
}
vec3 light0 = normalize(vlight0.xyz);        // pixel to light0 vector in surface-coordinates AKA tangent-coordinates
vec3 light1 = normalize(vlight1.xyz);        // pixel to light1 vector in surface-coordinates AKA tangent-coordinates
vec3 light2 = normalize(vlight2.xyz);        // pixel to light2 vector in surface-coordinates AKA tangent-coordinates
vec3 light3 = normalize(vlight3.xyz);        // pixel to light3 vector in surface-coordinates AKA tangent-coordinates
vec3 camera = normalize(vcamera.xyz);   // pixel to camera vector in surface-coordinates AKA tangent-coordinates
//
// add ambient contributions from lights 0,1,2,3
//
vec3 pixel = vec3(0.0, 0.0, 0.0);                              // pixel starts totally black
pixel += (color.xyz * (vec3(ig_clight0) * ambient));    // add ambient color from light0
pixel += (color.xyz * (vec3(ig_clight1) * ambient));    // add ambient color from light1
pixel += (color.xyz * (vec3(ig_clight2) * ambient));    // add ambient color from light2
pixel += (color.xyz * (vec3(ig_clight3) * ambient));    // add ambient color from light3
//
// add diffuse contributions from lights 0,1,2,3
//
pixel += (color.xyz * (vec3(ig_clight0) * diffuse) *  max(dot(tweak.xyz, light0.xyz), 0.0));    // add diffuse color from  light0
pixel += (color.xyz * (vec3(ig_clight1) * diffuse) *  max(dot(tweak.xyz, light1.xyz), 0.0));    // add diffuse color from  light1
pixel += (color.xyz * (vec3(ig_clight2) * diffuse) *  max(dot(tweak.xyz, light2.xyz), 0.0));    // add diffuse color from  light2
pixel += (color.xyz * (vec3(ig_clight3) * diffuse) *  max(dot(tweak.xyz, light3.xyz), 0.0));    // add diffuse color from  light3
//
// add specular contributions from lights 0,1,2,3 --- add code to support one-channel specularmap in imap01.w (now empty)
//
pixel += (color.xyz * (vec3(ig_clight0) * specular) *  pow(max(dot(camera.xyz, reflect(-light0.xyz,tweak.xyz)), 0.0), 128.0));
pixel += (color.xyz * (vec3(ig_clight1) * specular) *  pow(max(dot(camera.xyz, reflect(-light1.xyz,tweak.xyz)), 0.0), 128.0));
pixel += (color.xyz * (vec3(ig_clight2) * specular) *  pow(max(dot(camera.xyz, reflect(-light2.xyz,tweak.xyz)), 0.0), 128.0));
pixel += (color.xyz * (vec3(ig_clight3) * specular) *  pow(max(dot(camera.xyz, reflect(-light3.xyz,tweak.xyz)), 0.0), 128.0));
//
// limit the intensity of each pixel color to the range 0.0000 to 1.0000
//
color.xyz = clamp(pixel, 0.0, 1.0);                                             // assure 0.0000 <= color.rgba <= 1.0000
}
//
//  NOTE:  need to support alpha color blending here, and decide how to  handle z-depth of semi-transparent and fully transparent pixels
//
outcolor = color;
}

The important lines (as far as the topic of supporting unlimited numbers of bindless textures goes) are near the top.

First the four uniform blocks with binding specified to uniform block 0, 1, 2, 3 are shown as u32vec4 arrays that contain 2048 u32vec4 elements. Those four uniform blocks are backed by uniform buffers that contain up to 4096 bindless texture handles.

Not far inside the function is code that extracts the various 1-bit, 8-bit and 16-bit fields of interest from the u32vec2 vertex attribute.

The u08 imap0, imap1, imap2, imap3 fields contain the indexes of the u64 bindless texture arrays that this triangle wishes to access from the four corresponding uniform blocks (0,1,2,3) for the four purposes (texturemap, normalmap, conestepmap, othermap).

The tmapid value is an index to access the transformation matrices for this object (only useful to vertex shader).

The saybit value is a bit field in which each bit enables (or doesn't) various features like:
bit 0 : let vertex color.rgba impact the final color of the pixels --- this bit alone would cause emissive lighting (a light source).
bit 1 : let the positions, orientations and colors of lights impact the final color of pixels (lighting)
bit 2 : let the texturemap.rgba at the specified vtcoord.xy impact the final color of pixels (texture mapping)
bit 3 : let the normalmap.xyzw at the specified vtcoord.xy impact the final color of pixels (normal mapping AKA bump mapping)
bit 4 : let the conestepmap.xyzw at the specified vtcoord.xy impact the final color of pixels (conestep/parallax mapping)
bit x : a few more bits are relevant.

Quite likely the saybit field will be considerably shorter than 16-bits, and the tmapid field will be considerably wider. The tmapid field essentially contains the integer shape object identifier, which varies from 1 to number-of-objects. The vertex shader reads 1, 2, or 3 matrices from from SSBO with this tmapid being the index into the array of f32mat4x4 arrays. This is not needed in the fragment shader.

As far as bindless textures go, the key points are:

Each bindless texture handle is contained in two of the four elements of one of the uniform buffer u32vec4 arrays. The long lines that compute ivex0, ivex1, ivex2, ivex3 read the u64 bindless texture handle from two of the four elements of one of the u32vec4 array elements and converts them into a uvec2 vector (which GLSL knows how to convert to a sampler of the specified bindless texture).

Then the four lines that start with sampler2D convert the four u32vec2 values into four samplers for the specified bindless textures.

After that, the rest is simple and typical of many fragment shaders... except for the fact that each computation is conditional upon one of the say bits being set == 1. Therefore, a line like the following conditionally lets the vertex color.rgba contribute to the fragment color:

if (bool(saybit & 0x0001u)) { color = color * vcolor; }

And ditto for letting the positions, colors, brightness of the four lights impact the color of the fragment.

And ditto for letting the normalmap impact the color of the fragment via the conventional normalmap computation.

And so forth.

But as far as the bindless textures, only the texture(isam#, lines matter, as they access the bindless textures.

That's just about it.

I will attach four images captured from tests I ran to see whether I could change the texturemap and normalmap applied to individual triangles within simple shape objects. As you will see, that worked (except for one artifact that I'll mention that is only a problem with the way I define certain shape objects, and unrelated to this issue).

To make sense of these images you'll need to understand just a little about how these shapes are generated. For almost all shapes the create functions take the following arguments (and a lot more):

level
level_first
level_count
sides
side_first
side_count

Except for the most trivial shape objects (like face3 == triangle and face4 == rectangle0, shape objects can have between 1 and dozens of levels and 3 and dozens of sides. The shape objects shown in the attached images are called "faces" objects. The are essentially thin one or two sided disks with 3 or more sides and 1 or more levels.

A 1 level, 3 sided "faces" object is a triangle composed of three identical triangles. A 1 level, 4 sided "faces" object is a square composed of four identical triangles. A 1 level, 16 sided faces object is a 16-sided polygon composed of 16 identical triangles. All triangles have one vertex at the center.

A 2 or 3 or 16 level "faces" object with the same number of sides looks very similar. In fact, unless some levels or sides are omitted via those arguments, the number of levels does not change the outward shape or appearance of a faces object (but does change the shapep or appearance of some other shapes).

By setting level_first to a value greater than zero, the triangles that compose one or more of the innermost levels will be absent, and the level_count argument will specify the number of levels to build with triangles beyond level_first.

By setting side_first to a value greater than zero, the triangles that compose one or more sides of the shape (at all levels) will be absent (those sides from side 0 to side_first will be absent). The side_count argument specifies how many of the "sides" sides to build with triangles.

The "levels" and "sides" arguments must be valid and specific (levels must be 1 or more, while sides must be 3 or more).

The other arguments can be zero to specify the "default" level_first (0), the "default" side_first (0), the "default" level_count (level_count == levels), and the "default" side_count (side_count == sides).

Anyway, the point is this. One can create partial shapes with the create functions. You will see examples of this in the attached images.

What I did to test whether the code would display different textures on a "per triangle" basis was to add one or two lines of code to the "faces" create function to set a different texturemap and normalmap on for one side (one image), for two levels (another image), and for one side and two levels (final image). You can see the consequences of that in the images.

Which is... it works!

Except for one case that is my fault, not the fault of the bindless textures features. The problem is, the faces shape contains only one vertex at the center of the object, and all the triangles in level 0 share that vertex. This decision already caused me trouble once before, but I haven't gotten around to fixing the problem yet. I need to place as many vertices at the center as there are sides on level 0 of the shape.

Why so? In this case, there is no way to distinguish which "side" of the object the central vertex belongs to. Every other provoking vertex in the object is at the start of one side or one level of the object. In fact, the first == provoking vertex is always at the lower level and lower side of every triangle. But... which side is the central vertex the provoking vertex of? Oops! This is a classic singularity or pole problem. The central vertex is the provoking vertex on every single triangle in level 0, and thus the provoking vertex on every side. And therefore, the question that needs to be answered to decide which texturemap and normalmap to select for any side at level 0 is... undefined and problematic.

I ran into this problem before on the "faces", "disk", "peak", "cone", "ball", "globe" shapes. When creating objects, an option flags lets the calling function specify how to generate texture coordinates for every vertex. For the images displayed the texture coordinates were generated in a "flat mode" or "linear mode" manner (just lay the texturemap image flat onto the shape object and assign the x,y coordinates of the texturemap to the vertices they lie upon).

However, calling functions can specify other ways to generate texture coordinates, one of which is called "wrap mode". This "wrap mode" is sometimes sensible for the shapes I mentioned in the previous paragraph... especially for "peak", "cone", "ball", "globe". But a big problem arises when these shapes have only one vertex at the "tip" of the "peak" or "cone"... or at the "poles" of the "ball" or "globe". The problem is, with only one vertex at the pole, the texture-coordinates for that vertex in ALL the many triangles that touch either pole are always the same. And so, the tcoords along the sides of the triangles as the approach the pole can be very, very, extremely imprecise at various places.

The solution for both of these problems is to put as many vertices at these "singularity" points as the number of sides (and vertices) at the outside of the first level. This will fix the ugly texturemap problem with wrap mode, and also fix the ugly problem you will see in the images with the central level 0 (where the wrong or unexpected texture is displayed at the central level, even for the side we wanted to change the texture).

You can see the problem on the right-hand image in two of the three attached images, namely the image that shows different texturemap and normalmap on a certain side (a radial swath from center to one edge), and the image that shows different texturemap and normalmap on both certain sides and levels (but the sides issue is the problematic one). The problem is always the central area.

Anyway, thanks to all who helped so much recently. I hope you got something out of this too... at least a tiny bit. And I hope others get some benefit too, someday. I have to assume more people will be trying out bindless textures as time goes by. They're great for AZDO, which is what I'm headed for in as many ways as possible.

Now on to other AZDO opportunities, starting with all that multidraw jazz.

4. Also, as I hoped (and sorta thought), you will see that a very perverted but cool feature that I attempted actually works in OpenGL...
You should not claim that something "works" just because it doesn't appear to be broken. Undefined behavior is undefined; that it might do what you want in this particular case does not guarantee that it will continue to do so.

The ARB_bindless_texture extension makes it abundantly clear that the sampler value passed to texture functions must be dynamically uniform. And the definition of "dynamically uniform" explicitly disallows per-triangle data as being "dynamically uniform". Not unless the entire rendering command only has one triangle.

Now, that being said, `NV_gpu_shader5` explicitly allows non-dynamically uniform texture sampling. But as you might notice, that's an NVIDIA-only extension.

That means that all the invoked fragment shaders receive all the integer vertex-attributes ONLY from the first vertex of each triangle (which is why it is called the "provoking vertex"). Since integer vertex attributes are not interpolated like other vertex-attributes, the engine has no other choice except to choose the integer vertex attributes from one of the three vertices to send to all instances of the fragment shaders deployed to draw each triangle.
Which means precisely nothing as far as dynamically uniform expressions are concerned. The scope for a "graphical operation" is explicitly nebulous, but must be no greater than "a single rendering command, as defined by the client API." Multiple triangles in the same rendering operation are therefore able to be "a single rendering command". And therefore an expression which is dynamically uniform must result in the same value for every triangle rendered in that command.

NVIDIA of course can define this more narrowly (the standard requires it to be "a least as large as a triangle or patch". But that's implementation-specific behavior; as far as following the actual specification, dynamically uniform is defined at the rendering command level.

Consider the behavior of hardware where multiple FS invocations from different primitives are executed in the same invocation group/wavefront/whatever. There's no reason why an implementation cannot have 4-16 pixel quads from different triangles running in the same group. This is why those expressions have to be dynamically uniform: the hardware may only use the sampler value from one of those invocations to get which texture to read from, then read from that texture using each FS invocation's different texture coordinates.

That is a perfectly valid OpenGL implementation. And your code would break on it.

5. Originally Posted by Alfonse Reinheart
You should not claim that something "works" just because it doesn't appear to be broken. Undefined behavior is undefined; that it might do what you want in this particular case does not guarantee that it will continue to do so.

The ARB_bindless_texture extension makes it abundantly clear that the sampler value passed to texture functions must be dynamically uniform. And the definition of "dynamically uniform" explicitly disallows per-triangle data as being "dynamically uniform". Not unless the entire rendering command only has one triangle.

Now, that being said, `NV_gpu_shader5` explicitly allows non-dynamically uniform texture sampling. But as you might notice, that's an NVIDIA-only extension.

Which means precisely nothing as far as dynamically uniform expressions are concerned. The scope for a "graphical operation" is explicitly nebulous, but must be no greater than "a single rendering command, as defined by the client API." Multiple triangles in the same rendering operation are therefore able to be "a single rendering command". And therefore an expression which is dynamically uniform must result in the same value for every triangle rendered in that command.

NVIDIA of course can define this more narrowly (the standard requires it to be "a least as large as a triangle or patch". But that's implementation-specific behavior; as far as following the actual specification, dynamically uniform is defined at the rendering command level.

Consider the behavior of hardware where multiple FS invocations from different primitives are executed in the same invocation group/wavefront/whatever. There's no reason why an implementation cannot have 4-16 pixel quads from different triangles running in the same group. This is why those expressions have to be dynamically uniform: the hardware may only use the sampler value from one of those invocations to get which texture to read from, then read from that texture using each FS invocation's different texture coordinates.

That is a perfectly valid OpenGL implementation. And your code would break on it.
Thanks for the additional information. One followup question. Do they consider glMultiDrawElementsIndirect() functions as one draw or <count> draws for these purposes?

Also, even after listening to a lot of videos about these topics (all I could find), I don't recall hearing an answer to the following question: If an application draws the exact same set of objects with the exact same set of vertices (same total number of vertices) with functions like glMultiDrawElementsIndirect() but one time the draw count is 100 and the next time the draw count is 1000... will the later be substantially slower, or not much slower?

Obviously there is no state changes between draws either way, but perhaps there is measurable or significant [or substantial] overhead to kick off each draw call. Yet when I think about how this might be happening in the GPU, it seems also possible that the difference might be insignificant or nearly unmeasurable (unless taken to absurd extremes like 3 indices == 1 triangle per internal draw).

PS: While overall I'm a fan of all the great work the GPU makers have done over the years, one place I am very much not a fan is their apparent obscuration of what actually happens in the GPUs. The advent of CUDA and OpenCL has reduced that problem a bit, because certain aspects of GPU operation can be inferred from how compute is organized and works. But unless I missed some important articles somewhere (possible, because I finally gave up looking after a few years), much about GPU architecture and operation is still not stated. One reason I'm so bad at this stuff (OpenGL) is because I have (and always did have) the worst memory in the world. But even more important, because my past history was to develop products from [pretty much] the lowest-level possible and up from there, I knew how EVERYTHING worked. And my brain got habituated to needing to know how every element I worked with works. Everything I ever invented or designed this way was fairly easy and painless for me to develop and never problematic. For example, I used to design CPUs (which means invent the architecture, instruction set, everything) from scratch... meaning from SSI gates (AND, OR, XOR) or MSI (multiplexers and such, for which gate diagrams I could inspect and understand were always available). And every one of my CPU designs worked first time (though once there was a PCB layout error that required one patch wire to replace an omitted trace). The same has happened in my software career. I designed a great many products based upon microprocessors and microcontrollers (chips like 8052 and C8051F120 for example), which were fairly completely speced... and which almost without exception worked first time. And the same applies to the [sometimes rather sophisticated] software in these devices... easy to implement and debug. And why? Because I wrote the operating system too (or just integrated the required OS-like capabilities into the basic code)... every byte of code executing on the device.

But I also ran into cases where I tried to "build on the work of others". One of the first was my attempt to put a GUI around my optical design and analysis program (the first program I ever wrote, in junior high school). I chose Motif (on my UNIX computer) because... well... that's just about all that was available at the time. Oh, the allure of "let us save you all the time and effort of implementing a GUI". Yeah, right. I could never make the GUI work. Too many bugs, and way, way, way too much not specified. No way to know what was actually going on inside Motif, and no way to find out. So I dropped back to the Xt intrinsics, which were supposed to be lower-level. Same problem... no joy, and no way to know whether you could ever arrive at a solution. And of course, their hundred or so bug fixes every month. Eventually I created a new programming language (big mistake, but that's another story). Once the compiler and debugger was done, I wanted GUI capabilities. So I wrote my own interactive graphical GUI designer subsystem (in that language). Damn! I spent about 1% as much time to develop the whole damn thing than I spent trying (and failing) to make Motif and Xt work! Why? Because I based my graphics and GUI code on xlib, which was simple and did not hide [nearly as] much.

The fallout of all the above (and there are LOTS more supporting cases I'm certain you don't want to hear) is... maybe almost nobody else knows how horrific the situation is with anything we cannot know how it works to the lowest level. Okay, okay, if you're picky, I'm willing to accept that we don't need to know the material physics of transistors and basic gates in order to be able to design with gates (for example). But we still need to understand the consequences of that, the timing, the hysteresis, the temperature dependencies and so forth. But yeah, we don't necessarily need to understand the bumps on the sides of Quarks. So, while I push a long way towards the fundamental, I don't push all the way. Just far enough so I can know everything that matters to what I'm doing.

All this is just my way of saying "they should explain what's going on"... at least a lot more than they do. Or if they do, someone needs to point me to where those documents are, because I'm not in the mood to take a job at nvidia and AMD just to find out. AMD has been trying to hire me for decades, but I like being independent and working only on my own projects. In case it isn't obvious (!!! it is !!!), AMD has always wanted me to design CPUs, not GPUs. Hahaha!

Fact is, I've never even seen a fully coherent, understandable meaning of "dynamically uniform". The link you provided tries pretty damn hard to be clear, but I'm too stupid to glue all the abstraction together into a coherent idea in my head. For example, I cannot see for the life of me how texture coordinates can possibly be "dynamically uniform" when they are different for every freaking fragment shader that executes. Maybe that writeup on "dynamically uniform" makes sense to others. I dunno. It isn't complete or precise enough for me to comprehend. I suspect that's just the stubborn need to completely understand everything that pervades my consciousness from decades of training it to be that way (because no freaking way could I ever have developed all those products without fully understanding how the components I was working with actually worked).

If the GPU vendors would clearly explain "what happens" when the GPU processes three triangle vertices, then kicks off a [pile/bunch/collection/zorchplex] of fragment shaders, maybe we could more-or-less know what will work and what will not work before we waste millions of hours of our lives and the health of trillions of neurons (speaking collectively here). But like I said, they don't... as far as I can see. Even that link you provided (which is great to have, by the way) says it pretty much infers everything it says from Vulkan, not OpenGL... because OpenGL doesn't explain clearly. Yeah, no kidding! And yes, I understand OpenGL is a specification, not an application owned by a single GPU vendor. That's not an adequate excuse in my book. And no, I don't intend to take a job at Khronos either. If you knew what this engine is merely a subsystem of, then you'd understand why.

I am perfectly aware that my insistence on having a fully coherent and specific statements of the fundamentals of the components I work with... makes me look stupid. I don't care how I look. All I care about is getting results... good results... better results... great results if possible. I'm also aware that my eternally terrible memory works against me too (both in reality and appearances). So be it, I have no choice. But you know what's funny. I've found I absolutely don't need more than my terrible memory... when I fully understand how something works. When I understand... everything is clear and obvious. True, "understanding" must necessarily involve some kind of memory. But somehow the memory required for understanding is different... as if every bit of memory involved in the understanding is supported and reinforced by the understanding itself (and/or by every other little bit of related memory involved in the understanding).

Frankly, I don't know how you guys manage to remember so many individual factoids, or where you found them --- or wrote them? You probably all have great memories. Or, you submerge yourself in this stuff for so many endless hours every day that this stuff becomes part of your very being.

I am so utterly, completely, thoroughly used to "knowing stuff like this" as a mere consequence of "understanding how everything works". Without that overall understanding, my IQ165 turns into IQ11 (plus or minus 10). I'm not sure whether that's an exaggeration or not, but that certainly feels about right.

A couple factoids about me and this specific project are the following. I always design ahead... well ahead. In this case, my target date is two years from now, because I still have a lot of "procedurally generated content" nonsense to expand or add. Second, this will almost certainly become open-source freeware, not a commercial product. Nonetheless, I hope a great many people create commercial products on the back of this engine. Third, it doesn't bother me one tiny bit if only the best high-end GPUs can support all the features of the engine... though the high-end GPUs of both nvidia and AMD must support them (no one-vendor-only nonsense). However, since this specific feature (triangle-by-triangle specification of images/textures/normalmaps/etc) is not crucial for everyone, if bilateral support doesn't exist (at least on the near horizon), it could remain as an "undocumented feature" for the short or medium term. In short, it seems like plenty of other game/physics engines exist out there, so if this engine is just a "niche engine" then that's just fine with me (in fact, less trouble for me). Don't want to create an application with lots of procedurally generated content (that you don't have to do yourself)? Then grab another engine and best wishes (no sarcasm).

And so, I have the "luxury" of including features that might be completely out of the question for other engines or products or applications. And so I will. I don't need to release an engine that runs on "anything back to OpenGL v3.2 or v3.3" for example. In fact, I'm perfectly fine releasing an engine that requires the latest-and-greatest version of OpenGL... plus a few ARB extensions (but no fringe extensions or one-vendor extensions). I consider ARB extensions to be "headed to core in some form" extensions.

I know most folks do not have such "luxury". Just know that I do.

PS: I'll read that "dynamically uniform" writeup another 5 or 6 times. Sometimes that gets difficult issues to gel a bit better.

PS: If someone with a new AMD GPU wants to try this see whether this works the same, let me know what happens (or ask to run my code if you prefer). That could settle the question of this specific feature once and for all (if it works on AMD too).

PS: Most important, I appreciate all the time and effort you expend to help me [attempt to] overcome my memory [and other] disabilities. :-)

PS: I type at about warp 9.75, so I hope you read at least as fast, or don't mind such long [and sometimes rambling] messages (and typos).

6. Do they consider glMultiDrawElementsIndirect() functions as one draw or <count> draws for these purposes?
The answer is... unclear. The specification only says "single rendering command, as defined by the client API," but they never actually define that in anything but the most obvious way (aka: any `gl*Draw*` call).

However, `gl_DrawID` is explicitly required to be dynamically uniform. It's pretty difficult to make that happen without the individual draws being considered separate invocation groups.

Adding to that, Vulkan is explicit about this, "For indirect drawing commands with `drawCount` greater than one, invocations from separate draws are in distinct invocation groups". The reason why "indirect drawing commands" is mentioned is because Vulkan doesn't have non-indirect multidraw commands. You're expected to simply repeatedly issue multiple drawing operations if you want the CPU equivalent. Which will, by definition, be separate commands.

If an application draws the exact same set of objects with the exact same set of vertices (same total number of vertices) with functions like glMultiDrawElementsIndirect() but one time the draw count is 100 and the next time the draw count is 1000... will the later be substantially slower, or not much slower?
There's no way to answer that question. It depends far too much on what is being drawn.

All this is just my way of saying "they should explain what's going on"... at least a lot more than they do.
Who? Who are you talking to? Who is this "they"?

OpenGL (and Vulkan. And D3D. And Metal) is an abstraction. The whole point of an abstraction is that you do not care about what's going on.

You write code against the model defined by the abstraction. The implementation implements the abstraction's model on concrete hardware. That's the way it works. You are not required to understand "what's going on"; you are simply required to understand the rules laid down by the abstraction. Follow those rules, and your code works (modulo bugs).

This is what allows you to write the same code that works effectively on multiple platforms. You're surrendering knowledge of the low-level details so that you can gain independence from the low level details.

Every platform does rasterization slightly differently. What good is it to know how Intel does it if NVIDIA does it differently? What good is it to know how Kepler hardware does it if Maxwell hardware changes the rules?

For example, I cannot see for the life of me how texture coordinates can possibly be "dynamically uniform" when they are different for every freaking fragment shader that executes.
Why do you think they have to be? It's the sampler itself that must be dynamically uniform. That is, all of the invocations must access the same texture. But they don't have to access it in the same place.

Maybe that writeup on "dynamically uniform" makes sense to others. I dunno. It isn't complete or precise enough for me to comprehend.
What is incomplete or imprecise about it? It's kind of hard to help you when all you can say in response to information is "that's not good enough!"

A dynamically uniform expression is an expression that evaluates to the same value in every shader invocation spawned from the same rendering command. While that is a bit of a simplification (the idea of "dynamic instances" of expressions is what allows loop counters to be dynamically uniform), it's pretty complete and precise.

Show me a shader, pick an expression, and I can tell you if it will be guaranteed dynamically uniform or potentially not uniform (depending on the values the user provides. Inputs can be dynamically uniform, but only if the user provides the same input value to all invocations).

Even that link you provided (which is great to have, by the way) says it pretty much infers everything it says from Vulkan, not OpenGL... because OpenGL doesn't explain clearly. Yeah, no kidding!
To be fair, I wrote that notation before an update OpenGL 4.5, when they (finally!) updated the standard to (mostly) explain these things to the level of detail that Vulkan does. Though to be fair as well, that probably happened because I filed a bug report on it

Admittedly, they fixed this by literally copying-and-pasting a chunk of the SPIR-V standard into the GLSL specification (seriously, the "Dynamically Uniform Expressions and Uniform Control Flow" section is word-for-word), but at least it's there.

There are certainly holes in the OpenGL standard, thanks in part to it slowly evolving from a very different original standard. GL 1.1 is nearly unrecognizable in many ways from 4.6. By contrast, Vulkan started from scratch, which meant that they couldn't assume that something had already been said.

So sure, the update to GL that explained this didn't happen until I personally filed a bug report on it. But to be fair, Vulkan didn't fully present that information until I filed a bug report on it too. And it was literally the first public bug on the standard

Oh, and FYI? I filed a bug asking for clarification on the multidraw issue.

Frankly, I don't know how you guys manage to remember so many individual factoids, or where you found them --- or wrote them? You probably all have great memories. Or, you submerge yourself in this stuff for so many endless hours every day that this stuff becomes part of your very being.
For what it's worth, I didn't "remember" that bindless texture requires dynamically uniform expressions. I remembered how to find the bindless texture extension, and I searched for "dynamically" and found the note that restricted it. I didn't remember that there was a restriction; I simply thought that there could be one.

That's not memory; that's experience. Or paranoia, which is really the same thing when it comes to programming

Most of this comes from reading the standards. I've been doing graphics programming since before shaders were a thing. I've read a lot of the extension specifications when they were hot off the presses. I imagine that, in your CPU development work, you too read lots about code-gate design, substrates, and other things as research was being done. This is no different.

Also, you seem to divorce "understanding" from "memory", as though you could understand how something works without remembering how it works. It's also odd that you seem to suggest that we don't "understand" this stuff.

I always design ahead... well ahead. In this case, my target date is two years from now, because I still have a lot of "procedurally generated content" nonsense to expand or add.
And yet, you've chosen to use the API that's behind... well behind

I'm kidding, but only somewhat. If performance is so important, and your ship date is in the (relatively) far future... why aren't you using Vulkan? It'd make it a lot easier to do your procedurally generated stuff, thanks to better synchronization support and more explicit memory access. You're already using ubershaders and working to minimize state changes, so that's already in line with Vulkan best practices.

Vulkan doesn't have bindless textures, but since you're focused on more capable hardware anyway, you can simply require that implementations provide `shaderSampledImageArrayDynamicIndexing`, which lets you use arrays of textures which you can index from (with a dynamically uniform index, of course ). Modern desktop implementations provide that, and you'll note that even Intel is on that list.

If someone with a new AMD GPU wants to try this see whether this works the same, let me know what happens (or ask to run my code if you prefer). That could settle the question of this specific feature once and for all (if it works on AMD too).
No, it won't. Undefined behavior is undefined. Even if it worked, that's no guarantee that it will continue to do so.

7. Originally Posted by Alfonse Reinheart
The answer is... unclear. The specification only says "single rendering command, as defined by the client API," but they never actually define that in anything but the most obvious way (aka: any `gl*Draw*` call).

However, `gl_DrawID` is explicitly required to be dynamically uniform. It's pretty difficult to make that happen without the individual draws being considered separate invocation groups.

Adding to that, Vulkan is explicit about this, "For indirect drawing commands with `drawCount` greater than one, invocations from separate draws are in distinct invocation groups". The reason why "indirect drawing commands" is mentioned is because Vulkan doesn't have non-indirect multidraw commands. You're expected to simply repeatedly issue multiple drawing operations if you want the CPU equivalent. Which will, by definition, be separate commands.

There's no way to answer that question. It depends far too much on what is being drawn.

Who? Who are you talking to? Who is this "they"?

OpenGL (and Vulkan. And D3D. And Metal) is an abstraction. The whole point of an abstraction is that you do not care about what's going on.

You write code against the model defined by the abstraction. The implementation implements the abstraction's model on concrete hardware. That's the way it works. You are not required to understand "what's going on"; you are simply required to understand the rules laid down by the abstraction. Follow those rules, and your code works (modulo bugs).

This is what allows you to write the same code that works effectively on multiple platforms. You're surrendering knowledge of the low-level details so that you can gain independence from the low level details.

Every platform does rasterization slightly differently. What good is it to know how Intel does it if NVIDIA does it differently? What good is it to know how Kepler hardware does it if Maxwell hardware changes the rules?

Why do you think they have to be? It's the sampler itself that must be dynamically uniform. That is, all of the invocations must access the same texture. But they don't have to access it in the same place.

What is incomplete or imprecise about it? It's kind of hard to help you when all you can say in response to information is "that's not good enough!"

A dynamically uniform expression is an expression that evaluates to the same value in every shader invocation spawned from the same rendering command. While that is a bit of a simplification (the idea of "dynamic instances" of expressions is what allows loop counters to be dynamically uniform), it's pretty complete and precise.

Show me a shader, pick an expression, and I can tell you if it will be guaranteed dynamically uniform or potentially not uniform (depending on the values the user provides. Inputs can be dynamically uniform, but only if the user provides the same input value to all invocations).

To be fair, I wrote that notation before an update OpenGL 4.5, when they (finally!) updated the standard to (mostly) explain these things to the level of detail that Vulkan does. Though to be fair as well, that probably happened because I filed a bug report on it

Admittedly, they fixed this by literally copying-and-pasting a chunk of the SPIR-V standard into the GLSL specification (seriously, the "Dynamically Uniform Expressions and Uniform Control Flow" section is word-for-word), but at least it's there.

There are certainly holes in the OpenGL standard, thanks in part to it slowly evolving from a very different original standard. GL 1.1 is nearly unrecognizable in many ways from 4.6. By contrast, Vulkan started from scratch, which meant that they couldn't assume that something had already been said.

So sure, the update to GL that explained this didn't happen until I personally filed a bug report on it. But to be fair, Vulkan didn't fully present that information until I filed a bug report on it too. And it was literally the first public bug on the standard

Oh, and FYI? I filed a bug asking for clarification on the multidraw issue.

For what it's worth, I didn't "remember" that bindless texture requires dynamically uniform expressions. I remembered how to find the bindless texture extension, and I searched for "dynamically" and found the note that restricted it. I didn't remember that there was a restriction; I simply thought that there could be one.

That's not memory; that's experience. Or paranoia, which is really the same thing when it comes to programming

Most of this comes from reading the standards. I've been doing graphics programming since before shaders were a thing. I've read a lot of the extension specifications when they were hot off the presses. I imagine that, in your CPU development work, you too read lots about code-gate design, substrates, and other things as research was being done. This is no different.

Also, you seem to divorce "understanding" from "memory", as though you could understand how something works without remembering how it works. It's also odd that you seem to suggest that we don't "understand" this stuff.

And yet, you've chosen to use the API that's behind... well behind

I'm kidding, but only somewhat. If performance is so important, and your ship date is in the (relatively) far future... why aren't you using Vulkan? It'd make it a lot easier to do your procedurally generated stuff, thanks to better synchronization support and more explicit memory access. You're already using ubershaders and working to minimize state changes, so that's already in line with Vulkan best practices.

Vulkan doesn't have bindless textures, but since you're focused on more capable hardware anyway, you can simply require that implementations provide `shaderSampledImageArrayDynamicIndexing`, which lets you use arrays of textures which you can index from (with a dynamically uniform index, of course ). Modern desktop implementations provide that, and you'll note that even Intel is on that list.

No, it won't. Undefined behavior is undefined. Even if it worked, that's no guarantee that it will continue to do so.
-----

I have a lot of trouble making multi-level nested comments work, so I won't try. Just know that the following addresses issues in the order of your message.

Wow, it would be very interesting to learn that drawing the exact same objects (in every respect) with glDrawElements*() can accomplish more than glMultiDrawElements*() can! As usual, I can't remember for sure, but I sorta think maybe I recall them even saying the results are the same as the client (or perhaps driver?) executing glDrawElementsIndirect() in a loop (which would imply [to stupid me at least] that nothing will happen different).

Of course, that only pushes the question back as far as drawing whole objects, not as far as I pushed it... back to drawing individual triangles.

Oh, just a tiny factoid in case you like to know these things. I tried this new test code on my ancient GTX 680 and... it works the same as the GTX 1080 TI... and draws different textures/normalmaps on a triangle-by-triangle basis. Interesting. Already worked on ancient history.

Oh, another probably irrelevant tidbit. I noticed the series of bindless texture handles I received back from OpenGL differed by... one. So it seems rather unlikely they are actually GPU addresses, because it seems unlikely the single byte at those addresses could have enough information to define a sampler... or even provide sufficient indirection to such (unless they have a literal limit of 256 textures/images, which doesn't sound right to me).

To answer your next question, my answer is EVERYONE. Khronos should. nvidia should. AMD should. Everyone should (even though I don't care if Intel does). That way we can understand more fully what's going on, by reading multiple perspectives... some abstract, some concrete.

The whole point of an abstraction is that you do not care what's going on? Well, that's where this engine (and probably a few other applications) differ. I don't care if the engine only runs on 20% of GPUs... as long as they're high-end GPUs from both nvidia and AMD. But sure, I totally understand that running on a huge majority of GPUs is very important to a majority of 3D engine/application developers. I do. But not everyone does (and I dare say I'm not the only one). If I can achieve a 100x performance increase by "cheating" (knowing how just two brands of high-end GPUs work)... then I am more than happy to take advantage of that. Even 10x or 5x performance increase for that matter. Maybe even 2x.

This is where honesty comes in... or I should say, this is where clarity and honesty SHOULD come in (but apparently often doesn't). For example, in the GDC AZDO video they show some speed comparisons that show array textures are the same speed as bindless sparse texture arrays (or something like that). These were both shown as exactly 18x faster than their nominal case (a DX11 example). However, I looked into array textures A LOT, but my conclusions was "no workie in real life practical situations". Compared to my bindless texture implementation, I don't see how their scheme can work... unless every texture is the same size, and thus easy to put into different layers of one array texture. Unless there is a huge pile of unmentioned wackiness going on in their benchmark, I don't see any way for an application to specify arbitrary array textures of varying size and configuration, bound to arbitrary texture units, to be specified on an object-by-object basis. Let's assume they take advantage of every trick in the book and new tricks only limited to what is actually possible. Let's assume they call their favorite glMultiDrawElementsIndirect() like function, and take advantage of that built-in gl_DrawID shader variable to be able to access information in uniform blocks or SSBOs about how to render each object. The problem is, the shader programs can't change between draws. So as far as I can see, shader programs can only access specific samplers... the samplers explicitly named in the source code. But one sampler only accesses one array texture, and each array texture only contains one size and configuration of textures. And therefore, if the texture you want to access for this object within this glMultiDrawElementsIndirect() call accesses the set of textures that is wrong-size or wrong-configuration... you can't render the object at all in that glMultiDrawElementsIndirect() call.

Nonetheless, the approach I took lets shaders access any texture of any size and any configuration whenever it wants for each object (actually each triangle, but we'll ignore that for the purposes of this part of the conversation). Obviously I could make my scheme work from values stored in a UBO/SSBO and indexed with gl_DrawID too. But that would be slower than what I have (albeit only modestly == two extra accesses of UBO/SSBO contents), more complex, and incapable of "stupid GPU tricks" that others disapprove of (yet wish they achieve too). Of course, they want their applications to run on cell phones and every computer in the world, while I quite explicitly do not care. Seems only fair that I get something valuable for what I give up (execution on everything everywhere).

Hahaha... Khronos should pay you for your important contributions! Not sarcasm.

As an aside, I'd love to know how the GPU folks dispatch (and re-dispatch) fragment shaders when drawing triangles. For now at least, it seems my inference that they dispatch a fairly independent "horde" of them based upon the 3 vertices in a triangle. But yes, I understand it doesn't have to be this way, it just makes sense to me (which isn't saying much, since I really don't understand as much as I'd like about how GPUs work internally... in practice... in reality).

I might be wrong, but as much as nvidia would like to kick AMD in the nuts and vice versa, I suspect they both are coming to realize that most people want their code to run on as many brands of GPU as possible... and without needing to design, test and keep fixing different sections of code to do the same thing. I'm sure I'm not the only developer to doesn't want to redesign my whole freaking application (and invent a new architecture to make that efficient) to deal with endless growing piles of incompatibilities and alternate code paths.

I do benefit from reading the standards, and a few books (most are near worthless), and watching internet videos or reading PDFs and slide decks. However, so many terms are never defined in a way they become fully concrete in my mind that the more I read, the more the chaos of questions and confusions expand, and the less clear I get. Actually, that's not a fair statement. Maybe 60% become more clear the more I read (probably due to encountering different hints, comments and examples of application, which lets me draw inferences). But about 40% goes the other way, and prevents me from ever reaching clarity, it seems. But I do... eventually... get closer to clarity.

I didn't mean to imply you guys "don't understand this stuff". Obviously you understand vastly better than I do! However, I think you get this feeling based upon how I respond negatively to "not knowing what's really going on under the covers". You don't care! In fact, you seem to LOVE IT, because you get something you value more than understanding the guts and reasons and tradeoffs and justifications. You get something that works everywhere! So [maybe] you actually get positive feedback for the lack of complete explanation. I don't. I explained my experiences, so you should at least understand why a little bit.

Hahaha... you caught me! You're too clever by 99%.

For a while after I heard about Vulkan, it was much too vague and much too much hand waving. About three months ago I bought all the books that exist, and read/watched everything I could find about Vulkan to see if Vulkan is my xlib of 3D APIs (go read my previous message if you don't know what that means). Two factoids got me interested in Vulkan.

First, it is [supposedly] lower-level (and probably is). As I explained, unlike most people (I infer), I almost always do vastly better with lower-level interfaces, on both the hardware and software level. So my hope was, Vulkan would be much lower-level, based upon simpler (more fundamental) mechanisms, and also reveal more about how modern GPUs actually work (by exposing more to inference than OpenGL does). And second, maximize performance, which is extremely important to me. For my purposes, I want to unleash all the cores and threads of Threadripper on the engine my work does, including as much of interfacing with the 3D API as possible.

Reading the Vulkan books hurt my brain. Nothing new... trying to jam too much in too quickly. Takes time to see how the various factoids fit together. So it wasn't as thrilling a read as I hoped, but it was good enough to lead me to believe that's where I probably need to go, and fairly soon.

Then I thought about the porting process. Soon I realized that I really needed to clean up a few aspects of my engine that have just been "patched together well enough to function, but not yet implemented as I planned". One aspect of this was the "image object" in my engine, which includes texturemaps, surfacemaps, normalmaps, conestepmaps, specularmaps, heightmaps, volumemaps and everything else that looks much like a 1D, 2D, 3D (or 4D including time) construct that does not map more efficiently into some kind of buffer object [or other OpenGL construct].

One major reason I put off finishing the "image object" was because... right from the very start... the engine has been rendering thousands of objects in each glDrawElements() draw call. I always knew I had to massively reduce the number of batches and draw calls, so that's the way the engine was designed. I also knew that "this is such obvious nonsense" (the problem with switching textures all the time and thereby breaking batches) that "no way will the OpenGL or GPU guys not come up with more and better ways to deal with this... so I'll try to wait them out.

And in fact, my inference was correct, and with help from you and others here, I now have a massively general "image object" functionality wherein every 3D shape object can depend-upon, require and access as many images/textures/others as it wants without breaking batches within 3D shape objects or between 3D shape objects.

My fallback was to implement humongous texture atlases and deal with the hassle myself (replacing nominal texture-coordinates for each image with the modified texxture-coordinates based upon where each texture was automatically put in a humongous texture in a many-level texture array. This would be messy as hell, especially when it comes to wrapping modes, but did represent a worst case fallback plan. Of course, even with 32K mega-textures, and all texture images being somewhere within one of the 32K mega-textures, there was still the nasty question of how to access many of these... before the array textures appeared.

But I just got off the subject. To return to that path, the point is, I decided the process of switching over the Vulkan would be vastly easier if I cleaned up certain aspects of the engine AND tweak a few portions a bit to be more like "the Vulcan way" but still with OpenGL components and mechanisms. Which is how I ran into the AZDO video and a couple other PDFs that introduced some of the newer mechanisms like Storage instead of Data, persistence, coherence, rolling buffers (forget the proper term), multidraws (especially when combined with indirect draws... I think) and so forth.

I really do believe replacing OpenGL with Vulkan will be much easier once I replace as many well-chosen aspects of OpenGL with the very most modern OpenGL (so far I don't think I need v4.60, but I definitely need v4.50). And so, that's my answer to the question. Does that mean I'm flying under false pretenses here in the OpenGL forum? I don't think so. I hope you don't think so.

But to buttress that a bit... the better I understand AZDO principles, the more it seems possible that purposely taking the very most optimal approach with the latest and greatest OpenGL has to offer, might not be much slower than a properly written Vulkan application. Unfortunately, every presentation I've seen smells of being biased to appeal to whoever the listener is, so quite possibly I'll have to switch over to Vulkan to find out whether Vulkan is significantly, substantially or massively faster than the best OpenGL has to offer... or NOT. If they're not much different in speed, I will probably (eventually but definitely not yet) be more comfortable with Vulkan, because that's just how I always seem to roll (most fundamental and lowest-level with least hidden is best and easiest for me). But we shall see.

Oh, and BTW, I had not heard of Vulkan when I started this engine! So forgive me! Hahaha.

I almost hate to ask, but how does ShaderSampledImageArrayDynamicIndexing get around the problem I mentioned with OpenGL array textures, especially since the same "dynamically uniform" requirement still exists (in the spec if not reality). Of course, if the Vulkan spec doesn't support "per triangle" but it still works in practice, that's no more a mark against Vulkan than OpenGL. The problem with array textures in Vulkan seems to be the same as OpenGL, that all the images within one array texture need to be the same size and configuration. Which leads to the same problem (and potentially revolting solution) that I mentioned. Or does Vulkan have some other trick or specification to skirt around this problem?

Someday I'll get someone to test my code on a new AMD GPU. If it works, that's good enough for me. And if someday the "per-triangle" feature stops working, that won't stop the code from still offering the feature on a "per-object" basis. Which means, texturing won't stop working, only the "stupid GPU trick" of per-triangle image-switches will stop working. I won't like it, but I'll take the risk. That's a luxury I get for writing a non-mainstream, non-commerical application. And when people write games with my engine? They shall be informed of both the various risks and opportunities in the documentation, and can make their own decisions. To risk or not to risk... that is the question. Actually, just one of many.

Just curious. Other than better raw speed, how is the architecture or API of Vulkan better for procedurally generated content?

But yes, the "best practices" part of Vulkan has been fairly attractive to me so far. Not that I fully understand all of it yet, but that's normal for a slow learner like me. The hilarious part is, give me the raw GPU specs, and I'll write a better 3D API than OpenGL or Vulkan in my sleep in two months. That's my sweet spot. Fortunately, most other subsystems of my over-arching project match my strengths well. Assuming 3D doesn't kill me first! Hahaha.

8. On per-triangle texturing

Originally Posted by bootstrap
Of course, that only pushes the question back as far as drawing whole objects, not as far as I pushed it... back to drawing individual triangles.
But that question has been answered: there is no guarantee that different triangles in the same rendering command are part considered different invocation groups. It may not be clear how big "rendering command" is, but it's definitely bigger than "single primitive".

If you want to rely on undefined behavior, that's up to you. But in architectures like GPUs, you cannot assume that because something appears to work right now that it will continue to do so.

And I don't mean "in newer hardware"; I mean literally tomorrow. I mean that adding another object to the render might cause it to break. Moving the camera might cause it to break. And so forth.

Unless a specification guarantees it, UB appearing to work should not be assumed to actually work.

Originally Posted by bootstrap
Oh, just a tiny factoid in case you like to know these things. I tried this new test code on my ancient GTX 680 and... it works the same as the GTX 1080 TI... and draws different textures/normalmaps on a triangle-by-triangle basis. Interesting. Already worked on ancient history.
As I said, the presence of the NV_gpu_shader5 extension explicitly nullifies the dynamically uniform requirement for bindless textures. And the GTX 680 supports NV_gpu5_shader. Indeed, pretty much every 4.x NVIDIA GPU supports this extension.

So you're didn't need to test it; you're not relying on undefined behavior here.

Non-dynamically uniform texture accessing is not a matter of "ancient history". It's not that more modern hardware supports it and less modern hardware does not. It's a matter of specific hardware architecture that allows it to work.

And there's only one vendor who offers such hardware. None of AMD's extensions provide this. And if it's not in an actual specification, you should not rely on it.

On understanding hardware

Originally Posted by bootstrap
To answer your next question, my answer is EVERYONE. Khronos should. nvidia should. AMD should. Everyone should (even though I don't care if Intel does). That way we can understand more fully what's going on, by reading multiple perspectives... some abstract, some concrete.
Asking Khronos to explain how the hardware works is like asking a hair stylist to explain quantum mechanics. They might be able to do it, but it's clearly not the reason why most people go to them. The Khronos Group doesn't make hardware, so they have no reason/right to explain it.

NVIDIA will never explain the details of their hardware. It took them a year or so before they admitted that their GPU rasterization architecture included a pseudo-tile-based component.

AMD and Intel already publishes detailed information on their hardware, for the purposes of driver developers.

Originally Posted by bootstrap
I'd love to know how the GPU folks dispatch (and re-dispatch) fragment shaders when drawing triangles.
Every GPU does it differently. Some of them do it radically differently. It even changes from generation to generation.

Originally Posted by bootstrap
I might be wrong, but as much as nvidia would like to kick AMD in the nuts and vice versa, I suspect they both are coming to realize that most people want their code to run on as many brands of GPU as possible... and without needing to design, test and keep fixing different sections of code to do the same thing. I'm sure I'm not the only developer to doesn't want to redesign my whole freaking application (and invent a new architecture to make that efficient) to deal with endless growing piles of incompatibilities and alternate code paths.

...

I didn't mean to imply you guys "don't understand this stuff". Obviously you understand vastly better than I do! However, I think you get this feeling based upon how I respond negatively to "not knowing what's really going on under the covers". You don't care! In fact, you seem to LOVE IT, because you get something you value more than understanding the guts and reasons and tradeoffs and justifications. You get something that works everywhere! So [maybe] you actually get positive feedback for the lack of complete explanation. I don't. I explained my experiences, so you should at least understand why a little bit.
You seem to be contradicting yourself here. The only way that we can get to a point where testing across platforms is unnecessary is if people stop coding outside of the abstraction. And the only way to do that is if people stop trying to learn things beyond what the abstraction says you can do. If the abstraction says that "invocation group" extends to a rendering command, then you code to that. It doesn't what the "reasons and tradeoffs and justifications" are. You do what the standard says.

It's trying to learn details beyond the standard that makes people write incompatible code (on purpose).

On array textures and AZDO

Originally Posted by bootstrap
However, I looked into array textures A LOT, but my conclusions was "no workie in real life practical situations". Compared to my bindless texture implementation, I don't see how their scheme can work... unless every texture is the same size, and thus easy to put into different layers of one array texture. Unless there is a huge pile of unmentioned wackiness going on in their benchmark, I don't see any way for an application to specify arbitrary array textures of varying size and configuration, bound to arbitrary texture units, to be specified on an object-by-object basis.
What constitutes "real life, practical situations" is in the eyes of the beholder. The restriction that array textures require, that each layer must be the same size, is not an onerous burden for many applications. For your application it may be, but for the primary audience of the AZDO presentation, it's just not a particularly painful issue.

Think about how you decide to limit changes to vertex format state. You pick a single vertex format, which all meshes must abide by. That means that all sources of mesh data have to agree on that format. That's easy for you, because you generate your vertex data in code. But that's not so easy for someone who loads their vertex data from user-provided meshes.

To many developers, picking a single texture size that all textures for a particular usage must abide by is no more onerous of a burden than yours was to use a single vertex format.

Similarly:

Originally Posted by bootstrap
And therefore, if the texture you want to access for this object within this glMultiDrawElementsIndirect() call accesses the set of textures that is wrong-size or wrong-configuration... you can't render the object at all in that glMultiDrawElementsIndirect() call.
... yes. So they don't do that. They ensure that the data is never "wrong-size" or "wrong-configuration". Generally speaking, most high-performance graphical applications have near-complete control over their input data.

Performance typically requires rigid control over the input data. Generality almost always comes at the expense of performance.

You can make BSP or portal culling system over fixed geometry have faster rendering than a set of arbitrary, unknown meshes. You can make rendering with fixed-size textures faster than rendering with arbitrarily sized textures. You can make rendering with a single format of vertex data faster than rendering with multiple formats. And so on.

Primary optimizations require knowledge of, and therefore control over, what the data is and how it is to be rendered. The more control you give up, the fewer your options for optimizing.

Originally Posted by bootstrap
Unless there is a huge pile of unmentioned wackiness going on in their benchmark, I don't see any way for an application to specify arbitrary array textures of varying size and configuration, bound to arbitrary texture units, to be specified on an object-by-object basis.
Then you have misunderstood the point of the AZDO presentation. It was not made to explain how to render with any texture "on an object-by-object basis". It was made to explain how to improve performance on your current rendering system. It's not there to explain how to render in snazzy new ways; it's purely a performance optimization.

A performance-based presentation is not going to explain how to draw things in ways that couldn't be done before.

On Vulkan

Originally Posted by bootstrap
I almost hate to ask, but how does ShaderSampledImageArrayDynamicIndexing get around the problem I mentioned with OpenGL array textures, especially since the same "dynamically uniform" requirement still exists (in the spec if not reality).
You're citing two distinct problems. The array texture issue is that each individual layer in the texture has to be the same size. The dynamically uniform issue is that you're not allowed to have the texture object which gets fetched from be determined by non-dynamically uniform means.

Arrays of textures are different from array textures. An array texture is a `sampler2DArray array_texture;`. An array of textures is a `sampler2D array_of_textures[array_count];`. See the difference?

An array texture is a single texture object; that's why each layer has the same size. An array of textures is an array of different texture objects. Each element of `array_of_textures` contains a different texture object. And therefore, each object can be of a different size.

With `ShaderSampledImageArrayDynamicIndexing`, you are allowed to use a dynamically uniform expression to index the `array_of_textures` array. But without that Vulkan feature, the index must be a constant expression.

My point was that Vulkan does not directly support bindless textures. But by using an array of textures with dynamic indexing, you get the same effect as bindless textures: the ability to pass an identifier into the system, which gets converted into a specific texture object. Therefore, the equivalent to OpenGL bindless texture residency in Vulkan is simply changing which images are in the descriptor binding for the array in the descriptor set.

It comes with the same limitations as the OpenGL feature: it has to use a dynamically uniform value. But it allows you to do in Vulkan what you would have done in OpenGL.

Originally Posted by bootstrap
Other than better raw speed, how is the architecture or API of Vulkan better for procedurally generated content?
If your "procedurally generated content" is generated on the CPU, then it's much better. You can allocate memory in whatever way you feel you need. You can ask for how much memory is available directly, and allocate chunks of it that work best for your content generation system. Most importantly of all, you have direct knowledge and control of the memory architecture.

For example, in an embedded GPU, there is generally only one pool of memory. In such cases, your procedural generation algorithm will generate data directly into the location it will be read from. But for discrete GPUs, you'll probably want to transfer it into device-local memory. That requires an explicit DMA operation. And you can create an appropriate dependency between that DMA and the rendering commands that use it.

Or maybe you don't do that. Maybe the GPU can read vertex data from host-accessible memory directly, so you don't need that DMA.

That's a question you can actually ask in Vulkan.

OpenGL abstracts all of these details away, in the hopes that the implementation will do the right thing. There, you just persistent map a buffer; maybe it'll be as efficient as doing the DMA. Maybe not. But how do you tell if the GPU even has multiple pools of memory? In OpenGL, you can't.

Vulkan allows you to adapt to the particulars of the hardware. Which means you can take advantage of those particulars.

If your "procedurally generated content" is generated on the GPU, that's even better in Vulkan. You get all of the above advantages, plus improved synchronization support. Barriers in Vulkan are much more flexible than `glMemoryBarrier` in OpenGL.

And all of that ignores the fact that you can thread the construction of command buffers in Vulkan. Which you cannot do in OpenGL. Proper use of threading can dramatically improve CPU performance, thus allowing you to spend more time on CPU data generation.

Originally Posted by bootstrap
The hilarious part is, give me the raw GPU specs, and I'll write a better 3D API than OpenGL or Vulkan in my sleep in two months.
No, you won't. You will write one that is better for you and your needs. But you will not write one that is better or more broadly implementable for everyone.

9. I'm not sure why you care so much, or maybe don't believe I'm aware of what you said. So let me just say it clearly once again. Per triangle might break tomorrow, or when the moon rises, or when the camera moves, or when the next driver is released, or when someone at nvidia reads this thread and real quick like tells the software guys to make sure it stops working, or when the day of the week divided by the phase of Venus rounded to the nearest percent is a prime number.

To be clear in advance, nobody will even think to blame you... since you warned me urgently, repeatedly, and in no uncertain terms.

Then there's the other side of the coin. If you specify 0 to 4 images (texturemap and/or normalmap and/or conestepmap and/or othermap) in the shape object create function... those 0 to 4 values will be inserted into those fields in every vertex of the object. Which means, the application only has one choice for each of the 4 types of image objects --- no image, or one image for all vertices. And thus, nothing can go wrong (unless we find glMultiDrawElements*() requires the images not change for all and every individual draw in the multiple draws. In that case, my answer is "screw you, jerks" to the driver writers (followed by a bug report from me to the OpenGL folks to fix their documentation which clearly (to me, anyway) strongly implies images can change between the individual draws). I will, however, then turn tail and admit I'm the jerk if they point out to me what it must work that way (every draw must keep the same single set of bindless textures). Of course, based upon my (yes, much too limited) tests, there's no need for them to be so restrictive.

Okay, sorry, got a bit over the top there. But my point was supposed to be this. The create functions for all shape objects will contain arguments to specify two image objects, which presumably but not necessarily will be "texturemap" and "normalmap" (where a one-channel "specularmap" can be put into the .w component of the "normalmap" if desired (though the specular power therefore must apply equally to all colors and all viewing angles, which isn't a perfect solution but often "good enough"). BUT, each image object contains an "imagetype" element that specifies which of the four types of image this image is, so those two images could be any two of the four.

Which means, the nominal functions and "use case" (hate that term, but value the concept) conforms to what you advocate... I think... because the saybits that enable or disable each image is inserted into every vertex in each object, and the image identifiers (offsets into those bindless texture arrays in the shader) are also inserted into every vertex in each object.

However, the engine does provide access to the raw vertices (the object local-coordinate vertices). So, they can change anything they want. The could change the position.xyz of all or any combination of vertices, the zenith.xyz, north.xyz, east.xyz surface vectors of all or any combination of vertices, the color.rgba of all or any combination of vertices, the tcoord.xy of all or any combination of vertices, the index value that locates transformation matrices from SSBO (the index value being the integer object identifier), or... drum-roll ... the saybits or four image indices that specify the four image objects to access (courtesy "bindless textures").

I need to provide this general capability to avoid the need to provide 100 or 200 specialized functions to fiddle in 100 or 200 different ways. OTOH, the engine will offer 20 or 30 functions to support the most common operations, but seriously obscure or unusual or one-of-a-kind operations will have to be done by the application changing the vertices however they wish (with the responsibility and consequences falling squarely on the application by necessity).

In other words, "the engine cannot stop them".

But that's not the fully honest answer. Will I decide to offer a special-purpose function to make it easy and convenient for applications to selectively modify some but not all of the saybits or image identifiers? At the moment, I lean toward supplying such a function. Whether that function remains when release day arrives will not be known until much nearer release day. Quite possibly feedback from beta testers/adopters will provide feedback to help decide that question.

But... don't worry, whatever happens, it won't be your fault! :-) Really!

PS: And if you're correct that it doesn't work on AMD GPUs... no support or comments in the documentation will encourage anyone to attempt "per-triangle images"... or whatever we're [not] gonna call this capability... I mean travesty... when it doesn't work on AMD!!! :-)

----------

BTW, I do understand your attitude! Especially when you invest so much of your personal time, effort, sweat, blood and tears to help us morons out here. Every time some moron or jerk [like me] tries to make something cool work, you and your fellow saints (not sarcasm) find a big juicy pile of doodoo on the floor that never had to happen. And indeed, in most cases that's 100% true, because most people want their software to "run [virtually] everywhere" (perhaps absent 6+ year old GPUs... grudgingly). So I get it.

----------

As for your interpretation of the "array texture" practicality question, I think your answer to my complaint was... pretty damn good. And actually, this reminds me about some ways I'm willing to be bold where some others might not be willing. I'm the kind of "jerK" who would at least be willing to say "okay, potential [non-paying in my case] customers, thou shalt make all your images (texturemaps, normalmaps, conestepmaps, specularmaps, everyotherkindofmaps) all the same size and configuration, because we decided you're better off with 10x better speed than infinite flexibility on image size and configuration". Yup, that's the kind of decision I'm willing to make for other people, which some others are not, and which royally torques off some people. But, from what you say, may not be as infuriating as I assumed.

I was thinking along the lines of "but... but... but... this is an engine not a single application, so imposing a requirement like that is not appropriate". But you know what? After reading your message and thinking a big longer about that, it would bother me less (and probably less enough) to impose such a requirement. Nonetheless, I'm still bleeding profusely from trauma received in the "bindless texture wars", so for now anyway, I'll stick with this. But I admit you're right. And, in fact, if bindless textures didn't exist (or didn't/don't work reliably), those array textures sound plausible again. Well, more like "the only remotely viable approach" if bindless textures go down the tubes for some reason.

Maybe this is my inclination because I'm not the artist type (by even the most remote stretch of imagination), and as a consequence am not fluent with many support tools. Pretty much all I know is gimp. Of course gimp has no problem making all images the same size, so maybe I should just shut up while I'm far behind on this one.

PS: I also was glad to read your comment that seems to say application developers are quite used to going to [moderate] extremes to comply with requirements imposed to make performance good. Hence your comment about "requiring rigid control over input data".

I'm sure every author of a applications like engines (which are purposely written to support hundreds or thousands of other diverse applications) love to hear that they can make decisions, then say "my way or the highway", and the poor sucker application writers meekly say "okay boss", suck it up, and comply. And, in fact, for the reasons you explain, there is much sense in that... if the engine developers made wise decisions. So wish me wise decisions (and keep helping me make them). Okay, keep trying! :-) Given that I'm the first and primary "customer" for this engine (as a subsystem in a larger project), and I need [as close as feasible to] maximum performance, I'm probably even more receptive to that attitude than some others.

----------

My misunderstanding about the AZDO situation is probably much more caused by my assumption that engine authors would be extremely unwilling to require customers into inconveniences like making all their textures and normalmaps and othermaps the same size. Once I accept I was wrong to think that way, everything seems a lot more sensible. And yes, AZDO is most certainly not ONLY about textures, but switching textures is one of the major problems AZDO tries to deal with.

Just to be clear, I was never trying to place importance on "per triangle rendering", but was trying to place importance on "per object rendering". The reason should be quite clear too! The reason is, because switching state like textures is a major reason applications slow down, and this almost always happens on a per-object basis based upon forum comments I read (and articles I read, and slide-decks I read, and GPU videos I watch). In other words, to say that "displaying different textures on an object by object basis" is some kind of "never been done before" seems wildly absurd to me. From what I can tell, this is as common as common can be (and has exactly nothing to do with switching textures on a triangle-by-triangle basis).

However, I suppose I can add a bit of color here...

In my engine the usual (and almost only) way to create non-trivial shapes/objects is to call functions that create fairly simple shapes/objects, then assemble them into more and more complex shapes/objects in a multi-step, multi-level hierarchical manner. When each shape object is created, it will normally be quite happy to have zero or one texturemap and/or normalmap... so very conventional. However, complex shape/objects are created by attaching shape/objects to existing shape/objects. When this is done, the attached object can rotate and/or translate along the natural axes of the attached shape/object... or the "parent" shape/object it is attached to.

When I first got this working it was great fun to build huge hierarchies of shapes with every branch articulating. They can be created in any order, but it became quickly apparent that first attaching the furthest from root objects to shapes that will be the next level in was more intuitive. The reason is, you first create the simple 2, 3, 4 component shapes/objects that make sense as "components" (say a gun turret), then you attach that to a huge rotating circular table shape object (that will become a moving deck component of a mothership), then you attach the circular moving deck to the mothership... and so forth.

Anyway, it was great fun to have 3, 4, 5... even 8 or 10 level deep hierarchies with all the levels articulating (rotating and sliding back-and-forth) against their parent objects. So many of these contraptions are so cool, incredible and impressive.

BUT...

That misses the important point. A great many complex objects (in fact, almost ALL complex objects in most games) are rigid, fixed, never-moving objects. They need to be assembled too, especially in this "procedurally generated content" engine. All that works the same whether any object will articulate or not...

EXCEPT...

Once a large, complex shape/object like that is assembled, all the components that do not articulate can be fused into a single object.

In the past (before glMultiDrawElements*() at least), that was a huge win! Now only one transformation matrix need consume space or be computed each frame... if any. Now only one object need be drawn rather than dozens... EXCEPT for the nasty issue of the dozens of different image objects that act as textures and normalmaps for all those component objects!!!

If the batch must be broken to switch textures for each object (or piece of a shape/object if the shape/objects were fused into one shape/object)... then hands get thrown up and engine and application designers say "what's the use... the performance sucks... and inherently must suck".

Anyway, I'm just saying all these considerations are part of any "how to design" decision. If the engine and application get virtually zero benefit from fusing all those shape objects into a single shape object, then... what's the point? Just leave them separate and set a single bit that says (don't articulate).

----------

And finally, I'll answer your last sentence in a somewhat funny, somewhat tongue in cheek way:

Alfonse: No, you won't. You will write one that is better for you and your needs. But you will not write one that is better or more broadly implementable for everyone.

Me: Since you just convinced me the best and accepted way to write applications is to "make all the important decisions for everyone else and make them comply", I feel provisionally justified in saying "yes I would".

Of course, we'll never know. But one factoid is certain. Unlike many people (apparently), the lower-level i work at, the faster, better, easier I work. And now that I think about it, I don't recall anyone ever complaining about the interface to the products I developed. Interesting that (and interesting that I never thought about that before, probably cuz nobody ever complained).

#### Posting Permissions

• You may not post new threads
• You may not post replies
• You may not post attachments
• You may not edit your posts
•