how to implement bindless textures efficiently ???

In the process of “fixing up” my 3D engine, after a lot of reading (and pestering helpful folks here a bit), I’ve chosen “bindless textures” as the basis of the new implementation of “images” in the engine. In this case, “image” refers to:

  • surfacemaps AKA normalmaps AKA bumpmaps
  • colormaps AKA texturemaps
  • displacementmaps
  • conestepmaps
  • specularmaps
  • heightmaps
  • etc…

Also I imagine other build-in features of the engine will be implemented with bindless textures, but not through the generalized mechanism I’m discussing now. For example, to implement rendering the sky background I divided my billion-plus star catalog (real catalog of real stars plus lots of information) into 1024x1024x6 “regions” that exactly correspond to the areas covered by the 1024x1024x6 pixels in a 1024x1024 cube-map (with 6 faces). I also imagine I’ll need to implement a typical general-purpose “skybox” or whatever they’re called these days to let people put a “far-away background image” into a 1024x1024x6 to 4096x4096x6 cubemap. But those will be handled separately in a custom way, not through the generalized “image” mechanism.

The first question I have is this. Does any way exist in OpenGL v4.6 core or an ARB extension (thus likely to become core… someday) that lets me pack those u64 bindless texture handles into a UBO without half the space being wasted? It seems clear from what I read so far that an array of u64 values is half empty, with only one u64 texture handle per 16-byte “location”.

My natural instinct was to create a structure like the following…

[b]struct u64vec2 {
GLuint64 x;
GLuint64 y;
};

[/b]… then create an array of 256 or 1024 (or more) of elements of that type in a UBO. That way the desired bindless texture handle (or apparently a “sampler”… I guess) would get accessed a bit strangely, but that’s easy enough.

But, I don’t see where OpenGL (or GLSL) specifies 2-element u64 vectors. Of course, that’s probably because I’m missing some ARB extension or other, right? At least I hope so (if not already core).

I guess GLSL doesn’t necessarily need to recognize that type for my purposes, since each element will look like a sampler to GLSL… I think.

I don’t much care how the u64 bindless texture handles get packed tight… I just want to find some way. Does some way exist?


The following is a little “color” to explain what is my plan… in case there is some fatal flaw in my plan.

My engine primarily creates “shape” objects AKA “shapes” procedurally. Which means, all cameras, all lights, all [shape] objects that are “drawn” or “rendered” are created by creating simple [shape] objects, then assembling them into larger, more complex, rigid and/or hierarchical [articulating] shapes. The simplest shapes are fairly standard, namely camera, light, points, lines, face3, face4, faces, disk, grid, grix, mesh, peak, cone, tube, pipe, torus, ball, globe, sphere, ring, bridge3, bridge4, bridges, and so forth.

The create functions for all shape objects include arguments to allow each shape to be customized. For example, the create functions for almost all shapes contain the following arguments:

[b]- sides

  • side_first
  • side_count
  • levels
  • level_first
  • level_count
    [/b]
    … and where it makes sense, also…

[b]- twist

  • taper
    [/b]- and so forth.

… as well as obvious arguments like options (up to 64 option bits), color (default color for all vertices), etc.

The objid (object identifier) of up to four image objects [plus other kinds of objects] can be specified in the shape object create functions.

Each of those four image objects gets turned into a u08 integer, and the four of those u08 values are delivered to the shader as a single u32 integer vertex-attribute, to be broken into up-to 0,1,2,3,4 fields by the shader (as specified by 4~6 other bits in another u32 vertex-attribute).

Each of these [nominally] 8-bit image identifiers is not the same as the objid of the image in the engine. Instead, the 8-bit value is an index into one of four UBO blocks which contain samplers backed up by u64 bindless texture handles (as I understand this mechanism so far). By default with the current “uber-shaders” (and nominally), each of the four UBO blocks serves a different purpose, typically:

UBO #0 == colormaps AKA texturemaps
UBO #1 == surfacemaps AKA normalmaps AKA bumpmaps
UBO #2 == conestepmaps + heightmaps
UBO #3 == specularmaps or othermaps

At least that’s what happens right now (though monochrome specularmaps can be taken from the A channel of conventional colormaps if the A channel is not needed to specify transparency). Frankly, I’m fighting the battle between offering too much flexibility to accommodate existing textures versus simplicity, but that’s a side issue.

The important point is this. Each set of shaders (each “program object”) can specify what it does with the images in each of the four UBO blocks. That’s just a specification of each “program object” that can be specified by anyone who writes shaders for the engine (though they are expected to keep the same associations as the standard default “uber-shader program” for any image that serves one of the standard functions. So, if someone writes shaders that support a “displacement map” (but not “conestepmap”, obviously), they would put their “displacement map” in UBO #2.

So, to return to the basics.

When a shape object is created, the objid of up to 4 images is specified.

When each image object is created, its type is specified as an argument. A number of constants like the following are defined to specify image type:

IG_IMGAE_TYPE_IMAP0, IG_IMAGE_TYPE_COLORMAP,IG_IMAGE_TYPE_TEXTUREMAP — 3 intuitive names for one type == UBO #0
IG_IMAGE_TYPE_IMAP1, IG_IMAGE_TYPE_SURFACEMAP, IG_IMAGE_TYPE_NORMALMAP, IG_IMAGE_TYPE_BUMPMAP — 4 names== UBO #1
IG_IMAGE_TYPE_IMAP2, IG_IMAGE_TYPE_CONESTEPMAP, IG_IMAGE_TYPE_HEIGHTMAP — 3 intuitive names for one type == UBO #2
IG_IMAGE_TYPE_IMAP3, IG_IMAGE_TYPE_SPECULARMAP, IG_IMAGE_TYPE_OTHERMAP — 3 intuitive names for one type == UBO #3

As shown above, the type each image object is given when created determines which UBO that image object is stored into.

Whan each shape object is created, zero or one image object can be specified for each purpose. Another way to say this is, each shape object can specify the image objid of zero or one image object in each of those four image UBO blocks. The engine converts the image objid to the appropriate 8-bit index into the sampler arrays in each UBO in order to access the specified image objects for the desires purposes.

The reason for this scheme (as opposed to a single UBO block) is to increase the number of simultaneously accessible image objects to 1024 rather than 256. Since the engine does all the complex work under the covers, the burden on the application is minimal (just say what is the purpose of each image object when it is created).

The other thing to remember is the following. What happens to the zero to four objid of image objects passed in the shape object create functions? Each objid is converted into the appropriate 8-bit index and put into the appropriate field of that 32-bit vertex-attribute (based on the purpose == type of each image object), so that 32-bit vertex-attribute contains the same value in every vertex in the shape object. Which means, the exact same texturemap, normalmap, conemap, specularmap, heightmap, othermap will be processed for every vertex… leading to uniform and consistent appearance throughout the shape object.

That happens automatically, by default.

However, after shape objects are created, the application running on the engine can call shape object modification functions to change any combination of vertex attributes in any combination of shape object vertices.

And so, even the simple shape objects I mentioned can be radically customized.

A couple background facts. First, the “provoking index” is set to the first vertex in each triangle in this engine. Which means, 2/3 of those two 32-bit integer vertex-attributes never make it to the fragment shader… only the first vertex in each triangle gets to the fragment shader. Which is good. Of course these two 32-bit integer attributes have flat modifiers in the shaders, so they are constant across each triangle.

Second, as implied above, almost all shape objects are created out of an arbitrary number of levels and sides. What does this mean?

For example, consider a simple “faces” object. As you might expect, a faces object is just a flat, round disk (but not identical to the disk object in an important way, as you’ll learn shortly).

Let’s say an application program calls the ig_shape_create_faces() function and specifies the following values for the following arguments:
[b] - levels = 3

  • level_first = 1
  • level_count = 2
  • sides = 6
  • side_first = 1
  • side_count = 4[/b]

By default, as with all shape objects, the radius of the surface of the faces object is 1 meter (but can be scaled, sheared or otherwise modified at any time). Since the create function specified this faces object is composed of levels == 3, with level_first == 1 (not the default == 0) the surface of this faces object has a 1/3 meter radius hole in its center (like a hole in the center of a round disk).

Since the create function specified this faces object has sides == 6, the nominally round outer edge actually has only 6 sides… a hexagon. But side_first == 1 and side_count == 4, so only 4 of those 6 sides physically exist. Errrr… I mean “graphically exist”, meaning only those graphical surfaces exist (contain vertices).

And so, this faces object is 2 meters diameter, has a 2/3 meter hole in the center, and only 4 of the 6 sides exist.

One important point of the last several paragraphs is the following. The first (provoking) vertex of every triangle starts on the lowest level that triangle touches (every triangle spans between two adjacent levels) and also starts on the lowest side that triangle touches. This is important because this lets the engine provide a way to specify different image objects for vertices on a given range of levels and/or sides. So, for example, after an application creates any kindi of object, the application can easily tell the engine to “set 1, 2, 3 or 4 of the image objects on any range of levels to be new image objects” and/or “set 1, 2, 3, or 4 of the image objects on any range of sides to be new image objects”. Okay, I’m no artist, so that probably sounds boring. But to give just one example of what that means, those two function calls could change the texturemap and normalmap of some middle level or levels from the default “cobblestones” to “grass” or “bricks”, and some arbitrary side or sides from the default “cobblestones” to “fancy artistic tiles”. Okay, I really am not artist. But you get the point.

Also, everything above leads to the following point. I mentioned many of the simple shape objects, each of which is radically configurable during the create process, and later on by simple function calls (by 3D scaling, 3D shearing, 3D twisting, 1~4D randomization of attributes, and an open-ended number of additional procedural processes).

But the engine is designed to support vastly complex procedurally generated shapes too. For example, the create functions for simple shapes like “cup” attach, bond or fuse 2 to a few of the simplest shapes together, while create functions for super-complex shapes like “spacecraft” or “planet” or “galaxy” attach, bond or fuse tens, hundreds, thousands or more simple [and complex but simpler] shapes together.

The point is, when objects are attached, bonded or fused together, all the configuration performed on the individual elements is retained. And yet, individual aspects can be changed in simple, coherent, intuitive ways. For example, if the image object that contains those “fancy artistic tiles” can easily be replaced by any other texture to display something else.

Which leads to the following. Each create function (as well as modification functions) for super-complex objects like “house”, “office”, “spacecraft”, “planet” or pretty much anything else can generate an astronomical number and variety of shape objects of that general kind (house, office, spacecraft, etc). The relative dimensions of pretty much any subsection can easily be configured or changed, the appearance of any [natural] portion of any surface can be changed, just about every aspect of that “kind” of shape can be changed. Of course the create function (and supporting modify functions for that shape) can offer as many or few opportunities for configuration during shape creation… and/or later.


Anyway, that’s what I’m trying to achieve, and this scheme I described up near the beginning of this message is my attempt to implement some of these features and capabilities.

I wanted to implement a scheme with array textures, but they just don’t seem flexible enough. Unless all textures [of a given purpose] are the same size and work effectively with all the same sampler configuration, I don’t see any plausible way to make those four 8-bit imageid fields specify a rich enough variety of images for all the various purposes.

Maybe some guru out there sees a way to make that work. I don’t.

In contrast, the bindless textures do seem flexible enough, since (as I understand this), every texture can be a different size and configured differently.

But what do I know? Only enough to get the basic texturemap and normalmap stuff working (displaying on object surfaces). But they couldn’t even be selected before now… only the one texturemap and normalmap would display. :disgust:

Anyway, if anyone is a serious image/texture guru… especially bindless textures or via some other fancy texture-works tricks… I’m all ears. I mean eyes. Post your crazy ideas below. Thanks.

PS: And sorry my message is so long. I thought it might provide sufficient context to spur good ideas, or prevent waste of time to post ideas that just aren’t flexible enough.

PS: I’ve gone to a huge amount of effort to be able to render many shape objects in each call of glDrawElements() or similar. That’s another crucial reason for making so many image objects accessible simultaneously.

PS: As per my usual practice, I “design forward”. Which means, I’m open to any approach that is likely to be supported by most high-end cards from at least AMD and nvidia two years from now. My development rigs are Ryzen and Threadripper with nvidia GTX1080TI GPUs, 64-bits Linux Mint (later maybe windoze). These can be considered “absolute minimum required hardware”, since 2-ish years from now most folks running sophisticated 3D simulation/physics/game applications will have better. I do prefer to avoid non-ARB extensions, but I’ll consider anything likely to be supported by high-end AMD and nvidia GPUs (must be both, not just one or the other brand).

PS: Almost certainly the engine will be released as open-source. I don’t consider 3D engines a viable commercial market (not for me, anyway). For me, this engine is just one subsystem in a larger project.

Thanks!

maybe you’ll find this useful:

a good technique ist to store materials in buffer object and reference them, use bindless textures, and merge as many draw calls as possible, make use of indirect rendering and sort everything after renderstates, so that you only have to set 1 renderstate once (pre frame).

It seems clear from what I read so far that an array of u64 values is half empty, with only one u64 texture handle per 16-byte “location”

Looking at the ARB_bindless_texture specification, it clearly says that sampler and image variables can be members of a UBO. It even says how their data converts to memory.

But it does not say what the size, base alignment, or array stride of them is for std140 layout. And the pathetic example of using them in a UBO doesn’t explain its behavior either.

So if you want an in-spec way to pass such values, you’ll have to pass them as uints. And since there’s a conversion from a uvec2 to sampler/image types, that should be pretty easy. Simply pass an array of uvec4s, where each array element is two 64-bit integers. So if you want the ith element of the array, you return arr[i/2].xy or .zw, depending on whether i is even or odd. Or you can index the vector to get the value:


uvec4 val = arr[i/2];
uint ix = (i % 2) * 2;
uvec2 handle = uvec2(val[ix], val[ix + 1]);

[QUOTE=Alfonse Reinheart;1288406]Looking at the ARB_bindless_texture specification, it clearly says that sampler and image variables can be members of a UBO. It even says how their data converts to memory.

But it does not say what the size, base alignment, or array stride of them is for std140 layout. And the pathetic example of using them in a UBO doesn’t explain its behavior either.

So if you want an in-spec way to pass such values, you’ll have to pass them as uints. And since there’s a conversion from a uvec2 to sampler/image types, that should be pretty easy. Simply pass an array of uvec4s, where each array element is two 64-bit integers. So if you want the ith element of the array, you return arr[i/2].xy or .zw, depending on whether i is even or odd. Or you can index the vector to get the value:


uvec4 val = arr[i/2];
uint ix = (i % 2) * 2;
uvec2 handle = uvec2(val[ix], val[ix + 1]);

[/QUOTE]

What I also worry about is how those u64 integer bindless texture handles get turned to samplers in GLSL. We all know we can put a u64 integer into two u32 integers in a CPU, then put them back together again to recreate the original u64. Piece of cake. But somehow we’re supposed to put those u64 integers that are (or somehow “represent”) bindless texture handles into UBO… and assume OpenGL or GLSL or some magic force will know how to turn them into the samplers declared in the GLSL shader programs.

I am fairly sure I remember reading that arrays of ANY type in UBOs end up mapping one value to one 16-byte layout location. That’s really gross, and probably lots of people find that revolting (except, I suppose, arrays of vec4 or dvec2 values, which fit nicely).

I’m pretty sure I read this in edition 7 of OpenGL SuperBible. That’s where I got the idea that each u64 AKA GLuint64 will consume half of each 16-byte location (thereby wasting half the space in the UBO buffer object). Maybe I shouldn’t moan, since arrays of s32 or u32 values will waste 3/4 of the space, arrays of s16 or u16 values will waste 7/8 of the space, and arrays of s08 or u08 will waste 15/16 of the space in UBO buffers.

If I understand your code, I worry about the fact that somehow OpenGL or GLSL have to “map” or “convert” or otherwise associate the u64 AKA GLuint64 value to a sampler. Would OpenGL or GLSL be able to understand that your uvec2 could be turned into a bindless texture sampler? Maybe. But my wild guess would be “no”.

I also read about a shared layout alternative to std140 in the OpenGL SuperBible, but… it said absolutely nothing about how that layout scheme works.

Maybe I shouldn’t worry so much about this issue. After all, at most I’ll have 4 * 4KB UBO for bindless textures (or at most 4 * 64KB UBO). In the grand scheme of things, that’s not much for high-end GPU cards, which this engine caters to. Compare that to my need to have maybe one but more likely two f32mat4x4 for every shape object (transformation matrices). That’s 64-bytes or 128-bytes each shape object, and some simulations/games could have thousands or even tens of thousands of shape objects. 10,000 shape objects * 128-bytes each for transformation matrices == 1.28GB… which makes 4 * 4KB or 4 * 64KB appear laughably trivial.

Sometimes striving for efficiency gets me to forget the bigger picture temporarily. However, saying that made me remember why I worried about this… not so much memory waste, but CACHE MISSES. When every other 8-bytes is empty, the number of cache misses might soar. Since I don’t know how wide or deep are GPU caches, I don’t know how serious this is. Anyone know?

Which raises another question I will create another message to ask… but I’ll mention it here in case you know the answer.


A UBO can only hold 1024 transformation matrices (without screwball stunts that aren’t worth the hassle or smallish gains in my opinion). Which means, when application have 10,000 or so shape objects, the engine would need 2 * 10 == 20 UBOs to hold the transformation matrices. Plus, the code would need to play some funky games to assure it only renders objects with the same mod 1024 shape object integer identifier.

What will be vastly easier, obviously, is to put all transformation matrices into a single SSBO. This SSBO can be marked “write-only” from the perspective of the engine on the CPU (meaning the CPU would write but never read its contents), and marked “read-only” from the perspective of GLSL programs (which would never write to those SSBO buffers).

The question is… on high-end GPUs, will an SSBO that is configured in such a simplistic manner (basically, like a huge UBO) run as fast or nearly as fast as UBO ???

I hope so, because that sure would simplify the transformation matrix implementation in a few ways.

[QUOTE=john_connor;1288405]maybe you’ll find this useful:

a good technique ist to store materials in buffer object and reference them, use bindless textures, and merge as many draw calls as possible, make use of indirect rendering and sort everything after renderstates, so that you only have to set 1 renderstate once (pre frame).[/QUOTE]

Thanks for the link. I’ll will view and listen-to the presentation. I haven’t listened yet, but this might be similar to an “OpenGL AZDO” presentation I found recently, which was also helpful.

As you no doubt noticed, a lot of effort has gone into the engine to assure a large if not huge number of shape objects are rendered by each glDrawElements() function call. After I view, listen and think for a while, I may post a reply with further questions.

Thanks!

That’s why I said “pass an array of uvec4s”.

Or, you know, you could read the specification, where it clearly says “yes”:

So, what’s the concern?

in openGL 4.5 the ubo size is at least 16kB, divided by 16 floats per mat4 and 4bytes per float, you can store (at least) 256 mat4’s to a uniform block.

with SSBO’s you could fill (maybe?) the whole GPU memory with mat4’s

but if you put the matrices into a GL_ARRAY_BUFFER, then you can send a mat4 (or more) per instance with “instanced rendering”, that way you can send many more to the vertex shader

consider glMultiDrawElementsIndirect(…), you fill a struct like this to render your meshes:

    typedef  struct {
        uint  count;
        uint  instanceCount;
        uint  firstIndex;
        uint  baseVertex;
        uint  baseInstance;
    } DrawElementsIndirectCommand;

baseVertex and firstIndex is determined by the location of the mesh in you vertex / index buffer, count says how many indices the mesh needs to draw, with base instance you specify from where your VAO pulls the instanced data (ModelView and ModelViewProjection matrices for example), set instanceCount to 1 to draw the mesh (or 0 to skip it)

example a triangle:
vertices
(0, 0, 0)
(1, 0, 0)
(0, 1, 0)

indices
0, 1, 2

lets say the vertices have X other vertices located before them in the buffer (the vertices of other meshes), lets say Y indices are located before them in the index buffer, lets say M instances are located before the triangles MV and MVP matrices in the instanced buffer.

    typedef  struct {
        3; // triangle need 3 indices
        1; // draw 1 triangle
        Y; // triangle indices offset in IBO
        X; // triangle vertices offset in VBO
        M; // mesh instance offset in instance buffer
    } DrawElementsIndirectCommand;

put that data into a GL_DRAW_INDIRECT_BUFFER, bind that buffer, and call:

glMultiDrawElementsIndirect(
   GL_TRIANGLES, // to draw triangles
   GL_UNSIGNED_INT, // IBO data type
   sizeof(DrawElementsIndirectCommand) * 0, // location of draw command in indirect buffer
   1, // issue 1 drawcall
   0 // assuming all the data in GL_DRAW_INDIRECT_BUFFER has no padding between 2 drawcalls
);

thats how you reduce the scene to 1 draw call (per renderstate and the mesh’s primitive type)

[QUOTE=john_connor;1288413]in openGL 4.5 the ubo size is at least 16kB, divided by 16 floats per mat4 and 4bytes per float, you can store (at least) 256 mat4’s to a uniform block.

with SSBO’s you could fill (maybe?) the whole GPU memory with mat4’s

but if you put the matrices into a GL_ARRAY_BUFFER, then you can send a mat4 (or more) per instance with “instanced rendering”, that way you can send many more to the vertex shader

consider glMultiDrawElementsIndirect(…), you fill a struct like this to render your meshes:

    typedef  struct {
        uint  count;
        uint  instanceCount;
        uint  firstIndex;
        uint  baseVertex;
        uint  baseInstance;
    } DrawElementsIndirectCommand;

baseVertex and firstIndex is determined by the location of the mesh in you vertex / index buffer, count says how many indices the mesh needs to draw, with base instance you specify from where your VAO pulls the instanced data (ModelView and ModelViewProjection matrices for example), set instanceCount to 1 to draw the mesh (or 0 to skip it)

example a triangle:
vertices
(0, 0, 0)
(1, 0, 0)
(0, 1, 0)

indices
0, 1, 2

lets say the vertices have X other vertices located before them in the buffer (the vertices of other meshes), lets say Y indices are located before them in the index buffer, lets say M instances are located before the triangles MV and MVP matrices in the instanced buffer.

    typedef  struct {
        3; // triangle need 3 indices
        1; // draw 1 triangle
        Y; // triangle indices offset in IBO
        X; // triangle vertices offset in VBO
        M; // mesh instance offset in instance buffer
    } DrawElementsIndirectCommand;

put that data into a GL_DRAW_INDIRECT_BUFFER, bind that buffer, and call:

glMultiDrawElementsIndirect(
   GL_TRIANGLES, // to draw triangles
   GL_UNSIGNED_INT, // IBO data type
   sizeof(DrawElementsIndirectCommand) * 0, // location of draw command in indirect buffer
   1, // issue 1 drawcall
   0 // assuming all the data in GL_DRAW_INDIRECT_BUFFER has no padding between 2 drawcalls
);

thats how you reduce the scene to 1 draw call (per renderstate and the mesh’s primitive type)[/QUOTE]

Yeah, since I noticed the fancy draw functions like glMultiDrawElementsIndirect() it has seemed like “this might help do what I want… somehow”.

Your message clarifies a few things that I could not understand previously. Let me state what those were, and you can correct any of them that are wrong.

#0: I assume count is the number of indices in the IBO specified in the active VAO to process to draw the desired object. Assuming the primitives to be drawn are triangles (not points or lines), the count value must always be a multiple of 3 to be valid and make sense.

#1: the value firstIndex is not a byte-offset, but instead is how many indices are in the IBO before the index we want this draw to start with, regardless of whether indices are 1-byte, 2-byte or 4-byte values. Previously I had the impression this value was a byte-offset into the IBO.

#2: the value baseVertex is not a byte-offset, but instead is how many vertices are in the VBO before the vertex we want this draw to start with, regardless of how many bytes in each vertex-structure (in AoS) or in each vertex-element (in SoA). Previously I had the impression this was a byte-offset into the VBO. But also, the name baseVertex confuses me, and always confused me. Why not call this firstVertex to match the firstIndex name? Don’t both of these have the same meaning… the nth element in the IBO and VBO respectively? This difference in naming always implied to me that “something strange or different is going on with these two variables”. Was I wrong?

I’m even more confused about the other variables.

#3: What does instanceCount do or control? It sounds like you’re saying instanceCount should contain 0 to “not draw this object at all == 0 instances”, and instanceCount should contain 1 to “draw one instance of this object”. I guess the instanceCount == 0 case is helpful because the software can then stop objects outside the frustum from being culled by writing 0 into this one location. But I guess I wonder what happens if instanceCount > 1. I assume this will make the GPU process those exact same count indices by number of times specified by the instanceCount variable. But what happens in the shaders (or elsewhere) to distinguish the different instances? A built-in gl_instance variable? I vaguely recall seeing something like that somewhere.

#4: What does baseInstance do or control? By way of example, if baseInstance == 16, does this mean that the built-in gl_instance variable in the shader (assuming there is such a built-in variable) contains the integer value 16 ??? From your message, it sounds like you’re saying the appropriate f32mat4x4 local-to-world matrix for this object would be element 16 in an array of f32mat4x4 matrices in one SSBO, and the approrpiate f32mat4x4 local-to-world-to-view-to-projection matrix for this object would be element 16 in an array of f32mat4x4 matrices in another SSBO. Of course such matrices will pack tight and solid in pretty much any context (being their sizes are 64-bytes and will always sit on 64-byte address boundaries).

But if what I infer in #4 immediately above is correct, what happens when instanceCount > 1 ??? If I want to call a draw function that recognizes instances, and potentially draws many or hundreds or thousands of instances, I would want the same exact matrices (at matrix array element 16) to be sent to the shaders in both of these cases. At least that’s what I would want. Yes, I would want some other number (perhaps this possibly fictitious gi_instance variable I conjured up from my imagination or memory) to change every instance, and I’d index into something else (somewhere based upon the objid == object identifier, which is in every vertex) to tell the shader how to alter the position and/or orientation of each instance given gl_instance (or equivalent).

I worry that maybe this structure (or functions that access them) increment that baseInstance value every instance in some way, though I don’t know how. Anyway, I’m just confused because I’m not certain exactly what they (and others) might be trying to achieve with a draw call designed for instancing. I’m sorta half guessing that you are applying a draw call designed for instancing to do what my situation requires, but not what the structure and related functions are designed for… seeing that you only put 0 or 1 into the instanceCount variable. And so, my general sense of “not knowing what OpenGL or GLSL is doing” is blaring in the back of my head.

Can you explain a bit further?

But it does sound like you’ve got a good approach here.

#5: Oh, I just noticed the sizeof(DrawElementsIndirectCommand) * 0 argument to glMultiDrawElementsIndirect(). What is that and why zero? Sometimes it looks like you’re drawing one triangle, and sometimes you mention drawing the entire scene with one draw call. So maybe I’m getting those two cases mixed up.

Thing is, I’ve done more in my engine to get to the point where drawing an entire scene (everything in the frame) in one draw call is actually not so absurd an idea. My nominal shader is an “uber-shader” that supports a moderately large vertex structure that contains integer fields and control bits that let every triangle specify the type of lighting (emissive, many-lights, etc), specify up to four images for purposes that include conventional texturemap application, conventional surfacemap/normalmap tweaking of the surface vectors to generate fake geometry via lighting, conestepmap for generating somewhat less-fake “perspective geometry” via parallax)… and more. Given those bits and fields and the whole u64 bindless texture handle array stuff, every vertex can control its own fate in great detail.

Fact is, I can’t literally go “all the way” to one draw call, because the engine will also offer rendering of real, amazingly rich, highly configurable starfields based upon a technique that I figured out (but will no doubt need to hassle gurus like you about to figure out how to make OpenGL do every trick I need). For reasons that probably seem counter-intuitive to you at first, the star backgroun must be rendered last, not first. The main reason is not because it would be a huge waste to process through large parts of the catalog of more than 1 billion stars that I created (all real stars and extensive information like spectral-type, distance (via parallax measures), proper-motion and more) for portions of the sky covered by other objects… though that reason alone might be sufficient.

There is also the question of shadows. Ugh. I don’t look forward to getting to that part. The engine supports many cameras in the environment, all at different locations and pointing in different directions, each of which can be rendering onto a texture that will be pasted onto a rectangle that is (or lies over) the face of a display monitor that is also [potentially] visible in the scene (plus the similar case of mirrors and [partially] reflective surfaces like floors). The fallout of that is… how do you know where you need to capture information for shadow maps when… cameras are pointing in so many directions (each with a different field of view)? My off-the-top-of-my-head-but-I-bet-Im-right guess is… the shadowmap will need to be a cubemap. That plus who knows how many more [very simple and fast] passes will need to occur to fill that shadow cubemap will all the information to generate appropriate shadows in the rendering of all cameras? Aarg… I don’t want to think about this yet, cuz if I did, I might quit this project entirely! :frowning:

Nonetheless, it is important to render as much as possible in each draw call, and I’m going far out of my way to do so.

Thanks in advance for your message and your next reply.

My concern is… that I still don’t fully understand this stuff.

For example, does this mean I can put fully packed C-style u32vec2 arrays into CPU memory, then transfer with glBufferSubData() that array into the UBO buffer that backs the four sampler arrays in my fragment shaders? Sure, C is happy packing u32vec2 tightly without waste, but just because I put a packed array of those u32vec2 vectors into a UBO… doesn’t mean shaders can access every u32vec2.xy and convert to a sampler. In fact, nothing you posted in your message convinces me that the shader won’t skip every other u32vec2.xy in the uniform block that UBO backs.

However, your message does imply one possible way to achieve this result. And that is to make that array of u32vec2 into an array of u32vec4, with every even numbered u64 bindless texture handle split and jammed into the u32vec4.xy elements, and every odd numbered u64 bindless texture handle split and jammed into the u32vec4.zw elements. That sounds promising, and indeed, that’s probably what your first sentence is telling me by saying “pass an array of uvec4”.

Fact is, maybe I wasn’t such a moron, maybe I could understand how what you wrote tells me everything I need to know to get the desired result. Maybe you expected me to take the [very tiny] leap to understand I could extract u32vec4.zw elements, create a u32vec2.xy out of them… then follow the instructions in the ARB_GL_bindless_texture.txt specification (which I did read, but found only half informative and half confusing).

Frankly, I’m just now seeing that this is probably what happened. Wish I wasn’t so dense.

But also frankly… I still don’t see how this works on the shader code side of the situation. All the examples I remember seeing (though I have the worst memory in the entire solar system) implied that magic happens behind the scenes and samplers backed by bindless textures [must] automatically appear in shaders as samplers, not as u64 integers, and not as u32vec2 or u32vec4 integer vectors either. However, if you tell me that my shader code can receive an array of u32vec4 values [backed by a uniform block that contains an array of u32vec4 integers that contain sliced and diced portions of u64 integers/addresses, then after-the-face convert them into whatever kind of sampler the shader code wants… well… then I suppose I’d be writing the code.

Which is what I’ll do if I now seem to understand this correctly. Do I ???

Maybe I drew that inference (about “magic happens here behind the scenes” and “shaders only receive samplers”) incorrectly. Seems like maybe I did.

Read this:

It’ll answer a lot of your questions.

the value baseVertex is not a byte-offset, but instead is how many vertices are in the VBO before the vertex we want this draw to start with, regardless of how many bytes in each vertex-structure (in AoS) or in each vertex-element (in SoA). Previously I had the impression this was a byte-offset into the VBO. But also, the name baseVertex confuses me, and always confused me. Why not call this firstVertex to match the firstIndex name? Don’t both of these have the same meaning… the nth element in the IBO and VBO respectively? This difference in naming always implied to me that “something strange or different is going on with these two variables”. Was I wrong?

That’s all very confused.

This is an array of vertex data:


Vertex verts[100];

This is an array of index data:


int indices[10] = {0, 1, 2, 3, 4, 5, 6, 7, 8, 9};

So, we have 100 vertices and 10 indices. Now, consider this loop:


for(int ix = 0; ix < 10; ++ix)
{
  RenderWithVertex(vert[index[ix]], 0);
}

This renders with all 10 of the indices we have in our index array. It fetches each value from index from 0 to 9, uses that to fetch a Vertex, and then renders with that vertex. Since the index array references the vertices 0-9, the vertices that get renders are verts[0] through verts[9].

This loop corresponds to an indirect struct where count is 10, firstIndex is 0, baseVertex is 0, instanceCount is 1, and baseInstance is zero.

Now, consider this loop:


for(int ix = 0; ix < 5; ++ix)
{
  RenderWithVertex(vert[index[ix + 5]], 0);
}

This renders 5 vertices. But not the first 5 vertices in the index list; the last 5 vertices. The vertices that get renders are verts[5] through verts[9].

This loop represents an indirect struct where count is 5, firstIndex is 5, baseVertex is 0, instanceCount is 1, and baseInstance is zero.

Now try this:


for(int ix = 0; ix < 10; ++ix)
{
  RenderWithVertex(vert[index[ix] + 10], 0);
}

This renders 10 vertices, just like the first loop. It iterates over all of the indices in the index list. But look at what it does. It increments the index fetched from the index array by 10. So the vertices that get renders are verts[10] through verts[19].

This loop represents an indirect struct where count is 10, firstIndex is 0, baseVertex is 10, instanceCount is 1, and baseInstance is zero.

Do you see now how baseVertex is different from firstIndex? firstIndex affects where you start reading from the index array. baseVertex affects the value of the index after it has been read from the index array.

The idea is to be able to easily have mesh data for multiple objects in the same buffers. Consider indexed meshes in a disk file format. The indices are all relative to the beginning of their respective vertex arrays, of their own mesh. So the index 0 means “my first vertex”. But if you load them into the same buffer, you need a way to tell it where “my first vertex” starts.

You could do this by manually changing the index for each mesh at load time. But you’d just be doing what baseVertex already does. You could do this by calling glBindVertexBuffer or glVertexAttribPointer and providing a different offset for the buffer(s). But that’s a state change and therefore somewhat expensive. The baseVertex is a rendering parameter and therefore cheap.

FYI: the equivalent code for a full indirect draw command is:


void IndirectDraw(IndirectStruct params)
{
  for(auto instance = 0; instance < params.instanceCount; ++instance)
  {
    for(auto ix = 0; ix < params.count; ++ix)
    {
      RenderWithVertex(verts[params.baseVertex + index[params.firstIndex + ix]], params.baseInstance + instance);
    }
  }
}

What does baseInstance do or control? By way of example, if baseInstance == 16, does this mean that the built-in gl_instance variable in the shader (assuming there is such a built-in variable) contains the integer value 16 ???

Going to this page and searching for “baseInstance” would eventually take you not only to how the base instance works, but a very clear warning on what it does not do:

I genuinely cannot make this more clear.

But if what I infer in #4 immediately above is correct, what happens when instanceCount > 1 ??? If I want to call a draw function that recognizes instances, and potentially draws many or hundreds or thousands of instances, I would want the same exact matrices (at matrix array element 16) to be sent to the shaders in both of these cases.

Um… why? Generally speaking, instancing means rendering the same mesh multiple times, with different parameters. And one of the most important parameters for rendering a mesh multiple times is its transform.

So unless you want to render these instances on top of each other, it’s not clear why they would have the “same exact matrices”.

Oh, I just noticed the sizeof(DrawElementsIndirectCommand) * 0 argument to glMultiDrawElementsIndirect(). What is that and why zero?

Because the parameter is a byte offset. The conversion between “byte offset” and “array index” is “byte offset = sizeof(struct) * array index”.

The index being 0 here is academic; if you want to change the byte offset later, you don’t have to remember to add the sizeof part.

Fact is, I can’t literally go “all the way” to one draw call, because the engine will also offer rendering of real, amazingly rich, highly configurable starfields based upon a technique that I figured out

To be honest, the whole “make only one draw call ever” thing is really overblown. It’s not even what AZDO is all about. The AZDO presentation made it perfectly clear that draw calls aren’t the problem. The problem is state changes between draw calls. Making multiple draw calls in succession without any state changes is quite fast in OpenGL.

Overall, the thing you need to be worried about is not making one draw call ever. It’s making sure that:

  1. You’re minimizing the number of state changes that happen between draw calls.

  2. That the number of state changes between draw calls does not depend on scene complexity. That is, even if you’re rendering more stuff, you’re not increasing the number of state changes between draw calls.

That is, if your renderer does “draw all my shapes” in one step, then changes some state and “draw my starfield”, that’s fine. So long as “draw all my shapes” doesn’t involve state changes (or the set of state changes is fixed and invariant with how many shapes you render), you’re OK.

However, your message does imply one possible way to achieve this result.

The way to do it is, on the C++ side, use an array of GLuint64. That’s it. It’s just that simple.

The GLSL side has an array of uvec4 that is half the size of the C++ array of GLuint64. Obviously, the GLuint64 array must have an even number of elements.

Maybe you expected me to take the [very tiny] leap to understand I could extract u32vec4.zw elements, create a u32vec2.xy out of them… then follow the instructions in the ARB_GL_bindless_texture.txt specification (which I did read, but found only half informative and half confusing).

… there’s no leap. I literally gave you the GLSL code for reading the data. It’s right there in my first post:

uvec4 val = arr[i/2];
uint ix = (i % 2) * 2;
uvec2 handle = uvec2(val[ix], val[ix + 1]);

arr is the array of uvec4 in the UBO. i is the index for the bindless handle you want to fetch from the array. handle is the uvec2 containing the handle you wanted to fetch, ready to be cast directly into a sampler or image type.

All the examples I remember seeing

Stop looking at examples! “Examples” only represent the code that the person who wrote them wanted to write. Do not limit yourself to what other people want to write.

Stop being a copy-and-paste programmer.

baseVertex isnt really needed, but if you set it to 0, then every time you put vertices and indices of a mesh into a VBO and IBO, you have to pre-correct / shift all the indices with the number of vertices that were in the VBO before putting the mesh’s vertices into it

was that understandable ?? if not read this:
https://www.khronos.org/opengl/wiki/Vertex_Rendering#Base_Index

so now you have 2 buffers (VBO and IBO), and you can address different meshes in them, to do so you can “pack” all the necessary infos in a struct. for example:


then comes the 3rd buffer: “instance buffer”
until now, you had to set a uniform mat4 to transform your mesh in the correct space.
but consider the case when you want to draw 10 x the same meshes with different transforms:

for (uint i = 0; i < 10; i++)
{
	mat4 MVP = arrayoftransforms[i];
	glUniformMatrix4fv(location, 1, GL_FALSE, value_ptr(MVP));
	
	Draw(mesh);
}

or you could use an array of mat4s and upload the transforms once:

glUniformMatrix4fv(location, 10, GL_FALSE, arrayoftransforms);

for (uint i = 0; i < 10; i++)
{
	Draw(mesh);
}

however you do it, you have to issue 10 draw calls. and if you want to draw meshes 1000x, or more,
then you run into problems:
–> you cant upload 1000 transforms (assuming max. 16kB uniform block space) at once
–> even if you could, you’d have to call Draw(mesh) 1000x, which can impact performance

solution: instanced rendering
if you could upload 1000 transforms, the solution would be “DrawInstanced(mesh, 1000);”
https://www.khronos.org/opengl/wiki/Vertex_Rendering#Instancing

if you can’t upload 1000 transforms, then you have to put the transforms into another buffer: the “instance buffer”, and you modify your vertex array object (VAO) so that it pulls 1 transform for each mesh instance. here some tutorials:

important:
https://www.khronos.org/opengl/wiki/Vertex_Specification#Instanced_arrays
https://www.khronos.org/opengl/wiki/GLAPI/glVertexAttribDivisor


now you can reduce the number of draw calls to the number of different meshes you have.
to reduce even that, make use of a 4th buffer: the “indirect buffer” which effectivelly stores the draw command parameters so that you have to issue only 1 glMultiDrawElementsIndirect(…) to render all your meshes.

that’s what i explained previously …

if you take a look at the arguments of glMultiDrawElementsIndirect(…), you can see that you can specify “padding” in the indirect buffer between 2 drawcall instances. which means you can use the GPU to fill these structs, for example with a geometry or compute shader to perform instance culling (by setting “instancecount” to 0 if the mesh is behind the camera or so).


and … yes, stop looking at examples, try to figure it out first by reading (the wiki / specs / book / etc), only when you run into problems look at them :wink:

Thanks to Alfonse Reinhart, Dark Photon, john_conner… and some videos I found on the internet… I’ve had some success. I decided to post some comments here about what worked for anyone who finds these messages in the future.

Also, as I hoped (and sorta thought), you will see that a very perverted but cool feature that I attempted actually works in OpenGL… even thought gurus at nvidia and AMD claimed/implied it wouldn’t work on their videos. This was my desire to be able to switch texturemaps, normalmaps and potentially othermaps on a triangle-by-triangle basis (within a single object and draw). I’ll explain why I think this worked as I expected, not failed as the gurus expected.

Note that what I got working is only the hyper-generalized bindless texture approach I was shooting for. Next I’ll be diving into all the efficiency stuff we’ve been discussing based on eternal “Storage” instead of “Data”, persistent buffers, coherent buffers, glMultiDrawElementsIndirect() and so forth.

But for now, just this one victory, which I nonetheless consider a big victory because it means dozens or even hundreds of images can be “resident” and any of them can be accessed by any number of objects in a single draw call, even on a triangle-by-triangle basis if you’re as crazy as me. Though frankly, being able to apply totally different textures, normalmaps, conestepmaps and othermaps to different parts of objects seems like a very desirable situation to me. Otherwise I assume many objects would need to be broken up into separate objects and textured independently.

I removed all error checking (and most overly engine-specific code) from the following code to make the following more readable.

Okay, first my engine created four 4096 element arrays of u64 bindless texture handles.


    bytes = 32768;

    u64* buffer0 = (u64*) memory_buffer_create (bytes);   // u64 buffer0[4096] for engine to hold  4096 u64 bindless texture handles for UBO #0
    u64* buffer1 = (u64*) memory_buffer_create (bytes);   // u64 buffer1[4096] for engine to hold 4096 u64 bindless texture handles for UBO #1
    u64* buffer2 = (u64*) memory_buffer_create (bytes);   // u64 buffer2[4096] for engine to hold 4096 u64 bindless texture handles for UBO #2
    u64* buffer3 = (u64*) memory_buffer_create (bytes);   // u64 buffer3[4096] for engine to hold 4096 u64 bindless texture handles for UBO #3

    glstate.glimage_handle_buffer0 = buffer0;      // keep copy of OpenGL uniform buffer object that backs up UBO #0
    glstate.glimage_handle_buffer1 = buffer1;      // keep copy of OpenGL uniform buffer object that backs up UBO #1
    glstate.glimage_handle_buffer2 = buffer2;      // keep copy of OpenGL uniform buffer object that backs up UBO #2
    glstate.glimage_handle_buffer3 = buffer3;      // keep copy of OpenGL uniform buffer object that backs up UBO #3

    u32 imap0 = 0;                               // uniform buffer object  identifier to contain 4096 u64 bindless texture handles
    u32 imap1 = 0;                               // ditto for uniform block 1
    u32 imap2 = 0;                               // ditto for uniform block 2
    u32 imap3 = 0;                               // ditto for uniform block 3

    u32 binding0 = 0;     // this binding must be specified in shaders:  layout (binding = 0) uniform imap00
    u32 binding1 = 1;     // this binding must be specified in shaders:  layout (binding = 1) uniform imap01
    u32 binding2 = 2;     // this binding must be specified in shaders:  layout (binding = 2) uniform imap02
    u32 binding3 = 3;     // this binding must be specified in shaders:  layout (binding = 3) uniform imap03

    glCreateBuffers (1, &imap0);             // create four buffer objects to become the four uniform buffers
    glCreateBuffers (1, &imap1);             // that contain u64 bindless texture handles for four purposes,
    glCreateBuffers (1, &imap2);             // namely texturemaps, normalmaps, conestepmaps, othermaps.
    glCreateBuffers (1, &imap3);

    glstate.bufferid_imap0 = imap0;        // engine must remember OpenGL uniform block numbers
    glstate.bufferid_imap1 = imap1;        // in order to put new bindless texture handles in those
    glstate.bufferid_imap2 = imap2;        // those buffers when new textures are created.
    glstate.bufferid_imap3 = imap3;

    glstate.binding_imap0 = binding0;     // engine remembers binding numbers for these uniform blocks
    glstate.binding_imap1 = binding1;     // even though these are fixed by the engine specification.
    glstate.binding_imap2 = binding2;
    glstate.binding_imap3 = binding3;

    bytes = 32768;        // sufficient for 4096 u64 bindless texture handles in each of the four uniform blocks

    glNamedBufferStorage (imap0, bytes, 0, GL_MAP_WRITE_BIT | GL_MAP_PERSISTENT_BIT | GL_DYNAMIC_STORAGE_BIT);
    glNamedBufferStorage (imap1, bytes, 0, GL_MAP_WRITE_BIT | GL_MAP_PERSISTENT_BIT | GL_DYNAMIC_STORAGE_BIT);
    glNamedBufferStorage (imap2, bytes, 0, GL_MAP_WRITE_BIT | GL_MAP_PERSISTENT_BIT | GL_DYNAMIC_STORAGE_BIT);
    glNamedBufferStorage (imap3, bytes, 0, GL_MAP_WRITE_BIT | GL_MAP_PERSISTENT_BIT | GL_DYNAMIC_STORAGE_BIT);

    glBindBufferBase (GL_UNIFORM_BUFFER, binding0, imap0);
    glBindBufferBase (GL_UNIFORM_BUFFER, binding1, imap1);
    glBindBufferBase (GL_UNIFORM_BUFFER, binding2, imap2);
    glBindBufferBase (GL_UNIFORM_BUFFER, binding3, imap3);

When applications call image create functions, the textures are created in the usual OpenGL way.


    glCreateTextures (GL_TEXTURE_2D, 1, &glimageid);

    glTextureStorage2D (glimageid, levels, glinternal, width, height);   // specify image levels, width, height, pixel format
    glTextureParameteri (glimageid, GL_TEXTURE_WRAP_S, GL_REPEAT); 
    glTextureParameteri (glimageid, GL_TEXTURE_WRAP_T, GL_REPEAT);
    glTextureParameteri (glimageid, GL_TEXTURE_MAG_FILTER, GL_LINEAR);
    glTextureParameteri (glimageid, GL_TEXTURE_MIN_FILTER, GL_LINEAR_MIPMAP_LINEAR);

    if ((options & IG_OPTION_IMAGE_NOCLEAR) == 0) {
        glClearTexImage (glimageid, 0, glformat, gltype, color);
    }

    if ((options & IG_OPTION_IMAGE_NOMIP) == 0) {
        glGenerateTextureMipmap (glimageid);
    }

Then we find an available slot in the appropriate u64 bindless texture array for this new image object:


    switch (glblock) {       // glblock specifies what is the purpose of this image (see the four options below)
        case 0:
            for (i = 1; i < 4096; i++) {
                gltester = glstate.glimage_handle_buffer0[i];    // for colormaps AKA texturemaps
                if (gltester == 0) { glindex = i; break; }
            }
            break;
        case 1:
            for (i = 1; i < 4096; i++) {
                gltester = glstate.glimage_handle_buffer1[i];    // for surfacemaps AKA normalmaps
                if (gltester == 0) { glindex = i; break; }
            }
            break;
        case 2:
            for (i = 1; i < 4096; i++) {
                gltester = glstate.glimage_handle_buffer2[i];    // for conestepmaps AKA parallaxmaps
                if (gltester == 0) { glindex = i; break; }
            }
            break;
        case 3:
            for (i = 1; i < 4096; i++) {
                gltester = glstate.glimage_handle_buffer3[i];    // for othermaps (shader dependent)
                if (gltester == 0) { glindex = i; break; }
            }
            break;
        default:
            assert(0);
            return (CORE_ERROR_INTERNAL);                                                        // invalid uniform block
    }

Then get the u64 bindless texture handle for this image and make this image resident.


    glhandle = glGetTextureHandleARB (glimageid);
    glMakeTextureHandleResidentARB (glhandle);

Assign the texture handle to the empty slot we found in the appropriate glstate.glimage_handle_buffer#[] array:


    switch (glblock) {
        case 0:
            glstate.glimage_handle_buffer0[glindex] = glhandle;    // put bindless texture handle into empty slot in buffer0[]
             glNamedBufferSubData (glstate.bufferid_imap0, glindex *  sizeof(u64), sizeof(u64), &glhandle);    // and into imap0 in UBO #0  in GPU
            break;
        case 1:
            glstate.glimage_handle_buffer1[glindex] = glhandle;    // put bindless texture handle into empty slot in buffer1[]
             glNamedBufferSubData (glstate.bufferid_imap1, glindex *  sizeof(u64), sizeof(u64), &glhandle);    // and into imap1 in UBO #1  in GPU
            break;
        case 2:
            glstate.glimage_handle_buffer2[glindex] = glhandle;    // put bindless texture handle into empty slot in buffer2[]
             glNamedBufferSubData (glstate.bufferid_imap2, glindex *  sizeof(u64), sizeof(u64), &glhandle);    // and into imap2 in UBO #2  in GPU
            break;
        case 3:
            glstate.glimage_handle_buffer3[glindex] = glhandle;    // put bindless texture handle into empty sllot in buffer3[]
             glNamedBufferSubData (glstate.bufferid_imap3, glindex *  sizeof(u64), sizeof(u64), &glhandle);    // and into imap3 in UBO #3  in GPU
            break;
        default:
            return (CORE_ERROR_INTERNAL);    // impossible error
    }

Then copy the image loaded from disk or blob into the texture in GPU memory… then generate mipmap levels.


    glTextureSubImage2D (glimageid, 0, 0, 0, width, height, glformat,  gltype, tbuffer);    // copy image into bindless texture in GPU
    if ((options & IG_OPTION_IMAGE_NOMIP) == 0) {
        glGenerateTextureMipmap (glimageid);    // generate full mipmap
    }

The engine creates procedurally generated content, including 3D physical objects. This is rather complex, even for fairly simple shapes, so I won’t show how that works. Regardless, up to four images can be specified for each shape object created, one image for each of four purposes (mentioned several times in the code comments above).

Every vertex contains four u08 fields (in one element of a u32vec2 vertex attribute). Each of those u08 fields specify which image to access from each of the four uniform blocks created and filled in by code above. In other words, every vertex in every shape object can specify on a vertex by vertex basis which texturemap should be applied, which normalmap should be applied, which conestepmap should be applied, which othermap should be applied.

Except that’s not exactly true. For various reasons that have to do with the way shape objects are constructed, a function was called during engine initialization that specifies the first vertex of each triangle is the “provoking vertex”. That means that all the invoked fragment shaders receive all the integer vertex-attributes ONLY from the first vertex of each triangle (which is why it is called the “provoking vertex”). Since integer vertex attributes are not interpolated like other vertex-attributes, the engine has no other choice except to choose the integer vertex attributes from one of the three vertices to send to all instances of the fragment shaders deployed to draw each triangle.

Which is the hint, I believe, to why the perverse scheme of switching any combination of the images on a triangle-by-triangle basis works in this engine. To elevate to beyond a hint and state why I think this works, the story works like this (so I infer). When a GPU finishes processing the three vertices in a triangle, it needs to deploy a bunch of fragment shaders to process all the fragments within the triangle based upon the outputs of the vertex shader. The GPU needs to send the same information to ALL the fragment shaders it deploys… except for whatever hardware they have that performs the interpolation of floating-point outputs of the vertex shaders.

If somehow different fragment shaders tried to access different images/textures… that probably would not work. In fact, probably a great deal is “fixed, nailed down, cast in stone” when the fragment shaders are deployed to render the fragments within the triangle. And probably this includes images/textures and samplers. Originally I thought maybe it was possible to pass shaders from vertex shader to fragment shaders. And that would work for the purposes of this engine too, since either way the whole triangle is drawn with the same set of images/textures. But I haven’t seen any way to pass samplers, so I guess the way the engine does this is more-or-less equivalent (wherein every fragment shader assembles the bindless shader handle from two 32-bit values that overlay the u64 bindless texture handle).

The fragment shader code follows. I’ll make some comments about how the shader accesses the four images (actually, the current code only accesses the texturemap from uniform block #0 and the normalmap from block #1. The conestepmap code and othermap code is not yet written (but those images are available).


//
//
// ###########################  Max Reason
// #####  igtan017.frag  #####  copyright 2005 - 2017+
// ###########################  part of the ICE projects
//
#version 450 core    // requires support for GLSL v4.50 (and OpenGL v4.50)

#extension GL_ARB_bindless_texture : require    // requires support for GL_ARB_bindless_texture

layout (location =  0) uniform    mat4     ig_transform;    // transformation matrix
layout (location =  4) uniform    vec4      ig_clight0;        // color of light #0
layout (location =  5) uniform    vec4      ig_clight1;        // color of light #1
layout (location =  6) uniform    vec4      ig_clight2;        // color of light #2
layout (location =  7) uniform    vec4      ig_clight3;        // color of light #3
layout (location =  8) uniform    vec4      ig_plight0;        // position light #0
layout (location =  9) uniform    vec4      ig_plight1;        // position light #1
layout (location = 10) uniform    vec4     ig_plight2;        // position light #2
layout (location = 11) uniform    vec4     ig_plight3;        // position light #3
layout (location = 12) uniform    vec4     ig_pcamera;     // position camera == currently active camera

// imap00 == 4096 samplers mapped to 4096  images AKA "bindless texture handles" backed by UBO containing 4096 u64 texture handles
// imap01 == ditto
// imap02 == ditto
// imap03 == ditto

// image0 == 2048 uvec4.xyzw vectors containing 4096  u64 bindless texture handles mapped into 2048 uvec4 vectors.xyzw
// image1 == ditto
// image2 == ditto
// image3 == ditto

layout (binding = 0) uniform imap00 {
    uvec4    image0[2048];    // backed by 4096 u64 bindless texture handles in UBO #0
};

layout (binding = 1) uniform imap01 {
    uvec4     image1[2048];   // backed by 4096 u64 bindless texture handles in UBO #1
};

layout (binding = 2) uniform imap02 {
    uvec4    image2[2048];    // backed by 4096 u64 bindless texture handles in UBO #2
};

layout (binding = 3) uniform imap03 {
    uvec4    image3[2048];    // backed by 4096 u64 bindless texture handles in UBO #3
};

out        vec4        outcolor;     // color to write to the fragment/pixel
out        float        outdepth;    // depth to write to the fragment/pixel

in        vec4        vcamera;
in        vec4        vlight0;
in        vec4        vlight1;
in        vec4        vlight2;
in        vec4        vlight3;
in        vec4        vtcoord;
in        vec4        vcolor;
flat in  ivec2       vmixmatsay;

// mixmatsay.x == imap0(00:07) | imap1(08:15) | imap2(16:23) |  imap3(24:31)
// mixmatsay.y == tmatid(00:15) | say(16:31)

const float   ambient = 0.1250;    // multiply light color for ambient lighting   ::: 0.0625 to 0.2500 --- or higher
const float   diffuse = 0.6250;      // multiply light color for diffuse lighting   ::: 0.5000 to 0.7500 --- or higher
const float   specular = 1.0000;    // multiply light color for specular lighting  ::: 0.7500 to 1.0000
//
// #################
// #####  main()  #####
// #################
//
void main () {

    vec4 color = vec4(1.000, 1.000, 1.000, 1.000);    // color of this pixel == white (default color == starting color)
     vec3 tweak;                                                  //  tweak surface.xyz vectors of this pixel (default == untweaked ==  vec3(0,0,1)

// say bit 0 : color = vertex.rgba : otherwise ignore vertex.rgba ::: this alone == emissive lighting
// say bit 1 : color = light.rgba : otherwise ignore light.rgba
// say bit 2 : color modified by imap0 (texturemap)
// say bit 3 : color modified by imap1 (surfacemap AKA normalmap AKA bumpmap)
// say bit 4 : color modified by imap2 (conemap and heightmap)
// say bit 5 : color modified by imap3

    uint imap0 = (vmixmatsay.x >>  0) & 0x000000FF;
    uint imap1 = (vmixmatsay.x >>  8) & 0x000000FF;
    uint imap2 = (vmixmatsay.x >> 16) & 0x000000FF;
    uint imap3 = (vmixmatsay.x >> 24) & 0x000000FF;
    uint tmatid = (vmixmatsay.y >>  0) & 0x0000FFFF;
    uint saybit = (vmixmatsay.y >> 16) & 0x0000FFFF;

// ivec0 is uvec2 containing one u64 bindless  texture handle as one uvec2.xy ... which GLSL will convert  to a sampler
// ivec1, ivec2, ivec3 == ditto

    uvec2 ivex0 = bool(imap0 & 1) ? uvec2(image0[imap0 >> 1].zw) : uvec2(image0[imap0 >> 1].xy);
    uvec2 ivex1 = bool(imap1 & 1) ? uvec2(image1[imap1 >> 1].zw) : uvec2(image1[imap1 >> 1].xy);
    uvec2 ivex2 = bool(imap2 & 1) ? uvec2(image2[imap2 >> 1].zw) : uvec2(image2[imap2 >> 1].xy);
    uvec2 ivex3 = bool(imap3 & 1) ? uvec2(image3[imap2 >> 1].zw) : uvec2(image3[imap3 >> 1].xy);

    sampler2D isam0 = sampler2D(ivex0);    // isam0 is a 2D sampler to access bindless texture handle in imap00 uniform block
    sampler2D isam1 = sampler2D(ivex1);    // isam1 is a 2D sampler to access bindless texture handle in imap01 uniform block
    sampler2D isam2 = sampler2D(ivex2);    // isam2 is a 2D sampler to access bindless texture handle in imap02 uniform block
    sampler2D isam3 = sampler2D(ivex3);    // isam3 is a 2D sampler to access bindless texture handle in imap03 uniform block

    if (bool(saybit & 0x0001u)) {    // saybit:00 == vertex.rgba enabled
        color = color * vcolor;          // color = color.rgba * vertex.rgba
    }

    if (bool(saybit & 0x0004u)) {        // saybit:02 == imap0 enabled
        if (bool(saybit & 0x0010u)) {    // saybit:04 == imap2 enabled ::: conestepmap to be implemented later
            color = color * texture(isam0, vtcoord.xy);    // temporary fallback to unperturbed texturemap
        } else {
            color = color * texture(isam0, vtcoord.xy);    // color = color.rgba * texturemap.rgba @ tcoord.xy
        }
    }

    if (bool(saybit & 0x0002u)) {        // saybit:01 == light.rgba enabled
         if (bool(saybit & 0x0008u)) {    // saybit:03 == isam1.xyzw  enabled (tweak surface normal to achieve "normal-mapping")
            tweak = vec3((2.0 * texture(isam1, vtcoord.xy)) - 1.0);    // tweak surface normal to achieve "normal-mapping"
        } else {
            tweak = vec3(0.0000, 0.0000, 1.0000);    //
        }
        vec3 light0 = normalize(vlight0.xyz);        // pixel to light0 vector in surface-coordinates AKA tangent-coordinates
        vec3 light1 = normalize(vlight1.xyz);        // pixel to light1 vector in surface-coordinates AKA tangent-coordinates
        vec3 light2 = normalize(vlight2.xyz);        // pixel to light2 vector in surface-coordinates AKA tangent-coordinates
        vec3 light3 = normalize(vlight3.xyz);        // pixel to light3 vector in surface-coordinates AKA tangent-coordinates
        vec3 camera = normalize(vcamera.xyz);   // pixel to camera vector in surface-coordinates AKA tangent-coordinates
//
// add ambient contributions from lights 0,1,2,3
//
        vec3 pixel = vec3(0.0, 0.0, 0.0);                              // pixel starts totally black
        pixel += (color.xyz * (vec3(ig_clight0) * ambient));    // add ambient color from light0
        pixel += (color.xyz * (vec3(ig_clight1) * ambient));    // add ambient color from light1
        pixel += (color.xyz * (vec3(ig_clight2) * ambient));    // add ambient color from light2
        pixel += (color.xyz * (vec3(ig_clight3) * ambient));    // add ambient color from light3
//
// add diffuse contributions from lights 0,1,2,3
//
        pixel += (color.xyz * (vec3(ig_clight0) * diffuse) *  max(dot(tweak.xyz, light0.xyz), 0.0));    // add diffuse color from  light0
        pixel += (color.xyz * (vec3(ig_clight1) * diffuse) *  max(dot(tweak.xyz, light1.xyz), 0.0));    // add diffuse color from  light1
        pixel += (color.xyz * (vec3(ig_clight2) * diffuse) *  max(dot(tweak.xyz, light2.xyz), 0.0));    // add diffuse color from  light2
        pixel += (color.xyz * (vec3(ig_clight3) * diffuse) *  max(dot(tweak.xyz, light3.xyz), 0.0));    // add diffuse color from  light3
//
// add specular contributions from lights 0,1,2,3 --- add code to support one-channel specularmap in imap01.w (now empty)
//
         pixel += (color.xyz * (vec3(ig_clight0) * specular) *  pow(max(dot(camera.xyz, reflect(-light0.xyz,tweak.xyz)), 0.0), 128.0));
         pixel += (color.xyz * (vec3(ig_clight1) * specular) *  pow(max(dot(camera.xyz, reflect(-light1.xyz,tweak.xyz)), 0.0), 128.0));
         pixel += (color.xyz * (vec3(ig_clight2) * specular) *  pow(max(dot(camera.xyz, reflect(-light2.xyz,tweak.xyz)), 0.0), 128.0));
         pixel += (color.xyz * (vec3(ig_clight3) * specular) *  pow(max(dot(camera.xyz, reflect(-light3.xyz,tweak.xyz)), 0.0), 128.0));
//
// limit the intensity of each pixel color to the range 0.0000 to 1.0000
//
         color.xyz = clamp(pixel, 0.0, 1.0);                                             // assure 0.0000 <= color.rgba <= 1.0000
    }
//
//  NOTE:  need to support alpha color blending here, and decide how to  handle z-depth of semi-transparent and fully transparent pixels
//
    outcolor = color;
}

The important lines (as far as the topic of supporting unlimited numbers of bindless textures goes) are near the top.

First the four uniform blocks with binding specified to uniform block 0, 1, 2, 3 are shown as u32vec4 arrays that contain 2048 u32vec4 elements. Those four uniform blocks are backed by uniform buffers that contain up to 4096 bindless texture handles.

Not far inside the function is code that extracts the various 1-bit, 8-bit and 16-bit fields of interest from the u32vec2 vertex attribute.

The u08 imap0, imap1, imap2, imap3 fields contain the indexes of the u64 bindless texture arrays that this triangle wishes to access from the four corresponding uniform blocks (0,1,2,3) for the four purposes (texturemap, normalmap, conestepmap, othermap).

The tmapid value is an index to access the transformation matrices for this object (only useful to vertex shader).

The saybit value is a bit field in which each bit enables (or doesn’t) various features like:
bit 0 : let vertex color.rgba impact the final color of the pixels — this bit alone would cause emissive lighting (a light source).
bit 1 : let the positions, orientations and colors of lights impact the final color of pixels (lighting)
bit 2 : let the texturemap.rgba at the specified vtcoord.xy impact the final color of pixels (texture mapping)
bit 3 : let the normalmap.xyzw at the specified vtcoord.xy impact the final color of pixels (normal mapping AKA bump mapping)
bit 4 : let the conestepmap.xyzw at the specified vtcoord.xy impact the final color of pixels (conestep/parallax mapping)
bit x : a few more bits are relevant.

Quite likely the saybit field will be considerably shorter than 16-bits, and the tmapid field will be considerably wider. The tmapid field essentially contains the integer shape object identifier, which varies from 1 to number-of-objects. The vertex shader reads 1, 2, or 3 matrices from from SSBO with this tmapid being the index into the array of f32mat4x4 arrays. This is not needed in the fragment shader.

As far as bindless textures go, the key points are:

Each bindless texture handle is contained in two of the four elements of one of the uniform buffer u32vec4 arrays. The long lines that compute ivex0, ivex1, ivex2, ivex3 read the u64 bindless texture handle from two of the four elements of one of the u32vec4 array elements and converts them into a uvec2 vector (which GLSL knows how to convert to a sampler of the specified bindless texture).

Then the four lines that start with sampler2D convert the four u32vec2 values into four samplers for the specified bindless textures.

After that, the rest is simple and typical of many fragment shaders… except for the fact that each computation is conditional upon one of the say bits being set == 1. Therefore, a line like the following conditionally lets the vertex color.rgba contribute to the fragment color:

if (bool(saybit & 0x0001u)) { color = color * vcolor; }

And ditto for letting the positions, colors, brightness of the four lights impact the color of the fragment.

And ditto for letting the normalmap impact the color of the fragment via the conventional normalmap computation.

And so forth.

But as far as the bindless textures, only the texture(isam#, lines matter, as they access the bindless textures.

That’s just about it.

I will attach four images captured from tests I ran to see whether I could change the texturemap and normalmap applied to individual triangles within simple shape objects. As you will see, that worked (except for one artifact that I’ll mention that is only a problem with the way I define certain shape objects, and unrelated to this issue).

To make sense of these images you’ll need to understand just a little about how these shapes are generated. For almost all shapes the create functions take the following arguments (and a lot more):

level
level_first
level_count
sides
side_first
side_count

Except for the most trivial shape objects (like face3 == triangle and face4 == rectangle0, shape objects can have between 1 and dozens of levels and 3 and dozens of sides. The shape objects shown in the attached images are called “faces” objects. The are essentially thin one or two sided disks with 3 or more sides and 1 or more levels.

A 1 level, 3 sided “faces” object is a triangle composed of three identical triangles. A 1 level, 4 sided “faces” object is a square composed of four identical triangles. A 1 level, 16 sided faces object is a 16-sided polygon composed of 16 identical triangles. All triangles have one vertex at the center.

A 2 or 3 or 16 level “faces” object with the same number of sides looks very similar. In fact, unless some levels or sides are omitted via those arguments, the number of levels does not change the outward shape or appearance of a faces object (but does change the shapep or appearance of some other shapes).

By setting level_first to a value greater than zero, the triangles that compose one or more of the innermost levels will be absent, and the level_count argument will specify the number of levels to build with triangles beyond level_first.

By setting side_first to a value greater than zero, the triangles that compose one or more sides of the shape (at all levels) will be absent (those sides from side 0 to side_first will be absent). The side_count argument specifies how many of the “sides” sides to build with triangles.

The “levels” and “sides” arguments must be valid and specific (levels must be 1 or more, while sides must be 3 or more).

The other arguments can be zero to specify the “default” level_first (0), the “default” side_first (0), the “default” level_count (level_count == levels), and the “default” side_count (side_count == sides).

Anyway, the point is this. One can create partial shapes with the create functions. You will see examples of this in the attached images.

What I did to test whether the code would display different textures on a “per triangle” basis was to add one or two lines of code to the “faces” create function to set a different texturemap and normalmap on for one side (one image), for two levels (another image), and for one side and two levels (final image). You can see the consequences of that in the images.

Which is… it works!

Except for one case that is my fault, not the fault of the bindless textures features. The problem is, the faces shape contains only one vertex at the center of the object, and all the triangles in level 0 share that vertex. This decision already caused me trouble once before, but I haven’t gotten around to fixing the problem yet. I need to place as many vertices at the center as there are sides on level 0 of the shape.

Why so? In this case, there is no way to distinguish which “side” of the object the central vertex belongs to. Every other provoking vertex in the object is at the start of one side or one level of the object. In fact, the first == provoking vertex is always at the lower level and lower side of every triangle. But… which side is the central vertex the provoking vertex of? Oops! This is a classic singularity or pole problem. The central vertex is the provoking vertex on every single triangle in level 0, and thus the provoking vertex on every side. And therefore, the question that needs to be answered to decide which texturemap and normalmap to select for any side at level 0 is… undefined and problematic.

I ran into this problem before on the “faces”, “disk”, “peak”, “cone”, “ball”, “globe” shapes. When creating objects, an option flags lets the calling function specify how to generate texture coordinates for every vertex. For the images displayed the texture coordinates were generated in a “flat mode” or “linear mode” manner (just lay the texturemap image flat onto the shape object and assign the x,y coordinates of the texturemap to the vertices they lie upon).

However, calling functions can specify other ways to generate texture coordinates, one of which is called “wrap mode”. This “wrap mode” is sometimes sensible for the shapes I mentioned in the previous paragraph… especially for “peak”, “cone”, “ball”, “globe”. But a big problem arises when these shapes have only one vertex at the “tip” of the “peak” or “cone”… or at the “poles” of the “ball” or “globe”. The problem is, with only one vertex at the pole, the texture-coordinates for that vertex in ALL the many triangles that touch either pole are always the same. And so, the tcoords along the sides of the triangles as the approach the pole can be very, very, extremely imprecise at various places.

The solution for both of these problems is to put as many vertices at these “singularity” points as the number of sides (and vertices) at the outside of the first level. This will fix the ugly texturemap problem with wrap mode, and also fix the ugly problem you will see in the images with the central level 0 (where the wrong or unexpected texture is displayed at the central level, even for the side we wanted to change the texture).

You can see the problem on the right-hand image in two of the three attached images, namely the image that shows different texturemap and normalmap on a certain side (a radial swath from center to one edge), and the image that shows different texturemap and normalmap on both certain sides and levels (but the sides issue is the problematic one). The problem is always the central area.

Anyway, thanks to all who helped so much recently. I hope you got something out of this too… at least a tiny bit. And I hope others get some benefit too, someday. I have to assume more people will be trying out bindless textures as time goes by. They’re great for AZDO, which is what I’m headed for in as many ways as possible.

Now on to other AZDO opportunities, starting with all that multidraw jazz.

Also, as I hoped (and sorta thought), you will see that a very perverted but cool feature that I attempted actually works in OpenGL…

You should not claim that something “works” just because it doesn’t appear to be broken. Undefined behavior is undefined; that it might do what you want in this particular case does not guarantee that it will continue to do so.

The ARB_bindless_texture extension makes it abundantly clear that the sampler value passed to texture functions must be dynamically uniform. And the definition of “dynamically uniform” explicitly disallows per-triangle data as being “dynamically uniform”. Not unless the entire rendering command only has one triangle.

Now, that being said, NV_gpu_shader5 explicitly allows non-dynamically uniform texture sampling. But as you might notice, that’s an NVIDIA-only extension.

That means that all the invoked fragment shaders receive all the integer vertex-attributes ONLY from the first vertex of each triangle (which is why it is called the “provoking vertex”). Since integer vertex attributes are not interpolated like other vertex-attributes, the engine has no other choice except to choose the integer vertex attributes from one of the three vertices to send to all instances of the fragment shaders deployed to draw each triangle.

Which means precisely nothing as far as dynamically uniform expressions are concerned. The scope for a “graphical operation” is explicitly nebulous, but must be no greater than “a single rendering command, as defined by the client API.” Multiple triangles in the same rendering operation are therefore able to be “a single rendering command”. And therefore an expression which is dynamically uniform must result in the same value for every triangle rendered in that command.

NVIDIA of course can define this more narrowly (the standard requires it to be “a least as large as a triangle or patch”. But that’s implementation-specific behavior; as far as following the actual specification, dynamically uniform is defined at the rendering command level.

Consider the behavior of hardware where multiple FS invocations from different primitives are executed in the same invocation group/wavefront/whatever. There’s no reason why an implementation cannot have 4-16 pixel quads from different triangles running in the same group. This is why those expressions have to be dynamically uniform: the hardware may only use the sampler value from one of those invocations to get which texture to read from, then read from that texture using each FS invocation’s different texture coordinates.

That is a perfectly valid OpenGL implementation. And your code would break on it.

[QUOTE=Alfonse Reinheart;1288464]You should not claim that something “works” just because it doesn’t appear to be broken. Undefined behavior is undefined; that it might do what you want in this particular case does not guarantee that it will continue to do so.

The ARB_bindless_texture extension makes it abundantly clear that the sampler value passed to texture functions must be dynamically uniform. And the definition of “dynamically uniform” explicitly disallows per-triangle data as being “dynamically uniform”. Not unless the entire rendering command only has one triangle.

Now, that being said, NV_gpu_shader5 explicitly allows non-dynamically uniform texture sampling. But as you might notice, that’s an NVIDIA-only extension.

Which means precisely nothing as far as dynamically uniform expressions are concerned. The scope for a “graphical operation” is explicitly nebulous, but must be no greater than “a single rendering command, as defined by the client API.” Multiple triangles in the same rendering operation are therefore able to be “a single rendering command”. And therefore an expression which is dynamically uniform must result in the same value for every triangle rendered in that command.

NVIDIA of course can define this more narrowly (the standard requires it to be “a least as large as a triangle or patch”. But that’s implementation-specific behavior; as far as following the actual specification, dynamically uniform is defined at the rendering command level.

Consider the behavior of hardware where multiple FS invocations from different primitives are executed in the same invocation group/wavefront/whatever. There’s no reason why an implementation cannot have 4-16 pixel quads from different triangles running in the same group. This is why those expressions have to be dynamically uniform: the hardware may only use the sampler value from one of those invocations to get which texture to read from, then read from that texture using each FS invocation’s different texture coordinates.

That is a perfectly valid OpenGL implementation. And your code would break on it.[/QUOTE]

Thanks for the additional information. One followup question. Do they consider glMultiDrawElementsIndirect() functions as one draw or <count> draws for these purposes?

Also, even after listening to a lot of videos about these topics (all I could find), I don’t recall hearing an answer to the following question: If an application draws the exact same set of objects with the exact same set of vertices (same total number of vertices) with functions like glMultiDrawElementsIndirect() but one time the draw count is 100 and the next time the draw count is 1000… will the later be substantially slower, or not much slower?

Obviously there is no state changes between draws either way, but perhaps there is measurable or significant [or substantial] overhead to kick off each draw call. Yet when I think about how this might be happening in the GPU, it seems also possible that the difference might be insignificant or nearly unmeasurable (unless taken to absurd extremes like 3 indices == 1 triangle per internal draw).

PS: While overall I’m a fan of all the great work the GPU makers have done over the years, one place I am very much not a fan is their apparent obscuration of what actually happens in the GPUs. The advent of CUDA and OpenCL has reduced that problem a bit, because certain aspects of GPU operation can be inferred from how compute is organized and works. But unless I missed some important articles somewhere (possible, because I finally gave up looking after a few years), much about GPU architecture and operation is still not stated. One reason I’m so bad at this stuff (OpenGL) is because I have (and always did have) the worst memory in the world. But even more important, because my past history was to develop products from [pretty much] the lowest-level possible and up from there, I knew how EVERYTHING worked. And my brain got habituated to needing to know how every element I worked with works. Everything I ever invented or designed this way was fairly easy and painless for me to develop and never problematic. For example, I used to design CPUs (which means invent the architecture, instruction set, everything) from scratch… meaning from SSI gates (AND, OR, XOR) or MSI (multiplexers and such, for which gate diagrams I could inspect and understand were always available). And every one of my CPU designs worked first time (though once there was a PCB layout error that required one patch wire to replace an omitted trace). The same has happened in my software career. I designed a great many products based upon microprocessors and microcontrollers (chips like 8052 and C8051F120 for example), which were fairly completely speced… and which almost without exception worked first time. And the same applies to the [sometimes rather sophisticated] software in these devices… easy to implement and debug. And why? Because I wrote the operating system too (or just integrated the required OS-like capabilities into the basic code)… every byte of code executing on the device.

But I also ran into cases where I tried to “build on the work of others”. One of the first was my attempt to put a GUI around my optical design and analysis program (the first program I ever wrote, in junior high school). I chose Motif (on my UNIX computer) because… well… that’s just about all that was available at the time. Oh, the allure of “let us save you all the time and effort of implementing a GUI”. Yeah, right. I could never make the GUI work. Too many bugs, and way, way, way too much not specified. No way to know what was actually going on inside Motif, and no way to find out. So I dropped back to the Xt intrinsics, which were supposed to be lower-level. Same problem… no joy, and no way to know whether you could ever arrive at a solution. And of course, their hundred or so bug fixes every month. Eventually I created a new programming language (big mistake, but that’s another story). Once the compiler and debugger was done, I wanted GUI capabilities. So I wrote my own interactive graphical GUI designer subsystem (in that language). Damn! I spent about 1% as much time to develop the whole damn thing than I spent trying (and failing) to make Motif and Xt work! Why? Because I based my graphics and GUI code on xlib, which was simple and did not hide [nearly as] much.

The fallout of all the above (and there are LOTS more supporting cases I’m certain you don’t want to hear) is… maybe almost nobody else knows how horrific the situation is with anything we cannot know how it works to the lowest level. Okay, okay, if you’re picky, I’m willing to accept that we don’t need to know the material physics of transistors and basic gates in order to be able to design with gates (for example). But we still need to understand the consequences of that, the timing, the hysteresis, the temperature dependencies and so forth. But yeah, we don’t necessarily need to understand the bumps on the sides of Quarks. So, while I push a long way towards the fundamental, I don’t push all the way. Just far enough so I can know everything that matters to what I’m doing.

All this is just my way of saying “they should explain what’s going on”… at least a lot more than they do. Or if they do, someone needs to point me to where those documents are, because I’m not in the mood to take a job at nvidia and AMD just to find out. AMD has been trying to hire me for decades, but I like being independent and working only on my own projects. In case it isn’t obvious (!!! it is !!!), AMD has always wanted me to design CPUs, not GPUs. Hahaha!

Fact is, I’ve never even seen a fully coherent, understandable meaning of “dynamically uniform”. The link you provided tries pretty damn hard to be clear, but I’m too stupid to glue all the abstraction together into a coherent idea in my head. For example, I cannot see for the life of me how texture coordinates can possibly be “dynamically uniform” when they are different for every freaking fragment shader that executes. Maybe that writeup on “dynamically uniform” makes sense to others. I dunno. It isn’t complete or precise enough for me to comprehend. I suspect that’s just the stubborn need to completely understand everything that pervades my consciousness from decades of training it to be that way (because no freaking way could I ever have developed all those products without fully understanding how the components I was working with actually worked).

If the GPU vendors would clearly explain “what happens” when the GPU processes three triangle vertices, then kicks off a [pile/bunch/collection/zorchplex] of fragment shaders, maybe we could more-or-less know what will work and what will not work before we waste millions of hours of our lives and the health of trillions of neurons (speaking collectively here). But like I said, they don’t… as far as I can see. Even that link you provided (which is great to have, by the way) says it pretty much infers everything it says from Vulkan, not OpenGL… because OpenGL doesn’t explain clearly. Yeah, no kidding! And yes, I understand OpenGL is a specification, not an application owned by a single GPU vendor. That’s not an adequate excuse in my book. And no, I don’t intend to take a job at Khronos either. If you knew what this engine is merely a subsystem of, then you’d understand why.

I am perfectly aware that my insistence on having a fully coherent and specific statements of the fundamentals of the components I work with… makes me look stupid. I don’t care how I look. All I care about is getting results… good results… better results… great results if possible. I’m also aware that my eternally terrible memory works against me too (both in reality and appearances). So be it, I have no choice. But you know what’s funny. I’ve found I absolutely don’t need more than my terrible memory… when I fully understand how something works. When I understand… everything is clear and obvious. True, “understanding” must necessarily involve some kind of memory. But somehow the memory required for understanding is different… as if every bit of memory involved in the understanding is supported and reinforced by the understanding itself (and/or by every other little bit of related memory involved in the understanding).

Frankly, I don’t know how you guys manage to remember so many individual factoids, or where you found them — or wrote them? You probably all have great memories. Or, you submerge yourself in this stuff for so many endless hours every day that this stuff becomes part of your very being.

I am so utterly, completely, thoroughly used to “knowing stuff like this” as a mere consequence of “understanding how everything works”. Without that overall understanding, my IQ165 turns into IQ11 (plus or minus 10). I’m not sure whether that’s an exaggeration or not, but that certainly feels about right.

A couple factoids about me and this specific project are the following. I always design ahead… well ahead. In this case, my target date is two years from now, because I still have a lot of “procedurally generated content” nonsense to expand or add. Second, this will almost certainly become open-source freeware, not a commercial product. Nonetheless, I hope a great many people create commercial products on the back of this engine. Third, it doesn’t bother me one tiny bit if only the best high-end GPUs can support all the features of the engine… though the high-end GPUs of both nvidia and AMD must support them (no one-vendor-only nonsense). However, since this specific feature (triangle-by-triangle specification of images/textures/normalmaps/etc) is not crucial for everyone, if bilateral support doesn’t exist (at least on the near horizon), it could remain as an “undocumented feature” for the short or medium term. In short, it seems like plenty of other game/physics engines exist out there, so if this engine is just a “niche engine” then that’s just fine with me (in fact, less trouble for me). Don’t want to create an application with lots of procedurally generated content (that you don’t have to do yourself)? Then grab another engine and best wishes (no sarcasm).

And so, I have the “luxury” of including features that might be completely out of the question for other engines or products or applications. And so I will. I don’t need to release an engine that runs on “anything back to OpenGL v3.2 or v3.3” for example. In fact, I’m perfectly fine releasing an engine that requires the latest-and-greatest version of OpenGL… plus a few ARB extensions (but no fringe extensions or one-vendor extensions). I consider ARB extensions to be “headed to core in some form” extensions.

I know most folks do not have such “luxury”. Just know that I do.

PS: I’ll read that “dynamically uniform” writeup another 5 or 6 times. Sometimes that gets difficult issues to gel a bit better.

PS: If someone with a new AMD GPU wants to try this see whether this works the same, let me know what happens (or ask to run my code if you prefer). That could settle the question of this specific feature once and for all (if it works on AMD too).

PS: Most important, I appreciate all the time and effort you expend to help me [attempt to] overcome my memory [and other] disabilities. :slight_smile:

PS: I type at about warp 9.75, so I hope you read at least as fast, or don’t mind such long [and sometimes rambling] messages (and typos).

Do they consider glMultiDrawElementsIndirect() functions as one draw or <count> draws for these purposes?

The answer is… unclear. The specification only says “single rendering command, as defined by the client API,” but they never actually define that in anything but the most obvious way (aka: any gl*Draw* call).

However, gl_DrawID is explicitly required to be dynamically uniform. It’s pretty difficult to make that happen without the individual draws being considered separate invocation groups.

Adding to that, Vulkan is explicit about this, “For indirect drawing commands with drawCount greater than one, invocations from separate draws are in distinct invocation groups”. The reason why “indirect drawing commands” is mentioned is because Vulkan doesn’t have non-indirect multidraw commands. You’re expected to simply repeatedly issue multiple drawing operations if you want the CPU equivalent. Which will, by definition, be separate commands.

If an application draws the exact same set of objects with the exact same set of vertices (same total number of vertices) with functions like glMultiDrawElementsIndirect() but one time the draw count is 100 and the next time the draw count is 1000… will the later be substantially slower, or not much slower?

There’s no way to answer that question. It depends far too much on what is being drawn.

All this is just my way of saying “they should explain what’s going on”… at least a lot more than they do.

Who? Who are you talking to? Who is this “they”?

OpenGL (and Vulkan. And D3D. And Metal) is an abstraction. The whole point of an abstraction is that you do not care about what’s going on.

You write code against the model defined by the abstraction. The implementation implements the abstraction’s model on concrete hardware. That’s the way it works. You are not required to understand “what’s going on”; you are simply required to understand the rules laid down by the abstraction. Follow those rules, and your code works (modulo bugs).

This is what allows you to write the same code that works effectively on multiple platforms. You’re surrendering knowledge of the low-level details so that you can gain independence from the low level details.

Every platform does rasterization slightly differently. What good is it to know how Intel does it if NVIDIA does it differently? What good is it to know how Kepler hardware does it if Maxwell hardware changes the rules?

For example, I cannot see for the life of me how texture coordinates can possibly be “dynamically uniform” when they are different for every freaking fragment shader that executes.

Why do you think they have to be? It’s the sampler itself that must be dynamically uniform. That is, all of the invocations must access the same texture. But they don’t have to access it in the same place.

Maybe that writeup on “dynamically uniform” makes sense to others. I dunno. It isn’t complete or precise enough for me to comprehend.

What is incomplete or imprecise about it? It’s kind of hard to help you when all you can say in response to information is “that’s not good enough!”

A dynamically uniform expression is an expression that evaluates to the same value in every shader invocation spawned from the same rendering command. While that is a bit of a simplification (the idea of “dynamic instances” of expressions is what allows loop counters to be dynamically uniform), it’s pretty complete and precise.

Show me a shader, pick an expression, and I can tell you if it will be guaranteed dynamically uniform or potentially not uniform (depending on the values the user provides. Inputs can be dynamically uniform, but only if the user provides the same input value to all invocations).

Even that link you provided (which is great to have, by the way) says it pretty much infers everything it says from Vulkan, not OpenGL… because OpenGL doesn’t explain clearly. Yeah, no kidding!

To be fair, I wrote that notation before an update OpenGL 4.5, when they (finally!) updated the standard to (mostly) explain these things to the level of detail that Vulkan does. Though to be fair as well, that probably happened because I filed a bug report on it :wink:

Admittedly, they fixed this by literally copying-and-pasting a chunk of the SPIR-V standard into the GLSL specification (seriously, the “Dynamically Uniform Expressions and Uniform Control Flow” section is word-for-word), but at least it’s there.

There are certainly holes in the OpenGL standard, thanks in part to it slowly evolving from a very different original standard. GL 1.1 is nearly unrecognizable in many ways from 4.6. By contrast, Vulkan started from scratch, which meant that they couldn’t assume that something had already been said.

So sure, the update to GL that explained this didn’t happen until I personally filed a bug report on it. But to be fair, Vulkan didn’t fully present that information until I filed a bug report on it too. And it was [i]literally[/i] the first public bug on the standard :wink:

Oh, and FYI? I filed a bug asking for clarification on the multidraw issue.

Frankly, I don’t know how you guys manage to remember so many individual factoids, or where you found them — or wrote them? You probably all have great memories. Or, you submerge yourself in this stuff for so many endless hours every day that this stuff becomes part of your very being.

For what it’s worth, I didn’t “remember” that bindless texture requires dynamically uniform expressions. I remembered how to find the bindless texture extension, and I searched for “dynamically” and found the note that restricted it. I didn’t remember that there was a restriction; I simply thought that there could be one.

That’s not memory; that’s experience. Or paranoia, which is really the same thing when it comes to programming :wink:

Most of this comes from reading the standards. I’ve been doing graphics programming since before shaders were a thing. I’ve read a lot of the extension specifications when they were hot off the presses. I imagine that, in your CPU development work, you too read lots about code-gate design, substrates, and other things as research was being done. This is no different.

Also, you seem to divorce “understanding” from “memory”, as though you could understand how something works without remembering how it works. It’s also odd that you seem to suggest that we don’t “understand” this stuff.

I always design ahead… well ahead. In this case, my target date is two years from now, because I still have a lot of “procedurally generated content” nonsense to expand or add.

And yet, you’ve chosen to use the API that’s behind… well behind :wink:

I’m kidding, but only somewhat. If performance is so important, and your ship date is in the (relatively) far future… why aren’t you using Vulkan? It’d make it a lot easier to do your procedurally generated stuff, thanks to better synchronization support and more explicit memory access. You’re already using ubershaders and working to minimize state changes, so that’s already in line with Vulkan best practices.

Vulkan doesn’t have bindless textures, but since you’re focused on more capable hardware anyway, you can simply require that implementations provide shaderSampledImageArrayDynamicIndexing, which lets you use arrays of textures which you can index from (with a dynamically uniform index, of course;) ). Modern desktop implementations provide that, and you’ll note that even Intel is on that list.

If someone with a new AMD GPU wants to try this see whether this works the same, let me know what happens (or ask to run my code if you prefer). That could settle the question of this specific feature once and for all (if it works on AMD too).

No, it won’t. Undefined behavior is undefined. Even if it worked, that’s no guarantee that it will continue to do so.

[QUOTE=Alfonse Reinheart;1288472]The answer is… unclear. The specification only says “single rendering command, as defined by the client API,” but they never actually define that in anything but the most obvious way (aka: any gl*Draw* call).

However, gl_DrawID is explicitly required to be dynamically uniform. It’s pretty difficult to make that happen without the individual draws being considered separate invocation groups.

Adding to that, Vulkan is explicit about this, “For indirect drawing commands with drawCount greater than one, invocations from separate draws are in distinct invocation groups”. The reason why “indirect drawing commands” is mentioned is because Vulkan doesn’t have non-indirect multidraw commands. You’re expected to simply repeatedly issue multiple drawing operations if you want the CPU equivalent. Which will, by definition, be separate commands.

There’s no way to answer that question. It depends far too much on what is being drawn.

Who? Who are you talking to? Who is this “they”?

OpenGL (and Vulkan. And D3D. And Metal) is an abstraction. The whole point of an abstraction is that you do not care about what’s going on.

You write code against the model defined by the abstraction. The implementation implements the abstraction’s model on concrete hardware. That’s the way it works. You are not required to understand “what’s going on”; you are simply required to understand the rules laid down by the abstraction. Follow those rules, and your code works (modulo bugs).

This is what allows you to write the same code that works effectively on multiple platforms. You’re surrendering knowledge of the low-level details so that you can gain independence from the low level details.

Every platform does rasterization slightly differently. What good is it to know how Intel does it if NVIDIA does it differently? What good is it to know how Kepler hardware does it if Maxwell hardware changes the rules?

Why do you think they have to be? It’s the sampler itself that must be dynamically uniform. That is, all of the invocations must access the same texture. But they don’t have to access it in the same place.

What is incomplete or imprecise about it? It’s kind of hard to help you when all you can say in response to information is “that’s not good enough!”

A dynamically uniform expression is an expression that evaluates to the same value in every shader invocation spawned from the same rendering command. While that is a bit of a simplification (the idea of “dynamic instances” of expressions is what allows loop counters to be dynamically uniform), it’s pretty complete and precise.

Show me a shader, pick an expression, and I can tell you if it will be guaranteed dynamically uniform or potentially not uniform (depending on the values the user provides. Inputs can be dynamically uniform, but only if the user provides the same input value to all invocations).

To be fair, I wrote that notation before an update OpenGL 4.5, when they (finally!) updated the standard to (mostly) explain these things to the level of detail that Vulkan does. Though to be fair as well, that probably happened because I filed a bug report on it :wink:

Admittedly, they fixed this by literally copying-and-pasting a chunk of the SPIR-V standard into the GLSL specification (seriously, the “Dynamically Uniform Expressions and Uniform Control Flow” section is word-for-word), but at least it’s there.

There are certainly holes in the OpenGL standard, thanks in part to it slowly evolving from a very different original standard. GL 1.1 is nearly unrecognizable in many ways from 4.6. By contrast, Vulkan started from scratch, which meant that they couldn’t assume that something had already been said.

So sure, the update to GL that explained this didn’t happen until I personally filed a bug report on it. But to be fair, Vulkan didn’t fully present that information until I filed a bug report on it too. And it was [i]literally[/i] the first public bug on the standard :wink:

Oh, and FYI? I filed a bug asking for clarification on the multidraw issue.

For what it’s worth, I didn’t “remember” that bindless texture requires dynamically uniform expressions. I remembered how to find the bindless texture extension, and I searched for “dynamically” and found the note that restricted it. I didn’t remember that there was a restriction; I simply thought that there could be one.

That’s not memory; that’s experience. Or paranoia, which is really the same thing when it comes to programming :wink:

Most of this comes from reading the standards. I’ve been doing graphics programming since before shaders were a thing. I’ve read a lot of the extension specifications when they were hot off the presses. I imagine that, in your CPU development work, you too read lots about code-gate design, substrates, and other things as research was being done. This is no different.

Also, you seem to divorce “understanding” from “memory”, as though you could understand how something works without remembering how it works. It’s also odd that you seem to suggest that we don’t “understand” this stuff.

And yet, you’ve chosen to use the API that’s behind… well behind :wink:

I’m kidding, but only somewhat. If performance is so important, and your ship date is in the (relatively) far future… why aren’t you using Vulkan? It’d make it a lot easier to do your procedurally generated stuff, thanks to better synchronization support and more explicit memory access. You’re already using ubershaders and working to minimize state changes, so that’s already in line with Vulkan best practices.

Vulkan doesn’t have bindless textures, but since you’re focused on more capable hardware anyway, you can simply require that implementations provide shaderSampledImageArrayDynamicIndexing, which lets you use arrays of textures which you can index from (with a dynamically uniform index, of course;) ). Modern desktop implementations provide that, and you’ll note that even Intel is on that list.

No, it won’t. Undefined behavior is undefined. Even if it worked, that’s no guarantee that it will continue to do so.[/QUOTE]


I have a lot of trouble making multi-level nested comments work, so I won’t try. Just know that the following addresses issues in the order of your message.

Wow, it would be very interesting to learn that drawing the exact same objects (in every respect) with glDrawElements*() can accomplish more than glMultiDrawElements*() can! As usual, I can’t remember for sure, but I sorta think maybe I recall them even saying the results are the same as the client (or perhaps driver?) executing glDrawElementsIndirect() in a loop (which would imply [to stupid me at least] that nothing will happen different).

Of course, that only pushes the question back as far as drawing whole objects, not as far as I pushed it… back to drawing individual triangles.

Oh, just a tiny factoid in case you like to know these things. I tried this new test code on my ancient GTX 680 and… it works the same as the GTX 1080 TI… and draws different textures/normalmaps on a triangle-by-triangle basis. Interesting. Already worked on ancient history.

Oh, another probably irrelevant tidbit. I noticed the series of bindless texture handles I received back from OpenGL differed by… one. So it seems rather unlikely they are actually GPU addresses, because it seems unlikely the single byte at those addresses could have enough information to define a sampler… or even provide sufficient indirection to such (unless they have a literal limit of 256 textures/images, which doesn’t sound right to me).

To answer your next question, my answer is EVERYONE. Khronos should. nvidia should. AMD should. Everyone should (even though I don’t care if Intel does). That way we can understand more fully what’s going on, by reading multiple perspectives… some abstract, some concrete.

The whole point of an abstraction is that you do not care what’s going on? Well, that’s where this engine (and probably a few other applications) differ. I don’t care if the engine only runs on 20% of GPUs… as long as they’re high-end GPUs from both nvidia and AMD. But sure, I totally understand that running on a huge majority of GPUs is very important to a majority of 3D engine/application developers. I do. But not everyone does (and I dare say I’m not the only one). If I can achieve a 100x performance increase by “cheating” (knowing how just two brands of high-end GPUs work)… then I am more than happy to take advantage of that. Even 10x or 5x performance increase for that matter. Maybe even 2x.

This is where honesty comes in… or I should say, this is where clarity and honesty SHOULD come in (but apparently often doesn’t). For example, in the GDC AZDO video they show some speed comparisons that show array textures are the same speed as bindless sparse texture arrays (or something like that). These were both shown as exactly 18x faster than their nominal case (a DX11 example). However, I looked into array textures A LOT, but my conclusions was “no workie in real life practical situations”. Compared to my bindless texture implementation, I don’t see how their scheme can work… unless every texture is the same size, and thus easy to put into different layers of one array texture. Unless there is a huge pile of unmentioned wackiness going on in their benchmark, I don’t see any way for an application to specify arbitrary array textures of varying size and configuration, bound to arbitrary texture units, to be specified on an object-by-object basis. Let’s assume they take advantage of every trick in the book and new tricks only limited to what is actually possible. Let’s assume they call their favorite glMultiDrawElementsIndirect() like function, and take advantage of that built-in gl_DrawID shader variable to be able to access information in uniform blocks or SSBOs about how to render each object. The problem is, the shader programs can’t change between draws. So as far as I can see, shader programs can only access specific samplers… the samplers explicitly named in the source code. But one sampler only accesses one array texture, and each array texture only contains one size and configuration of textures. And therefore, if the texture you want to access for this object within this glMultiDrawElementsIndirect() call accesses the set of textures that is wrong-size or wrong-configuration… you can’t render the object at all in that glMultiDrawElementsIndirect() call.

Or maybe I’m wrong. Maybe if you put a huge long switch statement the shader could contain code to access every single array texture based upon some variable in the block of data accessed based upon gl_DrawID (or something) the shader could get the results from only that one array texture. My naive (and quite possiibly wrong) assumption circa the latest GPUs is that every access would be performed in every running shader, which would be horrible overhead for zillions of fragment shaders accessing many textures every fragment shader invocation when only one is necessary. Just imagine the memory bandwidth demand when a whole slew of fragment shaders access a dozen array textures or more! I never read a complete statement of how this stuff is resolved or executed, other than a claim a few years ago that every last bit of code is executed in every shader program, and that everything that did not need to be executed is simply discarded. For a simple ?: or one-line if-then block that’s not so bad. To access several pixels from 12 textures and perform filtering on them all compared to just one texture… sounds excessive to me! But you’ll probably tell me that aspect of shaders was fixed long ago. I hope so.

Nonetheless, the approach I took lets shaders access any texture of any size and any configuration whenever it wants for each object (actually each triangle, but we’ll ignore that for the purposes of this part of the conversation). Obviously I could make my scheme work from values stored in a UBO/SSBO and indexed with gl_DrawID too. But that would be slower than what I have (albeit only modestly == two extra accesses of UBO/SSBO contents), more complex, and incapable of “stupid GPU tricks” that others disapprove of (yet wish they achieve too). Of course, they want their applications to run on cell phones and every computer in the world, while I quite explicitly do not care. Seems only fair that I get something valuable for what I give up (execution on everything everywhere).

Hahaha… Khronos should pay you for your important contributions! Not sarcasm.

As an aside, I’d love to know how the GPU folks dispatch (and re-dispatch) fragment shaders when drawing triangles. For now at least, it seems my inference that they dispatch a fairly independent “horde” of them based upon the 3 vertices in a triangle. But yes, I understand it doesn’t have to be this way, it just makes sense to me (which isn’t saying much, since I really don’t understand as much as I’d like about how GPUs work internally… in practice… in reality).

I might be wrong, but as much as nvidia would like to kick AMD in the nuts and vice versa, I suspect they both are coming to realize that most people want their code to run on as many brands of GPU as possible… and without needing to design, test and keep fixing different sections of code to do the same thing. I’m sure I’m not the only developer to doesn’t want to redesign my whole freaking application (and invent a new architecture to make that efficient) to deal with endless growing piles of incompatibilities and alternate code paths.

I do benefit from reading the standards, and a few books (most are near worthless), and watching internet videos or reading PDFs and slide decks. However, so many terms are never defined in a way they become fully concrete in my mind that the more I read, the more the chaos of questions and confusions expand, and the less clear I get. Actually, that’s not a fair statement. Maybe 60% become more clear the more I read (probably due to encountering different hints, comments and examples of application, which lets me draw inferences). But about 40% goes the other way, and prevents me from ever reaching clarity, it seems. But I do… eventually… get closer to clarity.

I didn’t mean to imply you guys “don’t understand this stuff”. Obviously you understand vastly better than I do! However, I think you get this feeling based upon how I respond negatively to “not knowing what’s really going on under the covers”. You don’t care! In fact, you seem to LOVE IT, because you get something you value more than understanding the guts and reasons and tradeoffs and justifications. You get something that works everywhere! So [maybe] you actually get positive feedback for the lack of complete explanation. I don’t. I explained my experiences, so you should at least understand why a little bit.

Hahaha… you caught me! You’re too clever by 99%.

For a while after I heard about Vulkan, it was much too vague and much too much hand waving. About three months ago I bought all the books that exist, and read/watched everything I could find about Vulkan to see if Vulkan is my xlib of 3D APIs (go read my previous message if you don’t know what that means). Two factoids got me interested in Vulkan.

First, it is [supposedly] lower-level (and probably is). As I explained, unlike most people (I infer), I almost always do vastly better with lower-level interfaces, on both the hardware and software level. So my hope was, Vulkan would be much lower-level, based upon simpler (more fundamental) mechanisms, and also reveal more about how modern GPUs actually work (by exposing more to inference than OpenGL does). And second, maximize performance, which is extremely important to me. For my purposes, I want to unleash all the cores and threads of Threadripper on the engine my work does, including as much of interfacing with the 3D API as possible.

Reading the Vulkan books hurt my brain. Nothing new… trying to jam too much in too quickly. Takes time to see how the various factoids fit together. So it wasn’t as thrilling a read as I hoped, but it was good enough to lead me to believe that’s where I probably need to go, and fairly soon.

Then I thought about the porting process. Soon I realized that I really needed to clean up a few aspects of my engine that have just been “patched together well enough to function, but not yet implemented as I planned”. One aspect of this was the “image object” in my engine, which includes texturemaps, surfacemaps, normalmaps, conestepmaps, specularmaps, heightmaps, volumemaps and everything else that looks much like a 1D, 2D, 3D (or 4D including time) construct that does not map more efficiently into some kind of buffer object [or other OpenGL construct].

One major reason I put off finishing the “image object” was because… right from the very start… the engine has been rendering thousands of objects in each glDrawElements() draw call. I always knew I had to massively reduce the number of batches and draw calls, so that’s the way the engine was designed. I also knew that “this is such obvious nonsense” (the problem with switching textures all the time and thereby breaking batches) that "no way will the OpenGL or GPU guys not come up with more and better ways to deal with this… so I’ll try to wait them out.

And in fact, my inference was correct, and with help from you and others here, I now have a massively general “image object” functionality wherein every 3D shape object can depend-upon, require and access as many images/textures/others as it wants without breaking batches within 3D shape objects or between 3D shape objects.

My fallback was to implement humongous texture atlases and deal with the hassle myself (replacing nominal texture-coordinates for each image with the modified texxture-coordinates based upon where each texture was automatically put in a humongous texture in a many-level texture array. This would be messy as hell, especially when it comes to wrapping modes, but did represent a worst case fallback plan. Of course, even with 32K mega-textures, and all texture images being somewhere within one of the 32K mega-textures, there was still the nasty question of how to access many of these… before the array textures appeared.

But I just got off the subject. To return to that path, the point is, I decided the process of switching over the Vulkan would be vastly easier if I cleaned up certain aspects of the engine AND tweak a few portions a bit to be more like “the Vulcan way” but still with OpenGL components and mechanisms. Which is how I ran into the AZDO video and a couple other PDFs that introduced some of the newer mechanisms like Storage instead of Data, persistence, coherence, rolling buffers (forget the proper term), multidraws (especially when combined with indirect draws… I think) and so forth.

I really do believe replacing OpenGL with Vulkan will be much easier once I replace as many well-chosen aspects of OpenGL with the very most modern OpenGL (so far I don’t think I need v4.60, but I definitely need v4.50). And so, that’s my answer to the question. Does that mean I’m flying under false pretenses here in the OpenGL forum? I don’t think so. I hope you don’t think so.

But to buttress that a bit… the better I understand AZDO principles, the more it seems possible that purposely taking the very most optimal approach with the latest and greatest OpenGL has to offer, might not be much slower than a properly written Vulkan application. Unfortunately, every presentation I’ve seen smells of being biased to appeal to whoever the listener is, so quite possibly I’ll have to switch over to Vulkan to find out whether Vulkan is significantly, substantially or massively faster than the best OpenGL has to offer… or NOT. If they’re not much different in speed, I will probably (eventually but definitely not yet) be more comfortable with Vulkan, because that’s just how I always seem to roll (most fundamental and lowest-level with least hidden is best and easiest for me). But we shall see.

Oh, and BTW, I had not heard of Vulkan when I started this engine! So forgive me! Hahaha.

I almost hate to ask, but how does ShaderSampledImageArrayDynamicIndexing get around the problem I mentioned with OpenGL array textures, especially since the same “dynamically uniform” requirement still exists (in the spec if not reality). Of course, if the Vulkan spec doesn’t support “per triangle” but it still works in practice, that’s no more a mark against Vulkan than OpenGL. The problem with array textures in Vulkan seems to be the same as OpenGL, that all the images within one array texture need to be the same size and configuration. Which leads to the same problem (and potentially revolting solution) that I mentioned. Or does Vulkan have some other trick or specification to skirt around this problem?

Someday I’ll get someone to test my code on a new AMD GPU. If it works, that’s good enough for me. And if someday the “per-triangle” feature stops working, that won’t stop the code from still offering the feature on a “per-object” basis. Which means, texturing won’t stop working, only the “stupid GPU trick” of per-triangle image-switches will stop working. I won’t like it, but I’ll take the risk. That’s a luxury I get for writing a non-mainstream, non-commerical application. And when people write games with my engine? They shall be informed of both the various risks and opportunities in the documentation, and can make their own decisions. To risk or not to risk… that is the question. Actually, just one of many.

Just curious. Other than better raw speed, how is the architecture or API of Vulkan better for procedurally generated content?

But yes, the “best practices” part of Vulkan has been fairly attractive to me so far. Not that I fully understand all of it yet, but that’s normal for a slow learner like me. The hilarious part is, give me the raw GPU specs, and I’ll write a better 3D API than OpenGL or Vulkan in my sleep in two months. That’s my sweet spot. Fortunately, most other subsystems of my over-arching project match my strengths well. Assuming 3D doesn’t kill me first! Hahaha.

On per-triangle texturing

But that question has been answered: there is no guarantee that different triangles in the same rendering command are part considered different invocation groups. It may not be clear how big “rendering command” is, but it’s definitely bigger than “single primitive”.

If you want to rely on undefined behavior, that’s up to you. But in architectures like GPUs, you cannot assume that because something appears to work right now that it will continue to do so.

And I don’t mean “in newer hardware”; I mean literally tomorrow. I mean that adding another object to the render might cause it to break. Moving the camera might cause it to break. And so forth.

Unless a specification guarantees it, UB appearing to work should not be assumed to actually work.

As I said, the presence of the NV_gpu_shader5 extension explicitly nullifies the dynamically uniform requirement for bindless textures. And the GTX 680 supports NV_gpu5_shader. Indeed, pretty much every 4.x NVIDIA GPU supports this extension.

So you’re didn’t need to test it; you’re not relying on undefined behavior here.

Non-dynamically uniform texture accessing is not a matter of “ancient history”. It’s not that more modern hardware supports it and less modern hardware does not. It’s a matter of specific hardware architecture that allows it to work.

And there’s only one vendor who offers such hardware. None of AMD’s extensions provide this. And if it’s not in an actual specification, you should not rely on it.

On understanding hardware

Asking Khronos to explain how the hardware works is like asking a hair stylist to explain quantum mechanics. They might be able to do it, but it’s clearly not the reason why most people go to them. The Khronos Group doesn’t make hardware, so they have no reason/right to explain it.

NVIDIA will never explain the details of their hardware. It took them a year or so before they admitted that their GPU rasterization architecture included a pseudo-tile-based component.

AMD and Intel already publishes detailed information on their hardware, for the purposes of driver developers.

Every GPU does it differently. Some of them do it radically differently. It even changes from generation to generation.

[QUOTE=bootstrap;1288473]I might be wrong, but as much as nvidia would like to kick AMD in the nuts and vice versa, I suspect they both are coming to realize that most people want their code to run on as many brands of GPU as possible… and without needing to design, test and keep fixing different sections of code to do the same thing. I’m sure I’m not the only developer to doesn’t want to redesign my whole freaking application (and invent a new architecture to make that efficient) to deal with endless growing piles of incompatibilities and alternate code paths.

I didn’t mean to imply you guys “don’t understand this stuff”. Obviously you understand vastly better than I do! However, I think you get this feeling based upon how I respond negatively to “not knowing what’s really going on under the covers”. You don’t care! In fact, you seem to LOVE IT, because you get something you value more than understanding the guts and reasons and tradeoffs and justifications. You get something that works everywhere! So [maybe] you actually get positive feedback for the lack of complete explanation. I don’t. I explained my experiences, so you should at least understand why a little bit.[/quote]

You seem to be contradicting yourself here. The only way that we can get to a point where testing across platforms is unnecessary is if people stop coding outside of the abstraction. And the only way to do that is if people stop trying to learn things beyond what the abstraction says you can do. If the abstraction says that “invocation group” extends to a rendering command, then you code to that. It doesn’t what the “reasons and tradeoffs and justifications” are. You do what the standard says.

It’s trying to learn details beyond the standard that makes people write incompatible code (on purpose).

On array textures and AZDO

What constitutes “real life, practical situations” is in the eyes of the beholder. The restriction that array textures require, that each layer must be the same size, is not an onerous burden for many applications. For your application it may be, but for the primary audience of the AZDO presentation, it’s just not a particularly painful issue.

Think about how you decide to limit changes to vertex format state. You pick a single vertex format, which all meshes must abide by. That means that all sources of mesh data have to agree on that format. That’s easy for you, because you generate your vertex data in code. But that’s not so easy for someone who loads their vertex data from user-provided meshes.

To many developers, picking a single texture size that all textures for a particular usage must abide by is no more onerous of a burden than yours was to use a single vertex format.

Similarly:

… yes. So they don’t do that. They ensure that the data is never “wrong-size” or “wrong-configuration”. Generally speaking, most high-performance graphical applications have near-complete control over their input data.

Performance typically requires rigid control over the input data. Generality almost always comes at the expense of performance.

You can make BSP or portal culling system over fixed geometry have faster rendering than a set of arbitrary, unknown meshes. You can make rendering with fixed-size textures faster than rendering with arbitrarily sized textures. You can make rendering with a single format of vertex data faster than rendering with multiple formats. And so on.

Primary optimizations require knowledge of, and therefore control over, what the data is and how it is to be rendered. The more control you give up, the fewer your options for optimizing.

Then you have misunderstood the point of the AZDO presentation. It was not made to explain how to render with any texture “on an object-by-object basis”. It was made to explain how to improve performance on your current rendering system. It’s not there to explain how to render in snazzy new ways; it’s purely a performance optimization.

A performance-based presentation is not going to explain how to draw things in ways that couldn’t be done before.

On Vulkan

You’re citing two distinct problems. The array texture issue is that each individual layer in the texture has to be the same size. The dynamically uniform issue is that you’re not allowed to have the texture object which gets fetched from be determined by non-dynamically uniform means.

Arrays of textures are different from array textures. An array texture is a sampler2DArray array_texture;. An array of textures is a sampler2D array_of_textures[array_count];. See the difference?

An array texture is a single texture object; that’s why each layer has the same size. An array of textures is an array of different texture objects. Each element of array_of_textures contains a different texture object. And therefore, each object can be of a different size.

With ShaderSampledImageArrayDynamicIndexing, you are allowed to use a dynamically uniform expression to index the array_of_textures array. But without that Vulkan feature, the index must be a constant expression.

My point was that Vulkan does not directly support bindless textures. But by using an array of textures with dynamic indexing, you get the same effect as bindless textures: the ability to pass an identifier into the system, which gets converted into a specific texture object. Therefore, the equivalent to OpenGL bindless texture residency in Vulkan is simply changing which images are in the descriptor binding for the array in the descriptor set.

It comes with the same limitations as the OpenGL feature: it has to use a dynamically uniform value. But it allows you to do in Vulkan what you would have done in OpenGL.

If your “procedurally generated content” is generated on the CPU, then it’s much better. You can allocate memory in whatever way you feel you need. You can ask for how much memory is available directly, and allocate chunks of it that work best for your content generation system. Most importantly of all, you have direct knowledge and control of the memory architecture.

For example, in an embedded GPU, there is generally only one pool of memory. In such cases, your procedural generation algorithm will generate data directly into the location it will be read from. But for discrete GPUs, you’ll probably want to transfer it into device-local memory. That requires an explicit DMA operation. And you can create an appropriate dependency between that DMA and the rendering commands that use it.

Or maybe you don’t do that. Maybe the GPU can read vertex data from host-accessible memory directly, so you don’t need that DMA.

That’s a question you can actually ask in Vulkan.

OpenGL abstracts all of these details away, in the hopes that the implementation will do the right thing. There, you just persistent map a buffer; maybe it’ll be as efficient as doing the DMA. Maybe not. But how do you tell if the GPU even has multiple pools of memory? In OpenGL, you can’t.

Vulkan allows you to adapt to the particulars of the hardware. Which means you can take advantage of those particulars.

If your “procedurally generated content” is generated on the GPU, that’s even better in Vulkan. You get all of the above advantages, plus improved synchronization support. Barriers in Vulkan are much more flexible than glMemoryBarrier in OpenGL.

And all of that ignores the fact that you can thread the construction of command buffers in Vulkan. Which you cannot do in OpenGL. Proper use of threading can dramatically improve CPU performance, thus allowing you to spend more time on CPU data generation.

No, you won’t. You will write one that is better for you and your needs. But you will not write one that is better or more broadly implementable for everyone.

I’m not sure why you care so much, or maybe don’t believe I’m aware of what you said. So let me just say it clearly once again. Per triangle might break tomorrow, or when the moon rises, or when the camera moves, or when the next driver is released, or when someone at nvidia reads this thread and real quick like tells the software guys to make sure it stops working, or when the day of the week divided by the phase of Venus rounded to the nearest percent is a prime number.

To be clear in advance, nobody will even think to blame you… since you warned me urgently, repeatedly, and in no uncertain terms.

Then there’s the other side of the coin. If you specify 0 to 4 images (texturemap and/or normalmap and/or conestepmap and/or othermap) in the shape object create function… those 0 to 4 values will be inserted into those fields in every vertex of the object. Which means, the application only has one choice for each of the 4 types of image objects — no image, or one image for all vertices. And thus, nothing can go wrong (unless we find glMultiDrawElements*() requires the images not change for all and every individual draw in the multiple draws. In that case, my answer is “screw you, jerks” to the driver writers (followed by a bug report from me to the OpenGL folks to fix their documentation which clearly (to me, anyway) strongly implies images can change between the individual draws). I will, however, then turn tail and admit I’m the jerk if they point out to me what it must work that way (every draw must keep the same single set of bindless textures). Of course, based upon my (yes, much too limited) tests, there’s no need for them to be so restrictive.

Okay, sorry, got a bit over the top there. But my point was supposed to be this. The create functions for all shape objects will contain arguments to specify two image objects, which presumably but not necessarily will be “texturemap” and “normalmap” (where a one-channel “specularmap” can be put into the .w component of the “normalmap” if desired (though the specular power therefore must apply equally to all colors and all viewing angles, which isn’t a perfect solution but often “good enough”). BUT, each image object contains an “imagetype” element that specifies which of the four types of image this image is, so those two images could be any two of the four.

Which means, the nominal functions and “use case” (hate that term, but value the concept) conforms to what you advocate… I think… because the saybits that enable or disable each image is inserted into every vertex in each object, and the image identifiers (offsets into those bindless texture arrays in the shader) are also inserted into every vertex in each object.

However, the engine does provide access to the raw vertices (the object local-coordinate vertices). So, they can change anything they want. The could change the position.xyz of all or any combination of vertices, the zenith.xyz, north.xyz, east.xyz surface vectors of all or any combination of vertices, the color.rgba of all or any combination of vertices, the tcoord.xy of all or any combination of vertices, the index value that locates transformation matrices from SSBO (the index value being the integer object identifier), or… drum-roll … the saybits or four image indices that specify the four image objects to access (courtesy “bindless textures”).

I need to provide this general capability to avoid the need to provide 100 or 200 specialized functions to fiddle in 100 or 200 different ways. OTOH, the engine will offer 20 or 30 functions to support the most common operations, but seriously obscure or unusual or one-of-a-kind operations will have to be done by the application changing the vertices however they wish (with the responsibility and consequences falling squarely on the application by necessity).

In other words, “the engine cannot stop them”.

But that’s not the fully honest answer. Will I decide to offer a special-purpose function to make it easy and convenient for applications to selectively modify some but not all of the saybits or image identifiers? At the moment, I lean toward supplying such a function. Whether that function remains when release day arrives will not be known until much nearer release day. Quite possibly feedback from beta testers/adopters will provide feedback to help decide that question.

But… don’t worry, whatever happens, it won’t be your fault! :slight_smile: Really!

PS: And if you’re correct that it doesn’t work on AMD GPUs… no support or comments in the documentation will encourage anyone to attempt “per-triangle images”… or whatever we’re [not] gonna call this capability… I mean travesty… when it doesn’t work on AMD!!! :slight_smile:


BTW, I do understand your attitude! Especially when you invest so much of your personal time, effort, sweat, blood and tears to help us morons out here. Every time some moron or jerk [like me] tries to make something cool work, you and your fellow saints (not sarcasm) find a big juicy pile of doodoo on the floor that never had to happen. And indeed, in most cases that’s 100% true, because most people want their software to “run [virtually] everywhere” (perhaps absent 6+ year old GPUs… grudgingly). So I get it.


As for your interpretation of the “array texture” practicality question, I think your answer to my complaint was… pretty damn good. And actually, this reminds me about some ways I’m willing to be bold where some others might not be willing. I’m the kind of “jerK” who would at least be willing to say “okay, potential [non-paying in my case] customers, thou shalt make all your images (texturemaps, normalmaps, conestepmaps, specularmaps, everyotherkindofmaps) all the same size and configuration, because we decided you’re better off with 10x better speed than infinite flexibility on image size and configuration”. Yup, that’s the kind of decision I’m willing to make for other people, which some others are not, and which royally torques off some people. But, from what you say, may not be as infuriating as I assumed.

I was thinking along the lines of “but… but… but… this is an engine not a single application, so imposing a requirement like that is not appropriate”. But you know what? After reading your message and thinking a big longer about that, it would bother me less (and probably less enough) to impose such a requirement. Nonetheless, I’m still bleeding profusely from trauma received in the “bindless texture wars”, so for now anyway, I’ll stick with this. But I admit you’re right. And, in fact, if bindless textures didn’t exist (or didn’t/don’t work reliably), those array textures sound plausible again. Well, more like “the only remotely viable approach” if bindless textures go down the tubes for some reason.

Maybe this is my inclination because I’m not the artist type (by even the most remote stretch of imagination), and as a consequence am not fluent with many support tools. Pretty much all I know is gimp. Of course gimp has no problem making all images the same size, so maybe I should just shut up while I’m far behind on this one.

PS: I also was glad to read your comment that seems to say application developers are quite used to going to [moderate] extremes to comply with requirements imposed to make performance good. Hence your comment about “requiring rigid control over input data”.

I’m sure every author of a applications like engines (which are purposely written to support hundreds or thousands of other diverse applications) love to hear that they can make decisions, then say “my way or the highway”, and the poor sucker application writers meekly say “okay boss”, suck it up, and comply. And, in fact, for the reasons you explain, there is much sense in that… if the engine developers made wise decisions. So wish me wise decisions (and keep helping me make them). Okay, keep trying! :slight_smile: Given that I’m the first and primary “customer” for this engine (as a subsystem in a larger project), and I need [as close as feasible to] maximum performance, I’m probably even more receptive to that attitude than some others.


My misunderstanding about the AZDO situation is probably much more caused by my assumption that engine authors would be extremely unwilling to require customers into inconveniences like making all their textures and normalmaps and othermaps the same size. Once I accept I was wrong to think that way, everything seems a lot more sensible. And yes, AZDO is most certainly not ONLY about textures, but switching textures is one of the major problems AZDO tries to deal with.

Just to be clear, I was never trying to place importance on “per triangle rendering”, but was trying to place importance on “per object rendering”. The reason should be quite clear too! The reason is, because switching state like textures is a major reason applications slow down, and this almost always happens on a per-object basis based upon forum comments I read (and articles I read, and slide-decks I read, and GPU videos I watch). In other words, to say that “displaying different textures on an object by object basis” is some kind of “never been done before” seems wildly absurd to me. From what I can tell, this is as common as common can be (and has exactly nothing to do with switching textures on a triangle-by-triangle basis).

However, I suppose I can add a bit of color here…

In my engine the usual (and almost only) way to create non-trivial shapes/objects is to call functions that create fairly simple shapes/objects, then assemble them into more and more complex shapes/objects in a multi-step, multi-level hierarchical manner. When each shape object is created, it will normally be quite happy to have zero or one texturemap and/or normalmap… so very conventional. However, complex shape/objects are created by attaching shape/objects to existing shape/objects. When this is done, the attached object can rotate and/or translate along the natural axes of the attached shape/object… or the “parent” shape/object it is attached to.

When I first got this working it was great fun to build huge hierarchies of shapes with every branch articulating. They can be created in any order, but it became quickly apparent that first attaching the furthest from root objects to shapes that will be the next level in was more intuitive. The reason is, you first create the simple 2, 3, 4 component shapes/objects that make sense as “components” (say a gun turret), then you attach that to a huge rotating circular table shape object (that will become a moving deck component of a mothership), then you attach the circular moving deck to the mothership… and so forth.

Anyway, it was great fun to have 3, 4, 5… even 8 or 10 level deep hierarchies with all the levels articulating (rotating and sliding back-and-forth) against their parent objects. So many of these contraptions are so cool, incredible and impressive.

BUT…

That misses the important point. A great many complex objects (in fact, almost ALL complex objects in most games) are rigid, fixed, never-moving objects. They need to be assembled too, especially in this “procedurally generated content” engine. All that works the same whether any object will articulate or not…

EXCEPT…

Once a large, complex shape/object like that is assembled, all the components that do not articulate can be fused into a single object.

In the past (before glMultiDrawElements*() at least), that was a huge win! Now only one transformation matrix need consume space or be computed each frame… if any. Now only one object need be drawn rather than dozens… EXCEPT for the nasty issue of the dozens of different image objects that act as textures and normalmaps for all those component objects!!!

If the batch must be broken to switch textures for each object (or piece of a shape/object if the shape/objects were fused into one shape/object)… then hands get thrown up and engine and application designers say “what’s the use… the performance sucks… and inherently must suck”.

Anyway, I’m just saying all these considerations are part of any “how to design” decision. If the engine and application get virtually zero benefit from fusing all those shape objects into a single shape object, then… what’s the point? Just leave them separate and set a single bit that says (don’t articulate).


[SIZE=3]And finally, I’ll answer your last sentence in a somewhat funny, somewhat tongue in cheek way:

Alfonse: No, you won’t. You will write one that is better for you and your needs. But you will not write one that is better or more broadly implementable for everyone.

Me: Since you just convinced me the best and accepted way to write applications is to “make all the important decisions for everyone else and make them comply”, I feel provisionally justified in saying “yes I would”.

Of course, we’ll never know. But one factoid is certain. Unlike many people (apparently), the lower-level i work at, the faster, better, easier I work. And now that I think about it, I don’t recall anyone ever complaining about the interface to the products I developed. Interesting that (and interesting that I never thought about that before, probably cuz nobody ever complained).[/SIZE]