Part of the Khronos Group
OpenGL.org

The Industry's Foundation for High Performance Graphics

from games to virtual reality, mobile phones to supercomputers

Page 1 of 2 12 LastLast
Results 1 to 10 of 19

Thread: how to implement bindless textures efficiently ???

  1. #1
    Junior Member Regular Contributor
    Join Date
    Jan 2008
    Location
    phobos, mars
    Posts
    100

    Question how to implement bindless textures efficiently ???

    In the process of "fixing up" my 3D engine, after a lot of reading (and pestering helpful folks here a bit), I've chosen "bindless textures" as the basis of the new implementation of "images" in the engine. In this case, "image" refers to:

    - surfacemaps AKA normalmaps AKA bumpmaps
    - colormaps AKA texturemaps
    - displacementmaps
    - conestepmaps
    - specularmaps
    - heightmaps
    - etc...

    Also I imagine other build-in features of the engine will be implemented with bindless textures, but not through the generalized mechanism I'm discussing now. For example, to implement rendering the sky background I divided my billion-plus star catalog (real catalog of real stars plus lots of information) into 1024x1024x6 "regions" that exactly correspond to the areas covered by the 1024x1024x6 pixels in a 1024x1024 cube-map (with 6 faces). I also imagine I'll need to implement a typical general-purpose "skybox" or whatever they're called these days to let people put a "far-away background image" into a 1024x1024x6 to 4096x4096x6 cubemap. But those will be handled separately in a custom way, not through the generalized "image" mechanism.

    The first question I have is this. Does any way exist in OpenGL v4.6 core or an ARB extension (thus likely to become core... someday) that lets me pack those u64 bindless texture handles into a UBO without half the space being wasted? It seems clear from what I read so far that an array of u64 values is half empty, with only one u64 texture handle per 16-byte "location".

    My natural instinct was to create a structure like the following...

    struct u64vec2 {
    GLuint64 x;
    GLuint64 y;
    };

    ... then create an array of 256 or 1024 (or more) of elements of that type in a UBO. That way the desired bindless texture handle (or apparently a "sampler"... I guess) would get accessed a bit strangely, but that's easy enough.

    But, I don't see where OpenGL (or GLSL) specifies 2-element u64 vectors. Of course, that's probably because I'm missing some ARB extension or other, right? At least I hope so (if not already core).

    I guess GLSL doesn't necessarily need to recognize that type for my purposes, since each element will look like a sampler to GLSL... I think.

    I don't much care how the u64 bindless texture handles get packed tight... I just want to find some way. Does some way exist?

    ----------

    The following is a little "color" to explain what is my plan... in case there is some fatal flaw in my plan.

    My engine primarily creates "shape" objects AKA "shapes" procedurally. Which means, all cameras, all lights, all [shape] objects that are "drawn" or "rendered" are created by creating simple [shape] objects, then assembling them into larger, more complex, rigid and/or hierarchical [articulating] shapes. The simplest shapes are fairly standard, namely camera, light, points, lines, face3, face4, faces, disk, grid, grix, mesh, peak, cone, tube, pipe, torus, ball, globe, sphere, ring, bridge3, bridge4, bridges, and so forth.

    The create functions for all shape objects include arguments to allow each shape to be customized. For example, the create functions for almost all shapes contain the following arguments:

    - sides
    - side_first
    - side_count
    - levels
    - level_first
    - level_count

    ... and where it makes sense, also...

    - twist
    - taper
    - and so forth.

    ... as well as obvious arguments like options (up to 64 option bits), color (default color for all vertices), etc.

    The objid (object identifier) of up to four image objects [plus other kinds of objects] can be specified in the shape object create functions.

    Each of those four image objects gets turned into a u08 integer, and the four of those u08 values are delivered to the shader as a single u32 integer vertex-attribute, to be broken into up-to 0,1,2,3,4 fields by the shader (as specified by 4~6 other bits in another u32 vertex-attribute).

    Each of these [nominally] 8-bit image identifiers is not the same as the objid of the image in the engine. Instead, the 8-bit value is an index into one of four UBO blocks which contain samplers backed up by u64 bindless texture handles (as I understand this mechanism so far). By default with the current "uber-shaders" (and nominally), each of the four UBO blocks serves a different purpose, typically:

    UBO #0 == colormaps AKA texturemaps
    UBO #1 == surfacemaps AKA normalmaps AKA bumpmaps
    UBO #2 == conestepmaps + heightmaps
    UBO #3 == specularmaps or othermaps

    At least that's what happens right now (though monochrome specularmaps can be taken from the A channel of conventional colormaps if the A channel is not needed to specify transparency). Frankly, I'm fighting the battle between offering too much flexibility to accommodate existing textures versus simplicity, but that's a side issue.

    The important point is this. Each set of shaders (each "program object") can specify what it does with the images in each of the four UBO blocks. That's just a specification of each "program object" that can be specified by anyone who writes shaders for the engine (though they are expected to keep the same associations as the standard default "uber-shader program" for any image that serves one of the standard functions. So, if someone writes shaders that support a "displacement map" (but not "conestepmap", obviously), they would put their "displacement map" in UBO #2.

    So, to return to the basics.

    When a shape object is created, the objid of up to 4 images is specified.

    When each image object is created, its type is specified as an argument. A number of constants like the following are defined to specify image type:

    IG_IMGAE_TYPE_IMAP0, IG_IMAGE_TYPE_COLORMAP,IG_IMAGE_TYPE_TEXTUREMAP --- 3 intuitive names for one type == UBO #0
    IG_IMAGE_TYPE_IMAP1, IG_IMAGE_TYPE_SURFACEMAP, IG_IMAGE_TYPE_NORMALMAP, IG_IMAGE_TYPE_BUMPMAP --- 4 names== UBO #1
    IG_IMAGE_TYPE_IMAP2, IG_IMAGE_TYPE_CONESTEPMAP, IG_IMAGE_TYPE_HEIGHTMAP --- 3 intuitive names for one type == UBO #2
    IG_IMAGE_TYPE_IMAP3, IG_IMAGE_TYPE_SPECULARMAP, IG_IMAGE_TYPE_OTHERMAP --- 3 intuitive names for one type == UBO #3

    As shown above, the type each image object is given when created determines which UBO that image object is stored into.

    Whan each shape object is created, zero or one image object can be specified for each purpose. Another way to say this is, each shape object can specify the image objid of zero or one image object in each of those four image UBO blocks. The engine converts the image objid to the appropriate 8-bit index into the sampler arrays in each UBO in order to access the specified image objects for the desires purposes.

    The reason for this scheme (as opposed to a single UBO block) is to increase the number of simultaneously accessible image objects to 1024 rather than 256. Since the engine does all the complex work under the covers, the burden on the application is minimal (just say what is the purpose of each image object when it is created).

    The other thing to remember is the following. What happens to the zero to four objid of image objects passed in the shape object create functions? Each objid is converted into the appropriate 8-bit index and put into the appropriate field of that 32-bit vertex-attribute (based on the purpose == type of each image object), so that 32-bit vertex-attribute contains the same value in every vertex in the shape object. Which means, the exact same texturemap, normalmap, conemap, specularmap, heightmap, othermap will be processed for every vertex.... leading to uniform and consistent appearance throughout the shape object.

    That happens automatically, by default.

    However, after shape objects are created, the application running on the engine can call shape object modification functions to change any combination of vertex attributes in any combination of shape object vertices.

    And so, even the simple shape objects I mentioned can be radically customized.

    A couple background facts. First, the "provoking index" is set to the first vertex in each triangle in this engine. Which means, 2/3 of those two 32-bit integer vertex-attributes never make it to the fragment shader... only the first vertex in each triangle gets to the fragment shader. Which is good. Of course these two 32-bit integer attributes have flat modifiers in the shaders, so they are constant across each triangle.

    Second, as implied above, almost all shape objects are created out of an arbitrary number of levels and sides. What does this mean?

    For example, consider a simple "faces" object. As you might expect, a faces object is just a flat, round disk (but not identical to the disk object in an important way, as you'll learn shortly).

    Let's say an application program calls the ig_shape_create_faces() function and specifies the following values for the following arguments:
    - levels = 3
    - level_first = 1
    - level_count = 2
    - sides = 6
    - side_first = 1
    - side_count = 4


    By default, as with all shape objects, the radius of the surface of the faces object is 1 meter (but can be scaled, sheared or otherwise modified at any time). Since the create function specified this faces object is composed of levels == 3, with level_first == 1 (not the default == 0) the surface of this faces object has a 1/3 meter radius hole in its center (like a hole in the center of a round disk).

    Since the create function specified this faces object has sides == 6, the nominally round outer edge actually has only 6 sides... a hexagon. But side_first == 1 and side_count == 4, so only 4 of those 6 sides physically exist. Errrr... I mean "graphically exist", meaning only those graphical surfaces exist (contain vertices).

    And so, this faces object is 2 meters diameter, has a 2/3 meter hole in the center, and only 4 of the 6 sides exist.

    One important point of the last several paragraphs is the following. The first (provoking) vertex of every triangle starts on the lowest level that triangle touches (every triangle spans between two adjacent levels) and also starts on the lowest side that triangle touches. This is important because this lets the engine provide a way to specify different image objects for vertices on a given range of levels and/or sides. So, for example, after an application creates any kindi of object, the application can easily tell the engine to "set 1, 2, 3 or 4 of the image objects on any range of levels to be new image objects" and/or "set 1, 2, 3, or 4 of the image objects on any range of sides to be new image objects". Okay, I'm no artist, so that probably sounds boring. But to give just one example of what that means, those two function calls could change the texturemap and normalmap of some middle level or levels from the default "cobblestones" to "grass" or "bricks", and some arbitrary side or sides from the default "cobblestones" to "fancy artistic tiles". Okay, I really am not artist. But you get the point.

    Also, everything above leads to the following point. I mentioned many of the simple shape objects, each of which is radically configurable during the create process, and later on by simple function calls (by 3D scaling, 3D shearing, 3D twisting, 1~4D randomization of attributes, and an open-ended number of additional procedural processes).

    But the engine is designed to support vastly complex procedurally generated shapes too. For example, the create functions for simple shapes like "cup" attach, bond or fuse 2 to a few of the simplest shapes together, while create functions for super-complex shapes like "spacecraft" or "planet" or "galaxy" attach, bond or fuse tens, hundreds, thousands or more simple [and complex but simpler] shapes together.

    The point is, when objects are attached, bonded or fused together, all the configuration performed on the individual elements is retained. And yet, individual aspects can be changed in simple, coherent, intuitive ways. For example, if the image object that contains those "fancy artistic tiles" can easily be replaced by any other texture to display something else.

    Which leads to the following. Each create function (as well as modification functions) for super-complex objects like "house", "office", "spacecraft", "planet" or pretty much anything else can generate an astronomical number and variety of shape objects of that general kind (house, office, spacecraft, etc). The relative dimensions of pretty much any subsection can easily be configured or changed, the appearance of any [natural] portion of any surface can be changed, just about every aspect of that "kind" of shape can be changed. Of course the create function (and supporting modify functions for that shape) can offer as many or few opportunities for configuration during shape creation... and/or later.

    ----------

    Anyway, that's what I'm trying to achieve, and this scheme I described up near the beginning of this message is my attempt to implement some of these features and capabilities.

    I wanted to implement a scheme with array textures, but they just don't seem flexible enough. Unless all textures [of a given purpose] are the same size and work effectively with all the same sampler configuration, I don't see any plausible way to make those four 8-bit imageid fields specify a rich enough variety of images for all the various purposes.

    Maybe some guru out there sees a way to make that work. I don't.

    In contrast, the bindless textures do seem flexible enough, since (as I understand this), every texture can be a different size and configured differently.

    But what do I know? Only enough to get the basic texturemap and normalmap stuff working (displaying on object surfaces). But they couldn't even be selected before now... only the one texturemap and normalmap would display.

    Anyway, if anyone is a serious image/texture guru... especially bindless textures or via some other fancy texture-works tricks... I'm all ears. I mean eyes. Post your crazy ideas below. Thanks.

    PS: And sorry my message is so long. I thought it might provide sufficient context to spur good ideas, or prevent waste of time to post ideas that just aren't flexible enough.

    PS: I've gone to a huge amount of effort to be able to render many shape objects in each call of glDrawElements() or similar. That's another crucial reason for making so many image objects accessible simultaneously.

    PS: As per my usual practice, I "design forward". Which means, I'm open to any approach that is likely to be supported by most high-end cards from at least AMD and nvidia two years from now. My development rigs are Ryzen and Threadripper with nvidia GTX1080TI GPUs, 64-bits Linux Mint (later maybe windoze). These can be considered "absolute minimum required hardware", since 2-ish years from now most folks running sophisticated 3D simulation/physics/game applications will have better. I do prefer to avoid non-ARB extensions, but I'll consider anything likely to be supported by high-end AMD and nvidia GPUs (must be both, not just one or the other brand).

    PS: Almost certainly the engine will be released as open-source. I don't consider 3D engines a viable commercial market (not for me, anyway). For me, this engine is just one subsystem in a larger project.

    Thanks!
    Last edited by bootstrap; 09-05-2017 at 10:48 PM.

  2. #2
    Member Regular Contributor
    Join Date
    May 2016
    Posts
    443
    maybe you'll find this useful:
    http://on-demand.gputechconf.com/gtc...techniques.mp4

    a good technique ist to store materials in buffer object and reference them, use bindless textures, and merge as many draw calls as possible, make use of indirect rendering and sort everything after renderstates, so that you only have to set 1 renderstate once (pre frame).

  3. #3
    Senior Member OpenGL Lord
    Join Date
    May 2009
    Posts
    5,932
    It seems clear from what I read so far that an array of u64 values is half empty, with only one u64 texture handle per 16-byte "location"
    Looking at the ARB_bindless_texture specification, it clearly says that `sampler` and `image` variables can be members of a UBO. It even says how their data converts to memory.

    But it does not say what the size, base alignment, or array stride of them is for std140 layout. And the pathetic example of using them in a UBO doesn't explain its behavior either.

    So if you want an in-spec way to pass such values, you'll have to pass them as `uint`s. And since there's a conversion from a `uvec2` to `sampler/image` types, that should be pretty easy. Simply pass an array of `uvec4`s, where each array element is two 64-bit integers. So if you want the `i`th element of the array, you return `arr[i/2].xy` or `.zw`, depending on whether `i` is even or odd. Or you can index the vector to get the value:

    Code :
    uvec4 val = arr[i/2];
    uint ix = (i % 2) * 2;
    uvec2 handle = uvec2(val[ix], val[ix + 1]);

  4. #4
    Junior Member Regular Contributor
    Join Date
    Jan 2008
    Location
    phobos, mars
    Posts
    100
    Quote Originally Posted by Alfonse Reinheart View Post
    Looking at the ARB_bindless_texture specification, it clearly says that `sampler` and `image` variables can be members of a UBO. It even says how their data converts to memory.

    But it does not say what the size, base alignment, or array stride of them is for std140 layout. And the pathetic example of using them in a UBO doesn't explain its behavior either.

    So if you want an in-spec way to pass such values, you'll have to pass them as `uint`s. And since there's a conversion from a `uvec2` to `sampler/image` types, that should be pretty easy. Simply pass an array of `uvec4`s, where each array element is two 64-bit integers. So if you want the `i`th element of the array, you return `arr[i/2].xy` or `.zw`, depending on whether `i` is even or odd. Or you can index the vector to get the value:

    Code :
    uvec4 val = arr[i/2];
    uint ix = (i % 2) * 2;
    uvec2 handle = uvec2(val[ix], val[ix + 1]);

    What I also worry about is how those u64 integer bindless texture handles get turned to samplers in GLSL. We all know we can put a u64 integer into two u32 integers in a CPU, then put them back together again to recreate the original u64. Piece of cake. But somehow we're supposed to put those u64 integers that are (or somehow "represent") bindless texture handles into UBO... and assume OpenGL or GLSL or some magic force will know how to turn them into the samplers declared in the GLSL shader programs.

    I am fairly sure I remember reading that arrays of ANY type in UBOs end up mapping one value to one 16-byte layout location. That's really gross, and probably lots of people find that revolting (except, I suppose, arrays of vec4 or dvec2 values, which fit nicely).

    I'm pretty sure I read this in edition 7 of OpenGL SuperBible. That's where I got the idea that each u64 AKA GLuint64 will consume half of each 16-byte location (thereby wasting half the space in the UBO buffer object). Maybe I shouldn't moan, since arrays of s32 or u32 values will waste 3/4 of the space, arrays of s16 or u16 values will waste 7/8 of the space, and arrays of s08 or u08 will waste 15/16 of the space in UBO buffers.

    If I understand your code, I worry about the fact that somehow OpenGL or GLSL have to "map" or "convert" or otherwise associate the u64 AKA GLuint64 value to a sampler. Would OpenGL or GLSL be able to understand that your uvec2 could be turned into a bindless texture sampler? Maybe. But my wild guess would be "no".

    I also read about a shared layout alternative to std140 in the OpenGL SuperBible, but... it said absolutely nothing about how that layout scheme works.

    Maybe I shouldn't worry so much about this issue. After all, at most I'll have 4 * 4KB UBO for bindless textures (or at most 4 * 64KB UBO). In the grand scheme of things, that's not much for high-end GPU cards, which this engine caters to. Compare that to my need to have maybe one but more likely two f32mat4x4 for every shape object (transformation matrices). That's 64-bytes or 128-bytes each shape object, and some simulations/games could have thousands or even tens of thousands of shape objects. 10,000 shape objects * 128-bytes each for transformation matrices == 1.28GB... which makes 4 * 4KB or 4 * 64KB appear laughably trivial.

    Sometimes striving for efficiency gets me to forget the bigger picture temporarily. However, saying that made me remember why I worried about this... not so much memory waste, but CACHE MISSES. When every other 8-bytes is empty, the number of cache misses might soar. Since I don't know how wide or deep are GPU caches, I don't know how serious this is. Anyone know?

    Which raises another question I will create another message to ask... but I'll mention it here in case you know the answer.

    -----

    A UBO can only hold 1024 transformation matrices (without screwball stunts that aren't worth the hassle or smallish gains in my opinion). Which means, when application have 10,000 or so shape objects, the engine would need 2 * 10 == 20 UBOs to hold the transformation matrices. Plus, the code would need to play some funky games to assure it only renders objects with the same mod 1024 shape object integer identifier.

    What will be vastly easier, obviously, is to put all transformation matrices into a single SSBO. This SSBO can be marked "write-only" from the perspective of the engine on the CPU (meaning the CPU would write but never read its contents), and marked "read-only" from the perspective of GLSL programs (which would never write to those SSBO buffers).

    The question is... on high-end GPUs, will an SSBO that is configured in such a simplistic manner (basically, like a huge UBO) run as fast or nearly as fast as UBO ???

    I hope so, because that sure would simplify the transformation matrix implementation in a few ways.
    Last edited by bootstrap; 09-06-2017 at 08:09 PM.

  5. #5
    Junior Member Regular Contributor
    Join Date
    Jan 2008
    Location
    phobos, mars
    Posts
    100
    Quote Originally Posted by john_connor View Post
    maybe you'll find this useful:
    http://on-demand.gputechconf.com/gtc...techniques.mp4

    a good technique ist to store materials in buffer object and reference them, use bindless textures, and merge as many draw calls as possible, make use of indirect rendering and sort everything after renderstates, so that you only have to set 1 renderstate once (pre frame).

    Thanks for the link. I'll will view and listen-to the presentation. I haven't listened yet, but this might be similar to an "OpenGL AZDO" presentation I found recently, which was also helpful.

    As you no doubt noticed, a lot of effort has gone into the engine to assure a large if not huge number of shape objects are rendered by each gl*DrawElements*() function call. After I view, listen and think for a while, I may post a reply with further questions.

    Thanks!

  6. #6
    Senior Member OpenGL Lord
    Join Date
    May 2009
    Posts
    5,932
    Quote Originally Posted by bootstrap View Post
    I am fairly sure I remember reading that arrays of ANY type in UBOs end up mapping one value to one 16-byte layout location. That's really gross, and probably lots of people find that revolting (except, I suppose, arrays of vec4 or dvec2 values, which fit nicely).
    That's why I said "pass an array of `uvec4`s".

    Quote Originally Posted by bootstrap View Post
    Would OpenGL or GLSL be able to understand that your uvec2 could be turned into a bindless texture sampler? Maybe. But my wild guess would be "no".
    Or, you know, you could read the specification, where it clearly says "yes":

    Quote Originally Posted by The Specification
    Code :
          // In the following four constructors, the low 32 bits of the sampler
          // type correspond to the .x component of the uvec2 and the high 32 bits
          // correspond to the .y component.
          uvec2(any sampler type)     // Converts a sampler type to a
                                      //   pair of 32-bit unsigned integers
          any sampler type(uvec2)     // Converts a pair of 32-bit unsigned integers to
                                      //   a sampler type
          uvec2(any image type)       // Converts an image type to a
                                      //   pair of 32-bit unsigned integers
          any image type(uvec2)       // Converts a pair of 32-bit unsigned integers to
                                      //   an image type
    So, what's the concern?

  7. #7
    Member Regular Contributor
    Join Date
    May 2016
    Posts
    443
    A UBO can only hold 1024 transformation matrices (without screwball stunts that aren't worth the hassle or smallish gains in my opinion). Which means, when application have 10,000 or so shape objects, the engine would need 2 * 10 == 20 UBOs to hold the transformation matrices.
    in openGL 4.5 the ubo size is at least 16kB, divided by 16 floats per mat4 and 4bytes per float, you can store (at least) 256 mat4's to a uniform block.

    with SSBO's you could fill (maybe?) the whole GPU memory with mat4's

    but if you put the matrices into a GL_ARRAY_BUFFER, then you can send a mat4 (or more) per instance with "instanced rendering", that way you can send many more to the vertex shader

    consider glMultiDrawElementsIndirect(...), you fill a struct like this to render your meshes:
    Code :
        typedef  struct {
            uint  count;
            uint  instanceCount;
            uint  firstIndex;
            uint  baseVertex;
            uint  baseInstance;
        } DrawElementsIndirectCommand;

    baseVertex and firstIndex is determined by the location of the mesh in you vertex / index buffer, count says how many indices the mesh needs to draw, with base instance you specify from where your VAO pulls the instanced data (ModelView and ModelViewProjection matrices for example), set instanceCount to 1 to draw the mesh (or 0 to skip it)

    example a triangle:
    vertices
    (0, 0, 0)
    (1, 0, 0)
    (0, 1, 0)

    indices
    0, 1, 2

    lets say the vertices have X other vertices located before them in the buffer (the vertices of other meshes), lets say Y indices are located before them in the index buffer, lets say M instances are located before the triangles MV and MVP matrices in the instanced buffer.

    Code :
        typedef  struct {
            3; // triangle need 3 indices
            1; // draw 1 triangle
            Y; // triangle indices offset in IBO
            X; // triangle vertices offset in VBO
            M; // mesh instance offset in instance buffer
        } DrawElementsIndirectCommand;

    put that data into a GL_DRAW_INDIRECT_BUFFER, bind that buffer, and call:
    Code :
    glMultiDrawElementsIndirect(
       GL_TRIANGLES, // to draw triangles
       GL_UNSIGNED_INT, // IBO data type
       sizeof(DrawElementsIndirectCommand) * 0, // location of draw command in indirect buffer
       1, // issue 1 drawcall
       0 // assuming all the data in GL_DRAW_INDIRECT_BUFFER has no padding between 2 drawcalls
    );

    thats how you reduce the scene to 1 draw call (per renderstate and the mesh's primitive type)
    Last edited by john_connor; 09-07-2017 at 04:18 AM.

  8. #8
    Junior Member Regular Contributor
    Join Date
    Jan 2008
    Location
    phobos, mars
    Posts
    100
    Quote Originally Posted by john_connor View Post
    in openGL 4.5 the ubo size is at least 16kB, divided by 16 floats per mat4 and 4bytes per float, you can store (at least) 256 mat4's to a uniform block.

    with SSBO's you could fill (maybe?) the whole GPU memory with mat4's

    but if you put the matrices into a GL_ARRAY_BUFFER, then you can send a mat4 (or more) per instance with "instanced rendering", that way you can send many more to the vertex shader

    consider glMultiDrawElementsIndirect(...), you fill a struct like this to render your meshes:
    Code :
        typedef  struct {
            uint  count;
            uint  instanceCount;
            uint  firstIndex;
            uint  baseVertex;
            uint  baseInstance;
        } DrawElementsIndirectCommand;

    baseVertex and firstIndex is determined by the location of the mesh in you vertex / index buffer, count says how many indices the mesh needs to draw, with base instance you specify from where your VAO pulls the instanced data (ModelView and ModelViewProjection matrices for example), set instanceCount to 1 to draw the mesh (or 0 to skip it)

    example a triangle:
    vertices
    (0, 0, 0)
    (1, 0, 0)
    (0, 1, 0)

    indices
    0, 1, 2

    lets say the vertices have X other vertices located before them in the buffer (the vertices of other meshes), lets say Y indices are located before them in the index buffer, lets say M instances are located before the triangles MV and MVP matrices in the instanced buffer.

    Code :
        typedef  struct {
            3; // triangle need 3 indices
            1; // draw 1 triangle
            Y; // triangle indices offset in IBO
            X; // triangle vertices offset in VBO
            M; // mesh instance offset in instance buffer
        } DrawElementsIndirectCommand;

    put that data into a GL_DRAW_INDIRECT_BUFFER, bind that buffer, and call:
    Code :
    glMultiDrawElementsIndirect(
       GL_TRIANGLES, // to draw triangles
       GL_UNSIGNED_INT, // IBO data type
       sizeof(DrawElementsIndirectCommand) * 0, // location of draw command in indirect buffer
       1, // issue 1 drawcall
       0 // assuming all the data in GL_DRAW_INDIRECT_BUFFER has no padding between 2 drawcalls
    );

    thats how you reduce the scene to 1 draw call (per renderstate and the mesh's primitive type)

    Yeah, since I noticed the fancy draw functions like glMultiDrawElementsIndirect() it has seemed like "this might help do what I want... somehow".

    Your message clarifies a few things that I could not understand previously. Let me state what those were, and you can correct any of them that are wrong.

    #0: I assume count is the number of indices in the IBO specified in the active VAO to process to draw the desired object. Assuming the primitives to be drawn are triangles (not points or lines), the count value must always be a multiple of 3 to be valid and make sense.

    #1: the value firstIndex is not a byte-offset, but instead is how many indices are in the IBO before the index we want this draw to start with, regardless of whether indices are 1-byte, 2-byte or 4-byte values. Previously I had the impression this value was a byte-offset into the IBO.

    #2: the value baseVertex is not a byte-offset, but instead is how many vertices are in the VBO before the vertex we want this draw to start with, regardless of how many bytes in each vertex-structure (in AoS) or in each vertex-element (in SoA). Previously I had the impression this was a byte-offset into the VBO. But also, the name baseVertex confuses me, and always confused me. Why not call this firstVertex to match the firstIndex name? Don't both of these have the same meaning... the nth element in the IBO and VBO respectively? This difference in naming always implied to me that "something strange or different is going on with these two variables". Was I wrong?

    I'm even more confused about the other variables.

    #3: What does instanceCount do or control? It sounds like you're saying instanceCount should contain 0 to "not draw this object at all == 0 instances", and instanceCount should contain 1 to "draw one instance of this object". I guess the instanceCount == 0 case is helpful because the software can then stop objects outside the frustum from being culled by writing 0 into this one location. But I guess I wonder what happens if instanceCount > 1. I assume this will make the GPU process those exact same count indices by number of times specified by the instanceCount variable. But what happens in the shaders (or elsewhere) to distinguish the different instances? A built-in gl_instance variable? I vaguely recall seeing something like that somewhere.

    #4: What does baseInstance do or control? By way of example, if baseInstance == 16, does this mean that the built-in gl_instance variable in the shader (assuming there is such a built-in variable) contains the integer value 16 ??? From your message, it sounds like you're saying the appropriate f32mat4x4 local-to-world matrix for this object would be element 16 in an array of f32mat4x4 matrices in one SSBO, and the approrpiate f32mat4x4 local-to-world-to-view-to-projection matrix for this object would be element 16 in an array of f32mat4x4 matrices in another SSBO. Of course such matrices will pack tight and solid in pretty much any context (being their sizes are 64-bytes and will always sit on 64-byte address boundaries).

    But if what I infer in #4 immediately above is correct, what happens when instanceCount > 1 ??? If I want to call a draw function that recognizes instances, and potentially draws many or hundreds or thousands of instances, I would want the same exact matrices (at matrix array element 16) to be sent to the shaders in both of these cases. At least that's what I would want. Yes, I would want some other number (perhaps this possibly fictitious gi_instance variable I conjured up from my imagination or memory) to change every instance, and I'd index into something else (somewhere based upon the objid == object identifier, which is in every vertex) to tell the shader how to alter the position and/or orientation of each instance given gl_instance (or equivalent).

    I worry that maybe this structure (or functions that access them) increment that baseInstance value every instance in some way, though I don't know how. Anyway, I'm just confused because I'm not certain exactly what they (and others) might be trying to achieve with a draw call designed for instancing. I'm sorta half guessing that you are applying a draw call designed for instancing to do what my situation requires, but not what the structure and related functions are designed for... seeing that you only put 0 or 1 into the instanceCount variable. And so, my general sense of "not knowing what OpenGL or GLSL is doing" is blaring in the back of my head.

    Can you explain a bit further?

    But it does sound like you've got a good approach here.

    #5: Oh, I just noticed the sizeof(DrawElementsIndirectCommand) * 0 argument to glMultiDrawElementsIndirect(). What is that and why zero? Sometimes it looks like you're drawing one triangle, and sometimes you mention drawing the entire scene with one draw call. So maybe I'm getting those two cases mixed up.

    Thing is, I've done more in my engine to get to the point where drawing an entire scene (everything in the frame) in one draw call is actually not so absurd an idea. My nominal shader is an "uber-shader" that supports a moderately large vertex structure that contains integer fields and control bits that let every triangle specify the type of lighting (emissive, many-lights, etc), specify up to four images for purposes that include conventional texturemap application, conventional surfacemap/normalmap tweaking of the surface vectors to generate fake geometry via lighting, conestepmap for generating somewhat less-fake "perspective geometry" via parallax)... and more. Given those bits and fields and the whole u64 bindless texture handle array stuff, every vertex can control its own fate in great detail.

    Fact is, I can't literally go "all the way" to one draw call, because the engine will also offer rendering of real, amazingly rich, highly configurable starfields based upon a technique that I figured out (but will no doubt need to hassle gurus like you about to figure out how to make OpenGL do every trick I need). For reasons that probably seem counter-intuitive to you at first, the star backgroun must be rendered last, not first. The main reason is not because it would be a huge waste to process through large parts of the catalog of more than 1 billion stars that I created (all real stars and extensive information like spectral-type, distance (via parallax measures), proper-motion and more) for portions of the sky covered by other objects... though that reason alone might be sufficient.

    There is also the question of shadows. Ugh. I don't look forward to getting to that part. The engine supports many cameras in the environment, all at different locations and pointing in different directions, each of which can be rendering onto a texture that will be pasted onto a rectangle that is (or lies over) the face of a display monitor that is also [potentially] visible in the scene (plus the similar case of mirrors and [partially] reflective surfaces like floors). The fallout of that is... how do you know where you need to capture information for shadow maps when... cameras are pointing in so many directions (each with a different field of view)? My off-the-top-of-my-head-but-I-bet-Im-right guess is... the shadowmap will need to be a cubemap. That plus who knows how many more [very simple and fast] passes will need to occur to fill that shadow cubemap will all the information to generate appropriate shadows in the rendering of all cameras? Aarg... I don't want to think about this yet, cuz if I did, I might quit this project entirely! :-(

    Nonetheless, it is important to render as much as possible in each draw call, and I'm going far out of my way to do so.

    Thanks in advance for your message and your next reply.
    Last edited by bootstrap; 09-08-2017 at 01:00 PM.

  9. #9
    Junior Member Regular Contributor
    Join Date
    Jan 2008
    Location
    phobos, mars
    Posts
    100
    My concern is... that I still don't fully understand this stuff.

    For example, does this mean I can put fully packed C-style u32vec2 arrays into CPU memory, then transfer with glBufferSubData() that array into the UBO buffer that backs the four sampler arrays in my fragment shaders? Sure, C is happy packing u32vec2 tightly without waste, but just because I put a packed array of those u32vec2 vectors into a UBO... doesn't mean shaders can access every u32vec2.xy and convert to a sampler. In fact, nothing you posted in your message convinces me that the shader won't skip every other u32vec2.xy in the uniform block that UBO backs.

    However, your message does imply one possible way to achieve this result. And that is to make that array of u32vec2 into an array of u32vec4, with every even numbered u64 bindless texture handle split and jammed into the u32vec4.xy elements, and every odd numbered u64 bindless texture handle split and jammed into the u32vec4.zw elements. That sounds promising, and indeed, that's probably what your first sentence is telling me by saying "pass an array of uvec4".

    Fact is, maybe I wasn't such a moron, maybe I could understand how what you wrote tells me everything I need to know to get the desired result. Maybe you expected me to take the [very tiny] leap to understand I could extract u32vec4.zw elements, create a u32vec2.xy out of them... then follow the instructions in the ARB_GL_bindless_texture.txt specification (which I did read, but found only half informative and half confusing).

    Frankly, I'm just now seeing that this is probably what happened. Wish I wasn't so dense.

    But also frankly... I still don't see how this works on the shader code side of the situation. All the examples I remember seeing (though I have the worst memory in the entire solar system) implied that magic happens behind the scenes and samplers backed by bindless textures [must] automatically appear in shaders as samplers, not as u64 integers, and not as u32vec2 or u32vec4 integer vectors either. However, if you tell me that my shader code can receive an array of u32vec4 values [backed by a uniform block that contains an array of u32vec4 integers that contain sliced and diced portions of u64 integers/addresses, then after-the-face convert them into whatever kind of sampler the shader code wants... well... then I suppose I'd be writing the code.

    Which is what I'll do if I now seem to understand this correctly. Do I ???

    Maybe I drew that inference (about "magic happens here behind the scenes" and "shaders only receive samplers") incorrectly. Seems like maybe I did.

  10. #10
    Senior Member OpenGL Guru Dark Photon's Avatar
    Join Date
    Oct 2004
    Location
    Druidia
    Posts
    4,169
    Read this:
    * Vertex Rendering (OpenGL Wiki)

    It'll answer a lot of your questions.

Tags for this Thread

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •