Thanks for all the great feedback. Keep it coming! I’ll ask followup questions to everyone in this one reply.
— VAOs —
Would I be correct to say that VAOs are simply VBOs with the offsets/strides/datatypes of the contained vertices permanently bound? Presumably this is done to eliminate the need to specify offsets/strides/datatypes each time you make a VBO active and render it. Is that all VAOs are, or did I miss something?
— UBOs —
Would I be correct to say that a UBO is the complete set of uniform variables the shaders expect? If I understand this correctly, an application would need to define a set of offsets/strides/datatypes for the individual elements in a specific “uniform buffer object” just once. Then just before the application calls an OpenGL render function, it would update the values in its “uniform buffer object” image in CPU memory, then tell the driver where it is before the render call. The driver would then load all uniform variables into the GPU before rendering begins. Is this approximately correct?
— DrawElementsInstanced —
I recall reading somewhere that a new built-in shader variable came into existence in some version of OpenGL/GLSL after v2.10/v1.20 — called a vertexID number or similar. My assumption is, this vertexID number identifies which vertex in the VBO is currently being processed by each vertex shader (starting at zero, I assume). I guess this would be the value fetched from the IBO (the VBO that contains indices into the vertex VBO). That would seem to provide what is required for “instancing”. Thus I don’t see a need for special instancing draw calls. What am I missing?
— MappedBufferRange —
This I do not understand. Our engine calls glBufferSubData() regularly, which we assume updates a portion of a VBO (sometimes the entire IBO or VBO in our case). I must be missing something about the intent of this function.
— OpenGL standard versus extension —
You are both correct. We are interested in opportunties that are extensions today IF they are fairly likely to become core eventually (in similar form). As long as ATI remains a viable and popular source of high-performance video cards, we prefer not to lock our software into nvidia (even though we have been 100% nvidia since the beginning). nvidia has been great, but we have nothing against AMD — all our CPUs are Phenom2s!
— bindless graphics —
I read the nvidia PDF and the two extension text files, but have not gotten my brain around this yet. First, I find it difficult to believe that cache misses in the driver caused by looking up GPU addresses can slow any application by 7%, much less 700%. However, I applaud on principle the practice of letting CPU software control the GPU on the lowest feasible level.
It appears that VAOs eliminate the need to specify the offsets/strides/datatypes before each render. How much more efficiency does this extension offer over VAOs (which presumably are standard OpenGL)?
— texture arrays —
Is a texture array [?object?] different from a 3D texture? Are they different in the sense each texture in a texture-array can be different size [and format]? That would be very nice indeed, and much more convenient than our “hack” with 3D textures.
— maximum speed techniques —
Currently our engine has large IBOs and VBOs (65536 elements each), and typically we render each IBO/VBO pair in one or two OpenGL glDrawElements() or glDrawRangeElements() calls. We can do this because we make the CPU transform every vertex to world coordinates (because we need world-coordinates for collision detection and simulations of several physical processes). Our vertex contains a 16-bit (now integer!!!) of flag bits that can change the behavior of the shader. All this combines to let us render up to 65536 vertices per draw call, thereby amortizing the overhead involved in state changes over 65536 vertices. Every once in awhile we think “maybe this way is a mistake”, but so far our analysis and tests say this way is best, all things considered (for our engine, anyway).
Lately we have been wondering whether we should take this approach even further, switch to 32-bit indices, and put all our vertices into one huge IBO/VBO pair (up to ~30 million vertices). We could render large subsets of the IBO/VBO by calling glDrawRangeElements(), then update vertices outside that range by calling glBufferSubData(). That’s what we do now, except we always update the contents of each VBO before we render it (we never modify an IBO or VBO being rendered).
Our main motivation is not to increase performance, since our batch size is already huge, so further increases would likely not improve throughput measurably.
Instead, our main motivation is flexibility - to allow our engine to dynamically regroup “objects” in any way it wishes, simply by reloading modest subsections of the IBO only (vertices in the VBO never need to move when they are all inside one VBO).
Why would we want to do this? Here is one possibility, for example. Imagine a cube/tetrahedron/icosahedron/opportunistic centered around the camera/viewpoint with the camera pointing through the center of one face (or through a vertex). This divides the universe into 6/8/20/more volumes, each containing the centroid of some subset of all [game/simulation] objects. The objects in several to many of these volumes are not visible given the direction the camera is pointing (and the field-of-view of the camera). The engine can simply NOT DRAW the objects in any portion of the IBO that corresponds to these invisible volumes.
As objects move around in the environment from frame to frame, zero to a few objects will pass from one volume into another volume on each frame. The object can be removed from one volume and put into another simply by moving the object indices from one section of the IBO to another (and recompacting the “from” section of the IBO).
This is just one of several opportunities we find interesting, none of which work without switching to a single huge IBO/VBO pair. Any ideas and comments are welcome.