UBO vs SSBO for large array of local-to-world transformation matrices

bootstrap · July 15, 2017, 10:25am

The short (context-free) version of the question is this. How much slower will a huge array of local-to-world transformation matrices be in an SSBO versus UBO, assuming shaders only read the content and never write to the content.

For those willing to read a much longer statement of context and considerations…

##########

I want to make some changes in how my (work-in-progress) 3D simulation/physics/game engine works. My original idea was to create a uniform block that contains a simple array of transformation matrices, something like this:

[FONT=fixedsys][FONT=courier new]uniform local_to_world_transformation_matrix {
mat4 transform[65536];
} local_to_world;
[/FONT]
But then I realized that array of transformation matrices could only contain 1024 matrices because that is the maximum size of a uniform block object returned by:

int max_uniform_block_size = 0;
glGetIntegerv (GL_MAX_UNIFORM_BLOCK_SIZE, &max_uniform_block_size);

In games or applications that contain a large number of graphical objects in the environment, that complicates processing batches of objects. The reason is, my vertex structures contain a 32-bit integer that holds objid which is the “object number” AKA “object identifier” the vertex is part of. My plan was to put the local-to-world transformation matrix for all objects into one huge UBO associated with the uniform block specified above. Then the vertex shader can transform every vertex to world coordinates by multiplying the incoming vertex coordinates attribute by matrix in local_to_world.transform[objid].

Very simple and straightforward. And as everyone who makes 3D engines knows, it is already a fair bit of hassle to segregate objects into batches (collections of objects that share exactly the same set of shaders… and texturemaps… and surfacemaps… and conemaps… and heightmaps… and every other kind of resources). I try to make this more efficient by taking advantage of all four texture units and keeping as many resources in four array textures (so many textures available on each texture unit). To do this I have four u08 attributes on each vertex that specify which element in the four array texture to access, plus an additional bit field to specify whether each is to be applied or not (or in any of 16 to 64 arbitrary combinations). I think people call this the “uber shader” approach.

The engine doesn’t require all this flexibility be taken advantage of (especially not in every pass), but this flexible approach is the nominal standard, especially for game or application developers who are not expert at programming shaders.

Anyway, it is already quite a bit of work to segregate objects into batches that can be rendered in a single draw operation. When the local-to-world transformation matrix can be specified as easily as placing the object identifier in the objid field of each vertex, at least specifying the local-to-world transformation matrix is easy.

But then I found UBO can only hold 65536 bytes, which is only 1024 f32mat4x4 transformation matrices, since each matrix consumes 64 bytes of memory. While I have done a lot of work to keep my batch sizes large (some might say huge compared to many), I am not in any way bothered by the inability to draw more than 1024 objects per draw call. :-o That’s plenty big to be extremely efficient. No, that’s not the problem. The problem is, then I can’t just leave the objid object identifier in all the vertices and let the vertex shader index into local_to_world.transform[objid] to perform transformations (unless the game/application environment contains less than 1024 objects total. This will not usually be the case, especially in this 3D engine, because this 3D engine is designed for “procedurally generated content”, including 3D objects. This makes it fun and easy to create gazillions of objects — even without artists!!! :-o

To accommodate the limit of 1024 object batches, the engine would need to create batches of <= 1024 objects that replaced the objid object identifiers in the vertices with matid identifiers to identify the appropriate local-to-world transformation matrix in local_to_world.transform[matid]. Whenever any object needed to be moved into a different batch, the matid field in the vertices of that object would likely clash with an existing object in that batch. That would require that the array of vertices for that object be modified to insert a new value in the new matid field of every vertex, then the transformation matrix for that object be put into the array of local-to-world transformation matrices assigned to that batch.

But that’s not all:

all objects modified with a new matid in all its vertices would need to be transferred to the VBO in GPU memory.
all local-to-world matrices of objects moved into a new batch would need to be transferred to the corresponding UBO in GPU memory.

Keeping track of everything is also non-trivial.

In contrast, consider how this works in the nominal approach where the one local_to_world.transform[] uniform block could hold 65536 or even millions of transformation matrices indexed by the unique, never changing objid object identifier!

Then:

object vertices never need to be updated (they stay in GPU memory indefinitely).
the uniform buffer that holds the local_to_world.transform[65536+] buffer object only gets updated once per frame (in convenient portions).

##########

Okay, all the above provides context to consider the following question.

Are “shader storage blocks” and SSBO a rational solution to the above problem?

Several times I’ve seen statements that “shader storage blocks” are slower than “uniform blocks”. BUT… is this true even if no shader writes into the block? And even if they are slower, are the enough slower to offset all the extra work my 3D engine would need to perform? I’m not so much concerned with CPU time (though maybe I should be), I’m more worried that performing all those updates to the VBO that contains all objects might slow the GPU down significantly. Remember, the contents of objects in the VBO never need to be changed in the naive/simple approach where the transformation matrix array can be huge (because we put that array into a shader storage block instead of a uniform block.

##########

A few comments about this 3D engine.

I know 99% of objects (or more) in many games never or rarely move. In this case the vertex arrays for those objects can contain vertices in world coordinates and no local-to-world transformation is necessary (or just multiply by the unit matrix). While this engine should be appropriate for “normal games” like this (few moving objects), several of its first applications are for simulations (and games) that occur in space… where every object is constantly subject to forces, motion, rotation, collisions and collision responses, etc.

The following is how this 3D engine creates “batches” of objects to draw. First, all vertex structures of all objects are contained in a single VBO in GPU memory, with position and surface vectors (zenith/normal, north, east) in object local coordinates. Each frame the CPU creates one index array of 32-bit indices (called “elements” by OpenGL) for each batch, makes that IBO part of the VAO, copies that IBO to the GPU, then executes the draw call. This draws those objects in the VBO that are specified by indices in the IBO. All objects drawn by the IBO are rendered with exactly the same shaders and resources (texturemaps, surfacemaps, conemaps, othermaps, etc) and fully or partially overlap the viewport frustum. By the time the GPU has rendered a batch, the one CPU thread responsible for this work has created the IBO for the next batch (while all other threads are busy processing the next frame).

Sorry for all the excess detail. Some folks like to know the details (so they can give better advice), while others hate to read so much. Can’t please everyone.[/FONT]

Dark_Photon · July 15, 2017, 1:01pm

It’s going to depend on your GPU, your drivers, your update rate and method (writes), and your access pattern (writes and reads).

Yeah, I know. Not the “A or B” answer you were looking for

If I were you, I’d try both with your usage scenario on GPUs you care about and see. That’s the only way you’re really going to know for sure.

As far as access latency, with some GPUs/drivers, UBOs may be placed on on-chip shared memory whereas SSBOs/TBOs will be in device global memory. So UBOs could have lower latency.

However, consider that textures are also in global memory and how much bandwidth we pull from those typically in fragment shaders (with random access no less!). Due to the[FONT=palatino linotype] caching of global device memory reads and excellent latency-hiding in desktop GPUs and drivers, this just isn’t a bottleneck. Your use case is likely to be fairly meager by comparison.

[/FONT]

uniform local_to_world_transformation_matrix {
mat4 transform[65536];
} local_to_world;

[FONT=palatino linotype][FONT=fixedsys][FONT=courier new]
[/FONT][/FONT][/FONT]

[FONT=palatino linotype]

Re space and bandwidth, you have options here. With mat4 (16 floats), consider that 25% of that size (and read bandwidth) is useless and easily avoidable. 12 will easily do (25% reduction). And you can cut that down to 7 or 8 if you trade off for some shader math, giving you a 50%+ reduction overall – without even changing data types. So you have options to trade memory size/bandwidth for compute cycles, which can be useful.

[/FONT]

all objects modified with a new matid in all its vertices would need to be transferred to the VBO in GPU memory.

Keep in mind that you don’t need to interleave this with your other vertex attributes, which would reduce the size of the subloads.

That said, I’m definitely not arguing for the “rewrite your batches” approach just to use UBOs. Personally, I think you should at least try SSBOs and/or TBOs to avoid that complexity unless/until you determine that you need it.

I’m more worried that performing all those updates to the VBO that contains all objects might slow the GPU down significantly

Right. You’re right to worry about the efficiency of updating buffer objects. If not done correctly, that can really kill your performance. Generally good to avoid needless updates. But if you’re updating buffer objects efficiently, you might be surprised how much you can get away with updating in a single frame.

If you haven’t already, I’d recommend reading this page in the wiki: Buffer Object Streaming. Ask if you have questions. It doesn’t tell you everything, but it is a good kickstart to branch off from.

Some related reading:

[ul]
[li]Uniform buffers vs. texture buffers – The 2015 Edition (Yosoygames) [/li][li]Uniform Buffers vs. Texture Buffers (Rastergrid) [/li][/ul]

Dark_Photon · July 16, 2017, 6:35am

Each frame the CPU creates one index array of 32-bit indices (called “elements” by OpenGL) for each batch…

Forgot to ask. Why are you doing this each frame?

bootstrap · July 29, 2017, 2:15am

Why am I writing into UBOs every frame? Hahaha. Well, I know my answer will make you laugh… or make you cry… or make you take out a hit on me. Nonetheless, here goes:

I’ve been writing this 3D simulation/graphics/game engine for quite a while now. I wrote about 600 pages of design notes before I started writing code. I won’t even begin to explain the reasons for all my decisions and tradeoffs at that time, but GPUs have changed quite a bit since then (more than CPUs; my CPU was already 8-core when I started), and tradeoffs have changed.

Anyway, decisions in my initial design include an “uber-shader” that does “all the basics” like "many lights, normal shading, textures, surface/bump-mapping, emission source lighting, etc). What actually happens is determined by bits and integer fields in object vertices. So even within each object one triangle can be emission (a color light source), another triangle can be shaded (diffuse and specular), another can have a texturemap applied, another have a surfacemap applied, another have a specular-map applied, and others have any combination of these.

The result is… for almost all normal purposes (except procedurally generated craziness) the single uber-shader works for all surfaces of objects. In fact, in most scenarios it is literally possible to call[FONT=courier new] glDrawElements() once to draw all objects in the simulation or game.

To make this work, the CPU transforms all vertices of all objects from local-coordinates to world-coordinates every frame and writes them into the GPU (writes the VBOs back into the GPU). For conventional games this almost makes sense, since 99% of objects don’t move from frame to frame (if ever). That can’t be said for my engine however, because the simulations and games the engine will be for at least initially take place in space, where pretty-much every object moves every frame. Anyway, I wrote 64-bit SIMD assembly-language transformation routines which transform vertices fast-as-hell, and especially now with 16~32+ threads available with Ryzen/Threadrippper/Epyc… transforming all those vertices goes insanely fast (less than 9ns per vertex including 1 position and 3 surface vectors).

Part of the reason for this crazy idea (to draw every object without frustum culling) was the fact that GPUs are extremely efficient at noticing triangles are outside the frustum and throwing them away. Which means those vertices never flow into fragment shaders, which are far-and-away the biggest consumption of GPU cycles (even more so in my case).

In the above situation, since all vertices of all objects in the GPU are in world-coordinates, only one transformation matrix is required by the GPU for each camera, which my code puts into a uniform variable before it calls glDrawElements() to draw the whole frame.

As you know, this is not how people design things. And I always knew I might decide to change things before I finished the engine.

As it turns out, I’m now thinking of doing things the more conventional way… where the object vertices in VBOs in the GPU stay in local-coordinates and never need to be updated. This means my engine must compute a new transformation matrix for every object every frame and provide this to the GPU before calling glDrawElements() or whatever draw function gets called. And also, since this is being done, the engine will probably perform frustum culling for each object and omit those outside the frustum.

In this case, instead of updating the VBOs every frame, the engine would need to create and install new IBOs every frame that includes objects to be drawn but not objects outside the frustum)… then call glDrawElements().

Since essentially every object moves and/or rotates every frame, the transformation matrix for every object must be computed every frame and made available to the GPU before glDrawElements() is called upon to draw those objects. Of course the engine won’t be literally drawing ALL objects in a single glDrawElements() call any more, because that would be less efficient (because the GPU would be waiting on the CPU to generate all that input for all objects), but the engine will still be able to draw many objects every time it calls glDrawElements().

So now you see the reason for needing to place a great many local-coordinates to world-coordinates transformation matrices into GPU memory… so the GPU can transform all the objects in a batch as one process. The vertices contain a 16-bit integer field that specifies which transformation matrix [in which UBO] contains the transformation matrix for its vertices. If the engine can fit all transformation matrices into a UBO (or TBO or SSBO), then this 16-bit index can be the integer “objid” that my engine identifies every object with… very simple and never needs to be changed (just like the rest of the vertex).

Again, remember that unlike many engines and games, my engine has to assume that every object moves every frame. Fact is, some of the very first processes that executes each frame are:

#1: apply forces to objects (objects may have thrusters).
#2: apply gravity force to all [near-and-massive-enough-to-be-relevant] objects.
#3: perform collision detection.
#4: perform collision response.

In order to do the above, the transformation matrix from local-to-world coordinates must be computed for all objects, then the object centers (center-of-mass and geometric-center) must be transformed to world-coordinates (to perform bounding-sphere collision tests), and the position element of “unique vertices” of many objects also need to be transformed to world-coordinates in order to perform GJK “narrow phase” collision detection.

The point being… these transformation matrices have already been computed anyway, long before the time has come to call glDrawElements().

I suspect I’ve answered your question at this point.

But I’ll add a few recent funny comments… albeit “funny” in a painful sort of way, perhaps.

The more I look at where this is leading, the more I wonder whether my original approach may end up being best… or at least close to a tie.

How can this be? After all, “everyone” knows the conventional approach is best (where object vertices in the GPU are kept in never-changing local-coordinates).

What makes me wonder about this?

Well, another consideration of my engine (and first intended applications) is that many environments contain display screens that often display video images taken by video cameras (that are also within the environment). So, for example, if you are in the control room of a typical spacecraft, several displays will display views looking in various directions from outside the spacecraft, while others will display certain rooms within the spacecraft. When you have a bunch of cameras pointing in many directions you start to notice that while most objects will be outside the frustum for any one of the several or many cameras, almost every object will be within the frustum of one or more cameras. So maybe almost all objects need to be processed anyway.

But that’s not all. What about generating shadows? Virtually every object will be throwing shadows, which requires those objects be processed too.

So… I’m beginning to wonder whether the vast majority of objects needs to be processed at least as far as world-coordinates in the GPU… to be inside the view of a camera if not casting a shadow.

In fact, recently I decided the engine probably needs to create [at least] one very large shadow map in the form of a cube map (assuming that is possible).

While some of these considerations were part of why I designed the engine the perverted way I did in the beginning, even more reasons seem to appear every month or two.

If you have any comments about these issues, I’d love to hear them.

No matter what happens, I’m sure glDrawElements() or similar will be called at least several times (for each camera), not just once. For one thing, the engine needs to get the GPU drawing as soon as possible in order to keep the GPU busy as much of each frame time as possible (in case the application is GPU limited, which seems quite possible, especially with umpteen-core CPUs that are clearly the near-term and medium-term future).

Fact is, the whole set of tradeoffs is very complex. Given the requirements I mentioned above, my guess is, it will be necessary to perform one or more early “pre-passes”, perhaps from every camera… and hopefully not from every light-source too. The purpose of the camera passes would be to fill the depth buffer, and the purpose of the light-source passes would be to fill the shadow cube-map. But honestly, I’m not yet clear enough on the shadow process to know. And I’m not sure whether the engine can perhaps fill several depth buffer from the viewpoint and view-angles of several cameras simultaneously. I think that’s possible… at least fragment shaders can generate results into many framebuffers (I think at least 8 or 16).

Anyway, this whole mess is part of what I’ll be diving into fairly soon… once I figure out how totally screwed I am with textures and texture-units. I thought I had that texture-stuff figured out, but when I tried to write new shader code recently, it started to look like there is no way to do what I was planning (which was for each vertex to specify which texture-unit each texture-fetch was coming from (as well as which layer in the array texture)). Seems like I was totally wrong about being able to specify texture-unit to read texels from based upon other fields in those integers I have in my vertices. At least they can specify which layer in array textures, but… well… textures are a topic for a whole separate forum question.

PS: We can assume that by the time this engine is finished a 16-core Threadripper will be considered the minimal CPU for this engine, and a GTX1080TI will probably be considered below minimal GPU for this engine. In other words, a couple years or so. I didn’t mention before, but one of the main characteristics of this engine is procedurally generated content… from objects themselves to surface colors (what is done with textures now) to surface features (what is done with surface/normal/bump mapping now) and so forth (sound, voice?, etc). The procedurally generated objects stuff has been in since the start, but doesn’t impact how much is within each “batch” (each draw call). However, it is likely that more shaders will be necessary to procedurally generate surface colors/patterns and surface features, and that will force batches to be shorter (and sorted on the basis of those procedural shaders).

PS: I noticed in one of your references that AMD UBOs can be 2GB. Nice! Unfortunately my nvidia GTX 1080 ti specifies 65536 bytes as the maximum size for UBOs. I can’t imagine why they do that, but that’s what OpenGL reports. Probably I’ll try SSBO for now (to only have one set of code for this), then switch back to UBO when nvidia gets reasonable with their UBO limits! Thanks AMD !!!
[/FONT]

Dark_Photon · July 29, 2017, 10:30am

Yes, I agree this is a good plan. However…

…This means my engine must compute a new transformation matrix for every object every frame and provide this to the GPU before calling glDrawElements()
…engine will probably perform frustum culling for each object

Ok. Though with culling you have options (CPU or GPU).

In this case, instead of updating the VBOs every frame, the engine would need to create and install new IBOs every frame

What? This doesn’t follow.

You can leave vertices in object-space in VBOs, you can subload new transform matrices to the GPU, you can transform your verts to the spaces you need on the GPU, and you can do frustum-cull culling, either on the CPU or the GPU.

But this doesn’t mean you need to rewrite your IBOs every frame.

The missing link it sounds like you might want to consider for your design is what’s called “indirect rendering” (aka draw indirect). Check this out:

Indirect Rendering (GL wiki)

Multi-draw indirect rendering lets you slam a bunch of different otherwise-separate draw calls into a single draw call, draw calls whose parameters might have been written to the buffer object on the GPU.

With this, you can frustum-cull your objects (wherever, on the CPU or the GPU), and then just change the serialized list of draw calls you put in your indirect buffer. I suspect this is going to be a much smaller update in terms of total GPU buffer object updates required, cheaper to make, and easier for your engine to construct.

That said, if you’re hard-set on doing all of your culling on the CPU, you can consider just using multi-draw elements or multi-draw arrays. The main reason for choosing between these (MultiDrawElements/Arrays and MultiDrawIndirect) would be efficiency with the specific driver(s) you care about. The con to MultiDrawElements/Arrays is that the arrays you pass in are CPU-side pointers, which increases the likelihood that these are just implemented under-the-hood as loops around DrawElements/Arrays in the driver rather than as one draw call on the GPU.

So now you see the reason for needing to place a great many local-coordinates to world-coordinates transformation matrices into GPU memory

This is a side point not really material to your question, but…

I don’t understand why you say these matrices you’ll upload to the GPU are “to-world-coordinates” transforms. That suggests that your the extent of your “world” is fairly small if you can represent it all adequately with single-precision floating point. It’s more common to upload transforms that take you to eye-coordinates (e.g. MODELVIEW) and/or to clip-coordinates (MVP). That supports huge world spaces that can’t be adequately represented with single-precision float. Just something to consider if you think the size of your worlds might get to be fairly large.

bootstrap · August 1, 2017, 6:37am

Hahaha. Yeah, you could say the “worlds” are get large. So large, in fact that ultimately it will have several “world coordinates”… one for “sun space”, one for “Mercury space”, one for “Venus space”, one for “Earth space”, one for “Moon space”, one for “Mars space”… and so forth… plus one in “more-or-less-fixed-space” (this is my catalog of over 1 billion stars and another billion or so galaxies"… and yes, these are the real stars and galaxies in the sky). OTOH, all the stars have proper-motion and parallax information, which means the ultimate “fixed-space” is taken from a set of standard objects at extreme distances that truly do behave as fixed (to at least to nano-arc-second accuracy, which is plenty).

For the reason you mentioned (precision), all my CPU transformation matrces and vectors are held as[FONT=courier new] f64 (double). Any matrix or vectors that is sent to the GPU is also saved in f32 form, but everything in the CPU is f64.

For practical purposes each “world” (planet, moon, asteroid) can function more-or-less like its coordinate system is the universal coordinate system. But in fact, they are indeed what they are called… world-coordinates (for that one world). Fortunately interactions and transformations between “worlds” are fairly simple. The most precision required for any other world is typically about 1 arc-second (7 decimal digit precision), and that’s being silly really. Of course the CPU holds this information to 15~16 digit precision, but that’s more precise than any of this information is known.

I assumed the engine needs to reconstruct IBOs every frame to be able to render many objects per draw call. Here is how I look at this. Each object has its own IBO, which is just an array of 32-bit indices that specify which vertices must be rendered (3 at a time to make a triangle) in order to render that object. So the process I was thinking was:


transfer local-to-world transform matrix array to GPU    // array of transform matrix for all objects

for (object = 1; object < object_final; object++) {
  if (object may be inside frustum) {
    transfer contents of object IBO to large "render IBO"
  }
}

move "render IBO" into GPU
call glDrawElements();

I skip a few steps above. For example, there are outer loops to group objects into “render batches” that require different shader programs, different images/textures/etc that are not currently available, etc.

I also skipped steps related to the VAO.

But the VBO for all objects is already in the GPU, so that need not be moved.

Sure, if the code only had to render one object at a time, then the IBO for every object could be permanently stored in the GPU (perhaps in one huge IBO). Then the CPU would call glDrawElementsRange() or something like that once for each object, passing arguments that tell OpenGL where is the start of the IBO for that object in the single huge IBO, how many indices to process (to draw this object) and so forth. At least this is what I imagine, as I’ve never called any draw function more complex than glDrawElements().

Maybe this is a better approach, since no IBO or VBO need ever be moved to the GPU each frame… only the array of transform matrices.

Another of my problems is the issue you mention… who knows what happens with OpenGL. With glDrawElements(), I’m reasonably sure the GPU draws all the objects specified by the arguments in that one function call. With more complex functions… who freaking knows? It is incredibly annoying to go to insane effort to create a super-efficient process to render efficiently… and then find out OpenGL takes my one call of one OpenGL function… and calls driver functions 37 million times! Screw that! So I hesitate to adopt an approach that may screw me in this manner.

Which brings me to another issue you raise.

I’m trying to get all this stuff working reliably BEFORE I attempt to port this engine to Vulkan. So that is an important answer to your question. I don’t want to do anything that is very efficient on OpenGL but cannot be done equally efficiently on Vulkan. My guess is, one can more-or-less accomplish everything with equal or greater efficiency with Vulkan. Nonetheless, what I want to do is do things in a way now that moves across to Vulkan as directly and smoothly as possible.

As for transformation matrices. Currently the vertex shader receives all vertex positions and surface vectors in world-coordinates. So the current code only needs to supply a modelviewprojection transformation matrices in the default uniform block to the shaders. Except the “model-space” of this “modelviewprojection” matrix is “world-coordinates”, so it should be called “worldviewprojection” I suppose.

Nonetheless, the code in my shaders perform lighting computations in “world-coordinates” as well as “surface-coordinates” (the normal, tangent, bitangent vectors at each point on the object surface). Maybe some magic math method exists to perform all these computations without having world coordinates… but that’s beyond my lame brain! It was difficult enough for me to figure out how to write the shader code with this common coordinate-system (world-coordinates).

Oh, one more note. My “cameras” and “lights” are just normal objects… they are literally not special in any way (and thus there is no such thing as an “up vector” and stuff like that (that I never understood from a math point of view anyway)). So my camera positions are naturally in world-coordinates, the camera orientation vectors are naturally in world-coordinates, the light positions are in world-coordinates, and the light orientation vectors are naturally in world-coordinates. And nothing else makes sense to my lame brain!

Also, my engine supports any number of levels of linear and rotational articulation of objects against their “parent” objects. Since the lights and cameras are “normal objects” in my crazy scheme, the lights and cameras can be 3, 5, 7 levels of articulation down in a complex machine… and everything works perfectly. At this point, I wouldn’t even be able to guess how to start and do any of this any other way than the way it is (which is: every object has its local object coordinate system, and then the only common coordinate-system is “world-coordinates”). I’m probably lucky I was ever able to figure out how to make this math work in the first place, and quite probably burned up so many neurons in the process that too few remain to try to change any of this… much less do anything fancier.

The really weird thing is… and I don’t even know how I managed this any more… but the whole mess is very easy to work with from a programmer point of view. At any level in a mechanical hierarchy you just move and rotate in the natural intuitive coordinate system (of the object or its parent object), to assemble the two objects, then “attach” them together, then never worry again. From then on any time you want to manipulate anything at any level in the hierarchy you just do so in the coordinate-system of the object you manipulate. And without fail, no matter how complex the crazy machine you built, everything hangs together and functions properly.

I don’t think I could ever create this system again! Anyway, I say all the above just to convince you I would not know how to survive without world-coordinates. I don’t know how anyone could. When I think about all the crap I had to do to make arbitrary hierarchies, cameras and lights work properly (and display properly in any window), and ditto for many lights all over the place… I don’t know how anyone could even begin to do all this without some common frame of reference. And I can’t even imagine what common frame of reference that would be if not world-coordinates. Probably because I’m stupid. But I have to work with that, cuz that’s all I have to work with.

So after I switch the vertex positions and surface vectors to object-coordinates in the VBOs in the GPU, I feel compelled to supply to the shaders in the GPU the very same local-to-world transformation matrices that the CPU currently computes (except converted from f64mat4x4 to f32mat4x4 because GPUs greatly prefer f32 to f64. To be sure, I could also provide transformation matrices that transform further in the process. At least I assume I can reverse-engineer that part of my code. I no longer remember that work it has been so long now. Since writing that code, the only thing that isn’t “automatic” is setting the “zoom” level of the camera lens when the application wants to change the normal 90-degree horizontal field-of-view. I don’t even know what “eye-coordinates” are. Might that be the local-coordinates of the camera object? I see where the code has a “modelview” matrix, a “projectionmatrix” and then multiplies them together to create a modelviewprojection matrix (obviously they are different for each camera, since they are at different positions, in different orientations, connected to different target rectangles (framebuffer, window, texture), and have different lenses (zoom levels).

Sorry to go on at such length… I haven’t looked at this part of the code in ages. But I guess the bottom line is… the code contains modelview, projection and modelviewprojection matrices… except in this case “model” means “world”. In addition every object has its own “local-to-world” matrix. So I guess the modelview matrix you’re talking about is just the “local-to-world” transformation matrix for each object multiplied by my “worldview” matrix (which is wrongly called “modelview” matrix in my code). Or maybe not?

If precision loss will become a problem, I haven’t seen it yet, but maybe I will as time goes by. What will this problem look like?

PS: I read the documentation about those fancier draw functions several times, but my brain was never able to understand what they were talking about. I’ll go read about them a few more times and hope they start to make sense.

From a simplicity point of view it would be great if this mess could be broken down to:

all indices for all objects in one huge IBO
all vertices for all objects in another huge VBO,
perform culling to generate a list of offsets to the IBOs for objects that intersect the frustum
call a draw function and hand it an array of n offsets into the IBO (to specify the visible objects), plus n counts (to specify how many triangles to render).

Of course there’d be a couple more levels to the loop to group objects into those that have a common shader program and common assets.

Maybe that’s even what those functions do… and I just can’t understand their terminology.[/FONT]