The short (context-free) version of the question is this. How much slower will a huge array of local-to-world transformation matrices be in an SSBO versus UBO, assuming shaders only read the content and never write to the content.
For those willing to read a much longer statement of context and considerations…
##########
I want to make some changes in how my (work-in-progress) 3D simulation/physics/game engine works. My original idea was to create a uniform block that contains a simple array of transformation matrices, something like this:
[FONT=fixedsys][FONT=courier new]uniform local_to_world_transformation_matrix {
mat4 transform[65536];
} local_to_world;
[/FONT]
But then I realized that array of transformation matrices could only contain 1024 matrices because that is the maximum size of a uniform block object returned by:
int max_uniform_block_size = 0;
glGetIntegerv (GL_MAX_UNIFORM_BLOCK_SIZE, &max_uniform_block_size);
In games or applications that contain a large number of graphical objects in the environment, that complicates processing batches of objects. The reason is, my vertex structures contain a 32-bit integer that holds objid which is the “object number” AKA “object identifier” the vertex is part of. My plan was to put the local-to-world transformation matrix for all objects into one huge UBO associated with the uniform block specified above. Then the vertex shader can transform every vertex to world coordinates by multiplying the incoming vertex coordinates attribute by matrix in local_to_world.transform[objid].
Very simple and straightforward. And as everyone who makes 3D engines knows, it is already a fair bit of hassle to segregate objects into batches (collections of objects that share exactly the same set of shaders… and texturemaps… and surfacemaps… and conemaps… and heightmaps… and every other kind of resources). I try to make this more efficient by taking advantage of all four texture units and keeping as many resources in four array textures (so many textures available on each texture unit). To do this I have four u08 attributes on each vertex that specify which element in the four array texture to access, plus an additional bit field to specify whether each is to be applied or not (or in any of 16 to 64 arbitrary combinations). I think people call this the “uber shader” approach.
The engine doesn’t require all this flexibility be taken advantage of (especially not in every pass), but this flexible approach is the nominal standard, especially for game or application developers who are not expert at programming shaders.
Anyway, it is already quite a bit of work to segregate objects into batches that can be rendered in a single draw operation. When the local-to-world transformation matrix can be specified as easily as placing the object identifier in the objid field of each vertex, at least specifying the local-to-world transformation matrix is easy.
But then I found UBO can only hold 65536 bytes, which is only 1024 f32mat4x4 transformation matrices, since each matrix consumes 64 bytes of memory. While I have done a lot of work to keep my batch sizes large (some might say huge compared to many), I am not in any way bothered by the inability to draw more than 1024 objects per draw call. :-o That’s plenty big to be extremely efficient. No, that’s not the problem. The problem is, then I can’t just leave the objid object identifier in all the vertices and let the vertex shader index into local_to_world.transform[objid] to perform transformations (unless the game/application environment contains less than 1024 objects total. This will not usually be the case, especially in this 3D engine, because this 3D engine is designed for “procedurally generated content”, including 3D objects. This makes it fun and easy to create gazillions of objects — even without artists!!! :-o
To accommodate the limit of 1024 object batches, the engine would need to create batches of <= 1024 objects that replaced the objid object identifiers in the vertices with matid identifiers to identify the appropriate local-to-world transformation matrix in local_to_world.transform[matid]. Whenever any object needed to be moved into a different batch, the matid field in the vertices of that object would likely clash with an existing object in that batch. That would require that the array of vertices for that object be modified to insert a new value in the new matid field of every vertex, then the transformation matrix for that object be put into the array of local-to-world transformation matrices assigned to that batch.
But that’s not all:
- all objects modified with a new matid in all its vertices would need to be transferred to the VBO in GPU memory.
- all local-to-world matrices of objects moved into a new batch would need to be transferred to the corresponding UBO in GPU memory.
Keeping track of everything is also non-trivial.
In contrast, consider how this works in the nominal approach where the one local_to_world.transform[] uniform block could hold 65536 or even millions of transformation matrices indexed by the unique, never changing objid object identifier!
Then:
- object vertices never need to be updated (they stay in GPU memory indefinitely).
- the uniform buffer that holds the local_to_world.transform[65536+] buffer object only gets updated once per frame (in convenient portions).
##########
Okay, all the above provides context to consider the following question.
Are “shader storage blocks” and SSBO a rational solution to the above problem?
Several times I’ve seen statements that “shader storage blocks” are slower than “uniform blocks”. BUT… is this true even if no shader writes into the block? And even if they are slower, are the enough slower to offset all the extra work my 3D engine would need to perform? I’m not so much concerned with CPU time (though maybe I should be), I’m more worried that performing all those updates to the VBO that contains all objects might slow the GPU down significantly. Remember, the contents of objects in the VBO never need to be changed in the naive/simple approach where the transformation matrix array can be huge (because we put that array into a shader storage block instead of a uniform block.
##########
A few comments about this 3D engine.
I know 99% of objects (or more) in many games never or rarely move. In this case the vertex arrays for those objects can contain vertices in world coordinates and no local-to-world transformation is necessary (or just multiply by the unit matrix). While this engine should be appropriate for “normal games” like this (few moving objects), several of its first applications are for simulations (and games) that occur in space… where every object is constantly subject to forces, motion, rotation, collisions and collision responses, etc.
The following is how this 3D engine creates “batches” of objects to draw. First, all vertex structures of all objects are contained in a single VBO in GPU memory, with position and surface vectors (zenith/normal, north, east) in object local coordinates. Each frame the CPU creates one index array of 32-bit indices (called “elements” by OpenGL) for each batch, makes that IBO part of the VAO, copies that IBO to the GPU, then executes the draw call. This draws those objects in the VBO that are specified by indices in the IBO. All objects drawn by the IBO are rendered with exactly the same shaders and resources (texturemaps, surfacemaps, conemaps, othermaps, etc) and fully or partially overlap the viewport frustum. By the time the GPU has rendered a batch, the one CPU thread responsible for this work has created the IBO for the next batch (while all other threads are busy processing the next frame).
Sorry for all the excess detail. Some folks like to know the details (so they can give better advice), while others hate to read so much. Can’t please everyone.[/FONT]