Fast Sprites and VBOs

I wanted to ask if anyone had any opinions on the best way to render lots of sprites using vertex buffer objects.

Each sprite has several parameters like:
position (updated often)
texture2darray frame number (for animated sprites, updated a lot)
rotation (updated less frequently, probably never used for some)
scale (updated rarely, probably almost never used for most)

I’m trying to figure out what would be the best way to send these attributes to the GPU to render as many sprites as I can as fast as possible. The first optimization of course is the build the quad in the geometry shader. Now each sprite only requires a single vertex. Still, the big question is what is the best way to pass in the rest of the sprite attributes to the GPU?

I had a couple of ideas:

  1. Treat each sprite the same way you’d treat a traditional mesh. In this case all the parameters would be uniforms (like a transformation matrix) and the vertex buffer would not even really do anything. This seems like it would require too many API calls as you’d be issuing at least one call to push the uniform data and then a draw call for every single sprite.

  2. Use a large dynamic vertex buffer for all of the sprites and pack everything into vertex attributes. Update it once every frame and then sort and batch the draw calls by shader/texture state change.

Is it better to use one large vbo or break everything into a set of smaller ones? If the answer is to break it up, how do you determine when to do this?

For the purposes of dealing with vbo locks when you need to write to them would it make sense to do a double buffering approach? Maybe have 2 vbos and every other frame write updates the other one? How about a vbo thats twice a large and you always write the opposite half of it.

This seems like it could be somewhat wasteful. If 80% of my sprites have a rotation factor 0 then it doesn’t really make much sense to be storing this value for every sprite in a vertex attribute.

  1. Use a combination of vertex buffer and something else. Perhaps a texture buffer object and/or uniforms for the things like rotation and scaling which happen much less often.

One idea could be each frame to load up the texture buffer object only with the rarer attributes ( like scaling, and maybe some other esoteric things like tinting by a color) if they are used, and then set a vertex attribute to index into it and have a conditional in the shader check if this index is >0.

For example if I’m rendering 2000 sprites but only 5 of them are scaled, then the texture would have those 5 scale values and each of their vertex attributes would index into the texture where this scale data is located for each one. The rest of the sprites index attribute would be something like -1, indicating that they have no special attributes to lookup.

Of course if only 5 things out of 2000 have a special parameter, they could just be given separate draw calls and have that parameter specified by a uniform.

Basically I have a bunch of parameters. Some of them will change almost every frame and on a lot of sprites. Some of them will change less frequently. Between vertex attributes, uniforms, texture buffer objects (or normal textures), whats the best way to get this data into the GPU to my shader?

Personally I’d keep things simple, which would hopefully keep it fast - athough you can’t be 100% it’s the fastest without extensive testing of different tchniques and on different hardware. Technique A might be fastest on nVidia, but B might be fastest on AMD. Questions like this are very hard to answer because there is no right or wrong way.

Each sprite has several parameters like:
position (updated often)
texture2darray frame number (for animated sprites, updated a lot)
rotation (updated less frequently, probably never used for some)
scale (updated rarely, probably almost never used for most)

I’d condense these down to the following:
ModelMatrix (rotation,scale,position)
Frame Number

In fact, you could ‘pack’ the framenumber safely into the 16th float of the ModelMatrix since it’s always 1.0.

Every frame, the CPU only has to update the translational parts of the ModelMatrix 90% of the time, and on the odd occasion rebuild the matrix according to rotation and scale. For the GPU, having the ModelMatrix like this is very quick and it can be used directly to position the sprite into the camera.

Now, now to push these attributes to the GPU for each ‘instance’.
Again, you have choices as mentioned before. Before we start, how many sprite instances do you have?
Are they all sharing the same shader? Unless they are sharing the same shader and texture, then they can’t really be called instances.

Instancing techniques available for this are:

  1. No instancing. Just use a loop and send the ModelMatrix via glUniform* call. For low numbers of sprites (< 15000 this may actually be the fastest method)

  2. ARB_Instanced_Arrays. Use glAttribDivisor to send the model Matrix. Implementation is slightly more tricky since glVertexAttrib can only send 4 floats at a time you’ll have to use 4 sets of vertexattributes and 4 glAttribDivisors.

  3. Uniform Buffer Object. This is a precious resource for the shader and one with limited size. AMD and nVidia differ in how much memory is available for the buffer object. If you are rendering too many instances then they won’t fit into the memory avaialble and you’ll have to break the draw call down into batches. Kinda self defeating when this happens.

  4. Texture Buffer Object. Absolutely tons of memory availble for all your instancing. The uniforms data is accessed as if the data were stored in a texture (so in the shader you use a TexelFetch command). Not quite as fast as UBO, but much more flexible and can handle millions of instances. Also has the advantange of being very easy to code - unlike UBO which is a bit of a pain to integrate into existing frameworks.

So there you have it. For low instance counts don’t even bother - you won’t get any speed up because you are only saving triangle setup overhead by instancing.
Instanced Arrays could be fairly simple for you to implement if you fancy benchmarking the difference. Each frame you can just use glBufferSubData to upload the new set of ModelMatrix data to the VBO. I don’t think you need two sets of VBO’s as it won’t make a lot of difference. If however, you can perform this task in a second thread (ie update position, rotation and scale), then that’s a different story and you can take advantage of the CPU power. I’d advoid uploading to GL in the second thread (you need to have a separate context), instead have two sets of Instance data (set A for thread A; set B for thread B) where one set is read-only & uploaded to GL and is the ‘current’ data; the other set is write-only and updated. Next frame switch pointers and repeat.