GL3.3+ Instancing and VAOs

Hi all,
I would like to implement a renderer making heavy use of instancing.
Here’s what I see as a global pipeline for rendering :

  • Update Scene Graph
  • Perform visibility tests and store model matrices of geometry about to be drawn
  • Sort matrices per-instance (mesh) and per-depth(earlyZ!)
  • For each instance (mesh)
    • Stream matrices in a VBO (I’ll call it the instance VBO) in a specified range.
    • Bind VAO
    • For each submesh, bind material (UBO, textures and shader program) and call a drawinstanced function.

I’d recover the matrix with four vec4 with in attributes in the vertex shader and use the VertexAttribDivisor function(instead of using a TBO(slower) or UBO(less flexible for dynamic number of instances)).
Now I would like to update the buffer with the algorithm described by Rob Barris just here (using mapbufferrange and gl_unsynced_bit) : http://www.opengl.org/discussion_boards/ubbthreads.php?ubb=showflat&Number=273484&page=4.
Unfortunately, VAOs make this impossible since pointers to vertex attributes are constant making it possible for the gpu to read in a range to which I could be writing in some usecases (adding instances dynamically during the application loop)
This is quite annoying , and I might just try not to use VAOs -at least, bind one and then forget about it-… Is there anything I can/could/should do? (Or perhaps my idea isn’t good …)
Thanks !

I would like to implement a renderer making heavy use of instancing.

Are you rendering a scene where “heavy use of instancing” would be useful? Instancing is an optimization; use it when you think you need performance. You should not just assume that instancing is worthwhile for what you’re doing.

Now I would like to update the buffer with the algorithm described by Rob Barris just here (using mapbufferrange and gl_unsynced_bit)

Rob doesn’t talk about using unsychronized. He’s talking about using invalidate/orphaning. These are two very different things.

Unfortunately, VAOs make this impossible since pointers to vertex attributes are constant making it possible for the gpu to read in a range to which I could be writing in some usecases (adding instances dynamically during the application loop)

I don’t see how VAOs affect whether the buffer is being read when you want to write to it or not. Just because the buffer object does not happen to be bound to anything currently doesn’t mean that it isn’t being used by the GPU.

Anyway, what Rob was talking about is that you don’t need to know anything about whether the GPU is using a buffer object or not if you orphan it. Once you call glBufferData(NULL) or glMapBufferRange(GL_INVALIDATE_BUFFER_BIT), the old memory is gone and there is new memory in its place. There’s no stalling.

The use of VAOs do not affect this one way or the other.

The key point I was making, and this may or may not apply to the OP here, is that if you have a source of modest sized, dynamically generated, chunks of vertex data that need to be transferred, you can combine orphaning and unsynchronized map to be able to pack all those batches into a buffer sequentially without delays, and that you can transition from one buffer-full to the next without need for fences. But the pattern only holds up if your data transfer pattern is write-once per region within the buffer… you can’t go back and trod on previously written data safely unless you do some synchronization.

So you could have a 2MB buffer, you could have hundreds of batches of data dropped into it with a draw call then placed on each one - that process isn’t efficient unless you use the unsynchronized mode, since the second attempt to map that buffer would block on the first draw being in flight.

To sum up, the unsynchronized map capability is key to being able to interleave data delivery with issuance of draws on those data chunks. Orphaning/invalidation is key when you have filled up such a buffer and need to rewind the cursor to the start and repeat with some fresh storage.

Okay, I’ll detail the instance VBO update now:
What I have :

  • A list of instanced drawing algorithm (meshes)
  • A list of pairs offset/count
  • A list of matrices
  • Instanced VBO sized say 2MBytes
    Now I want to render a mesh with instancing, I get its offset/matrix_count (also the number of instances to render) pair, bind the VBO to the range associated with the instance, and set the matrices. Then I perform the draw call. On to the next mesh, map the new range, set matrices of the instances, and so on. But these ranges are computed every cycle (using the result of visibility culling algorithms, user parameters, etc… )
    Here’s some pseudo code for drawing

for each mesh 
   // map the buffer.
   glBindBuffer(GL_ARRAY_OBJECT, instanceBufferID);
   // if offset+numInstances is bigger than VBO size, orphan.
   float* data = (float*)glMapBufferRange(GL_ARRAY_OBJECT, currentMesh.offset, currentMesh.numInstancesToRender*sizeof(mat4), GL_MAP_WRITE_BIT | GL_MAP_INVALIDATE_RANGE_BIT | GL_MAP_UNSYNCHRONIZED_BIT | GL_FLUSH_EXPLICIT_BIT);
   data = &currentMesh.matrixArray[0];
   glFlushMappedBufferRange();
   // bind VAO, material, and draw
   currentMesh.drawCall(currentMesh.numInstancesToRender);

Now here’s what the VAO construction would look like for my drawable objects:


// Enable Vertex Attrib Arrays
glEnableVertexAttribArray(semantic::vertex_pos);
glEnableVertexAttribArray(semantic::instance_mat_column0);
glEnableVertexAttribArray(semantic::instance_mat_column1);
glEnableVertexAttribArray(semantic::instance_mat_column2);
glEnableVertexAttribArray(semantic::instance_mat_column2);

// Bind vertex data
glBindBuffer(GL_ARRAY_BUFFER, vertexBufferID);
glVertexAttribPointer(semantic::vertex_pos, 4, GL_FLOAT, GL_FALSE, 0, NULL);

// Bind instanced attributes
// I need some_const_val to be dynamic ...
glBindBuffer(GL_ARRAY_BUFFER, instanceBufferID);
glVertexAttribPointer(semantic::instance_mat_column0, 4, GL_FLOAT, GL_FALSE, sizeof(mat4), some_const_val);
glVertexAttribPointer(semantic::instance_mat_column1, 4, GL_FLOAT, GL_FALSE, sizeof(mat4), some_const_val+sizeof(GLfloat)*1);
glVertexAttribPointer(semantic::instance_mat_column2, 4, GL_FLOAT, GL_FALSE, sizeof(mat4), some_const_val+sizeof(GLfloat)*2);
glVertexAttribPointer(semantic::instance_mat_column3, 4, GL_FLOAT, GL_FALSE, sizeof(mat4), some_const_val+sizeof(GLfloat)*3);

// instanciate the perInstance_matrix attributes
glVertexAttribDivisor(semantic::instance_mat_column0,1);
glVertexAttribDivisor(semantic::instance_mat_column1,1);
glVertexAttribDivisor(semantic::instance_mat_column2,1);
glVertexAttribDivisor(semantic::instance_mat_column3,1);

My problem isn’t about updating the buffer but using it for the final draw calls : the pointer to the instanced attributes in my array buffer is changed (almost) every new rendering cycle.

Are you rendering a scene where “heavy use of instancing” would be useful?
Using one draw call instead of N>1 is better whenever possible, isn’t it ?

My problem isn’t about updating the buffer but using it for the final draw calls : the pointer to the instanced attributes in my array buffer is changed (almost) every new rendering cycle.

And? VAOs are dynamic objects; changing their data doesn’t cost anything more than changing the attribute binding point without VAOs.

Using one draw call instead of N>1 is better whenever possible, isn’t it ?

No. It isn’t. Instancing, like most optimizations, has very specific circumstances under which it is useful.

Before you even consider instancing, you must first be certain that you are CPU limited. Until you are, instancing is of no value.

Second, you must have a data set that is conducive to instancing. This means:

1: Your mesh instance must be of reasonable size.

2: Your instances should not differ by much. That is, the difference in rendering one instance vs. another should be controllable via a few uniform values. Maybe a mat4 or so at the absolute maximum. No textures can change per-instance.

3: You must be rendering no less than 1,000 instances per frame, on average. Generally more.

Unless all of these conditions are met, instancing will be of no value to you.

As you can see, instancing is a special-case technique. It solves a very specific kind of problem. If you can coerce some of your rendering to look like that problem, and you are CPU bound in your rendering, then instancing may be helpful to you. Otherwise don’t bother.

And? VAOs are dynamic objects; changing their data doesn’t cost anything more than changing the attribute binding point without VAOs.
Okay so I’d be re-specifying the VAO every frame. No big deal if it’s not costly.

[…]Unless all of these conditions are met, instancing will be of no value to you.
So no benefit okay, but do you also mean performance loss compared to OneInstanceDraw calls?