Request: variadic instancing

Currently the instancing calls available repeat the same exact indexes (someone please correct me if I missed something in the specifications, as I have before :o )

The API addition I propose is pretty simple:


glDrawElementsInstancedVariadic(GLenum primitive_type, 
                                GLenum index_type, 
                                GLint count, const GLuint *params, const void *indices)

is equivalent to:


for(int i=0; i<count; ++i)
{
   gl_InstanceID=i;
   glDrawElements(primitive_type, params[2*i+1], index_type, indices+params[2*i])
}


Naturally this can be extended more to other bits such as allowing for params to be sourced from a buffer objects (either a new target or naming the buffer object in the call rather than an offset).

The goal of the requested extension is for situations where one wishes to draw many, many small “things” that “live on the same buffer objects”. The most typical example is glyphs when you draw glyphs in a way aside from drawing a quad. Other examples as well are abound too.

Naturally, the above assumes that a GL implementation can save some overhead in the above over an application issuing a multiple glUniform-glDrawElement call pairs.

Thoughts (constructive preferred) are welcome.

But what you propose is not instancing anymore but rather something similar to glMultiDrawArrays.

If your problem is that glMultiDrawArrays does not increase the instance count then look the spec of the following extensions to accomplish the same thing:

GL_ARB_draw_indirect
GL_ARB_base_instance
GL_AMD_multi_draw_indirect

But you can do the same thing without indirect draw commands, just then you’ll need a client side loop.

Anyway, instancing is always about rendering the same geometry, not for using different index lists.

The point is to advance the gl_InstanceID. Also, the vertex attribute divisor thing is… quite awkward and icky, and we are still left with that each instance the exact same number of primitives. [It is quite possible that is one of the reasons why it might help performance!]. As stated, there is no point of this suggestion if a GL implementation will not save any overhead vs the glUniform-glDrawCall pair per variadic instance.

Anyway, instancing is always about rendering the same geometry, not for using different index lists.

That is the point of the proposal. To move instancing to allowing for greater variation. As an example of utility, we need only look at drawing letters via triangles (ala Loop and Blinn). Doing that now requires one draw call per letter or massive replication of attribute data. Considering that one often draws LOTS of letters, this is not good. Along these lines, also GL_NV_path_rendering allows for a kind of variadic instancing with the transformations set per path.

But it can, with GL_ARB_base_instance as you just set the base instance field in the indirect command in a way that you always start the next draw command with an instance ID greater than the last one from the previous. E.g:

  • [li] indirect draw command #1: base instance 0, instance count 100[] indirect draw command #2: base instance 100, instance count 250[] indirect draw command #3: base instance 350, instance count 150[*] etc.

Another thing that you forgot is that we have also atomic counters via GL_ARB_shader_atomic_counters. gl_InstanceID is no different than any other atomic counter thus if you don’t like the solution above, you can still use your very own atomic counter instead of gl_InstanceID.

The point is there is nothing in your proposal that cannot be already solved using the existing toolset.

Edit: okay, using atomic counters could be a bit tricky but the first solution is really simple and possible.

I’m trying to wrap my brain around what is being suggested, but I’m not quite certain I get it.

It seems like what you’re asking is to essentially provide a per-instance offset into the index array. That each instance is rendered with a different index count and base index.

One problem with this is that the index counts and base indices are all in CPU memory. I don’t see that as being particularly good in terms of helping performance. Maybe since GL 3.x hardware has those new BaseInstance draw functions, the driver can handle the instance increment manually. But, much like the glMultiDraw functions, that’s something you could do yourself.

Other examples as well are abound too.

OK, what are they?

Considering that one often draws LOTS of letters, this is not good.

“LOTS” is a relative term. Let’s look at the worst case scenario. A 12pt font, rendered at 1920x1200. Full screen, and all the text flushes fully left and right.

You might get 50 words per line. Given an average word length of perhaps 5 letters per word, that’s 250 letters. 1200 pixels of height might give you 75 lines. Total letter count on the screen: 18,750.

Assuming that one uses reasonable compression on the data (2D positions as shorts, 2D texcoords as shorts), each vertex will be 8 bytes in size. Let’s also assume the worst case scenario: one is using the core profile so no GL_QUADS (and no cheating with geometry shaders), each quad takes up 6 vertices. Therefore, each quad will take 48 bytes.

Total byte size for the worst-case screen text count: 900,000 bytes per frame. Less than 1MB.

If you change the entire screen’s text all at once, constantly, every frame, at 60fps, that will require ~52MB per second of transfer.

In any case, I think most video cards can handle ~52MB per frame. Also, if you’re drawing that much text, then it’s likely that your application consists primarily of text drawing (for example, OpenGL-accelerating a web browser). So even if this pushed your graphics card to its limit, it’s not like you’re rendering a bunch of other complex stuff too.

So I’m not seeing a use case for this.

Yes, your method (if directly supported by hardware, which I highly doubt for modern hardware) would be faster and take up less memory. But as we see here, even in the worst case, you’re not even taking up a full MB of memory. And simply I don’t see drawing 18,750 quads as being particularly difficult for any GPU, even low-end ones.

Do you have a use case for this that doesn’t make something that isn’t a bottleneck faster?

But it can, with GL_ARB_base_instance as you just set the base instance field in the indirect command in a way that you always start the next draw command with an instance ID greater than the last one from the previous. E.g:

indirect draw command #1: base instance 0, instance count 100
indirect draw command #2: base instance 100, instance count 250
indirect draw command #3: base instance 350, instance count 150
etc.

Reading the spec again of GL_AMD_mult_draw_indirect… AHHH … :o shame and embarrassment… no wait.

it was so close at first I thought I was wrong :smiley: but the struct DrawElementsIndirectCommand does not have the instanceID, only the baseInstance (which makes since because of what DrawElementsIndirect is)…

so… it is so close, and getting that setting of the gl_InstanceID is… uh icky. The only way I see to do this realistically is to vary the baseInstance, set the divisor for one of the attributes as 1, have that attribute array such that Attribute[i]=I, then using that attribute gives what one would want… but that is just silly and working around the API rather than the API working for me. Any other suggestions welcome though!

and to Alfonse:

Assuming that one uses reasonable compression on the data (2D positions as shorts, 2D texcoords as shorts), each vertex will be 8 bytes in size. Let’s also assume the worst case scenario: one is using the core profile so no GL_QUADS (and no cheating with geometry shaders), each quad takes up 6 vertices. Therefore, each quad will take 48 bytes.

Um, the idea was that if one is NOT rendering text as quads. Ahem. If you take the time to google you will find that the technique I am referring to has that each glyph is rendered as a fair number of triangles so that zooming, rotation, etc are still rendered well. Roughly speaking the number of attributes per glyph is the total number of end points and off curve control points of the glyph, not 4 attributes per glyph and not 4(GL_QUAUDS) or 6(GL_TRIANGLES) or (4 GL_TRIANGLE_STRIP with primitive restart) indices per glyph. for the quad situation, one allocates each instance of each letter anyways and goes to town, the ugly is ahem other methods.

One problem with this is that the index counts and base indices are all in CPU memory. I don’t see that as being particularly good in terms of helping performance. Maybe since GL 3.x hardware has those new BaseInstance draw functions, the driver can handle the instance increment manually. But, much like the glMultiDraw functions, that’s something you could do yourself.

I specifically wrote that along the lines of the proposal to source the offsets and counts from a buffer object is also on the table.

EDIT I realized GL_AMD_mult_draw_indirect was not enough!

If you take the time to google you will find that the technique I am referring to has that each glyph is rendered as a fair number of triangles so that zooming, rotation, etc are still rendered well.

And if you Google a bit further, you will find that Valve has developed techniques wherein one can get zoom, rotation, etc while using quads :wink:

So no, outline fonts are not necessary for arbitrary scaling.

And if you Google a bit further, you will find that Valve has developed techniques wherein one can get zoom, rotation, etc while using quads

So no, outline fonts are not necessary for arbitrary scaling.

You know what: I am pretty beyond knowing of various text rendering techniques. Let me educate you:

  • [li] Distance field text rendering: all corners get rounded so you are not drawing the glyphs as they really are. Additionally, there are issues near corners (like the stem of an r) where it gets bumpy. Additionally the distance field glyph needs to be a pretty high resolution to get reasonable results and text minification gets ugly. Chinese glyphs are particularly nasty business[*] there are some techniques where quads are used to draw the glyph and the “texture” data for the glyph lets one run a fragment shader to get better per-pixel accuracy. I’ve even made my own techniques. They all suffer from that the fragment shader gets pretty heavy pretty quickly and some suffer from a very, very heavy pre-process step.

I was not the only one not happy with this, witness GL_NV_path_rendering. It has bits just for letters!