PDA

View Full Version : Efficiency of multple glDrawElements commands



letslearn
02-27-2012, 12:54 AM
Let's say I have a buffer object with vertex data on the GPU. And I also have a buffer object with indices on the GPU, in preparation for use of glDrawElements.

Let's also say the geometry I am working with very precisely defined and simple enough so that I can do this kind of thing:

Multiple calls to glDrawElements, using different modes of primitive generation (GL_QUAD_STRIP, then GL_TRIANGLES).

Doing this will allow me to use a smaller buffer object for the indices. But is there a performance penalty for sending multiple commands rather than just one?

mhagain
02-27-2012, 08:06 AM
A single call with GL_TRIANGLES will always be better. Don't worry about memory usage and buffer sizes (at least not in this context) - they're not as important as you might think.

thokra
02-27-2012, 05:17 PM
But is there a performance penalty for sending multiple commands rather than just one?

Depends on the number of calls. :) Multiple can mean 10, 100 or a fews thousand and so on. If the general number of batches is low enough, you don't need to bother reducing draw calls since your app will most likely not be CPU limited anyway. This means that even if you condense all calls into a single call, you will probably get nothing out of it performance wise since your bottleneck is somewhere else.

As with any optimization in programming: first profile your app, then change the parts of your code which you figure are reducing performance, then profile and see if anything changed.

mhagain
02-27-2012, 07:05 PM
Most of the performance impact is not going to come from multiple calls, but from switching your primitive type, switching between indexed and non-indexed modes, and doing triangle setup. Any type of primitive needs to be converted to triangles for the GPU to process it, and - obviously - GL_TRIANGLES have the win here because they're already triangles.

Just using GL_TRIANGLES will give you much much simpler code too, with a single path for everything. That means less bugs, more maintainability and easier to expand/enhance in the future. A miniscule bit of extra index bandwidth is a small enough price to pay for that alone.

I would encourage you to set up both a multi-type version and a GL_TRIANGLES-only version, add a switch to toggle at runtime, and do a performance comparison.

Alfonse Reinheart
02-27-2012, 08:39 PM
Most of the performance impact is not going to come from multiple calls, but from switching your primitive type, switching between indexed and non-indexed modes, and doing triangle setup.

Do you have some evidence that issuing multiple draw calls for different primitive types actually impacts performance compared to using the same primitive type? Or are you just speculating?

Because I seriously doubt that any of what you said is true.


Any type of primitive needs to be converted to triangles for the GPU to process it, and - obviously - GL_TRIANGLES have the win here because they're already triangles.

... what? It costs nothing to render with GL_TRIANGLE_STRIP in terms of what the hardware has to do. You can make a case for using TRIANGLES based on indexing and building post-T&L buffer optimized lists. But just in terms of the basic effort the hardware has to do, they're equivalent.


Just using GL_TRIANGLES will give you much much simpler code too, with a single path for everything.

By "single path", you mean "not having to read from a variable and pass it to glDrawElements". Somehow, I don't think that's a major performance bottleneck.

aqnuep
02-27-2012, 08:59 PM
Most of the performance impact is not going to come from multiple calls, but from switching your primitive type, switching between indexed and non-indexed modes, and doing triangle setup.

Switching your primitive type should not involve any performance hit, neither should changing between indexed and non-indexed modes. Actually the only reason you should stick with a single setup if you would like to submit all your draw calls in a single MultiDraw* command which can actually be faster.


Any type of primitive needs to be converted to triangles for the GPU to process it, and - obviously - GL_TRIANGLES have the win here because they're already triangles.

GL_POINTS, GL_LINES, GL_TRIANGLES and GL_TRIANGLE_STRIP are native on all GPUs that I'm aware of, however, you should probably avoid the other ones.

mhagain
02-28-2012, 03:21 AM
Do you have some evidence that issuing multiple draw calls for different primitive types actually impacts performance compared to using the same primitive type? Or are you just speculating?

Benchmarking. I suggest you try it before going off on a rant. Go on.

Alfonse Reinheart
02-28-2012, 08:30 AM
Benchmarking. I suggest you try it before going off on a rant. Go on.

You're the one making the claim, so it's up to you to substantiate it with benchmarking results. So please post your results and the program you used to obtain them.

letslearn
02-29-2012, 01:21 AM
OP here. Thanks for all your comments.

Wouldn't a more important consideration be this:

If you use GL_QUAD_STRIP instead of GL_TRIANGLES, wouldn't you gain the benefit of having to run render the vertex shader less times in order to run the same object?

Alfonse Reinheart
02-29-2012, 02:50 AM
Not necessarily. If it were a post-T&L buffer optimized triangle list, it probably wouldn't.

The post-T&L buffer is a spot of memory that contains vertices that have gone through the vertex shader, as well as the index that created them. Vertex shaders are entirely deterministic; what you get out is determined by the uniforms (which don't change within a draw call) and the inputs. Since the same index will resolve to the same vertex shader inputs, the same index will produce the same outputs from the vertex shader.

So the hardware can look at an incoming index, check to see if it is in the post-T&L buffer, and if it is simply use that data instead of needlessly running the vertex shader.

You use more indices with triangle lists overall. But you still get fast results.

letslearn
02-29-2012, 09:42 AM
Ah, did not know about that. I was wondering why uniforms cannot be changed within a draw call. I'm assuming the OpenGL implementation takes care of that post-T&L buffer optimized triangle list itself, without exposing it to the user.

I'm guessing glDrawElementsInstanced cannot take advantage of the post-T&L buffer, as it reads a different gl_InstanceID for each draw.

Alfonse Reinheart
02-29-2012, 11:00 AM
I was wondering why uniforms cannot be changed within a draw call.

How could you change them within a draw call? The only functions that change uniforms are API functions. And a draw call is a single, atomic API call.

And glBegin/glEnd expressly forbid calling most functions between them, glUniform included.



I'm guessing glDrawElementsInstanced cannot take advantage of the post-T&L buffer, as it reads a different gl_InstanceID for each draw.

It can't use it for vertices between instances. But it certainly can use it for vertices within an instance.

letslearn
02-29-2012, 11:32 AM
How could you change them within a draw call? The only functions that change uniforms are API functions. And a draw call is a single, atomic API call.

I was wondering why shaders themselves could not modify the uniform values they read.


It can't use it for vertices between instances. But it certainly can use it for vertices within an instance.

Ah that makes a lot more sense than what how I thought glDrawElementsInstanced works.

Alfonse Reinheart
02-29-2012, 11:42 AM
I was wondering why shaders themselves could not modify the uniform values they read.

Because then they wouldn't be uniform ;)

thokra
02-29-2012, 11:49 AM
I was wondering why shaders themselves could not modify the uniform values they read.

A more pressing question is: Why would you want to modify values that are inherently constant across a primitive, across multiple objects, across the whole frame or over the whole runtime?

letslearn
02-29-2012, 12:01 PM
Thokra, I can't precisely remember the reason why I wanted such functionality. I think I wanted to emulate glDrawElementsInstaced's capabilities using glDrawElements (in case OpenGL 3.x is not available on the target system) and this involved me using uniforms as "loop counters". Looking back, this entire idea doesn't even make sense in my head, so it's probably best to forget about it.

That said, I don't even know if shaders being able to modify uniforms is even needed for that.

mbentrup
03-01-2012, 04:37 AM
You should read the pseudo instancing paper from NVidia: http://developer.download.nvidia.com/SDK/9.5/Samples/samples.html#glsl_pseudo_instancing

Alfonse Reinheart
03-01-2012, 10:16 AM
It should be noted that this paper is from the pre-DX10 days, and it is not known how efficient this is for hardware that isn't a GeForce FX or GeForce 6xxx.

Someone should get an OpenGL performance test suite together.

mbentrup
03-02-2012, 02:11 AM
Hardware that is OpenGL 3 capable should use native instancing, so pseudo instancing is only a fallback for those old chips anyway.

This is especially nice as a fallback for GL_ARB_instanced_arrays (pseudocode):


<bind non-instanced attribs>
if( GL_ARB_instanced_arrays_available) {
glEnableVertexAttribArray(<instanced attr>);
glVertexAttribPointer(<instanced attr>, 4, GL_FLOAT, 0, 0, attribs);
glVertexAttribDivisorARB(<instanced attr>, 1);
glDrawElementsInstancedARB(GL_TRIANGLES, num_instances, ...);
} else {
glDisableVertexAttribArray(<instanced attr>);
for(i = 0; i < num_instances; i++) {
glVertexAttrib4fv(<instanced attr>, attribs[i]);
glDrawElements(GL_TRIANGLES, ...)
}
}


You application can use the identical shaders, VBOs etc.

+1 for the performance test suite.