performance using glDrawElements or glArrayElement

Hi,

i tried to improve the performance of my display (around 1million quads) by using glVertexPointer and glDrawElements. Since i had to display with flat shade mode, i looped over each quad and supplied one normal and then glDrawElements for each element as below
glPushClientAttrib(GL_CLIENT_VERTEX_ARRAY_BIT);
glEnableClientState(GL_VERTEX_ARRAY);
glVertexPointer(3, GL_FLOAT, 0, (const GLvoid*)xyz);
for(i=0; i<n_elements; i++, pIndices+=4, nxyz+=3)
{
glNormal3fv((const GLfloat *)nxyz);
glDrawElements(GL_QUADS, 4, GL_UNSIGNED_INT, pIndeces);
}
but this turns to be slower than my immediate mode rendering using 4 glVertex3fv calls instead of glVertexPointer and glDrawElements,

can you please help me understand why the usage of glVertexPointer is slower than using glVertex3fv?
with immediate mode rendering i get around 13fps and with the usage of glVertexPointer and glDrawElements, i get 8fps :frowning:

TIA
satya

Use also a vertex array for your normals. Yes, you have to replicate each normal 4 times to match the vertices they belong to. And best would be to pad them with a 4th (unused float) to align the per-vertex data to 16 byte boundaries. You have to use a stride of 16 bytes for the glNormalPointer() call then.

BTW, the same is true for the vertex positions, which could give you an additional speed up…

Your code seems a little strange… I think you are bottle necked with that loop.

The idea with vertex arrays is you put all the data into one big array, and get rid of the immediate mode calls, and if possible use one single glDrawElements() command to do the lot.

You can make a Normal array you know.

You could also use interleaved arrays here so that you have vertex and normal data intertwined.

Is your platform low on memory, because I can’t see a reason for the way you are doing things above. Perhaps I am missing something.

Vertex arrays will only be faster, if you use them right. The way you do it at the moment is, pardon, completely wrong.

To get good performance, with each glDraw* call you should render 300 triangles or more. You can render like 50-100 triangles and performance will be decent (still much faster than immediate mode), but the more, the better. Your current idea to render each quad individually means much much more work for the driver, than with immediate mode.

As stated above, you can create normal arrays and texture-coordinate arrays, too. To be able to render so many triangles in one batch, it is usually necessary to duplicate a lot of data (like the normals), but there is no way around it.

Also, quads are ok, but triangles are the GPUs preferred primitive.

When you have it working properly with vertex arrays, you also might want to look into the “vertex buffer object (VBO)” extension for further speed improvements.

Jan.

Consider using one or more ring buffers (circular update with no overwrite) for lots of smallish batches and large dedicated buffers for huge objects. The trade off being setup and copy overhead at a finer granularity verses lump sum submission at a much coarser granularity. My guess is internally drivers are doing something ring-like internally to implement the immediate mode stuff these days (probably a win for relatively short bursts of quads and tris, etc).

GL3 has a new buffer mapping API that’ll probably make this kind of pattern easier to see and implement.

“To get good performance, with each glDraw* call you should render 300 triangles or more. You can render like 50-100 triangles and performance will be decent (still much faster than immediate mode), but the more, the better. Your current idea to render each quad individually means much much more work for the driver, than with immediate mode.”

agree. sending more primitves afaik might spawn few threads to work it all out faster then one by one. there need to be a balance though like everywhere else, not to many not to less.

Thank you very much all, but i must tell that i did get 2X speed up with this kind of implementation (over the immediate mode) on my older laptop (with a quadro fx 2500, 512MB), the same piece of code, is running slower than immediate mode on my new laptop (with gfx card quadro fx 3600m. 512 MB)

  • I am not able to understand the change in performance between the 2 cards/laptops

  • Is duplicating the information the only way? I might have to duplicate the normal information and the vertex information also even in case of smooth shade display, to render the sharp edges properly. Also, if i have the animation data, i would have to manage different set of normals and vertex information for each frame and duplicating it for each frame of animation might negate the speed up i gain from the VertexArray please suggest

  • i forgot to mention one more thing, that 1M quads are not rendered as a whole, they are stored in some kind of a collector and each collector would typically have around 10K quads (so each collector can have different attributes like different color, …)

Please suggest,

Thanks again for all your suggestions

satya

Differences in the drivers and hardware will give different results. So it’s not really something that is worth worrying about too much for specific instances like this. First get the algorithm the best you can.

If you want to avoid data duplication the only real way is to use Element Arrays, or to generate things like geometry dimensions or normals in shaders. But these are always problem specific. So it may or may not be possible for you.
Also most elegant methods of doing this require Geometry shaders.

At the end of the day the fastest way to do anything is to move as little data as fast as possible, and to get rid of as much looping and logic / repeat calls as possible.

If you are moving huge amounts of data then perhaps look at the available sizes you can have to store data. Colours can be bytes, and some geometry can be half sized.

Also putting things into Buffers on the GPU is another option. If they are being updated a lot then perhaps stream and double buffer the data so that the GPU can be drawing from one source while you get the next load of data ready elsewhere.