Vertex-Arrays not faster than immediate mode?

Hi!

first of all: latest Nvidia-drivers, WinXP pro, Visual Studio 6, C++ :slight_smile:

I wrote a program which loads a model and displays it. nothing special. 23500 faces, 70500 vertices, normals and texcoords.
divided into 69 Meshes.
Drawing it mesh-per-mesh with one glBegin/end per mesh and then glvertex3fā€¦ and so on I get a Framerate of 26fps.

using ONE display-list for the whole object gets me 550fps. also ok. but I have the need to modify the vertex-data often. so display-lists arenā€™t the thing for me.

ok, I thought. Letā€™s use Vertex Arrays.

I put all the vertices from all meshes into one big float-array. same with normals and texture-data.
so I had three float-arrays.
then I used glEnableClientState for vertex, normals and texturecoord-arrays.
used gl*Pointer-functions to set the pointers and then go to the draw loop.
(setting up matrix, and then call
glDrawArrays(GL_TRIANGLES, 0, iVertexCount);
the modelview-matrix-calculations are the same in all three methods. (displaylists, immediatemode, vertex arrays).
the model showed up. displayed correctly. all ok. BUT: only 24fps???
Vertex Arrays slower than Immediate-mode???

whatā€™s wrong here?

Greetings, Sebastian

Streaming data from three disjoint arrays at the same time may explain your performance issues.

Try interleaved arrays.

struct Vertex
{
  float tx,ty;     //tex coords
  float nx,ny,nz;  //normal
  float x,y,z;     //position
};

Vertex* vertex_array=(Vertex*)malloc(70500*sizeof(Vertex));
//fill array
.
.
.

//render:
glInterleavedArrays(GL_T2_N3_C3,sizeof(Vertex),vertex_array);
glDrawArrays(<...> );

Note that you can use other vertex layouts than those prefabbed for glInterleavedArrays. Look it up in the spec to see what it does, and more importantly, how. This will allow you to pull the same trick for arbitrary vertex layouts (say, for multitexturing).

I agree with interleaving leading to better performance.

Also, you want to use DrawRangeElements() rather than DrawElements(). If you can, try also using the LockArraysEXT() extension; it may have an impact on some hardware.

Last, for really good performance, you could look into using ARB_vertex_buffer_object, which has suddenly become the Preferred Method ā„¢ for submitting geometry, although you need really up-to-date drivers for it to work well. ARB_VBO should give you similar peformance as display lists.

I could believe that DrawArrays is worse than immediate mode.

Youā€™ve got the overhead of array validation but none of the benefit of vertex reuse.

Originally posted by cass:
[b]
I could believe that DrawArrays is worse than immediate mode.

Youā€™ve got the overhead of array validation but none of the benefit of vertex reuse.[/b]

even if the vertices are cache-friendley ordered ?

I think VBOā€™s are the best solution for this problemā€¦

Originally posted by AdrianD:
even if the vertices are cache-friendley ordered ?

Yes. DrawArrays is cache-friendly by definition. Thereā€™s no index list.

You send the same amount of data in both cases, you just pay the additional expense of array setup and validation overhead with DrawArrays.

Originally posted by cass:
[b]
I could believe that DrawArrays is worse than immediate mode.

Youā€™ve got the overhead of array validation but none of the benefit of vertex reuse.[/b]
Interesting ā€¦
Why do you need to validate on glDrawArrays? It already has an explicit ā€˜rangeā€™.

I can understand the vertex reuse thing, but what if I do glDrawArrays(GL_TRIANGLE_STRIP,<ā€¦> ), wouldnā€™t it be irrelevant then?

glDrawArrays could even be faster than glDrawRangeElements, depending on primitive type and whether or not the hardware needs indices.

Originally posted by zeckensack:
[b] Interesting ā€¦
Why do you need to validate on glDrawArrays? It already has an explicit ā€˜rangeā€™.

I can understand the vertex reuse thing, but what if I do glDrawArrays(GL_TRIANGLE_STRIP,<ā€¦> ), wouldnā€™t it be irrelevant then?

glDrawArrays could even be faster than glDrawRangeElements, depending on primitive type and whether or not the hardware needs indices.[/b]

You need to validate on any Pointer changes or array enable/disable. Itā€™s inherent array overhead. Not too expensive, but it depends on how frequently you change pointers or array enables.

Indexed stuff can usually get more benefit from vertex sharing than say strips alone.

If youā€™re not getting additional vertex sharing by using indexes, then all youā€™re doing is paying for the overhead of the index list.

Thanks -
Cass

edit: some UBB screwiness (take 2)

[This message has been edited by cass (edited 12-17-2003).]

Iā€™d tend to say unless your batches are very small drawArrays is better than immediate mode. On ATI hardware with their different driver/hardware model youā€™ll see a bigger benefit from using drawArrays than NVIDIA hardware.

Hi!

as I can read here, I think I should use VBOā€™s here.
but I dont understand the performance difference.
each call to glVertex3f,normal3f and texcoord2f has its own overhead. and this 3 times per triangleā€¦
and glDrawArrays has only one time overheadā€¦
my problem is, that the vertices must be modificable.
the next thing is, that the code runs (vertex arrays and immediate) with 23fps/26fps on a amd1800+ with geforce2mx.
and only 18fps on amd1000 with geforce4ti
I think it doesnā€™t depend on the graphics hardware. driver and OS are the same on both machinesā€¦

bastian

display list : 550 fps
intermediate : 26 fps
vertex array : 24 fps

do you calculate the coordinates again and again for every frame ?

Seems strange that drawarrays should be slower than immediate mode. It should still alleviate a good deal of CPU work and I thought it facilitated DMA transfers. With some of the HW I used to work with waaay back, drawarrays was actually the fastest method, faster than drawelements. Itā€™s been far too long since I used it so I canā€™t say about any recent experiences with it.

Cass,
you made we want to bring up another question. Thereā€™s been a few discussions about large vs many small VBOs. You said the cost was in the glpointer calls. What I didnā€™t find clear is whether the cost of these calls is greater when a different VBO is bound or if itā€™s the same even under the same VBO.
In other words, as an example, would be well off binding a single VBO and then specifying different offsets through gl
pointer calls (possibly maintaing smaller index formats) or should we minimise the number of gl*pointer calls and use larger indices and rely on DrawRangeElements to reduce the index sizes?

I have calculated the fps only for the drawing-loop. no mods are made on the vertices.

same loop for all 3 possibilities, only with a call to another drawing handler.

the 3 handlers are:

  1. draw triangle-per-triangle, immediate mode, (for-loop ), the data is in STL-Containers
  2. call to glDrawArrays(), the data is in 3 float-arrays
  3. call to glCallList(), the data is in a displaylist

Tried the interleaved arrays for #2 yet? Iā€™m sure it will give you at least a small boost.

Switched around:

Originally posted by cass:
Indexed stuff can usually get more benefit from vertex sharing than say strips alone.
Yes, obviously.

Originally posted by cass:
[b]You need to validate on any Pointer changes or array enable/disable. Itā€™s inherent array overhead. Not too expensive, but it depends on how frequently you change pointers or array enables.

If youā€™re not getting additional vertex sharing by using indexes, then all youā€™re doing is paying for the overhead of the index list. [/b]
Iā€™m not sure I understand.

The benefit should be that you can stream many vertices through a single call (function call overhead, higher bandwidth efficiency). Right?

As for the setup overhead, ā€¦ um. Letā€™s see
1)look for occurances of stride==0, replace
2a)check if vertex layout is ā€˜single streamā€™
2b)check if vertex layout is ā€˜compactā€™
2c)select an appropriate copy method

optional)build an index list if required (or just reuse [parts of] an old one)

4)transfer all vertices. Exceptions on bad client memory are fully acceptable here, no?
5)copy last array element attributes to ā€˜currentā€™ state.

What you said makes a lot of sense to me in the context of VBOs, but does it fully apply to system memory arrays as well?

Originally posted by Madoc:
Seems strange that drawarrays should be slower than immediate mode. It should still alleviate a good deal of CPU work and I thought it facilitated DMA transfers. With some of the HW I used to work with waaay back, drawarrays was actually the fastest method, faster than drawelements. Itā€™s been far too long since I used it so I canā€™t say about any recent experiences with it.

If you render a lot of static geometry with a single DrawArrays call, it should be faster, I agree.

If youā€™re changing pointers frequently and rendering with lots of DrawArrays calls, it may well be slower.

If your geometry is dynamic, and you build the whole array up front beforehand, then you may not be getting the CPU/GPU parallelism that you would with immediate mode.

Mainly, I wanted to point out that ā€œarrays are fasterā€ is not a simple truism. In order to make things faster, the feature/mechanism must be widening a bottleneck that is currently limiting performance.


Cass,
you made we want to bring up another question. Thereā€™s been a few discussions about large vs many small VBOs. You said the cost was in the glpointer calls. What I didnā€™t find clear is whether the cost of these calls is greater when a different VBO is bound or if itā€™s the same even under the same VBO.
In other words, as an example, would be well off binding a single VBO and then specifying different offsets through gl
pointer calls (possibly maintaing smaller index formats) or should we minimise the number of gl*pointer calls and use larger indices and rely on DrawRangeElements to reduce the index sizes?

This will vary some among implementations, but for NVIDIAs, the performance will be mostly driven by the number of gl*Pointer calls, not so much by how many VBOs are involved.

Too many VBOs and you pay some (marginal) penalty for more frequent VBO state changes. Too few VBOs and you pay a (potentially very high) penalty for forcing a coherent CPU/GPU view of an unnecessarily large chunk of memory. Forcing this coherency requires either synchronization stalling or lots of in-band data copying. This is a real waste if that coherency is not essential.

Small VBOs solve the coherency problem and make driver-side memory management much easier. In the long term, I expect a one or two attribs for a few hundred vertexes per VBO to be ā€œfreeā€. And it will never hurt (though it may not help much) to pack multiple attributes (perhaps from multiple objects) into a single VBO ā€“ if they are static or nearly static. This is probably a good idea if you have lots of static objects with very few vertices - though if you donā€™t render these things all at the same time, immediate mode may be better still.

Does that help?

Thanks -
Cass

edit: clarification ā€¦

[This message has been edited by cass (edited 12-18-2003).]

Originally posted by zeckensack:
What you said makes a lot of sense to me in the context of VBOs, but does it fully apply to system memory arrays as well?

My main point is that both methods have overhead.

All other things being equal (especially things like vertex reuse), if you consider primitives per glBegin call (immediate) or (group of) gl*Pointer calls (arrays), there is usally a threshold below which immediate mode is just faster.

Of course this depends on the actual hw implementation. Much SGI hw was probably always faster in immediate mode, because that was its native interface. Likewise, hardware that has no direct support for immediate mode may not ever be faster than arrays.

Hope this helpsā€¦

Cass

Ok: here my ā€œbenchmarksā€: (now Iā€™m using my home, not my office-machine(geforce2mx))

only the drawing-routines changed. the vertices are all the same al the time. the arrays are staticā€¦(just for benchmarking ). 23500 faces, 70500 vertices, 70500 normals, 70500 texcoords

AMD Athlon1000, Geforce4Ti4200, WinXPpro, latest nvidia-drivers

  1. Immediate mode - 18.5fps
  2. Display Lists - 485.5fps
  3. glInterleavedArrays (one Array for whole object) - 66.3fps
  4. gl*Pointer (one Array for whole Object for vertices, one for normalsā€¦) - 75.3fps
  5. glPointer (Object split up into 70 meshes) - 77.5fps
    to number 4: every mesh is an C+Ā±Object and has its own vertex, normal and texcoord-arrays in it. every mesh does a glEnableClientState() for all three coord-types, Pointer-Setup using gl
    Pointer, glDrawArrays, and glDisableClientstate().
    in no. 2 and 3 those functions (glEnableClientState, gl*Pointer, glDisableClientState)are only called once in the Init-phase of the mesh (glDisable in the destroy-phase)ā€¦

so Iā€™m a bit confused nowā€¦ more overhead (but smaller arrays) leads to more performance than less function-overhead but bigger arraysā€¦???

[This message has been edited by mcbastian (edited 12-18-2003).]

(havenā€™t read til here)
but:
>>23500 faces, 70500 vertices, 70500
>>normals, 70500 texcoords
whats about optimizing to shared vertices and using indices ? i think this would also increase the speed.