PDA

View Full Version : Vertex-Arrays not faster than immediate mode?



mcbastian
12-17-2003, 05:01 AM
Hi!

first of all: latest Nvidia-drivers, WinXP pro, Visual Studio 6, C++ :-)

I wrote a program which loads a model and displays it. nothing special. 23500 faces, 70500 vertices, normals and texcoords.
divided into 69 Meshes.
Drawing it mesh-per-mesh with one glBegin/end per mesh and then glvertex3f... and so on I get a Framerate of 26fps.

using ONE display-list for the whole object gets me 550fps. also ok. but I have the need to modify the vertex-data often. so display-lists aren't the thing for me.

ok, I thought. Let's use Vertex Arrays.

I put all the vertices from all meshes into one big float-array. same with normals and texture-data.
so I had three float-arrays.
then I used glEnableClientState for vertex, normals and texturecoord-arrays.
used gl*Pointer-functions to set the pointers and then go to the draw loop.
(setting up matrix, and then call
glDrawArrays(GL_TRIANGLES, 0, iVertexCount);
the modelview-matrix-calculations are the same in all three methods. (displaylists, immediatemode, vertex arrays).
the model showed up. displayed correctly. all ok. BUT: only 24fps????
Vertex Arrays slower than Immediate-mode????

what's wrong here?

Greetings, Sebastian

zeckensack
12-17-2003, 07:21 AM
Streaming data from three disjoint arrays at the same time may explain your performance issues.

Try interleaved arrays.


struct Vertex
{
float tx,ty; //tex coords
float nx,ny,nz; //normal
float x,y,z; //position
};

Vertex* vertex_array=(Vertex*)malloc(70500*sizeof(Vertex)) ;
//fill array
.
.
.

//render:
glInterleavedArrays(GL_T2_N3_C3,sizeof(Vertex),ver tex_array);
glDrawArrays(<...> );
Note that you can use other vertex layouts than those prefabbed for glInterleavedArrays. Look it up in the spec to see what it does, and more importantly, how. This will allow you to pull the same trick for arbitrary vertex layouts (say, for multitexturing).

jwatte
12-17-2003, 09:28 AM
I agree with interleaving leading to better performance.

Also, you want to use DrawRangeElements() rather than DrawElements(). If you can, try also using the LockArraysEXT() extension; it may have an impact on some hardware.

Last, for really good performance, you could look into using ARB_vertex_buffer_object, which has suddenly become the Preferred Method (tm) for submitting geometry, although you need really up-to-date drivers for it to work well. ARB_VBO should give you similar peformance as display lists.

cass
12-17-2003, 10:10 AM
I could believe that DrawArrays is worse than immediate mode.

You've got the overhead of array validation but none of the benefit of vertex reuse.

AdrianD
12-17-2003, 10:14 AM
Originally posted by cass:

I could believe that DrawArrays is worse than immediate mode.

You've got the overhead of array validation but none of the benefit of vertex reuse.

even if the vertices are cache-friendley ordered ?

AdrianD
12-17-2003, 10:18 AM
I think VBO's are the best solution for this problem....

cass
12-17-2003, 10:31 AM
Originally posted by AdrianD:
even if the vertices are cache-friendley ordered ?

Yes. DrawArrays is cache-friendly by definition. There's no index list.

You send the same amount of data in both cases, you just pay the additional expense of array setup and validation overhead with DrawArrays.

zeckensack
12-17-2003, 10:40 AM
Originally posted by cass:

I could believe that DrawArrays is worse than immediate mode.

You've got the overhead of array validation but none of the benefit of vertex reuse.Interesting ...
Why do you need to validate on glDrawArrays? It already has an explicit 'range'.

I can understand the vertex reuse thing, but what if I do glDrawArrays(GL_TRIANGLE_STRIP,<...> ), wouldn't it be irrelevant then?

glDrawArrays could even be faster than glDrawRangeElements, depending on primitive type and whether or not the hardware needs indices.

cass
12-17-2003, 11:03 AM
Originally posted by zeckensack:
Interesting ...
Why do you need to validate on glDrawArrays? It already has an explicit 'range'.

I can understand the vertex reuse thing, but what if I do glDrawArrays(GL_TRIANGLE_STRIP,<...> ), wouldn't it be irrelevant then?

glDrawArrays could even be faster than glDrawRangeElements, depending on primitive type and whether or not the hardware needs indices.


You need to validate on any Pointer changes or array enable/disable. It's inherent array overhead. Not too expensive, but it depends on how frequently you change pointers or array enables.

Indexed stuff can usually get more benefit from vertex sharing than say strips alone.

If you're not getting *additional* vertex sharing by using indexes, then all you're doing is paying for the overhead of the index list. http://www.opengl.org/discussion_boards/ubb/wink.gif

Thanks -
Cass

edit: some UBB screwiness (take 2)


[This message has been edited by cass (edited 12-17-2003).]

Enbar
12-17-2003, 03:35 PM
I'd tend to say unless your batches are very small drawArrays is better than immediate mode. On ATI hardware with their different driver/hardware model you'll see a bigger benefit from using drawArrays than NVIDIA hardware.

mcbastian
12-18-2003, 12:03 AM
Hi!

as I can read here, I think I should use VBO's here.
but I dont understand the performance difference.
each call to glVertex3f,normal3f and texcoord2f has its own overhead. and this 3 times per triangle...
and glDrawArrays has only one time overhead...
my problem is, that the vertices must be modificable.
the next thing is, that the code runs (vertex arrays and immediate) with 23fps/26fps on a amd1800+ with geforce2mx.
and only 18fps on amd1000 with geforce4ti
I think it doesn't depend on the graphics hardware. driver and OS are the same on both machines.......

bastian

HamsterofDeath
12-18-2003, 12:52 AM
display list : 550 fps
intermediate : 26 fps
vertex array : 24 fps

do you calculate the coordinates again and again for every frame ?

Madoc
12-18-2003, 03:44 AM
Seems strange that drawarrays should be slower than immediate mode. It should still alleviate a good deal of CPU work and I thought it facilitated DMA transfers. With some of the HW I used to work with waaay back, drawarrays was actually the fastest method, faster than drawelements. It's been far too long since I used it so I can't say about any recent experiences with it.


Cass,
you made we want to bring up another question. There's been a few discussions about large vs many small VBOs. You said the cost was in the gl*pointer calls. What I didn't find clear is whether the cost of these calls is greater when a different VBO is bound or if it's the same even under the same VBO.
In other words, as an example, would be well off binding a single VBO and then specifying different offsets through gl*pointer calls (possibly maintaing smaller index formats) or should we minimise the number of gl*pointer calls and use larger indices and rely on DrawRangeElements to reduce the index sizes?

mcbastian
12-18-2003, 06:40 AM
I have calculated the fps only for the drawing-loop. no mods are made on the vertices.

same loop for all 3 possibilities, only with a call to another drawing handler.

the 3 handlers are:
1. draw triangle-per-triangle, immediate mode, (for-loop http://www.opengl.org/discussion_boards/ubb/smile.gif ), the data is in STL-Containers
2. call to glDrawArrays(), the data is in 3 float-arrays
3. call to glCallList(), the data is in a displaylist

zeckensack
12-18-2003, 06:46 AM
Tried the interleaved arrays for #2 yet? I'm sure it will give you at least a small boost.

zeckensack
12-18-2003, 07:14 AM
Switched around:

Originally posted by cass:
Indexed stuff can usually get more benefit from vertex sharing than say strips alone.Yes, obviously.


Originally posted by cass:
You need to validate on any Pointer changes or array enable/disable. It's inherent array overhead. Not too expensive, but it depends on how frequently you change pointers or array enables.

If you're not getting *additional* vertex sharing by using indexes, then all you're doing is paying for the overhead of the index list. http://www.opengl.org/discussion_boards/ubb/wink.gifI'm not sure I understand.

The benefit should be that you can stream many vertices through a single call (function call overhead, higher bandwidth efficiency). Right?

As for the setup overhead, ... um. Let's see
1)look for occurances of stride==0, replace
2a)check if vertex layout is 'single stream'
2b)check if vertex layout is 'compact'
2c)select an appropriate copy method

optional)build an index list if required (or just reuse [parts of] an old one)

4)transfer all vertices. Exceptions on bad client memory are fully acceptable here, no?
5)copy last array element attributes to 'current' state.

What you said makes a lot of sense to me in the context of VBOs, but does it fully apply to system memory arrays as well?

cass
12-18-2003, 08:26 AM
Originally posted by Madoc:
Seems strange that drawarrays should be slower than immediate mode. It should still alleviate a good deal of CPU work and I thought it facilitated DMA transfers. With some of the HW I used to work with waaay back, drawarrays was actually the fastest method, faster than drawelements. It's been far too long since I used it so I can't say about any recent experiences with it.


If you render a lot of static geometry with a single DrawArrays call, it should be faster, I agree.

If you're changing pointers frequently and rendering with lots of DrawArrays calls, it may well be slower.

If your geometry is dynamic, and you build the whole array up front beforehand, then you may not be getting the CPU/GPU parallelism that you would with immediate mode.

Mainly, I wanted to point out that "arrays are faster" is not a simple truism. In order to make things faster, the feature/mechanism must be widening a bottleneck that is currently limiting performance.



Cass,
you made we want to bring up another question. There's been a few discussions about large vs many small VBOs. You said the cost was in the gl*pointer calls. What I didn't find clear is whether the cost of these calls is greater when a different VBO is bound or if it's the same even under the same VBO.
In other words, as an example, would be well off binding a single VBO and then specifying different offsets through gl*pointer calls (possibly maintaing smaller index formats) or should we minimise the number of gl*pointer calls and use larger indices and rely on DrawRangeElements to reduce the index sizes?

This will vary some among implementations, but for NVIDIAs, the performance will be mostly driven by the number of gl*Pointer calls, not so much by how many VBOs are involved.

Too many VBOs and you pay some (marginal) penalty for more frequent VBO state changes. Too few VBOs and you pay a (potentially very high) penalty for forcing a coherent CPU/GPU view of an unnecessarily large chunk of memory. Forcing this coherency requires either synchronization stalling or lots of in-band data copying. This is a real waste if that coherency is not essential.

Small VBOs solve the coherency problem and make driver-side memory management much easier. In the long term, I expect a one or two attribs for a few hundred vertexes per VBO to be "free". And it will never hurt (though it may not help much) to pack multiple attributes (perhaps from multiple objects) into a single VBO -- if they are static or nearly static. This is probably a good idea if you have lots of static objects with very few vertices - though if you don't render these things all at the same time, immediate mode may be better still.

Does that help?

Thanks -
Cass

edit: clarification ...

[This message has been edited by cass (edited 12-18-2003).]

cass
12-18-2003, 08:50 AM
Originally posted by zeckensack:
What you said makes a lot of sense to me in the context of VBOs, but does it fully apply to system memory arrays as well?

My main point is that both methods have overhead.

All other things being equal (especially things like vertex reuse), if you consider primitives per glBegin call (immediate) or (group of) gl*Pointer calls (arrays), there is usally a threshold below which immediate mode is just faster.

Of course this depends on the actual hw implementation. Much SGI hw was probably always faster in immediate mode, because that was its native interface. Likewise, hardware that has no direct support for immediate mode may not ever be faster than arrays.

Hope this helps...

Cass

mcbastian
12-18-2003, 09:20 AM
Ok: here my "benchmarks": (now I'm using my home, not my office-machine(geforce2mx))

only the drawing-routines changed. the vertices are all the same al the time. the arrays are static....(just for benchmarking http://www.opengl.org/discussion_boards/ubb/smile.gif ). 23500 faces, 70500 vertices, 70500 normals, 70500 texcoords

AMD Athlon1000, Geforce4Ti4200, WinXPpro, latest nvidia-drivers

1. Immediate mode - 18.5fps
2. Display Lists - 485.5fps
3. glInterleavedArrays (one Array for whole object) - 66.3fps
3. gl*Pointer (one Array for whole Object for vertices, one for normals...) - 75.3fps
4. gl*Pointer (Object split up into 70 meshes) - 77.5fps
to number 4: every mesh is an C++-Object and has its own vertex, normal and texcoord-arrays in it. every mesh does a glEnableClientState() for all three coord-types, Pointer-Setup using gl*Pointer, glDrawArrays, and glDisableClientstate().
in no. 2 and 3 those functions (glEnableClientState, gl*Pointer, glDisableClientState)are only called once in the Init-phase of the mesh (glDisable in the destroy-phase)...

so I'm a bit confused now.... more overhead (but smaller arrays) leads to more performance than less function-overhead but bigger arrays.....???



[This message has been edited by mcbastian (edited 12-18-2003).]

DJSnow
12-18-2003, 01:23 PM
(haven't read til here)
but:
>>23500 faces, 70500 vertices, 70500
>>normals, 70500 texcoords
whats about optimizing to shared vertices and using indices ? i think this would also increase the speed.

HamsterofDeath
12-18-2003, 01:53 PM
display list : 480 fps
glbegin : 18 fps

lol ? sure your geforce isn't cheating ?
is the whole mesh visible all the time ?

i heard that nvidias display lists perform a visibility check by constructing a cube around the whole drawn vertices and then check if you can see the cube

mcbastian
12-18-2003, 11:24 PM
I don't know how my geforce should cheat here..
the fps are calculated by measuring the time used to draw.
The meshes are visible all the time. (they're rotating...)
the performance-loss with glBegin could have other possibilities because the vertices etc. are stored in STL-Containers. so I think it's more the CPU than GPU who decreases the performance.

JustHanging
12-21-2003, 05:46 AM
I think the performance difference is because drawing with glBegin doesn't use T&L on geforce cards but display lists do.

Thanks to this thread I'm changing my next project to default to display lists, I had forgotten how good they are. All I'm worried about is the memory hit, how much memory does a display list take compared to the corresponding vertex arrays? And where do they get stored, main or video memory? I know this is implememtation dependent, but I'd like to hear some of the common solutions.

-Ilkka

JanHH
12-21-2003, 06:30 AM
"I think the performance difference is because drawing with glBegin doesn't use T&L on geforce cards but display lists do."

Are you SURE?? That would mean that the whole great achievement of a T&L engine is completely useless if not using display lists!? I can't believe that.. and after all, a display lists also contains glBegin and glEnd, only that it is in the graphics card memory and probably compiled.

But this thread also strengthens my belief in display lists.. experiences:

my "small" terrain engine has about 50.000 faces (triangle strpis) and is running absolutely flawless in terms of memory. the "large" terrain (whole germany, flightsim) has about 1000000 to 2000000 faces that are all in display lists at the same time, and there, sometimes swapping seems to occur as it slows down sometimes a little bit ("ruckeln", what is the english word?).

Jan

zeckensack
12-21-2003, 06:24 PM
VBOs with STATIC_DRAW usage should perform just as well as display lists.

I don't think T&L gets switched off in immediate mode. Immediate mode is slow for large batches, but I'd attribute that to comparatively slow transfer bandwidth and function call overhead, not to T&L ...


Originally posted by JanHH
<...> it slows down sometimes a little bit ("ruckeln", what is the english word?).It depends ... http://www.opengl.org/discussion_boards/ubb/smile.gif
If it runs smooth overall but some single frames take a bit longer, you're looking for "hitches" or the popular "stuttering".
Another nice one is "[it runs] choppy", but that often refers to overall low performance.

Miguel_dup1
12-21-2003, 07:18 PM
I have not helped in this forum in a long time... But it is never too late...

Are you drawing your mesh as lines or as filled polygons?

Draw them as filled polygons...


Miguel C

Miguel_dup1
12-21-2003, 07:20 PM
Sorry, I need to be more explicit... Both faces must be filled.


Regards,
mc


Originally posted by mancha:
I have not helped in this forum in a long time... But it is never too late...

Are you drawing your mesh as lines or as filled polygons?

Draw them as filled polygons...


Miguel C

JanHH
12-21-2003, 08:44 PM
"stuttering" is the right word. it runs very fluent but sometimes slows down for a few frames, I guess that is when the graphics card memory gets swapped. But I was really impressed to see what larga data can be put into display lists.

Jan

DJSnow
12-22-2003, 07:27 AM
@JanHH:
the word for the effect, you are searching for is "to jerk/jerking".

JustHanging
12-22-2003, 09:57 AM
I've read from this forum (Matt I think it was) a long time ago, that T&L isn't used in immediate mode, but I'm not sure if that's valid information anymore. The results speak for themselves, though. I'm sure you can take advantage of T&L when using vertex arrays, but I don't know if it's used with the basic setup, perhaps you need to use cva, var ot vbo to get it.

So how is it, NVidia guys?

-Ilkka

Miguel_dup1
12-22-2003, 10:17 AM
Guys, I doubt that the issue is due to T&L and video card combination... I have both, a raedon and an NVidia and I get consistent results with both when using the same method of data feeding. e.g. when both are running with vertex array and both are with immidiate mode.

The thing I saw is that when I was using vertex arrays to gain performance, I in fact had significantly lost it as compared to inmidiate mode; my head almost pop open from the steam built up.

At the very least, vertex array is going to be as fast as immidiate mode; regardless of T&L...

The problem is that vertex is optimized for filled polygons. When I set my polygons to be filled, must be both faces, vertex arrays lived up to its reputation of being faster.

In case you dont know what I mean by filled polygons, simply place the two lines of code where you initialize the opengl settings.

glPolygonMode( GL_FRONT, GL_FILL );
glPolygonMode( GL_BACK, GL_FILL );


Give it a shot.


Originally posted by JustHanging:
I've read from this forum (Matt I think it was) a long time ago, that T&L isn't used in immediate mode, but I'm not sure if that's valid information anymore. The results speak for themselves, though. I'm sure you can take advantage of T&L when using vertex arrays, but I don't know if it's used with the basic setup, perhaps you need to use cva, var ot vbo to get it.

So how is it, NVidia guys?

-Ilkka

Madoc
12-22-2003, 10:24 AM
Originally posted by JustHanging:
I've read from this forum (Matt I think it was) a long time ago, that T&L isn't used in immediate mode, but I'm not sure if that's valid information anymore. The results speak for themselves, though. I'm sure you can take advantage of T&L when using vertex arrays, but I don't know if it's used with the basic setup, perhaps you need to use cva, var ot vbo to get it.

So how is it, NVidia guys?

-Ilkka

That was probably the "T&L vertex cache", which won't work with immediate mode or DrawArrays, simply because you're not reutilising any vertices.

Of course HW T&L is still used regardless of how you specify your geometry.


[This message has been edited by Madoc (edited 12-22-2003).]

JustHanging
12-22-2003, 10:49 PM
Damn, I was wrong. I just went through the archives, and yes, all modes seem to exploit T&L just fine. Sorry.

Mancha: Isn't GL_FRONT_AND_BACK, GL FILL the default mode? Therefore you shouldn't have to do anything about it unless you've messed it up, right?

-Ilkka

Miguel_dup1
12-22-2003, 11:12 PM
JustHanging, that would be right... But when our fellow requesting help says "a mesh" it sounds like he is not using filled polygons. I am assuming too much I guess.

But if he is getting those results with filled enabled, I would like to look at the code because I am sure it is a wrong setting he has.


mc


Originally posted by JustHanging:
Damn, I was wrong. I just went through the archives, and yes, all modes seem to exploit T&L just fine. Sorry.

Mancha: Isn't GL_FRONT_AND_BACK, GL FILL the default mode? Therefore you shouldn't have to do anything about it unless you've messed it up, right?

-Ilkka

mcbastian
12-23-2003, 10:31 AM
...
I'm using textured polygons.
if no texture is avaiable, GL_COLOR_MATERIAL is also enabled http://www.opengl.org/discussion_boards/ubb/smile.gif

in my initgl():

glPolygonMode(GL_FRONT_AND_BACK, GL_FILL);
glEnable(GL_TEXTURE_2D);
glEnable(GL_COLOR_MATERIAL);

http://www.opengl.org/discussion_boards/ubb/wink.gif

but to get back to my initial question: what the hell am I doing wrong? http://www.opengl.org/discussion_boards/ubb/smile.gif

greetings, Bastian

Miguel_dup1
12-23-2003, 06:46 PM
If you want you can send me your code and I will inspect it...
miguel@wayne.edu


mc


Originally posted by mcbastian:
...
I'm using textured polygons.
if no texture is avaiable, GL_COLOR_MATERIAL is also enabled http://www.opengl.org/discussion_boards/ubb/smile.gif

in my initgl():

glPolygonMode(GL_FRONT_AND_BACK, GL_FILL);
glEnable(GL_TEXTURE_2D);
glEnable(GL_COLOR_MATERIAL);

http://www.opengl.org/discussion_boards/ubb/wink.gif

but to get back to my initial question: what the hell am I doing wrong? http://www.opengl.org/discussion_boards/ubb/smile.gif

greetings, Bastian

jwatte
12-24-2003, 08:37 AM
the fps are calculated by measuring the time used to draw.


That will not give meaningful results. You have to measure time right after calling SwapBuffers(), then measure time right after the next SwapBuffers(), and subtract your first time; that's your frame-draw-time. One over frame-draw-time equals FPS.

If you only measure the time it takes to issue the commands, no doubt the lists will be very fast, as they're all in memory already, and "drawing" the list consists of putting a command in some queue, pointing it at the memory -- that doesn't actually DRAW the list to screen; it just queues the command.

dorbie
12-24-2003, 09:21 AM
Swapbuffers doesn't always block unless you have another swap in the buffer so this isn't 100% reliable, and you'll actually be measuring the earlier frame when it is, although you get a strange hybrid of code & dispatch from the latter frame and draw from the earlier frame, although being pipelined only the slowest will be timed, in a naive view of the scenario.

[This message has been edited by dorbie (edited 12-24-2003).]

mcbastian
12-25-2003, 08:06 AM
Hi!

STUPID ME!!!!! STUPID ME!!!!! STUPID ME!!!!!

Thank you for your help. but I finally found my stupid mistake. I didn't measure the time used for drawing a frame, instead I measured the time it took to call only the gl*()-functions, without taking a look at the time including SwapBuffers().....

now I got the following results:

1) Immediate-Mode: 18.9FPS
2) Display-List: 75.6FPS
3) DrawArrays() 73.9FPS
4) DrawArrays() 70.2FPS

3) uses gl*Pointer
4) uses glInterleavedArrays()

Now everything is clear for me. You shouldn't code after midnight. And if you do, you should double-check your result http://www.opengl.org/discussion_boards/ubb/smile.gif

Thanks for help :-)

Bastian