PDA

View Full Version : ATI VAO performance problems



CAD_Swede
03-08-2003, 01:01 PM
Hey, everyone.

I'm currently putting the ATI extensions into the CAD engine I'm working on right now, and I'm running into some problems with performance.

~1,000,000 triangles(lit, non-textured, vertex-and-normals only, GL_FILL not GL_LINES, etc) in my seemingly optimal case gives me about 50 fps. That's
nice. I'm happy. ATI rules! :-)

However, when I load pretty much the same model (proprietary format, sorry) with a slightly different tristrip layout, I fall out of the "Fast" render path and I end up with 400 ms render times. I'm not so happy.

So, I'm trying to analyze what the heck causes this. I even have one 90k triangle model run at 140ms. It's so slow.

So, obviously the tristrip layout is not very good in the "bad" cases and given that the models are quite large, it's tricky to analyze the actual model layout (although I will start that next week. I'm away from the code right now and it's driving me mad :-)

What I'm wondering about is: What could cause this?

Since the 90k model is obviously much smaller than the 1M model but renders 5 times slower, something is flakey.

Does anyone know what recommendations ATI has put out? Max tristrip length? Minimum tristrip length? Max number of tristrips? Do collapsed triangles(in longer tristrips) matter? max indexes? Anything? Or is my Radeon 9700 Pro simply busted? ;-)

Since the data format is the same in both cases(vertex and normal array objects in
conjunction with the element array extension, etc) I doubt that the data format is the problem. (ie byte alignment etc shouldn't be the root cause of this, right?)

Ok, hopefully someone will understand what I'm trying to say here. I'm basically falling off the SuperFast rendering path and end up plodding along on the lesser trodden paths of the card/driver. :-)

Thanks for any help!

/Henrik

roffe
03-08-2003, 06:03 PM
I remember reading somewhere that for NVIDIA cards at least you need to be sending 500 primitives or more for "good" performance.I'm sure ATI cards are not that far off.

I would suggest rendering every model with both GL_TRIANGLES and GL_TRIANGLE_STRIP to see the difference. And btw, if this "proprietary format" is of your own creation, I cant see why you couldn't analyze model layout.

CAD_Swede
03-08-2003, 10:11 PM
Originally posted by roffe:
I remember reading somewhere that for NVIDIA cards at least you need to be sending 500 primitives or more for "good" performance.I'm sure ATI cards are not that far off.

500 primitives? Do you mean per-strip or in total? 500 triangle tristrips are fairly large strips so it might cause problems elsewhere, but I'll look into that.



I would suggest rendering every model with both GL_TRIANGLES and GL_TRIANGLE_STRIP to see the difference. And btw, if this "proprietary format" is of your own creation, I cant see why you couldn't analyze model layout.

What I meant was that with 1 million triangles in the model, it's a lot of data to sift through to try to find the problem. I didn't mean that I *can't* do it, but I meant that it's a huge undertaking and before I went into that, I figured I should post here to see if there's anyone who could advise me on what to actually look for. :-)

Thanks for replying. I'll keep going through the good and the bad model to see what the major diferences seem to be. If anyone can think of something, though, I'd appreciate any help.

Thanks!

/Henrik

MZ
03-09-2003, 06:55 AM
from OpenGL Hardware Registry for Radeon 9700:
DrawRangeElements: Max. recommended index count = 65535

Does any of your triangles reference vertices which are more than 65535 indices apart?

roffe
03-09-2003, 09:21 AM
Originally posted by CAD_Swede:
500 primitives? Do you mean per-strip or in total? 500 triangle tristrips are fairly large strips so it might cause problems elsewhere, but I'll look into that.
[/B]

I would say it means per batch, per glDraw***() call. Unless you call glFinish after every draw call it is impossible to say when the gpu will start processing vertices.

CAD_Swede
03-09-2003, 11:16 AM
Originally posted by MZ:
from OpenGL Hardware Registry for Radeon 9700: Does any of your triangles reference vertices which are more than 65535 indices apart?

Thanks for giving me that number. Although I *doubt* that this would be the case, I can't positively rule this one out before I've done a long, hard and boring analysis of the differences between the slow and the fast model. :-)

I'll post back here when I've figured it out. Don't hold your breath, though, folks :-)

/Henrik



[This message has been edited by CAD_Swede (edited 03-09-2003).]

Graham
03-10-2003, 01:25 AM
I personally have found Ati cards suffer enormously depending on the number of glDrawElements/rangeElements calls you make, which nvidia cards don't seem to...

drawing in batches of ~50 tris I can only manage around 400,000 tris/sec on a radeon 9500P. which is appauling... around 3 million/t/sec on a geforce _1_. 9 when using VAR.

my advice. don't use triangle strips, use one large triangle list. on ati cards at least (for me), they seem to be no slower than strips. Cranking out the D3D9 mesh viewer, rendering in triangle list mode or single triangle strip mode seems to go slightly in favour of lists for sheer tris/sec (around 70 million to be precise, 55-60 for strip).I realise this isn't exactly very scientific, since there are far more factors at play, but at least it shows ati cards handle traingle lists really really well http://www.opengl.org/discussion_boards/ubb/smile.gif

[This message has been edited by Graham (edited 03-10-2003).]

Humus
03-10-2003, 02:12 PM
Are you using an 'odd' format? The hardware support for vertex formats is limited to floats for all data types plus unsigned bytes for colors and shorts for texture coordinates. If you're using any other format, VAO will be slower than plain vertex arrays.

Korval
03-10-2003, 04:06 PM
Graham, how do Radeons handle a single glDrawMultiElements call, vs multiple glDrawElements (assuming they support multi-elements)?

CAD_Swede
03-10-2003, 09:35 PM
Originally posted by Humus:
Are you using an 'odd' format? The hardware support for vertex formats is limited to floats for all data types plus unsigned bytes for colors and shorts for texture coordinates. If you're using any other format, VAO will be slower than plain vertex arrays.


No, I'm using floats. In the good "fast" run (~50 Mtris/second) the tri strips are pretty long and in the bad run (2 Mtris/s) some of the tri strips are very short. I suppose this is what's causing my problem, that some of the strips are just too damn short ( < 5 triangles)

Thanks everyone who've contributed.

It seems ATI's 9700 just stalls more than the other cards when the tri strips get really short.

/Henrik

Nicolas Lelong
03-11-2003, 01:14 AM
Hi,

FWIW, I've stumbled upon the same problem recently.

As my strips were all very short, the glDraw* calls had _far_ too much overhead [render time fell from about 45fps without strips to 1.5fps with short strips!!] - I tried using GL_EXT_multi_draw_arrays without noticing much performance improvement.

For the moment, I disable the stripping on R9700 but my first experiments are leading me to think that strips shorter than 32 triangles are slower than raw triangles... to be confirmed...

Cheers,
Nicolas

CAD_Swede
03-11-2003, 01:18 AM
Originally posted by Nicolas Lelong:
For the moment, I disable the stripping on R9700 but my first experiments are leading me to think that strips shorter than 32 triangles are slower than raw triangles... to be confirmed...


Cool. I'll do some further tests to see if I can find the "breaking point" for tri strip length. Maybe some really nasty code will have to be added to the deepest tight render loop to differ between long and short tristrips. Yikes. :-(

I'll keep adding stuff to this thread if I find out more. Thanks for replying!

/Henrik

Ozzy
03-11-2003, 07:27 AM
Originally posted by Nicolas Lelong:

For the moment, I disable the stripping on R9700 but my first experiments are leading me to think that strips shorter than 32 triangles are slower than raw triangles... to be confirmed...

Nicolas


Well, the bottleneck is at the primitive level.. That means that it is preferable to use triangles lists instead of splitted strips.. Or in other words to be closest as possible of the perfect shoot -> 1 object = 1 prim = 1 strip. ;)

Concerning VAO implementation now, the vertex formats limitations (floats,rgb bytes) and so on make this EXT a *bit* too much rigid as was saying Humus.

Anyway, i don't know with today's drivers and the latest VAO extensions available (GL_ATI_vertex_attrib_array_object for instance) maybe they have significantly enhanced performances.. But frankly, VAO compared to NV_VAR were *not* challenging at all a few months ago.. (but this is another story :)

CAD_Swede
03-11-2003, 10:26 AM
Originally posted by Ozzy:
Well, the bottleneck is at the primitive level.. That means that it is preferable to use triangles lists instead of splitted strips.. Or in other words to be closest as possible of the perfect shoot -> 1 object = 1 prim = 1 strip. http://www.opengl.org/discussion_boards/ubb/wink.gif

I feel like a complete dummy here, but.. What are "triangle lists"? I mean, I thought I knew this stuff. :-) Do you mean non-indexed triangles? I can't just give OpenGL my vertex set and just say "render", I still have to feed it some kind of index list? Right? I'm not THAT ignorant, am I? ;-)


Anyway, i don't know with today's drivers and the latest VAO extensions available (GL_ATI_vertex_attrib_array_object for instance) maybe they have significantly enhanced performances.. But frankly, VAO compared to NV_VAR were *not* challenging at all a few months ago.. (but this is another story http://www.opengl.org/discussion_boards/ubb/smile.gif

Do you mean challenging as in difficult? No, they're sweet as candy. Pretty straight forward and very efficient. However, with short strips, performance hits the sewer.

I'm going to do a radical move tomorrow and simply try to "extend" the short strips with the last index in order to pad the strip to a length that the card likes. The reduction in speed is so significant for short strips that I'm *pretty* sure that it'll be faster to render strips that look like:

{0,1,2,3,4,5,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6 ,6,6}
rather than
{0,1,2,3,4,5,6}

even if the idea sounds completely wack on paper.

oh well, the joy of random tweaking to improve speed. You've just got to love it :-)

Again, updated will be made to this thread.

/Henrik


[This message has been edited by CAD_Swede (edited 03-11-2003).]

Korval
03-11-2003, 11:49 AM
I'm going to do a radical move tomorrow and simply try to "extend" the short strips with the last index in order to pad the strip to a length that the card likes. The reduction in speed is so significant for short strips that I'm *pretty* sure that it'll be faster to render strips that look like:

The point we've been trying to make is that it isn't the length of the strip that is the problem. It's the number of glDraw* calls you're making. In a model with short strips, you're obviously going to call glDrawElements more than you are in a model with long strips.

What you should do is either use a triangle list, or attach all your strips together with degenerate triangles and render them in a single glDrawElements call. Or try glDrawMultiElements.

Humus
03-11-2003, 01:48 PM
Originally posted by Ozzy:
Anyway, i don't know with today's drivers and the latest VAO extensions available (GL_ATI_vertex_attrib_array_object for instance) maybe they have significantly enhanced performances.. But frankly, VAO compared to NV_VAR were *not* challenging at all a few months ago.. (but this is another story http://www.opengl.org/discussion_boards/ubb/smile.gif


The VAO implementation is very solid IMO. And it will only get better with ARB_vertex_buffer_object. I've been able to squeeze out 300.1 million vertices/s on a R9700. That's pretty close to theoretical maximum.

[This message has been edited by Humus (edited 03-11-2003).]

Dodger
03-11-2003, 05:05 PM
From the spec:

EXT_multi_draw_arrays

"These functions behave identically to the standard OpenGL 1.1 functions glDrawArrays() and glDrawElements() except they handle multiple lists of vertices in one call. Their main purpose is to allow one function call to render more than one primitive such as triangle strip, triangle fan, etc."

Supported on all NV cards if I'm not mistaken, and I would be surprised if ATI's drivers wouldn't support it too. In my experience, it does make a big difference. Yes, the number of glDraw* calls can hurt you, and that's exactly what this extension was created for.

CAD_Swede
03-11-2003, 09:33 PM
Originally posted by Korval:
The point we've been trying to make is that it isn't the length of the strip that is the problem. It's the number of glDraw* calls you're making. In a model with short strips, you're obviously going to call glDrawElements more than you are in a model with long strips.

Nichlas Lelong, who commented further up, says that he's run into the same problem with short strips and that draw_multi didn't help in that case. So, since neither him or me have had any success with draw_multi, I'll have to try something else, right?

Also, when you say "triangle list", what do you mean? Just non-indexed triangles? Or is it some magic I've never heard of? Again, as I wrote in my previous post I don't recognize this way of drawing triangles. If it's something trivial than I just don't recognize the corresponding OpenGL calls for it...or maybe my brain is just fried. (I'm taking a C# course this week for work. Gawks! ;-)

/Henrik

roffe
03-11-2003, 10:25 PM
I would assume this means regular triangles using vertex arrays,

glDraw***(GL_TRIANGLES,nbrOfVertices,type,pIndices )

Ozzy
03-11-2003, 10:57 PM
Originally posted by Humus:
The VAO implementation is very solid IMO. And it will only get better with ARB_vertex_buffer_object. I've been able to squeeze out 300.1 million vertices/s on a R9700. That's pretty close to theoretical maximum.

[This message has been edited by Humus (edited 03-11-2003).]

cool ;)

(very solid but fixed to one type of vertex format if u want to get nice perfs)

did u use float + byte colors?
what about GL_ATI_vertex_attrib_array_object ?
what about performances while enabling/disabling lighting ?

just curious.. :) i know you've done your homework! ;)

CAD_Swede
03-12-2003, 05:00 AM
Sorry..Managed to hit Reply instead of Edit:

More stuff:

What I realize now (I hate writing here while being away from my code :-) is that
the glMultiDrawElementsEXT from the EXT_multi_draw_arrays extension isn't really compatible with the ATI_element_array extension, is it?

I mean, normally I use DrawElementArrayATI to draw each strip. The strip data array for the current mesh has been uploaded, together with the vertex and the normal arrays, to the card according to the ATI_vertex_array_object extension.

I assume that this means that I can't use glMultiDrawElementsEXT, since it's not part of the ATI_element_array extension and therefore (reasonably) not aware of the VAO stuff.

SO, I *guess* I'm sh*t-out-of-luck until the
ATI_element_array extension contains something similar to glMultiDrawElementsEXT :-)

Unless, as usual, I've failed to understand the whole extension hoolabaloo :-)

/Henrik



[This message has been edited by CAD_Swede (edited 03-12-2003).]

Humus
03-12-2003, 01:42 PM
Originally posted by Ozzy:
cool http://www.opengl.org/discussion_boards/ubb/wink.gif

(very solid but fixed to one type of vertex format if u want to get nice perfs)

did u use float + byte colors?
what about GL_ATI_vertex_attrib_array_object ?
what about performances while enabling/disabling lighting ?

just curious.. http://www.opengl.org/discussion_boards/ubb/smile.gif i know you've done your homework! http://www.opengl.org/discussion_boards/ubb/wink.gif

For that 300.1 Mv/s figure I used plain vertices only. Unsigned bytes should work fine, I've tested that too with good performance. The vertex_attrib_array_object only makes sure you can use vertex attributes from GL_ARB_vertex_program in VAO too. VAO was designed when EXT_vertex_shader was around and since it uses its own set of draw routines a new call for attributes was needed when the ARB_vertex_program came around. This will not be a problem with VBO.
Lighting will of course reduce performance. The R9700 doesn't have any fixed function lighting hardware, so all lighting end up as a long vertex shader. The more lights enabled, the longer the vertex shader running under the hood. Enabling stuff like double sided lighting etc will make it slower too.

Korval
03-12-2003, 05:26 PM
Well, CAD, you could try the other thing I suggested:


attach all your strips together with degenerate triangles and render them in a single glDrawElements call

CAD_Swede
03-12-2003, 10:41 PM
Originally posted by Korval:
Well, CAD, you could try the other thing I suggested: Putting them together with degenerate triangles.


As we're working with a CAD program, where GL_LINES will be used as much as GL_FILL to draw the triangles, I think degenerate triangles would still show up and look weird, especially if they're connecting triangles that are in separate planes and "far" away from eachother. Maybe there's some logic I could slap on to make it be less of a problem, but I believe it would still exist... Thanks for the idea, though. Maybe some kind of mix for the separate cases can be achieved.

Thanks!

/Henrik

CAD_Swede
03-13-2003, 06:16 AM
Originally posted by CAD_Swede:
Nichlas Lelong, who commented further up, says that he's run into the same problem with short strips and that draw_multi didn't help in that case. So, since neither him or me have had any success with draw_multi, I'll have to try something else, right?


Just for the record, I'd like to add that I've now tried to pad the strips so that they're always at least 32 vertices long. That didn't help either. (Which is good. That'd be a weird bug :-) So, what's left for me to do is to check for some kind of strip merging technique, as someone said degenerate triangles, etc, when that's a possiblilty.

Or, I could just wait for the ATI drivers to get as good/fast as the nVidia and WildCat drivers are when it comes to rendering short strips. ;-)

Thanks, everyone for a good discussion and for taking your time to respond.

/Henrik

[This message has been edited by CAD_Swede (edited 03-13-2003).]

Korval
03-13-2003, 08:44 AM
As we're working with a CAD program, where GL_LINES will be used as much as GL_FILL to draw the triangles,

Well, there's no such thing as a "degenerate line", so theres nothing you can do on those primitives.


I think degenerate triangles would still show up and look weird

Degenerate triangles (triangles that use the same index twice) will never appear. The GL spec, and virtually ever hardware renderer I know of, guarentees this.

Also, if your data is really that poorly strippable, you should really consider a triangle list (GL_TRIANGLES) rather than a bunch of strips (GL_TRIANGLE_STRIP).


Also, when you say "triangle list", what do you mean? Just non-indexed triangles? Or is it some magic I've never heard of? Again, as I wrote in my previous post I don't recognize this way of drawing triangles.

It's a list of triangles. If you're drawing index with glDrawElements, each 3 indices represents a particular triangle. It's just like a face list. If you're not drawing indexed (and unless you have significant vertex reuse, I'm not sure I would suggest it), then you just arrange your vertex data into an array of 3 vertices, duplicating data where necessary.

CAD_Swede
03-13-2003, 11:48 AM
Originally posted by Korval:
Well, there's no such thing as a "degenerate line", so theres nothing you can do on those primitives.

I wrote "GL_LINES", not "GL_LINE". GL_LINES draws the edges of a triangle instead of filling it as GL_FILL does. Thus, degenerate triangles can't be used when I'm drawing GL_LINES as those lines will be clearly visible...or so I believe. We have an outline mode where degenerate triangles definitely makes things look like crap. :-)



Also, if your data is really that poorly strippable, you should really consider a triangle list (GL_TRIANGLES) rather than a bunch of strips (GL_TRIANGLE_STRIP).


The data is in general very nicely strippable. We often get strips over 2000 vertices long. Very nice. Very fast. However, we stumble across shorter strips at times and on those models, rendering speed drops from 50 fps for 1 million triangles to 20 fps for 150,000 triangles... or, in a few horror examples, way below that.

So, degenerate triangles will work nicely for rendering modes when we only use GL_FILL. (which is *most* of the time, so
it should be ok). In a few cases, though, we're going to have to render with GL_LINES instead and in those cases we'll have to ditch the degenerated triangles.

Thanks, though! I really appreciate it!


Edited to add: Now, the *real* problem isn't the tristrip techniques used, to be honest.

The problem is that the Radeon 9700 Pro kicks nVidias and WildCat's butt when being fed good strips. However, in the "bad" cases, the Wildcat and the GeForce 4 nVidia renders the models ALOT faster than the 9700 Pro does. Why, ATI, why!?!? :-)

Why can't the ATI card be just as content as the nVidia card is when rendering many short strips? :-)

/Henrik



[This message has been edited by CAD_Swede (edited 03-13-2003).]

Madoc
03-13-2003, 11:02 PM
The simplest solution to your problem is to simply render the long triangle strips and then all triangles making up strips of less than 32 in a single GL_TRIANGLES call. I would try this before anything else.

When doing lines you might find it's much faster do actually pass line geometry rather than changing the polygon mode. I found this to be especially true on wildcats.