ATI VAO performance problems

Hey, everyone.

I’m currently putting the ATI extensions into the CAD engine I’m working on right now, and I’m running into some problems with performance.

~1,000,000 triangles(lit, non-textured, vertex-and-normals only, GL_FILL not GL_LINES, etc) in my seemingly optimal case gives me about 50 fps. That’s
nice. I’m happy. ATI rules! :slight_smile:

However, when I load pretty much the same model (proprietary format, sorry) with a slightly different tristrip layout, I fall out of the “Fast” render path and I end up with 400 ms render times. I’m not so happy.

So, I’m trying to analyze what the heck causes this. I even have one 90k triangle model run at 140ms. It’s so slow.

So, obviously the tristrip layout is not very good in the “bad” cases and given that the models are quite large, it’s tricky to analyze the actual model layout (although I will start that next week. I’m away from the code right now and it’s driving me mad :slight_smile:

What I’m wondering about is: What could cause this?

Since the 90k model is obviously much smaller than the 1M model but renders 5 times slower, something is flakey.

Does anyone know what recommendations ATI has put out? Max tristrip length? Minimum tristrip length? Max number of tristrips? Do collapsed triangles(in longer tristrips) matter? max indexes? Anything? Or is my Radeon 9700 Pro simply busted? :wink:

Since the data format is the same in both cases(vertex and normal array objects in
conjunction with the element array extension, etc) I doubt that the data format is the problem. (ie byte alignment etc shouldn’t be the root cause of this, right?)

Ok, hopefully someone will understand what I’m trying to say here. I’m basically falling off the SuperFast rendering path and end up plodding along on the lesser trodden paths of the card/driver. :slight_smile:

Thanks for any help!

/Henrik

I remember reading somewhere that for NVIDIA cards at least you need to be sending 500 primitives or more for “good” performance.I’m sure ATI cards are not that far off.

I would suggest rendering every model with both GL_TRIANGLES and GL_TRIANGLE_STRIP to see the difference. And btw, if this “proprietary format” is of your own creation, I cant see why you couldn’t analyze model layout.

Originally posted by roffe:
I remember reading somewhere that for NVIDIA cards at least you need to be sending 500 primitives or more for “good” performance.I’m sure ATI cards are not that far off.

500 primitives? Do you mean per-strip or in total? 500 triangle tristrips are fairly large strips so it might cause problems elsewhere, but I’ll look into that.


I would suggest rendering every model with both GL_TRIANGLES and GL_TRIANGLE_STRIP to see the difference. And btw, if this “proprietary format” is of your own creation, I cant see why you couldn’t analyze model layout.

What I meant was that with 1 million triangles in the model, it’s a lot of data to sift through to try to find the problem. I didn’t mean that I can’t do it, but I meant that it’s a huge undertaking and before I went into that, I figured I should post here to see if there’s anyone who could advise me on what to actually look for. :slight_smile:

Thanks for replying. I’ll keep going through the good and the bad model to see what the major diferences seem to be. If anyone can think of something, though, I’d appreciate any help.

Thanks!

/Henrik

from OpenGL Hardware Registry for Radeon 9700:

DrawRangeElements: Max. recommended index count = 65535

Does any of your triangles reference vertices which are more than 65535 indices apart?

Originally posted by CAD_Swede:
500 primitives? Do you mean per-strip or in total? 500 triangle tristrips are fairly large strips so it might cause problems elsewhere, but I’ll look into that.
[/b]

I would say it means per batch, per glDraw***() call. Unless you call glFinish after every draw call it is impossible to say when the gpu will start processing vertices.

Originally posted by MZ:
from OpenGL Hardware Registry for Radeon 9700: Does any of your triangles reference vertices which are more than 65535 indices apart?

Thanks for giving me that number. Although I doubt that this would be the case, I can’t positively rule this one out before I’ve done a long, hard and boring analysis of the differences between the slow and the fast model. :slight_smile:

I’ll post back here when I’ve figured it out. Don’t hold your breath, though, folks :slight_smile:

/Henrik

[This message has been edited by CAD_Swede (edited 03-09-2003).]

I personally have found Ati cards suffer enormously depending on the number of glDrawElements/rangeElements calls you make, which nvidia cards don’t seem to…

drawing in batches of ~50 tris I can only manage around 400,000 tris/sec on a radeon 9500P. which is appauling… around 3 million/t/sec on a geforce 1. 9 when using VAR.

my advice. don’t use triangle strips, use one large triangle list. on ati cards at least (for me), they seem to be no slower than strips. Cranking out the D3D9 mesh viewer, rendering in triangle list mode or single triangle strip mode seems to go slightly in favour of lists for sheer tris/sec (around 70 million to be precise, 55-60 for strip).I realise this isn’t exactly very scientific, since there are far more factors at play, but at least it shows ati cards handle traingle lists really really well

[This message has been edited by Graham (edited 03-10-2003).]

Are you using an ‘odd’ format? The hardware support for vertex formats is limited to floats for all data types plus unsigned bytes for colors and shorts for texture coordinates. If you’re using any other format, VAO will be slower than plain vertex arrays.

Graham, how do Radeons handle a single glDrawMultiElements call, vs multiple glDrawElements (assuming they support multi-elements)?

Originally posted by Humus:
Are you using an ‘odd’ format? The hardware support for vertex formats is limited to floats for all data types plus unsigned bytes for colors and shorts for texture coordinates. If you’re using any other format, VAO will be slower than plain vertex arrays.

No, I’m using floats. In the good “fast” run (~50 Mtris/second) the tri strips are pretty long and in the bad run (2 Mtris/s) some of the tri strips are very short. I suppose this is what’s causing my problem, that some of the strips are just too damn short ( < 5 triangles)

Thanks everyone who’ve contributed.

It seems ATI’s 9700 just stalls more than the other cards when the tri strips get really short.

/Henrik

Hi,

FWIW, I’ve stumbled upon the same problem recently.

As my strips were all very short, the glDraw* calls had far too much overhead [render time fell from about 45fps without strips to 1.5fps with short strips!!] - I tried using GL_EXT_multi_draw_arrays without noticing much performance improvement.

For the moment, I disable the stripping on R9700 but my first experiments are leading me to think that strips shorter than 32 triangles are slower than raw triangles… to be confirmed…

Cheers,
Nicolas

Originally posted by Nicolas Lelong:
For the moment, I disable the stripping on R9700 but my first experiments are leading me to think that strips shorter than 32 triangles are slower than raw triangles… to be confirmed…

Cool. I’ll do some further tests to see if I can find the “breaking point” for tri strip length. Maybe some really nasty code will have to be added to the deepest tight render loop to differ between long and short tristrips. Yikes. :frowning:

I’ll keep adding stuff to this thread if I find out more. Thanks for replying!

/Henrik

Originally posted by Nicolas Lelong:
[b]
For the moment, I disable the stripping on R9700 but my first experiments are leading me to think that strips shorter than 32 triangles are slower than raw triangles… to be confirmed…

Nicolas[/b]

Well, the bottleneck is at the primitive level… That means that it is preferable to use triangles lists instead of splitted strips… Or in other words to be closest as possible of the perfect shoot -> 1 object = 1 prim = 1 strip. :wink:

Concerning VAO implementation now, the vertex formats limitations (floats,rgb bytes) and so on make this EXT a bit too much rigid as was saying Humus.

Anyway, i don’t know with today’s drivers and the latest VAO extensions available (GL_ATI_vertex_attrib_array_object for instance) maybe they have significantly enhanced performances… But frankly, VAO compared to NV_VAR were not challenging at all a few months ago… (but this is another story :slight_smile:

Originally posted by Ozzy:
Well, the bottleneck is at the primitive level… That means that it is preferable to use triangles lists instead of splitted strips… Or in other words to be closest as possible of the perfect shoot -> 1 object = 1 prim = 1 strip.

I feel like a complete dummy here, but… What are “triangle lists”? I mean, I thought I knew this stuff. :slight_smile: Do you mean non-indexed triangles? I can’t just give OpenGL my vertex set and just say “render”, I still have to feed it some kind of index list? Right? I’m not THAT ignorant, am I? :wink:

Anyway, i don’t know with today’s drivers and the latest VAO extensions available (GL_ATI_vertex_attrib_array_object for instance) maybe they have significantly enhanced performances… But frankly, VAO compared to NV_VAR were not challenging at all a few months ago… (but this is another story

Do you mean challenging as in difficult? No, they’re sweet as candy. Pretty straight forward and very efficient. However, with short strips, performance hits the sewer.

I’m going to do a radical move tomorrow and simply try to “extend” the short strips with the last index in order to pad the strip to a length that the card likes. The reduction in speed is so significant for short strips that I’m pretty sure that it’ll be faster to render strips that look like:

{0,1,2,3,4,5,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6}
rather than
{0,1,2,3,4,5,6}

even if the idea sounds completely wack on paper.

oh well, the joy of random tweaking to improve speed. You’ve just got to love it :slight_smile:

Again, updated will be made to this thread.

/Henrik

[This message has been edited by CAD_Swede (edited 03-11-2003).]

I’m going to do a radical move tomorrow and simply try to “extend” the short strips with the last index in order to pad the strip to a length that the card likes. The reduction in speed is so significant for short strips that I’m pretty sure that it’ll be faster to render strips that look like:

The point we’ve been trying to make is that it isn’t the length of the strip that is the problem. It’s the number of glDraw* calls you’re making. In a model with short strips, you’re obviously going to call glDrawElements more than you are in a model with long strips.

What you should do is either use a triangle list, or attach all your strips together with degenerate triangles and render them in a single glDrawElements call. Or try glDrawMultiElements.

Originally posted by Ozzy:
Anyway, i don’t know with today’s drivers and the latest VAO extensions available (GL_ATI_vertex_attrib_array_object for instance) maybe they have significantly enhanced performances… But frankly, VAO compared to NV_VAR were not challenging at all a few months ago… (but this is another story

The VAO implementation is very solid IMO. And it will only get better with ARB_vertex_buffer_object. I’ve been able to squeeze out 300.1 million vertices/s on a R9700. That’s pretty close to theoretical maximum.

[This message has been edited by Humus (edited 03-11-2003).]

From the spec:

EXT_multi_draw_arrays

“These functions behave identically to the standard OpenGL 1.1 functions glDrawArrays() and glDrawElements() except they handle multiple lists of vertices in one call. Their main purpose is to allow one function call to render more than one primitive such as triangle strip, triangle fan, etc.”

Supported on all NV cards if I’m not mistaken, and I would be surprised if ATI’s drivers wouldn’t support it too. In my experience, it does make a big difference. Yes, the number of glDraw* calls can hurt you, and that’s exactly what this extension was created for.

Originally posted by Korval:
The point we’ve been trying to make is that it isn’t the length of the strip that is the problem. It’s the number of glDraw* calls you’re making. In a model with short strips, you’re obviously going to call glDrawElements more than you are in a model with long strips.

Nichlas Lelong, who commented further up, says that he’s run into the same problem with short strips and that draw_multi didn’t help in that case. So, since neither him or me have had any success with draw_multi, I’ll have to try something else, right?

Also, when you say “triangle list”, what do you mean? Just non-indexed triangles? Or is it some magic I’ve never heard of? Again, as I wrote in my previous post I don’t recognize this way of drawing triangles. If it’s something trivial than I just don’t recognize the corresponding OpenGL calls for it…or maybe my brain is just fried. (I’m taking a C# course this week for work. Gawks! :wink:

/Henrik

I would assume this means regular triangles using vertex arrays,

glDraw***(GL_TRIANGLES,nbrOfVertices,type,pIndices)

Originally posted by Humus:
[b] The VAO implementation is very solid IMO. And it will only get better with ARB_vertex_buffer_object. I’ve been able to squeeze out 300.1 million vertices/s on a R9700. That’s pretty close to theoretical maximum.

[This message has been edited by Humus (edited 03-11-2003).][/b]

cool :wink:

(very solid but fixed to one type of vertex format if u want to get nice perfs)

did u use float + byte colors?
what about GL_ATI_vertex_attrib_array_object ?
what about performances while enabling/disabling lighting ?

just curious… :slight_smile: i know you’ve done your homework! :wink: