strange performance issue

Hi,

with my application I get strange performance results:

P4 2.4 Ghz with Radeon 9800 nonpro, about 10fps
P4 3.2 Ghz with Quadro FX 3000 (driver ver. 61.76) only between 1fps and max 13fps

This is for a scene with 70k tris and phong glsl shading (3 shader changes (once glsl-fixed function), about 2900 texture changes, and 3000 glDrawElements).
On the P4,2.4Ghz+Radeon my application is CPU bound. Changing the window size changes nothing.
With the P4,3.2 Ghz + Quadro FX3000 the application seem to be fragment shader limited. And depending how much tris are visible or how high the resolution is, the performance get extremely slow. With maxed render window it is <= 1 fps. The 13 fps are only achieved if the whole scene is out of view and no tris are rendered.

most of the time is spent in the glDrawElements. I use glsl shaders and static vbo. I checked for normal opengl errors, the shaders compile and link fine, I validate it and nowhere a problem is indicated.
I have also tried using normal vertex arrays instead of vbo but this changes nothing…

Has anybody an idea what can cause this slow performance or how to find it out?

70K tris, 3000 glDrawElements, and 2900 texturestate changes this means, every glDrawElemts call draws only an average of 24(!) tris.
this is what i call “whorst case” for your HW.
even if the most time is spent in the glDrawElemnts call, the main reason are the texture & state-changes.(because the statechanges are processed, when they are needed for rendering, NOT when they are altered using the OpenGL API)
actually it would make no difference to draw 5K tris per call… you would get the same performace with this number of state changes on your HW.

to reduce the number of statechanges you can do this:

  1. sort your tris by texutres/shaders and draw all tris for a material with only one glDrawElemets call. this should reduce the number of statechanges a lot.

  2. merge your textures into bigger ones (a so called “texutre-atlas”). this also reduces the number of statechanges.(but you have to change your texture-vertexmapping and you can’t use it for repeating textures, and you’l also get problems with mipmapping…i do not like this method.)

  3. and ofcourse, don’t forget to cache your geometry in video/AGP mem using VBO.

Originally posted by AdrianD:
[b]70K tris, 3000 glDrawElements, and 2900 texturestate changes this means, every glDrawElemts call draws only an average of 24(!) tris.
this is what i call “whorst case” for your HW.
even if the most time is spent in the glDrawElemnts call, the main reason are the texture & state-changes.(because the statechanges are processed, when they are needed for rendering, NOT when they are altered using the OpenGL API)
actually it would make no difference to draw 5K tris per call… you would get the same performace with this number of state changes on your HW.

to reduce the number of statechanges you can do this:

  1. sort your tris by texutres/shaders and draw all tris for a material with only one glDrawElemets call. this should reduce the number of statechanges a lot.
  1. merge your textures into bigger ones (a so called “texutre-atlas”). this also reduces the number of statechanges.(but you have to change your texture-vertexmapping and you can’t use it for repeating textures, and you’l also get problems with mipmapping…i do not like this method.)
  1. and ofcourse, don’t forget to cache your geometry in video/AGP mem using VBO.[/b]
    yeah, I know this isn’t good batching. (it’s just a brute force rendering of a complete quake (mod)level :wink: ) And I don’t have problems with the Radeon Performance, but:

I don’t think that the state changes/driver overhead causes the slow performance on the P4,3.2Ghz+Quadro FX3000. If it would, the slow performance should be independent of the window size / resolution. But it is dependent. The driver overhead should be more or less independent of the rendered fragments. So imo it is not a problem of bad batching. If it were it would also mean that the nvidia driver overhead would be at least a magnitude greater than from ATI, which I can’t believe.

No, I think there have to be another issue…

Oh, and as said I already use vbo.

Any other ideas?

how do you draw the level exactly ?

a). one pass:
for every shader: set texutre, set shader, draw polys.

or
b). multipass:
1st pass: disable all shaders and texture and draw all geometry with a single drawcall into the zbuffer.
2nd pass: set z-test to GL_LEQUAL and disable z-writes, and the draw all primitives like in method a.

with method a) - when your geometry is not fornt-to-back sorted - all triangles are drawn. even if they are not or only partially visible. this produces very much invisible fragments (=useless fragment programm execution).
using method b) you can at least make sure, that there are no more fragments drawn, than there are visible pixels on the screen.

if this still does not help then lets talk about your fragment programs.(if you dont reduce the precision of the fragment programms using nvidia extensions then nvidia crads run solwer then ati cards from the comparable generation)

Originally posted by AdrianD:
[b]how do you draw the level exactly ?

a). one pass:
for every shader: set texutre, set shader, draw polys.

or
b). multipass:
1st pass: disable all shaders and texture and draw all geometry with a single drawcall into the zbuffer.
2nd pass: set z-test to GL_LEQUAL and disable z-writes, and the draw all primitives like in method a.

with method a) - when your geometry is not fornt-to-back sorted - all triangles are drawn. even if they are not or only partially visible. this produces very much invisible fragments (=useless fragment programm execution).
using method b) you can at least make sure, that there are no more fragments drawn, than there are visible pixels on the screen.

if this still does not help then lets talk about your fragment programs.(if you dont reduce the precision of the fragment programms using nvidia extensions then nvidia crads run solwer then ati cards from the comparable generation)[/b]
a)

I know my current rendering algorithm is ****ty, but I was in doubt that the Quadro FX3000 suffer so much more than a Radeon 9800 nonpro. I mean is everything correct if a highend workstation card is easily outperformed by a mid-high range consumer card, just by using a GPU unfriendly rendering algorithm?

Mhh, seems so…, probably the weak performance for higher precision combined with the huge overdraw could be the problem. Just haven’t thought that it would be such an enormous difference.

Thanks for tips, now it makes a little more sense :slight_smile:

I mean is everything correct if a highend workstation card is easily outperformed by a mid-high range consumer card, just by using a GPU unfriendly rendering algorithm?

Doesn’t suprise me as the Big Deal with Quadro cards is that it can render nice looking and very fast lines. Because of this other parts will take a performance hit. This is why they don’t reccomended the use of a Quadro for gaming.

About your poor performance, yeah you are making WAY to many drawelement calls. All of those state changes and the overhead in drawelemnts is piling up like crazy. I wrote a q3 bsp renderer with bump mapping and other effects a while back and I had seriously bad performance. Turns out I was calling drawelements about 1500 times. I found out what was causing my batching code to fail and once that was fixed, my number of drawelements calls dropped to the number of textures in my scene, which turned out to be like 8 I think it was. My fps went from about 10 to 300+. Even if the state changing wasn’t the problem, as I said before there is a bit of overhead in the drawelement call and if you make thousands of calls to it, this overhead piles up to huge amounts slowing you way down. I did a test on this once where I didn’t do any state changing, just thousands of drawelement calls with one triangle with 3 colors and all I can say is I sat back and watched a nice slide show. :slight_smile:

Also VBO is not going to help you unless you are AGP transfer limited, by the number of triangles you are rendering, you are more than likely not limited in this area. Of course you should still use it and have it enabled in the case you do load up a butt load of triangles.

What’s going to help extreme amounts is to batch (sort) everything by texture, then render each batch. Also it would be good to sort each poly in each batch in front to back order (excluding transparent surfaces, those should be in back to front, unless you use depth peeling, and should be rendered last) before you render each batch. This will allow you to take advantage of early depth out reducing the amount of processed fragments a bunch.

-SirKnight

I have found Radeon 9800 to be surprisingly CPU bound – possibly doing CPU T&L instead of using the hardware, when using OpenGL. A profiler like VTune will show a lot of time spent in ATI drivers, and moving vertex data to system memory instead of AGP/VRAM memory (using vertex buffer objects) makes things run faster. I have no solution to offer, unfortunately.