Part of the Khronos Group
OpenGL.org

The Industry's Foundation for High Performance Graphics

from games to virtual reality, mobile phones to supercomputers

Page 1 of 2 12 LastLast
Results 1 to 10 of 19

Thread: Performance Question

  1. #1
    Intern Newbie
    Join Date
    Jun 2003
    Posts
    32

    Performance Question

    I feel a bit foolish asking this (and this might not be the correct forum) but here it goes:

    In my spare time I am setting up an engine but I am confused to it's performance (good/bad).
    A bit of needed info: I have a GF4 ti4200 (x4 AGP 64Meg) and a 1.6 Gig P4.

    I checked the Nvidia site and looked at the peak performances. I think I read somewhere that the nvidia performance statistics are extremely optimized and that I should not expect results close to it in my application. What I wondered is if I am "on the mark" or is there something I am doing wrong (since my results are much lower).

    I set up my test app to generate "lots" of tris in order to check how well it is performing. It has a skydome with approx 21.6K tris. A patch of ground with 8K tris, and a "pillar" in the center with 8K tris as well. At this primative stage I just sent everything to the card.

    3 Textures totalling 38016 tris. Each texture is done in order (ie: Only 3 texture changes) using vertex lighting.
    Static VBO's for everything.
    Indexes are GL_UNSIGNED_SHORT and I use glDrawRangeElements.
    I am sending GL_TRIANGLES.
    I only clear the Z buffer each frame.
    My "hud" displays position, rotation, and FPS. This is the only part that is still done with VAs.

    Currently the VBO indexes are sent row by row. I read an article from nvidia about sending everything in one tri strip and using degenerates. Does this help much or is it not worth the bother?

    I set up some simple profiling code and included the percentages (in case it helps).

    Performance (24 bit color, 8 alpha) fullscreen:
    395 fps @ 800x600 Processing Time: glDrawRangeElements 38% SwapBuffers 24%
    300 fps @ 1024x768 Processing Time: glDrawRangeElements 44% SwapBuffers 47%

    At first glance I thought it was good but when doing the maths it does not seem that great. Opinions? Suggestions? Any help in this matter would be greatly appreciated as I am hesitant to continue until I feel this is resolved.

  2. #2
    Junior Member Regular Contributor
    Join Date
    Mar 2002
    Location
    NI, Germany
    Posts
    114

    Re: Performance Question

    I think you are experiencing normal/good results. You could further check your application with a performance analyzer like VTune or so. Otherwise it's hard to say where you could speed up something without having your complete source.
    I don't think that triangle strips will help much because if you are sending the triangle indices in a cache friendly way, the driver automatically generate strips.
    From an OpenGL point of view, I don't think that there is much you can do in terms of optimization if you are already do your own state caching in your application to keep state changes at a minimum.
    My tip is to test everything out for yourself. This way you'll get the optimal performance for your application, cause every app is different (well, almost).

    God, i've got to go to bed now....

    Hope this helps

  3. #3
    Advanced Member Frequent Contributor
    Join Date
    Apr 2000
    Posts
    748

    Re: Performance Question

    You need to learn what a bottleneck is.

    There is no point in trying to increase the data transfer rate, or the transform rate if you are fillrate limited, which you are at these kind of framerates. The fact that you loose 95 fps by going down from 1024x768 to 800x600 is proving that.

    Y.

  4. #4
    Member Regular Contributor
    Join Date
    Aug 2003
    Location
    France
    Posts
    299

    Re: Performance Question

    I agree with Ysaneya, and further more, your HUD may be causing some overdraw.

    NVidia web site has tons of bottleneck tutorials. Try to catch an up to date one though.

    And for strip optimization, I fear the hardware won't generate strips automatically, but rather works with a pre transform and a post transform cache. Remapping the indices of your vertices in your array may drastically increase performance. The small tricky point is the pre transform cache : when you "upload" a vertex to the pipeline, the hardware take a bunch of his neighbours with him, and store all of them in the pre transform cache. Remapping the buffers will make good use of this cache. As a second step, you can try to use tri strips, but in all my tests, it gave porr results.

    SeskaPeel.

  5. #5
    Junior Member Regular Contributor
    Join Date
    Mar 2002
    Location
    NI, Germany
    Posts
    114

    Re: Performance Question

    Just to clarify what i ment with "sending triangle indices in a cache friendly way":
    If your triangles look like the fine ASCII art below, you should send your indices in the order 0, 1, 2, 2, 1, 3 assuming clockwise order (next would be 3, 1, 5, 5, 1, 4). This will let the driver reuse the same vertex (here index 2 for the first two tris) without fetching it again from some memory. So it takes only two vertex transformations to draw a triangle following the first on (which takes three). This is equal to a triangle strip, IIRC. I think that's what's going on in nVidia and ATI drivers since Geforce days (dunno 'bout other vendors).

    Code :
    0-------1-------4
    |     / | \     |
    |   /   |   \   |
    | /     |     \ |
    2-------3-------5
    SeskaPeel:
    Well, of course the HUD will cause a bit overdraw but how would you change it other than not drawing it?
    I do agree with you about triangle strips not beeing a good idea. They are causing too much work for the little speedup (if there really is one). Just think about breaking your geometry into pieces because of different textures/texture coordinates...bah.

  6. #6
    Intern Newbie
    Join Date
    Jun 2003
    Posts
    32

    Re: Performance Question

    Thanks for the tips

    Currently I am sending the rows in 0-1-2-2-1-3 order. Floats all around except bytes for the color. Each array is tightly packed.

    The strip optimization I was reading about is here:
    http://developer.nvidia.com/object/devnews005.html
    Scroll down to the "Coding Tip" section. (Don't use the links near the top of the page.)

    Jens Scheddin: As you were mentioning, the 0-1-2-2-1-3 order takes two vertex transformations to draw a triangle but the method in that link states "only 1 vertex for every 2 triangles needs to be computed".

    Has anyone tried the method described in that Nvidia link? I always was curious as to whether or not that method was worth the effort.

    Ysaneya: I wasn't sure about being purely fillrate limited. Let's say that the 800x600 mode was fillrate limited - at 1024x768 I would assume that I would only be pushing about 241 fps. I get 213 fps @1280x1024 which (to me) looks like it's more than a fillrate issue. (I'm not claiming to be even close to being an expert in this area by the way.) It can't be AGP bound since everything currently resides in VBOs. Conversely, if the 1280x1024 mode was fillrate limited, then I should be getting up to 355 fps @ 1024x768 or up to 580 fps in 800x600. This is what got me to think that something else must be a limiting factor - which I am hoping to identify. That's why I was wondering if tri strips (as mentioned in that link) was the way to go.

    About the HUD: It's only a bit of text at the moment. I was mentioning the fact I was using VA for it right now. I am planning to convert the HUD system over to VBO but was planning it for a later stage once I finalized other aspects of my code. I mentioned it just in case there were any issues in mixing VA and VBO.

    I now did some searches on optimizing performances and am going to try out some tests they mention. I might be worrying for nothing - I just wasn't sure if the results were good for VBOs. If not, where could I look to improve things.

    Thanks for your time and effort.

  7. #7
    Advanced Member Frequent Contributor
    Join Date
    Oct 2000
    Location
    Belgium
    Posts
    857

    Re: Performance Question

    What is it that makes people want their apps to run at 500 fps? If anything, if your app runs at hundreds of frames per second, you should be looking for ways to make it slower, not faster (i.e. give the video card more work).

    You're never going to reach your card's advertised peak triangle rates with a benchmark that runs at 500 fps, so if that's what you're interested in measuring, optimizing your current code won't help unless you feed the card a lot more than 38000 triangles. I would suggest one million as a good starting point

    -- Tom

  8. #8
    Intern Newbie
    Join Date
    Jun 2003
    Posts
    32

    Re: Performance Question

    Tom: My response to your first question is your own answer to it - I do want to make it slower. I am planning multiple passes and hope to get reasonable support on lower end cards. I just wanted to start off making a "basic pass" that is as efficient as possible. Although this equates into a higher fps - this is not my goal. The less time I spend on each pass will mean more time for other things I also figured nailing this portion down now would save me much grief later on as the project evolved.

    I posted here since I felt my app wasn't completely fillrate bound (contrary to Ysaneya's opinion). For something as simple as this test - I thought it should be. fwiw: I realised today that the new monitor I got allows me to test 1600x1200 and I got 153 fps. Comparing that to the 213 fps @1280x1024 makes me think that around the 1280x1024 mark is when I get fillrate limited - not at 800x600. I simply wanted to see if I am correct in my assumption and what the possible cause is - and see if it was worth the bother to "fix" it.

    If it means 1 million+ polies to help me write optimal code - then so be it I'll give that a shot too. (I thought 38K was good enough. Obviously I'm wrong then - I did say I'm no expert on this.) Thanks for the tip.

  9. #9
    Advanced Member Frequent Contributor
    Join Date
    Apr 2000
    Posts
    748

    Re: Performance Question

    I don't believe it's entirely fillrate limited, but mostly, yeah. The thing is, there is different kind of bottlenecks in an application, and usually your framerate is limited by the most important one, fillrate in your case. And at these kind of framerates i would take any benchmark with a grain of salt.

    In addition your calculations are suspicious. Although good in theory, in practise there is a lot of things to consider, like GPU/CPU paralelism, and things "behind the scenes". You just can't apply a linear formula and expect to guess the framerate for a given resolution. For instance imagine the VSync issue (i know it's not your problem here, but just to show my point). If you are running at 60.0001 fps, you will see a virtual framerate of 60 fps. Let's say you add a single more triangle, and your framerate goes down to 59.99999 fps. Suddenly you'll see a virtual framerate of 30 fps. But you just can't conclude that adding one polygon costs 50% of your performance each time. It's pretty much the same at every level in the driver.

    Y.

  10. #10
    Advanced Member Frequent Contributor
    Join Date
    Oct 2000
    Location
    Belgium
    Posts
    857

    Re: Performance Question

    Originally posted by JotDot:
    I am planning multiple passes and hope to get reasonable support on lower end cards. I just wanted to start off making a "basic pass" that is as efficient as possible.
    A noble goal, but the problem with premature optimization is that you may spend time optimizing something that will turn out not to be a bottleneck at all. If the 38K triangle scene you're using now is indicative of what you're aiming for in the long run, you're unlikely to become T&L-limited even when doing multiple passes.

    Indeed, adding more passes may only make you even more fillrate-limited than you already are. In this case, your optimization efforts should be focused on reducing overdraw, not on improving your vertex throughput. The two require very different approaches. You can do both, of course, but chances are 50% of your time will be wasted if you do

    -- Tom

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •