PDA

View Full Version : Strange loss of performance



neomind
01-20-2004, 11:20 PM
Hi! My name is Morgan Johansson and I am working on (among other things) an OpenGL-based graphics engine for a game with a very high triangle count.

The last few days I have been optimizing the code and I found something strange - I seem to have a black hole in my code draining performance.

The program is multithreaded (though SDL) and rendering has its own thread. Profiling has shown me that 99.91% of the time in the demo program is spent waiting for rendering to finish. Nearly all of the CPU-intensive tasks (of my creation) are performed in that last 0.09%.

At first I simply thought the graphics card was limiting but that is not the case. Moving from an Intel 865 integrated chip to a Geforce 3 or an ATI FireGL X1 gives no more than twice the framerate (from 20 to 40 fps).

The scene is a rendering of 300 objects of 550 triangles each (though only 6 geometries). This is currently displayed using vertex arrays (glDrawElements with GL_TRIANGLES). Each vertex has position, normal and texture coordinates in floats. There is a single texture on the triangles.

So far I have tried the following with no or very little effect on the framerate:
* Turned off the textures and blocked all calls to send the textures to graphics memory and activation of these.
* Decreased the number of triangles in each object to 260.
* Used display lists for all the drawing (6 lists in total).
* Switched between matrix loading and calls to glTranslate etc.

The only thing that seems to have effect on the framerate is decreasing the number of instances drawn of the six meshes.

Some statistics I have:
* I only get a vertex processing rate of 5-10M vertices/second on a geforce 3 (Athlon 1.33 GHz). As I dont use strips, that is about 2-3M triangles/second.
* I change material settings 9 times each frame.
* I do one push, load, pop on the modelview matrix for each instance.

I realize that there are planty of things I can do to boost performance. But what I would like to know is where I loose performance. Seems to me it is probably either some CPU intensive task hidden in the drivers or some bus that isn't fast enough.

Any help with this problem is appreciated! Sorry about the lengthy post, I wanted to describe the problem in detail.

Cheers,
Morgan Johansson

M/\dm/\n
01-20-2004, 11:45 PM
VBO?

Adrian
01-21-2004, 12:21 AM
Check whether your are fill rate limited, render to a much smaller window and see if your frame rate increases.

If its not fillrate then using VBO or VAR/VAO should give you better performance.

Ysaneya
01-21-2004, 12:42 AM
In summary, you are drawing 300 * 550 = 165 000 triangles per frame @ 40 fps = 6.6 MTris/sec on a GF3 only using standard vertex arrays. The performance seems to be normal in that case.

You say that 99.91% of your time is spent in "waiting for rendering to finish". I'm assuming you're speaking of swapbuffers here. If your rendering code is quite simple with no CPU work, a theory could be that all your calls are queued up by the driver; when swapping the buffers, the queue might be full and needs to be emptied before giving the hand back to the program.

It can be a bandwidth problem; try modifying your vertex format to only a position, and check if the performance changes. Since you're not using any fancy shading or texturing, i don't think you're fillrate limited, but as it's easy to test (just decrease the resolution of your window), also check that.

Y.

neomind
01-21-2004, 12:57 AM
VBO support is planned in the future, but the question is not really what I can do to improve performance a bit, but rather what could possibly be wrong as performance is so bad? I would have expected much higher vertex processing rates than I currently see. DL:s should give me fair performance, should they not?

Fill rate is probably not the problem. 800x600 rendered at the same rate as 640x480. Also, I would have expected graphics cards to make a difference if fill rate was the problem.

I only get 20 fps with the geforce 3 (the same as with the Intel 865). The 40 fps was with the ATI FireGL X1. As far as I can tell these are very low numbers. The Intel 865 should never be as fast as the geforce 3 unless the CPU is the problem.

"Waiting for rendering to finish" is actually the thread syncronization. The main loop sleeps until there is work to do (mostly transforms in this case).

I will try modifiying the vertex format, but as I have tried rendering everything using display lists I wonder if it could really be a bandwidth problem? Are not display lists always stored in graphics memory?

Ysaneya
01-21-2004, 03:02 AM
Yeah, if you were bandwidth limited, i would have expected a gain of performance when switching to display lists. How are you building them ? You also said that you tried reducing the number of triangles per object and didn't see a difference, which seems to suggest the problem is not geometry or bandwidth related. Although with standard vertex arrays, if you're having 10M vertices/second, that'd be a throughput of 320 Mb/second, that's still quite high.

All of these suggests a CPU bottleneck. Are you sure you're not forcing the renderer thread to wait somewhere (you mentionned multiple threads, what does it give if you only try one thread?) ?

Y.

neomind
01-21-2004, 03:39 AM
I am rendering the display lists like this:
*glGenLists
*activate textures and array pointers
*glNewList
*glDrawElements
*glEndList

Calling it works, and it does improve performance a little bit (from ~18 to ~21 fps).

I do have one place where there could be some waiting, but that is a single mutex lock in the rendering (called once/frame) and it can only cause waiting during those 0.09% of the time. I'll have to check that further though. But the way I look at it, there is hardly any other way to write a multithreaded engine.

Good suggestions! Thank you! It will take me some time to check this.

hoshi55
01-21-2004, 06:52 AM
neomind, you are using SDL's multithreading?

there can be very strange things happening when using SDL multithreading and openGL. i tried it once and remember alpha blending just refusing to work. i spent days on this problem, but in the end SDL's threading turned out to be the problem and when i removed it, everything worked fine.

apart from that (obviously you are not having this problem), i think the SDL docs say something about mutex fairness for threads cannot be 100%guaranteed or something like this. i also had some problems with synchronizing my multithreaded engine, both with concurrency and speed.

so if i were you i'd try rendering without multithreading and see what happens. <EDIT> see my post below</EDIT>
if all else fails, you might want to try using windows API threads instead. if you don't have to be cross platfrom compatible, they work ok.

hope that helps
hoshi55


[This message has been edited by hoshi55 (edited 01-21-2004).]

hoshi55
01-21-2004, 07:12 AM
i couldn't find the section about SDL concurreny imprecisions, maybe i read that on some mailing list, but this is from SDL's docproject site:



In general, you must be very aware of concurrency and data integrity issues when
writing multi-threaded programs. Some good guidelines include:

Don't call SDL video/event functions from separate threads

Don't use any library functions in separate threads

...


the SDL FAQ has some more specific advice:



Q:
Can I call SDL video functions from multiple threads?

A:
No, most graphics back ends are not thread-safe, so you should only call SDL video functions from the main thread of your application.


didn't you mention that you call your rendering stuff from one thread and leave the transforms etc to the main thread? maybe you should try rendering from the main thread.

mariuss
01-21-2004, 07:36 AM
You have backface cull enabled?
What about depth test.
When yo do glClear
GL_DEPTH_BUFFER_BIT
or
(GL_COLOR_BUFFER_BIT | GL_DEPTH_BUFFER_BIT));
try just GL_DEPTH_BUFFER_BIT

Anyway you wont see at once all 500*300 polygons because they are culled one by each other, and some of them may be out of screen.
So...
you have to reject all unseen polygons
from submitting them to videocard if they are out of camera field view. You can start with the objects, bound-box them and to a fast check if the object is out of vieing frustrum.
Alternative... If the scene is static I would build a BSP out of it. This alows you to do a back to front drawing,back face cullijg, (disabling the z order on videocard and backface culing) and to cull unseen items against frustrum pretty fast.
Doing this on processor side will imrove your frame rate considerably.



[This message has been edited by mariuss (edited 01-21-2004).]

neomind
01-21-2004, 07:56 AM
Yes, I am using the SDL multithreading. I am however, using the renderthread for all OpenGL calls. So far I have not had any problems (except this one if that is the cause).

But given the strange nature of the problem it seems wise to try without the threading. It could very well be the cause.

I had not heard of the mutex fairness problem. That in it self should not cause this I think, but it might still be wise to use caution.

Thank you for your advise, hoshi!

neomind
01-21-2004, 08:27 AM
Back face culling is enabled.
Depth test is enabled.

The scene is dynamic so I am using view frustum culling and a kind of scene partitioning that allows me to cull multiple objects at once. Nothing fancy, but it works well. Anyways, the problem is that I cannot draw enough triangles (those that aren't culled).

[This message has been edited by neomind (edited 01-21-2004).]

maximian
01-21-2004, 08:56 AM
Did you try interleaving the data. I also found that it made a noticeble improvement if the data was byte aligned. Ie 32 byte as opposed to 24.

I also found glDrawArrays worked faster than glDrawElements in certain cases.

EDIT:
Just remembered. I read this article once which found that the geforce 3 to be only about 2 as fast as Integrated Solutions, when not dealing with programmable code.

[This message has been edited by maximian (edited 01-21-2004).]

Cyranose
01-21-2004, 10:26 AM
Here's a quick speedup trick, not guaranteed for all apps but worth a try: Move the glSwapbuffers command to the start of the frame rather than the end of the frame. If needed, add your own MP-safe flag to indicate that all client-side rendering has finished so you can start processing your next frame immediately (animation, physics, etc..)

Also, are you windowed or full-screen? There's a difference in swap behavior, though I'm fairly sure you're right about being stuck in synchronization for some reason. This might mean you have plenty of rendering time left but swapbuffers is blocking (hence the first suggestion). Might also test this with or without vsync turned on.

Avi

neomind
01-21-2004, 12:15 PM
Ok, I've tried the engine without threading. Made no difference at all.

Haven't had the time to change the data format. It is quite well embedded in the engine so it will take som time to change that.

The program is running full screen 800x600 in 16 bit color depth.

Korval
01-21-2004, 01:13 PM
Neomind, you seemed to have missed a critical piece of information, so I'll quote what Ysaneya said:


In summary, you are drawing 300 * 550 = 165 000 triangles per frame @ 40 fps = 6.6 MTris/sec on a GF3 only using standard vertex arrays. The performance seems to be normal in that case.

To reiterate, your performance is pretty reasonable, even at 20fps, for a GeForce 3 without VBO or VAR.

Now, to answer your "where I loose performance?" In the drivers.

The GPU can't directly access system memory, unless it is AGP memory. So, when you set up your vertex arrays, and call glDraw* to render them, the driver has to copy these vertices out to memory that the GPU can read directly. Also, it must do 2 other things:

1: Make sure that the vertex data is in a format the GPU can read. So, if the GPU can't handle unsigned shorts, and you have positions as unsigned shorts, it must convert them into floats during the copy operation.

2: Make sure that the number of indices is smaller than the hardware-defined limits on number of indices that can be drawn at once.

You probably aren't hitting #2. But #1 you may be, if you're using a vertex format that is not supported by the hardware. Stick with floats if you want to guarentee support.

Note that the driver must do copy operations each time you render an instance, since it can't guarentee that you haven't changed vertices since the last call, even if you haven't called a gl*Pointer since then.

In short, the driver has to do a lot of copying. A GeForce 3, using VAR or VBO, can get much better performance, thanks largely to the lack of copy operations that have to happen.

Ysaneya
01-21-2004, 02:39 PM
I don't think it's that simple. He mentionned his vertex format (pos, normal and tex coords in floats), very standard. Also if it was that kind of problem, decreasing the polycount should affect the framerate, which he says it isn't.

Y.

jwatte
01-21-2004, 03:56 PM
Note that when vendors say "twenty-hojillion triangles per second," you can only achieve that if you terminate every other thread on your computer, and bake all of those triangles in a single display list, and use a single modelview matrix and texture for all of them, and it takes an entire second to render that frame.

If you have a regular appplication, that draws many (say, hundreds) of objects per frame, with many (say, hundreds) of material, light, texture and modelview states, then your performance in triangles per second will be nowhere near the rated peak throughput. That's just the fact; learn to live with it.

Also, the first thing I was thinking was "fillrate bound, probably?" -- have you tried with a smaller window? (Say, 1/8 the size)

Last, if you're bound on just waiting for the pipeline to flush in swapbuffers, you may be able to add a lot more CPU processing on your side with no drop in frame rate on HT&L cards -- i e, your CPU usage could go from 1% to 10%, and the wait-usage would go from 99% to 90%, and you'd still have the same frame rate. On non-HT&L cards (Intel Extreme, for example) that's less often the case, unless you're extremely fill-bound, and your CPU work doesn't stall on memory much.

neomind
01-21-2004, 08:49 PM
Thank you all for your advise on this subject.

To further mystify this problem, I have now tried (should have done it long ago) using much more complex models (10.000 triangles and 30.000 vertices). Suddenly I am getting the performance I was expecting.

Displaying the same number of objects, I am now getting around 50M processed vertices each second. Still using nothing more than triangles in display lists. The vertex format remains the same.

Korval
01-21-2004, 08:57 PM
Displaying the same number of objects, I am now getting around 50M processed vertices each second. Still using nothing more than triangles in display lists.

What this is saying is that the number of vertices is clearly irrelevant; it's the number of instances. Which leads to the following questions:

1: How much state are you setting up per instance?
2: How many different textures are you using?
3: Do you still get this performance if you don't use display lists?

neomind
01-21-2004, 10:29 PM
I have lately used (to find this error) a crippled version of the engine.

1. Lights are off and remain so. No materials are set. Only thing I do for each instance is that I set the modelview matrix, either by multiplying with a matrix or by calling glTranslate (tried both). No fog. And as far as I can tell, there is no other state change in there either.

2. No textures in the crippled version.I am sending texture coordinates but I am not using them.

3. I am not able to test that at the moment but I doubt it. I do not think I should reach 50M vertices/sec using vertex arrays.
50M*(3+3+2)*4 bytes per second! That's 1.5 Gbytes.

dorbie
01-21-2004, 10:45 PM
AGP 8X can theoretically push 2.1 gigabytes per second.

neomind
01-21-2004, 10:51 PM
The geforce 3 I am using is on a AGP 4X.



[This message has been edited by neomind (edited 01-21-2004).]

Ysaneya
01-22-2004, 12:17 AM
Well if you switched to display lists, make sure all your scene is visible by your camera. The NVidia drivers perform frustum culling on display lists, so it can mess up your benchmarking figures.

If i follow you, you're now drawing 3 millions triangles per frame ? What framerate do you get ?

Y.

neomind
01-22-2004, 12:38 AM
With 300 of the 10000 triangle objects I am getting a framerate of about 5-6 fps. I frustum cull the objects before sending them to rendering so it should not mess with my benchmark figures. Also, most of the scene is visible at all times.

Adrian
01-22-2004, 12:47 AM
Originally posted by neomind:
[B]I am rendering the display lists like this:
*glGenLists
*activate textures and array pointers
*glNewList
*glDrawElements
*glEndList
B]

You are only generating the display lists once, before the main render loop, right?

What happens to performance if you comment out any changes to the modelview matrix?

Personally I think you should just use VAR/VBO. No point worrying about performance unless you are using the optimum method. If you then still have a performance problem then its worth investigating. I haven't found display lists that fast for rendering small amounts of geometry many times.

neomind
01-22-2004, 01:05 AM
You are only generating the display lists once, before the main render loop, right?
Yes.


What happens to performance if you comment out any changes to the modelview matrix?
I think I have tried this with no change, but I will check it again later.


Personally I think you should just use VAR/VBO. No point worrying about performance unless you are using the optimum method.
The reason why I want vertex arrays (and display lists) is because I want to be able to run the game even on systems with few extensions, for many different reasons. I will use VAR or VBO in the final version and have vertex arrays as the fall-back. And as it should also run on MacOS X and Linux it restricts my choices even more (no wgl).


[This message has been edited by neomind (edited 01-22-2004).]

Joel
01-22-2004, 01:08 AM
It reminds me of a file i found on the nVidia web site called BatchBatchBatch.pdf by Matthias Wloka for a GDC.
In a nutshellthe more important is the number of batches you send and not the number of triangles in it.
According to its numbers you need to send more than 130-200 triangles per batch to not be CPU limited. As you send 550 you are above this limit but the raising in performance when you increase the number of triangles per object seems logical. 300 batches with low triangle count must be a high number for GeForce3 ???

I don't have the link but it should be easy to find on their site.

[edit]By reading it again it gives numbers equivalent to those you get. Please post the link to this file if you find it for future readers.

[This message has been edited by Joel (edited 01-22-2004).]

neomind
01-22-2004, 01:19 AM
http://developer.nvidia.com/docs/IO/8230/BatchBatchBatch.pdf

neomind
01-22-2004, 02:56 AM
That article was enlightening. It seems to explain most of the performance problems I have (or at least the general characteristics of them). It was certainly something I didn't know, I have always thought that the cost of a batch is quite small.

Thank you for pointing this out. I wonder if the cost of a batch is smaller with VARs or VBOs?

Joel
01-22-2004, 04:17 AM
Originally posted by neomind:
I wonder if the cost of a batch is smaller with VARs or VBOs?


As many have said it should be, as it is easier for the driver to fill the pipe (no/faster copy...). But as i didn't do any tests i can't really say. The one in best position to tell us what the gain would be (apart nVidia guys) is probably you, if you bench it http://www.opengl.org/discussion_boards/ubb/wink.gif

orbano
01-22-2004, 10:15 AM
i dont know if VBO will help in solving the glDrawElements call overhead. Im using VBOs, but calling it becomes a bottleneck over a few hundred models. I can only achieve about 6Mtrianlges/sec with 1000 objects with <100 faces, but reach 10-14Mtri/sec with 3 models each having a million (!!!) faces. (using radeon 8500LE with separate arrays with 24byte vertex data unaligned).

Korval
01-22-2004, 10:42 AM
That article was enlightening.

I thought so, too. Until Cass told me this applies to Direct3D only.

neomind
01-22-2004, 11:23 AM
Oh, well. Guess I can't do much more than to implement VBOs and see if there is any difference. I'll benchmark using vertex arrays, display lists and VBOs and post to the forum.