PDA

View Full Version : Octree speed slower on faster computer, odd??



soconne
01-08-2004, 07:07 AM
I've finished my own Octree system and currently it simply loads in a milkshape model of about 5,000 triangles. My problem is that I'm running it on 2 computers, one has

P4 2.8Ghz HT 800MHz FSB
1Gb DDR 400
Radeon 9800 Pro 128MB

and the other has

Athlon XP 1800
1Gb DDR 333
Geforce 4 TI 4400 128MB

My problem is that if I move the camera completely outside the octree, so that basically no recursion occurs since the entire octree is within the frustum, the frame rate on my radeon 9800 is about 600 fps, and on the Geforce it's about 350-400fps.

But when i move inside the octree, especially inside of deep down nodes, you would think the frame rate would increase since a lot of geometry is being culled. But on the radeon it actually decreases, but on the geforce it increases.

For instance I set the camera in the exact same position on both computers, making sure i was within a node deep in the octree, and the Geforce got 600fps, and my radeon got 400fps. This is very wierd, and I'm pretty sure it shouldn't be happening.

I'm drawing each triangle with multiple calls to glVertex3f, because i'm not worried about speed at the moment. But I'm pretty sure that Radeon and Geforce process those calls at the same rate, am I wrong?

I thought it was a memory problem, since so much recursion is going on but I ran memory benchmarks and the Radeon computer is faster, especially since I've run 3dmark2001 and 2003 on both.

Has anybody else run into a problem similiar to this? Where the framerate drops because of recursion, but doesn't happen on other computers?

ml
01-08-2004, 08:08 AM
I've only used an octree to narrow down triangles to check for collisions, but when do you stop dividing nodes?
How many nodes are you drawing when you have your camera set to the position you tested on both computers?
How do you check wether or not a node is in the frustrum?
How many times do you call glBegin(GL_TRIANGLES), for each node?

Also, 5000 triangles is a pretty low figure to benchmark with.
Render the boundingbox of each node to see if everything looks like it should.

[This message has been edited by ml (edited 01-08-2004).]

jwatte
01-08-2004, 10:41 AM
If you're running at more than 100 fps, you're not giving the cards and systems enough to work with to really do any difference. Add 10x the amount of geometry, and try again.

Also, it's un-clear what is limiting your frame rate. If your octree is really fp math heavy, the Athlon may beat the P4. Use a profiler to find out what the problem is -- asking a question on a forum, with no measurements or code to back it up, isn't going to do it :-)

JanHH
01-08-2004, 07:59 PM
a) when the whole tree is inside of the viewing frustum, there's the maximum amount of recursion happening, every node is visited (or I got something wrong)

b) what you are measuring is wheter a node is faster to cull vs to draw. seems that on one system culling is faster, on the other one drawing.

I made the experience that in fact drawing something can be faster than checkig if it's visible on modern hardware, and also that walking the tree itself costs time (stack push and pops, I guess), I experienced that simply checking every node for visibility was in fact faster than walking the tree.

Jan

[This message has been edited by JanHH (edited 01-08-2004).]

soconne
01-08-2004, 08:05 PM
Thanks for the help guys. I've decided the best way to go is to just draw everything using either display lists or vertex arrays, and then use octrees for collision detection, since my world will only consist of less than 15,000 triangles anyway. I've found it faster to just do this instead of trying to cull.

EG
01-08-2004, 09:57 PM
In addition to the FP issue on P4, be aware that the P4 suffers a lot more from branch misprediction than the Athlon.

If you have lots of "if" whose condition isn't evaluated to an overwhelming majority of either True or False, that may cripple the P4 pretty badly due to its longer pipeline... and that isn't an uncommon situation with octree and other space-partition schemes. So you may actually benefit more from a shallow octree than a deep, very discriminating one.

JustHanging
01-09-2004, 12:50 AM
JanHH: If the entire terrain is visible, you don't have to visit each node. Just organize your culling as follows:

Starting from root node, if the node isn't in frustum, stop. If it's completely in frustum, draw everything under it. If it's partially in frustum, recurse to child nodes.

This'll eliminate a lots of unnecessary culling and recursion.

-Ilkka

AdrianD
01-09-2004, 02:07 AM
oconnellseanm:

i had recently the same problem with my octree implementation. on my ATI9500pro i got a slowdon with an octree compared so simple spit-it-out mesh rendering. on GF-based systems(gf4/gf2go) however, i got the expected speedup. What was going wrong ?
my test-scene had 17K triangles and i discovered that on the ATI it is much faster just to display the whole mesh with a single drawelements call than to clip & display 2000 triangles with 50-100 small baltches(or so). but this way its terrible slow on GF based systems.(not gfFX, haven't tested it on this systems yet..)
my first solution was simple and well known: PVS (for every node, a list of flags for all possibly visible nodes from within this node)
this way you can create a static mesh(actually you are only creating a new index list) which contains only the possibly visible nodes, and you can make sure that as long your camera is inside this node, you don't need to do any computations on the CPU - nice. this makes it faster on ATIs but it's still slower on NVidias compared to the octree/frustrumclipping approach.
(because there are still far too much unseen-but-submitted triangles)
So i added simple frustrum clipping (which is only applied to the possibly visible nodes).
This gave me the needed speedup on my Nvidia systems, but it also did slowdown my ATI implementation again.(because this way i must recreate the indexlist every time the camera rotation changes - and that happens usually very often in an animation...)

finally, i decided to mix both methods:
every frame, i create a "current-visible-set" based on the PVS and the clipping results, and recreate the indexlist only when this set has changed compared to the last frame.
If i don't make my nodes too small, this approach works very well, and i get more speedup than i expected - on all systems.
btw.my average nodesize is 250-500tris, and my maximal treedepth is 5, anything smaller creates too much overhead. i tested it with scenes from 15K to 250K tris per scene.

and of course, all this optimizations are a kind of pointless if you don't store the geometry & indicies on the HW using VAR/VAO/VBO. http://www.opengl.org/discussion_boards/ubb/wink.gif

Azazel
01-12-2004, 05:10 AM
LOL! Did ya knew that Athlon is seriosly faster in floating point operations that Intels products at the same frequency? Also, when you have such framerate you shouldnt be suprised of variations less than 50%