anybody using only GL_TRIANGLES?

in my engine i render meshes either as triangles (i cull faces before i send them to opengl) or strips + tris.
today i forced everything to render as triangles ie no strips (like carmack saiz q3a does) i was surprised by the resulting drop in performance ‘30-40%’ so why does q3a just use triangles + no strips. i can see there are a few benifits (hey i cant type a space here)eg simplicity, fewer calls to glDrawElements() but the performance drop is unacceptable. is anyone else out there using only triangles + if so , why?

what do you mean by unacceptable?
Do you mean the frame rate go down to like 5-10 fps??

or it drops from 100 to 70??

Are you presorting the triangles? JC also said triangles which would form a strip are stored directly behind the other.

Mr Carmack telling to use Triangles instead of Triangle strips ?
While all 3d accelerators manufacturers are telling us to use strips ?

Sounds strange any link ?

[This message has been edited by Ingenu (edited 03-12-2001).]

30-40% drop in performance which is a lot.
the tris are pretty much sorted (not perfect) but im sending them practically in strip order. it does sort of make sense though

eg take a strip of 10 triangles
thus for a tri strip im sending 12 vertices/normals/texcoords*4
whereas with triangles im sending (10 * 3) / 45% culled is about 16 verts (the longer the strip the bigger the difference)

granted for trianlges im using far less glDrawElement calls though i believe the extra effort to build the triangle list is offset by this

If you use glDrawElements(), and your vertices are used more or less in triangle strip order, then most vertices will be reused three times in a very short time span. Let me try some ASCII art:

2--4--6
|\ |\ |
| \| \|
1--3--5

You might render this with an index array like this: (1, 2, 3; 2, 4, 3; 3, 4, 5; 4, 6, 5). If you reuse vertices this quickly, they will still be in your 3D card’s vertex cache, so they only need to be transferred and T&L’ed once. That way, it shouldn’t be much slower than actually using a strip. The downside is that it probably will be a lot slower on a non-T&L card.

  • Tom

[This message has been edited by Tom Nuydens (edited 03-12-2001).]

yes i think its time for me to get a hardware tnl card ( prolly a geforce2mx )
taken from the opengl performance faq ver2

All NVIDIA GPUs have a 16 element post-T&L vertex cache (also called the “vertex file”),
though the effective size is closer to 10 elements when you consider pipelining.

so tristrips with nontnl cards,

  • tris (strips if possible) with tnl cards. got it

But I imagine it wouldn’t hurt a T&L card to use strips either. Because, unless I misunderstand here, you need to half-way stripify your triangles anyway even on a T&L card to take advantage of the vertex caching.

BTW, how are you culling the non-visible faces? Are you actually transforming each face of the mesh and doing a dotproduct to cull it? And then transfering the info of visible faes into a vertex buffer? Is this really worth it? Is GL_CULLFACE not enough? It just seems to me that if you had alot of meshes to render, that’d be ALOT of work to do just to remove some faces.

Also, I saw someone say earlier that nVidia had some kind of utility that would “stripify” your meshes? Anybody got a link? Thanks!

true thats why i say strips if possible. all triangles were in world space thus doing the culling before sending them to the card is pretty cheap. u do remove more than some faces doing it typically 35%-45%

(nice friendly link to the nvidia tristrip program) http://www.nvidia.com/Marketing/Developer/DevRel.nsf/pages/0B738E907A7DC3A28825693F0065F772

Thanks, zed! BTW, how much of a performance increase can you expect to see doing your own backface culling? Is it very significant? Was the percentage you gave the percentage of faces removed or the performance increase? Thanks!

I suggest:

Dont try to increase your rendering speed by applying techniques which works on a per triangle basis.
But if you really want to use culling, try grouping your polys that share the same plane( Planar surfaces in quake) and cull them all at once.


Cem UZUNLAR turing@anet.net.tr

of course all this is bollux if you’re fill bound but anyways.
all my testing has been done with real game data.
>>Dont try to increase your rendering speed by applying techniques which works on a per triangle basis.
But if you really want to use culling, try grouping your polys that share the same plane( Planar surfaces in quake) and cull them all at once.<<
of course im doing this already its only the stuff that makes it through the scenegraph that gets tested for backface culling, im not to sure how much quicker that is (doing your own backface culling) i think 10% or someit

im in the process of making an artifical program which ill post a link to when im done (should be about an hour or 2)

ok first version finished (no backface culling yet ill save that for tonight ) http://members.nbci.com/myBollux/benchmarkSRC.zip

the data isnt prolly arranged perfect for gpu’s i believe they want data in an order like this
| |-| |- etc
|-| |-|

my results (whilst on the internet)

ARRAYS_STRIPS 0.973000 seconds 2104830.314100 tris per second
ARRAYS_STRIPS 0.946000 seconds 2164904.772007 tris per second
ARRAYS_TRIANGLES
ARRAYS_TRIANGLES 1.246000 seconds 1643659.643141 tris per second
ARRAYS_TRIANGLES 1.222000 seconds 1675941.076273 tris per second
INTERMEDIATE_TRI_STRIPS
INTERMEDIATE_STRIPS 1.034000 seconds 1980657.565334 tris per second
INTERMEDIATE_STRIPS 1.005000 seconds 2037810.954942 tris per second
INTERMEDIATE_TRIANGLES
INTERMEDIATE_TRIANGLES 2.657000 seconds 770794.109904 tris per second
INTERMEDIATE_TRIANGLES 2.665000 seconds 768480.242437 tris per second

I understood Carmack’s post to say that Q3 only uses discrete triangles where the compiled vertex array extension is available; otherwise it finds and uses strips.

If EXT_compiled_vertex_array is not present, we set up the same vertex arrays, but we do strip finding ourselves and issue glBegin() / glArrayElement() / … / glEnd(). This is faster than the discrete triangle path for most drivers that don’t have compiled vertex arrays (because they don’t retransform every vertex), but results in a lot more API overhead and limits batch processing. You can change between this behavior and the single draw elements call with the variable “r_drawstrips 0/1”. The optimal path is to have compiled vertex arrays and take it as one big glDrawElements call.

true GL_TRIANGLES with CVA’s is quick using the q3 format( i have the 10.80 nvidia drivers , which i though were gonna accelerate more data formats with cva’s than just the q3 one, which they do, but it seems one method ie q3, gets accelerated better than others)

>> one big drawelements call<<
it cant be to big (more than 64k) or else performance will die completely

No, the Q3 format is accelerated the same amount as any other format with CVAs on the latest drivers. You should be able to get good performance regardless of your vertex format.

  • Matt