PDA

View Full Version : typical polygon throughput on accelerated hw



PT_Barn
07-12-2007, 09:31 PM
I'm using a 3rd party API that sits on top of linux opengl 1.2+. Running this on an ATI X600 card. Currently getting less than 1M poly's per second (around 900,000). Seems low. Polys are sent as tri strips. The ATI X600 driver is opengl 2.0. I would expect 5M polys / second. Is this reasonable? I'm going to do some poking around thru the API's source code to see how the polys are being sent over. I believe it's via display lists. Thanks for any info.

-PT

dorbie
07-12-2007, 10:19 PM
Are you sending strips using DrawElements with cache coherent index reuse of verts? How complex is the T&L state or vertex program? Are you sure you're T&L limited and not pixel fill? What's the overhead for app etc. Are you drawing enough stuff to amortize overheads like screen clear. Are you avoiding other stalls & flushes like glFinish?

There's VERY little info in your post and you really need to look at all of this stuff to even get close to an answer to your question.

PT_Barn
07-13-2007, 08:17 AM
Originally posted by dorbie:
Are you sending strips using DrawElements with cache coherent index reuse of verts? How complex is the T&L state or vertex program? Are you sure you're T&L limited and not pixel fill? What's the overhead for app etc. Are you drawing enough stuff to amortize overheads like screen clear. Are you avoiding other stalls & flushes like glFinish?

There's VERY little info in your post and you really need to look at all of this stuff to even get close to an answer to your question. Avoiding stalls and flushes, yes. I turned off textures and it didn't improve performance much, maybe 1-2 fps. Can you describe cache coherent index reuse of verts? I'm about to look through the source code now, I'll post more details later today. thanks.

-PT

dorbie
07-13-2007, 08:13 PM
The T&L results are cached on most desktop cards.

If your strip reuses the same indecies it will save a lot on T&L. This means a good draw elements strip will issue degenerates to make sure it is always looping back on itself to better exploit vertex cache coherency.

A popular way to generate these strips is to use nvtristrip.

http://developer.nvidia.com/object/nvtristrip_library.html

You should also look into VBOs and GL_EXT_draw_range_elements (or GL_NV_vertex_array_range or GL_NV_vertex_array_range2)

A combination of all three should approach best performance.

Disabling texture is not enough there can be other limits like framebuffer read modify write and there's still overall scene complexity vs fixed overheads (and also synchronized swap etc which can bite you).

You MUST draw a lot of stuff in a small window if you want to measure this kind of throughput reliably.

Use nvtristrip
Draw meshable triangles (i.e. ensure you have connected stripable tris *using shared indices* that were sent to nvtristrip))
Use VBO
Use Draw_range_elements(or NV_vertex_array_range or NV_vertex_array_range2)
Use a small viewport
Draw LOTs of stuff
Ensure synch to vertical retrace is OFF
No glFinish calls anywhere
Time stuff over many frames
Amortize clear buffer and other overheads by drawing many frames without a clear.

Have tight drawing code that simply issues the draw elements calls, no extraneous bull****.

Only then can you look at other overheads like backface culling on or off. Attribute bindings alignment an access patterns. Vertex shader or T&L state as an influence. This last section will be what your groundwork allows you to measure & compare to estimate the performance of a content path.