PDA

View Full Version : Drawing speed: GL_TRIANGLES vs. GL_POLYGON



devdept
11-27-2009, 07:00 AM
Hi All,

Is there any benefit in terms of performances using GL_TRIANGLES or GL_QUAD instead of a generic GL_POLYGON with the proper number of vertices?

Thanks,

Alberto

Julien Gouesse
11-27-2009, 07:43 AM
Hi!

GL_POLYGON is extremely slow. Drawing with triangles is highly optimized on most modern graphics cards.

pjcozzi
11-27-2009, 08:26 AM
Julien is right; using triangles is the way to go. In particular, use indexed triangle lists to take advantage of the GPU's vertex caches.

One reason GPUs are optimized for triangles is because the end points of a triangle are on the same plane, which is not always the case for a polygon. Also, GL_POLYGON won't render non-convex polygons correctly and it is removed in the core OpenGL 3.2 profile.

Regards,
Patrick

devdept
11-27-2009, 02:20 PM
use indexed triangle lists to take advantage of the GPU's vertex caches
Are indexed triangle lists better of display list even if the geometry does not change?

Thanks,

Alberto

pjcozzi
11-27-2009, 04:46 PM
Are indexed triangle lists better of display list even if the geometry does not change?
I don't have a solid answer to this so let me give you some background. I've read that NVIDIA's display list compiler is very good and sometimes even outperforms static VBOs. I've never tested myself so I cannot confirm this. Although display lists were removed from the core 3.2 profile so I no longer use them.

When rendering a large amount of static geometry, I recommend using static VBOs with indexed triangle lists that are reordered for the GPU's vertex caches. There are several reordering algorithms, see Tom Forsyth's algorithm (http://home.comcast.net/~tom_forsyth/papers/fast_vert_cache_opt.html) and Fast Triangle Reordering for Vertex Locality and Reduced Overdraw (http://www.cs.princeton.edu/gfx/pubs/Sander_2007_%3ETR/tipsy.pdf). We have implemented the later algorithm with good results.

I've also read that interleaved vertex attributes are faster than non-interleaved, and that unsigned short indices are faster than unsigned ints, but I haven't noticed much difference in either case.

Note that this is just for raw rendering horsepower. You can, of course, use culling, LOD, lay down z first, etc, and then render with optimized triangle lists in a static VBO.

Take care,
Patrick

Dark Photon
11-27-2009, 07:41 PM
Are indexed triangle lists better of display list even if the geometry does not change?
...I've read that NVIDIA's display list compiler is very good and sometimes even outperforms static VBOs. I've never tested myself so I cannot confirm this.
I have. This is definitely true. And it's not a tiny outperformance either.

Aleksandar
11-28-2009, 06:07 AM
I have. This is definitely true. And it's not a tiny outperformance either.
Exactly! It depends on drivers and hardware, but even with the latest NVIDIA drivers DLs can be up to two times faster than static VBOs. I have proven it with many test-applications.

But, DLs have other draw-backs, and they are deprecated, unfortunately. :(
I hope that things will be better when driver-developers have less work to do (after excluding all dreprecated functionality) and optimize VBO a little bit better, so that they can achieve the speed of DLs.

devdept
11-28-2009, 06:16 AM
Aleksandar

Therefore, practivcally instead of doing:


glBegin(GL_TRIANGLES);
// first tri
glVertex3d();
glVertex3d();
glVertex3d();
// second tri
glVertex3d();
glVertex3d();
glVertex3d();
glEnd();
What shall we do to use static VBOs?

Thanks,

Alberto

Aleksandar
11-28-2009, 06:42 AM
I apologize for the next question, because it cannot be considered as a "beginners coding question", but I would like to avoid starting a new thread/topic, and it is related to performance issues...

Question: Can anyone direct me to the official NVIDIA's paper or some academic paper, preferably not older than few years, where can be found in-depth explanation of strategy for rendering on real GPUs? Or, at least, some charts depicting polygon_count/FPS dependency.

Reason: I have discovered that there is a non-linear dependency between polygon count and rendering speed. For example, I can raise number of triangles four times and the frame rendering time rises just for a third of its value (100K triangles for 7.14ms and 400K triangles for 10.4ms). All triangles are distributed in about 3K VBOs of different sizes. Of course, after some limit, for example more than 7M triangles, frame-rate dramatically drops.

Aleksandar
11-28-2009, 07:13 AM
Presuming you are using fixed functionality (and glVertex*() functions calls means exactly that...), I think that the next link will help you:
http://www.opengl.org/wiki/VBO

devdept
11-28-2009, 07:23 AM
Yes, it is exaclty what I was looking for.

Thanks,

Alberto

Ketracel White
11-29-2009, 09:56 AM
I hope that things will be better when driver-developers have less work to do (after excluding all dreprecated functionality)


That will never happen because way too much software depends on the old stuff.

Anyway, VBOs have one big problem and that's requiring the programmer to do everything and preventing the driver from really optimizing the data (I ran into that issue with a program that despite all optimizations I did still runs faster in immediate mode.) With display lists the driver can do whatever it wants and organize the data any way it likes so if done well it will naturally be faster.

Aleksandar
11-29-2009, 10:42 AM
That will never happen because way too much software depends on the old stuff.

Whom are you talking about the deprecation problems? :(
I've got a lot of "old" code too. I twitched my hair when I spare many hours to "rise one old application to its feet" only with GL 3.2 Core functionality and realized that I lost a lot of functionality and didn't gain any speed boost.

I know that old functionality will stay, and I'm glad for that. But also hope that NVIDIA/AMD will issue something like "lite" drivers only with Core functionality, where performance would be on the higher level. But the proliferation of drivers will have other problems. Who knows what the future will bring to us...

M/\dm/\n
11-29-2009, 10:42 AM
I apologize for the next question, because it cannot be considered as a "beginners coding question", but I would like to avoid starting a new thread/topic, and it is related to performance issues...

Question: Can anyone direct me to the official NVIDIA's paper or some academic paper, preferably not older than few years, where can be found in-depth explanation of strategy for rendering on real GPUs? Or, at least, some charts depicting polygon_count/FPS dependency.

Reason: I have discovered that there is a non-linear dependency between polygon count and rendering speed. For example, I can raise number of triangles four times and the frame rendering time rises just for a third of its value (100K triangles for 7.14ms and 400K triangles for 10.4ms). All triangles are distributed in about 3K VBOs of different sizes. Of course, after some limit, for example more than 7M triangles, frame-rate dramatically drops.

I guess you won't find such data, reason being multiple stages of pipeline.

You can get a bottleneck in any of the stages that is not raw geometry processing and you will be able to increase it without any problems at all, then you cross the critical point, geometry becomes the slowest link and your program takes a nose dive.

The best strategy is to think about general "best practices" when you design the program, but to only worry about performance issues of the stage when it's the culprit of general slowdown.

Actually, everyone recommends to increase workload of other stages, like pixel shading/texture sizes, to get them on par.

But with unified shaders there is a whole new can of problems. Vertex stage will affect fragment stage and so on.

http://developer.amd.com/media/gpu_assets/PerformanceTuning.pdf

M/\dm/\n
11-29-2009, 10:52 AM
Anyway, VBOs have one big problem and that's requiring the programmer to do everything and preventing the driver from really optimizing the data (I ran into that issue with a program that despite all optimizations I did still runs faster in immediate mode.) With display lists the driver can do whatever it wants and organize the data any way it likes so if done well it will naturally be faster.


Well, you only have to optimize if geometry transfer/processing is the bottleneck.

If it is, static_draw VBO with indexed draws and cache friendly indices will probably be as fast as display list.

I don't know how it works nowadays, but older GPUs used to reread vertex if it was shared with multiple vertices instead of recalculating it. So you have bulk vertex data, say you want to draw triangles, you pass index array and the driver starts to draw.

From index array (GL_TRIANGLES)
Use index 1
Use index 2
Use index 3
---- next tri ----
Use index 2 (taken from cache)
Use index 3 (taken from cache)
Use index 4
---- next tri ----
Use index 4 (taken from cache)
Use index 2 (taken from cache)
Use index 1 (taken from cache)

Cache was like 15 vertices long, so if you preprocess the data for reusage it can really get fast.

glDrawArrays doesn't have this luxury.

Aleksandar
11-29-2009, 04:40 PM
Thank you, M/\dm/\n!

But, those are general terms I already know. I need some reference for citation, and some starting point for my further research. I want to prove that my algorithm is good enough and I need to measure its performance. The number of rendered primitives is relatively low, but in some cases the number of functions calls can explode. I need to measure the impact of number of function calls on the drop of frame rate, but...

There are two problems with benchmarks:

1. cold-start
2. power saving

All modern processors have power management that reduces power consumption, and also execution speed, if task is not challenging. So, it is almost impossible to measure the real speed of GPU using the same test on all machines. The second problem is that speed of the test depends on the previous task.

Probably there are thousands of other problems, but those two are currently the most important for me.

Trying to solve those problems I have carried out many experiments to seek out "the row power of GPU" by finding number of triangles that can be rendered in the time-unit. I discovered some non-linear dependency and sudden drop in the "triangles per millisecond" speed when changing the size of VBOs. That was the cause of my previous question.

I'm sorry for this long post, but ... last two days testing GPUs was my predominant occupation. :(

Aleksandar
11-29-2009, 04:42 PM
If it is, static_draw VBO with indexed draws and cache friendly indices will probably be as fast as display list.

Should be even faster than DLs, but unfortunately they are not. :(

M/\dm/\n
11-30-2009, 12:46 AM
If it is, static_draw VBO with indexed draws and cache friendly indices will probably be as fast as display list.

Should be even faster than DLs, but unfortunately they are not. :(

Well, in that case I can't tell you much. I haven't been coding/researching OpenGL from 1.5 days and I'm picking everything up myself right now. There's a lot I've missed.

Maybe someone from advanced forum knows the answer. Nvidia and Ati developers used to post there.

Aleksandar
11-30-2009, 12:30 PM
Thank you M/\dm/\n, anyway!

And I'm glad you are back to OpenGL programming again. :)

Ketracel White
12-01-2009, 04:57 AM
If it is, static_draw VBO with indexed draws and cache friendly indices will probably be as fast as display list.

Should be even faster than DLs, but unfortunately they are not. :(


Why should that be faster? I can'T imagine anything being theoretically faster than having the driver create a raw list of GPU commands for a drawing operation, including vertex optimization? If implemented well I don't think there's anything that could get faster than a display list.

devdept
12-01-2009, 05:06 AM
I believe that the reason is that you specify the same vertex many times instead of one time only using VBOs...

Ketracel White
12-01-2009, 05:54 AM
Yes, but a well-designed draw list compiler should be able to take care of that.

Aleksandar
12-01-2009, 07:24 AM
The reason why static VBOs should be faster than DLs is in the fact that DLs are more complicated than VBOs. Generaly, DLs can contain transformations, state changes, and cetera, besides the row data. VBOs are just buffers of data.

The deprecation model is introduced, not to alleviate life of programmers, but to enable easier drivers optimization. It is not an easy task to implement optimization of display lists.

Your comment is very good. Well-designed drivers WOULD gain highest performance with DLs, but... Optimization becomes harder and harder with every new functionality being added to GL.

Aleksandar
12-01-2009, 07:28 AM
I believe that the reason is that you specify the same vertex many times instead of one time only using VBOs...
The repetition of vertices depends on the function used for drawing. VBOs often contains multiple instances of the same vertex. This is the only way glDrawArays() can work.

Jan
12-01-2009, 10:34 AM
When you look at all the new extensions of the last years, there is often the question "should this and that be included in display-lists". Very often the answer is "no", even if one might think that it could make sense from the idea of display lists in general. I think most of the time the reason is simply, to prevent having display lists become even more complicated to implement.

In the future i would like to see something like display lists return, but with a much better design, because from a performance point of view, especially considering multithreaded rendering, DLs COULD be the best thing, but only if it can be guaranteed, that all vendors can implement them with limited effort.

Jan.

Aleksandar
12-01-2009, 11:01 AM
Completely agree!
I also hope that DLs will be included into the Core of GL.