PDA

View Full Version : glMultiDrawElements or glPrimitiveRestartindex?



padawan
04-19-2011, 07:48 PM
From what I understand they are basically the same. The wiki confirmed my impression. So I don't see what the advantages would be of one over the other. Is it just the memory management overhead in glMultiDrawElements? Should I prefer glPrimitiveRestartindex for that reason?
I actually might prefer glMultiDrawElements. It looks "cleaner"

Alfonse Reinheart
04-19-2011, 08:21 PM
Well, consider this: glMultiDrawElements uses a client-side array that specifies offsets into the buffer object for rendering. Primitive restart-based rendering is all done on the GPU. It's entirely possible that multi-draw is done with a loop right on the CPU and therefore doesn't buy you any performance gains you couldn't get yourself.

Also, primitive restart works in combination with all other rendering methods. There's no glMultiDrawRangeElements, for example, nor is there glMultiDrawElementsInstanced.

padawan
04-19-2011, 08:36 PM
Thanks for the answer


Well, consider this: glMultiDrawElements uses a client-side array that specifies offsets into the buffer object for rendering. Primitive restart-based rendering is all done on the GPU. It's entirely possible that multi-draw is done with a loop right on the CPU and therefore doesn't buy you any performance gains you couldn't get yourself.
Really? Isn't the purpose of methods like this to send a batch to your dedicated hardware? If it is like you say glMultiDrawElements is basically useless. Why would I need something like that?


Also, primitive restart works in combination with all other rendering methods. There's no glMultiDrawRangeElements, for example, nor is there glMultiDrawElementsInstanced.
I'm not sure of what you mean. If I understand correctly I can do both of these with glMultiDrawElements. Do you mean it's easier to do that with glPrimitiveRestartindex?

EDIT: actually maybe I can't do the instanced part that easily

Alfonse Reinheart
04-19-2011, 08:46 PM
Really? Isn't the purpose of methods like this to send a batch to your dedicated hardware? If it is like you say glMultiDrawElements is basically useless. Why would I need something like that?

Because, at some point in the past, it sounded like a good idea.

Also, that doesn't mean that there's a guarantee of zero performance improvement with its use. There are some things that drivers need to do (verify that buffer objects exist and are on the GPU, get their pointers, etc) that would have to be done with multiple repeated calls.

As with anything else, if performance is important, you should test and see what happens.


If I understand correctly I can do both of these with glMultiDrawElements.

Not according to the quick reference card (http://www.khronos.org/files/opengl41-quick-reference-card.pdf). There are only 3 multi-draw functions: MultiDrawArrays, MultiDrawElements, and MultiDrawElementsBaseVertex.

padawan
04-19-2011, 09:13 PM
Because, at some point in the past, it sounded like a good idea.

Also, that doesn't mean that there's a guarantee of zero performance improvement with its use. There are some things that drivers need to do (verify that buffer objects exist and are on the GPU, get their pointers, etc) that would have to be done with multiple repeated calls.

As with anything else, if performance is important, you should test and see what happens.
Or I can use glPrimitiveRestartindex :)
I could stand some extra memory management, and I'm not even sure about that. But if this is the price I have to pay I'll go straight with primitive restart. At this point I don't see a real advantage in using glMultiDrawElements. Why isn't it deprecated?


Not according to the quick reference card (http://www.khronos.org/files/opengl41-quick-reference-card.pdf). There are only 3 multi-draw functions: MultiDrawArrays, MultiDrawElements, and MultiDrawElementsBaseVertex.
I know that. I always do my research :)
What I mean is that I could get the same result as a hypothetical glMultiDrawRangeElements using glMultiDrawElements, since I can specify my range there. Doing instanced drawing seems a bit more problematic though.

However, I would say this is now more of academic interest, as glPrimitiveRestartindex seems the way to go.
I wold just be curious to know why glMultiDrawElements is still in the specification.

mhagain
04-19-2011, 11:59 PM
Actually just using GL_TRIANGLES as the primitive type would be what I'd recommend. No need for any special case handling, multiple complex chains of triangles can be grouped together, you can keep the same vertexes in the same order as before, and you still get all the benefits of indexing.

padawan
04-20-2011, 12:33 AM
Actually just :) using GL_TRIANGLES as the primitive type would be what I'd recommend. No need for any special case handling, multiple complex chains of triangles can be grouped together, you can keep the same vertexes in the same order as before, and you still get all the benefits of indexing.
I've seen a few discussions about using triangles or strips. At the end, as far as I understand, the problem is just the maximization of hits in the vertex cache. At the moment I'm favoring strips, but I'll have to look in detail at this issue later. I already have a question I want to ask about that :)
Indeed this is a point for triangles. You don't need restarting.

Aleksandar
04-20-2011, 11:27 AM
This topic is probably exhausted, but I want to add some lines from my personal experience:

1. glDrawElements using primitive restart is certainly faster than glMultiDrawElements. The reason is not in the execution speed on the GPU (there is a difference of only about several percents), but in the way drivers interpret glMultiDrawElements command. I don't know what exactly happens in drivers, but it seems that they execute glDrawElements in a loop. (I think Alfonse also stated that several years ago, but I can't find that post). The reason I believe this happens is in the fact that I got 400% of speed up in my code when I turned to primitive restart. But that speedup was only in CPU time! GPU time almost stays the same. So, restart primitives may unload your CPU, but would not make your drawing faster. Please, don't misunderstand what I said; unloading CPU is a very useful effect of primitive restart approach.

2. Although indexing enables more flexible use of GL_TRIANGLES primitive, GL_TRIANGLE_STRIPS are not for the recycle bin yet. For many regular structures (like regular grids), GL_TRIANGLE_STRIPS enables more compact index buffers. For example, for NxN regular grid
- GL_TRIANGLES requires 3*2*(N-1)*(N-1) = 6(N^2-2N+1) = 6N^2 indices
- GL_TRIANGLE_STRIPS requires 2*N*(N-1) = 2N^2-2N = 2N^2 indices
- GL_TRIANGLE_STRIPS with cache optimization proposed by Ignacio Castaño requires 2N^2 – 2N + floor(N/(cacheSize-1))*2N vertices
It would be interesting to discuss about the efficiency of vertex post-transform cache, but on the first glance it seems that new cards are less sensitive to this optimization (I need to carry out some experiments in order to prove or reject this statement/premise). But the point is that STRIPS consumes about three times less memory space for the regular grid representation comparing to triangles.

padawan
04-20-2011, 12:00 PM
This topic is probably exhausted, but I want to add some lines from my personal experience:

1. glDrawElements using primitive restart is certainly faster than glMultiDrawElements. The reason is not in the execution speed on the GPU (there is a difference of only about several percents), but in the way drivers interpret glMultiDrawElements command. I don't know what exactly happens in drivers, but it seems that they execute glDrawElements in a loop. (I think Alfonse also stated that several years ago, but I can't find that post). The reason I believe this happens is in the fact that I got 400% of speed up in my code when I turned to primitive restart. But that speedup was only in CPU time! GPU time almost stays the same. So, restart primitives may unload your CPU, but would not make your drawing faster. Please, don't misunderstand what I said; unloading CPU is a very useful effect of primitive restart approach.

2. Although indexing enables more flexible use of GL_TRIANGLES primitive, GL_TRIANGLE_STRIPS are not for the recycle bin yet. For many regular structures (like regular grids), GL_TRIANGLE_STRIPS enables more compact index buffers. For example, for NxN regular grid
- GL_TRIANGLES requires 3*2*(N-1)*(N-1) = 6(N^2-2N+1) = 6N^2 indices
- GL_TRIANGLE_STRIPS requires 2*N*(N-1) = 2N^2-2N = 2N^2 indices
- GL_TRIANGLE_STRIPS with cache optimization proposed by Ignacio Castaño requires 2N^2 – 2N + floor(N/(cacheSize-1))*2N vertices
It would be interesting to discuss about the efficiency of vertex post-transform cache, but on the first glance it seems that new cards are less sensitive to this optimization (I need to carry out some experiments in order to prove or reject this statement/premise). But the point is that STRIPS consumes about three times less memory space for the regular grid representation comparing to triangles.
Obviously the topic wasn't really exhausted :)

Both points are interesting, but I find the second one especially interesting, for the simple reason that I have never heard of this strips with cache optimization. I did know the name of Ignacio Castano though. I believe he wrote some stripifying code.
My first thought is that indeed strips are more convenient, because you need less data. It's the reason why I use them. However, I found this (http://www.opengl.org/discussion_boards/ubbthreads.php?ubb=showflat&Number=290338&page=1) thread, where there are some points in favor of triangles. I still have to clearly understand what it says though.
For what I see the point is basically the maximization of the cache. It would be ideal to know the capabilities of your cache, so you can optimize your strips/triangles based on the hardware you have. But I'm afraid it's not so easy.
Another point is that you might have to do the stripification every time before the rendering, overloading the CPU, but this is not necessary if you have a fixed set of objects. In that case you need to do that only once, maybe even offline.

The first point is for me a useful confirmation that multidraw has to go into the trashcan. Utterly useless method.

Aleksandar
04-20-2011, 12:48 PM
I missed that thread. :)
Now it is several months old so I'll not add new comments.

mhagain little bit overemphasizes influence of vertex post-transform cache taking into account vertices only. Currently I'm trying totally to remove vertex attributes from my renderings. In that case indices are the only data that is streamed to the GPU (along with textures). Having three times less data is significant in that case. When reading the thread that you've quoted carefully read what Daniel (aqnuep) and Dark Photon said.

You can read more about Ignacio's method here: Optimal Grid rendering (http://www.ludicon.com/castano/blog/2009/02/optimal-grid-rendering/) But I get almost the same rendering speed with simple primitive restart. That's why I said that I have to carry out more experiments. Vertex shaders I used were pretty simple.

mhagain
04-20-2011, 01:14 PM
Some more of my thoughts on this matter.

I'm not too concerned about memory and/or bandwidth overhead for indexes. One would need to be really using huge amounts of indexes for this to become a dominant factor. I tend to prefer 16-bit indexes myself so impact is halved anyway.

The requirement of 6n^2 indexes for GL_TRIANGLES needs to take into account that you'll most commonly get far superior vertex reuse, with a theoretical best case reuse of double that which you'll get with GL_TRIANGLE_STRIP (and consequently half the memory/bandwidth for vertexes). That's assuming best-base for both; real-world will be somewhere in between. Let's go nuts. At one extreme I've seen meshes where the original stripification was not that great, and which, when converted to plain indexed triangles, needed as little as 10% of the vertexes as before. And bouncing back again, worst-case with plain triangles will get you no better than with strips, but with the additional overhead of those extra indexes.

As vertexes are always going to be more heavyweight than indexes this can all add up to accepting additional overhead in one area in exchange for reduced overhead in another, and which can translate into huge performance gains if the use case is appropriate, but it's something you need to measure for your own datasets before making a decision either way. But you do need to balance both sides of this equation (more indexes in exchange for less vertexes) otherwise you're just working from incorrect data.

The point about vertex caches becoming less relevant on newer hardware is relevant and needs further discussion, testing, benchmarking and experimenting.

As always - profile, profile, profile. And do it before putting in the work. If vertex submission or the vertex pipeline are not bottlenecks in your program, none of this will do you any good. That can't be stressed enough.

Summary is that neither approach is inherently better than the other all of the time. For the most common types of meshes you'll be using, plain triangles with indexes are more likely to outperform strips (with or without indexes) on bandwidth and vertex reuse terms, so they are what I'd personally recommend in the general case. You can always find something that reverses that however, so it's up to you whether or not you want to add special-case code for that something. And if your program's bottleneck is actually elsewhere, you can completely ignore all of this. ;)

Alfonse Reinheart
04-20-2011, 01:41 PM
I tend to prefer 16-bit indexes myself so impact is halved anyway.

Do you prefer this to the extent that you perform multiple draw calls (possibly with BaseVertex)? That is, if you have a model with more than 2^16 vertices in it, do you issue multiple draw calls to render it, or just a single draw call?


The requirement of 6n^2 indexes for GL_TRIANGLES needs to take into account that you'll most commonly get far superior vertex reuse, with a theoretical best case reuse of double that which you'll get with GL_TRIANGLE_STRIP (and consequently half the memory/bandwidth for vertexes).

This is the statement that is in dispute.

Forsythe's calculations dealt with index reuse, not vertex reuse. That is, the number of indices specified by the user that get reused in the space of a draw call. What he didn't take into account was that strips have fewer indices by their very nature, and therefore will naturally have less index reuse. This makes his numeric comparisons less useful for determining which one works better.

That being said, the following is true: given an arbitrary mesh, it is easier to implement an algorithm to get good vertex reuse with a triangle-based algorithm than a strip-based one. Just writing the code to do it is easier in the triangle case.

mhagain
04-20-2011, 01:49 PM
I tend to prefer 16-bit indexes myself so impact is halved anyway.

Do you prefer this to the extent that you perform multiple draw calls (possibly with BaseVertex)? That is, if you have a model with more than 2^16 vertices in it, do you issue multiple draw calls to render it, or just a single draw call?

Neither. I prefer it to the extent that it's more widely supported in hardware.

Alfonse Reinheart
04-20-2011, 02:08 PM
Neither. I prefer it to the extent that it's more widely supported in hardware.

I was asking what you do if you're presented with a mesh that has more than 2^16 vertices in it.

32-bit indices are widely supported in hardware, and have been for quite some time. They're certainly slower in much hardware, but they've been supported for a long time.

kyle_
04-20-2011, 02:10 PM
Where did you encounter hardware without support for 32bit indices?

Aleksandar
04-20-2011, 02:24 PM
I completely agree with you!

For arbitrary complex mesh usage of TRIANGLES is probably more flexible than STRIPS for achieving more hits in the vertex-cache. I just said that there are special cases, like regular grids, where STRIPS are much better. So we agreed also about this. :)

I also agree with you that if the bottleneck is not in VS stage, all this is totally irrelevant. That's why I said I need more experiments with very complex vertex shaders in order to estimate efficiency of vertex-cache.

The only disagreement is in the statement that STRIPS could require more vertices (or I misunderstood the point of "more indexes in exchange for less vertexes" ). The number of vertices is exactly the same in both cases.

Please consider the case of rendering the mesh that doesn't have any vertex attribute. In that case, the indices are only data required. The position of each vertex can be generated according to its ID, and other data can be read from textures. In this case using STRIPS, if possible, could reduce memory footprint of the mesh. Of course, for arbitrary mesh it could be probably very hard to achieve good vertex-cache utilization. Much harder than with triangles. The point of my previous post is that TRIANGLE_STRIPS shouldn't be a priori rejected in favor of TRIANGLES. ;)

Alfonse Reinheart
04-20-2011, 02:29 PM
Please consider the case of rendering the mesh that doesn't have any vertex attribute. In that case, the indices are only data required.

Technically, you don't even need that; you can (theoretically) just call glDrawArrays with some number of vertices. Note that I have not seen this tested, nor have I tested it myself, so I can't be sure that this actually works. But core GL 3.x and above should be able to do it.

Aleksandar
04-20-2011, 02:43 PM
I'm doing that on a regular base, and it works.
Although I'm using glDrawElements because of it's ability to control vertex-cache reusing through index reordering. Also glDrawArrays assumes multiplication of the same vertices which requires some gymnastics with ID that can be totally skipped with indexed based drawing (like with glDrawElements).

Dark Photon
04-20-2011, 06:32 PM
2. Although indexing enables more flexible use of GL_TRIANGLES primitive, GL_TRIANGLE_STRIPS are not for the recycle bin yet.
Agreed. "Indexed" TRIANGLE_STRIPS have some marginal utility, under some limited circumstances.

However, I think we may agree (as I mentioned in another thread (http://www.opengl.org/discussion_boards/ubbthreads.php?ubb=showflat&Number=295091&Searchpa ge=1&Main=56035&Words=atvr&Search=true#Post295091) ) that "non-indexed" TRIANGLE_STRIPS should be avoided. ATVR up near 2.0 -- nowhere down near 1.0 where you can get with indexed TRIANGLEs.

padawan
04-21-2011, 04:51 AM
Okay, I have something I wanted to ask later, but I'll take advantage of this thread to do that now. So:

1) can someone explain when or why triangles are better than strips? I can't imagine a situation where the layout is so complex that strips miss the cache so much.
Actually, how big can a cache be?

2) would it be possible and reasonable to optimize the process for a specific cache? OpenGL doesn't have a way to get information about the cache, does it?

Thanks

Aleksandar
04-21-2011, 04:04 PM
You have scratched a pretty complex topic, but anyway I wonder why nobody dares to answer to these questions. :)

1) In order to preserve locality sometimes it is better to traverse triangles in some pseudo-circular manner, using closed space filling curves (like Sierpinski). Triangle strips can rapidly progress in a certain direction, hence easily violet locality. Furthermore, long chains might violate input cache also.

The size of vertex post-transform cache differs from card to card. Each processing unit has its own cache which size is usually expressed in number of entries (transformed vertices). But even for the same GPU, number of entries may vary depending on the number and sizes of vertex output attributes. Furthermore, in order to parallelize vertex processing, GPUs divide single VBO across multiple processing units and duplicate shared vertices. Even with perfect vertex ordering there are still many redundant calculations.

In short:
- the size of the vertex post-transform cache is usually unknown,
- it depends on GPU architecture, number and size of vertex shader output attributes (because we need to know its size in number of entries not in bytes),
- how VBO will be split across multiple execution units is also unknown,
- whether cache uses FIFO or LRU policy is also unknown

You could run some benchmarks (even in run-time) and determine all those parameters, or do it off-line and optimize your application for certain architecture.

But, there are much better methods. Many cache-optimization algorithms are cache oblivious. Using LRU strategy they are trying to utilize vertex-locality as much as possible. They are not optimal, but good enough for wide variety of hardware.

There are many algorithms for vertex post-transform cache usage optimization. Some of them gains better ACMR, while others are extremely fast and can be used in real-time optimization.

To make things even more complicated, the efficient utilization of early-Z for very complex objects with expensive fragment shaders requires calculation and storing several index buffers for a single object depending on the viewing angle.

2) Yes and No! Yes, it is possible, but usually it is not reasonable because of wide variety of hardware. I think I gave the answer on this question with the previous one. ;)

ZbuffeR
04-21-2011, 04:12 PM
Sierpinski is not space-filling, it is more emmental filling ...
Try Hilbert curve instead, as detailed on page 556 of "Real-time rendering" By Tomas Möller, Eric Haines, Naty Hoffman.

padawan
04-21-2011, 07:53 PM
You have scratched a pretty complex topic, but anyway I wonder why nobody dares to answer to these questions. :)
Yeah, I can imagine that. As I said, I didn't want to touch it, it's way too advanced for me, but the discussion got started. Hell, after all it's my thread :)

I guess nobody is answering because I'm too annoying :D


The size of vertex post-transform cache differs from card to card. Each processing unit has its own cache which size is usually expressed in number of entries (transformed vertices). But even for the same GPU, number of entries may vary depending on the number and sizes of vertex output attributes. Furthermore, in order to parallelize vertex processing, GPUs divide single VBO across multiple processing units and duplicate shared vertices. Even with perfect vertex ordering there are still many redundant calculations.
Damn. Where do I get all this information? Looks like important stuff to me.


You could run some benchmarks (even in run-time) and determine all those parameters, or do it off-line and optimize your application for certain architecture.
What kind of benchmarks? Like progressively increasing the size of a buffer and see when the performance goes down?


But, there are much better methods. Many cache-optimization algorithms are cache oblivious. Using LRU strategy they are trying to utilize vertex-locality as much as possible. They are not optimal, but good enough for wide variety of hardware.

There are many algorithms for vertex post-transform cache usage optimization. Some of them gains better ACMR, while others are extremely fast and can be used in real-time optimization.
uhmmm... does that mean that at the end triangles are not so useful?


2) Yes and No! Yes, it is possible, but usually it is not reasonable because of wide variety of hardware. I think I gave the answer on this question with the previous one. ;)
I am not sure of what you mean.
What I'm proposing is to get information about the cache on the specific hardware, for example when you install the application, or when you load it. Then, that single time, you can run your optimization, and save the results. So you will have a layout that is optimal for that specific machine.
Something similar to what you said about the benchmarks, just without the benchmarks. You get the information in a more efficient way, if it's possible.


Sierpinski is not space-filling, it is more emmental filling ...
Try Hilbert curve instead, as detailed on page 556 of "Real-time rendering" By Tomas Möller, Eric Haines, Naty Hoffman.
I'm on it :)

Aleksandar
04-22-2011, 02:26 AM
Well, testing capabilities at startup time or whenever the hardware (or driver) is changed is a feasible solution. Just keep in mind that different shaders have different output attributes. It would be probably better to use just a single strategy based on some cache oblivious approach. If you decide to discover the cache size, it is better to guess some good starting size and than try to adjust it. In my application 32 entries is proved to be a good guess. :)