PDA

View Full Version : NV_primitive_restart anyone?



namespace
06-07-2004, 06:00 AM
Hi!

Has anyone tested the GL_NV_primitive_restart Extension?

Linky (http://oss.sgi.com/projects/ogl-sample/registry/NV/primitive_restart.txt)

The possibility to insert glDraw-Calls in Vertexdata looks like a real improvement to me.
Unfortunatly im not able to bench it (n30 emulation), just want to know if its worth to spend some time on it.

plasmonster
06-07-2004, 01:05 PM
I have used it, and it's a thing of beauty, a simple, elegant solution.

Unfortunately, it doesn't look like there's going to be widespread support for this anytime soon, AFAIK. That's the only problem I would see with being dependent on it.

Obli
06-08-2004, 02:51 AM
Originally posted by Sean:
I have used it, and it's a thing of beauty, a simple, elegant solution.

Unfortunately, it doesn't look like there's going to be widespread support for this anytime soon, AFAIK. That's the only problem I would see with being dependent on it.Please, can you tell me how much faster it goes?
It does not need to be accurate. "Somewhat faster" or "faster enough" may be enough.
I have tried to use it but it didn't fit my datapath so I decided to drop it but I would like to hear those rumors.
Thank you in advance.

Adrian
06-08-2004, 03:15 AM
If your app is CPU limited and makes hundreds/thousands of drawelements calls per frame then you may benefit from this extension. It's difficult to quantify how much difference it makes because it depends on many factors.

There is a small but significant CPU overhead each time you call drawelements(or similar). This extension allows you to reduce the number of those calls.

I havent used this extension yet but it looks extremely useful.

plasmonster
06-08-2004, 04:27 AM
@Obli
It was *faster* in the context of a LOD terrain. I'd like to be more specific, but I've since done away with the extension, growing weary of waiting for the ARB version. But I would use it just for the facility of primitive batching, without the performance perk.

You could get a big win for LOD terrains, though. Some algorithms require the inserion of degenerate tringles in order to form a single continuous tristrip (Lindstrom, et al). This is where it could really come in handy.

I've often wondered if it would be possible to extend this to accept a primitive type after the restart index.

End();
Begin(new_type);

This might make for some nifty batching possibilities, most notably in the context of generic meshes, which are rarely uniform wrt primitive sequencing.

namespace
06-08-2004, 11:03 AM
The big chance I see is that triangle strips finally become effective, compared to a simple, but cachekilling triangle list solution.

Korval
06-08-2004, 11:49 AM
I've since done away with the extension, growing weary of waiting for the ARB version.Why would you expect an ARB version? This is something that has to be built into the hardware specifically, and the choice for the restart value can easily be different depending on the hardware.


The big chance I see is that triangle strips finally become effectiveI don't know what strips you're using, but they have almost always been effective. More effective than lists.

namespace
06-08-2004, 01:14 PM
I don't know what strips you're using, but they have almost always been effective. More effective than lists.Having multiple Draw-Calls for every mesh can produce a "nice" overhead in the engine and driver.
Add some more Draw-Call because you have to change textures/shader for a submesh and you (i.e. me ;) )
end up with too many Calls and small batches.

plasmonster
06-08-2004, 02:42 PM
Why would you expect an ARB version?Most of the good ideas eventually make their way to the ARB. It just so happens that I think this extension is a particularly good one. I take it you do not agree?


This is something that has to be built into the hardware specifically, and the choice for the restart value can easily be different depending on the hardware.I agree that each implementation will have to deal with the particulars, this is certainly the case for any extension. But I fail to see why this should dash my hopes for the extension making it to the ARB. Can you explain?

And as the restart index can be any user defined 32 bit value (nVidia spec), I don't understand why this would hinder an implementation, since conformant implementations have to support uint indices anyway.

Korval
06-08-2004, 04:31 PM
Having multiple Draw-Calls for every mesh can produce a "nice" overhead in the engine and driver.
Add some more Draw-Call because you have to change textures/shader for a submesh and you (i.e. me )
end up with too many Calls and small batches.Unless, of course, you stich all your strips together with degenerate triangles. And, if you're doing material changes in the middle of your mesh, you already have problems that no primitive restart can solve.


I take it you do not agree?It's a fine idea, but the only implementation that gives good performance is one where the hardware actively understands the restart command. It's like VAR; it's very specific to a particular hardware implementation.


And as the restart index can be any user defined 32 bit value (nVidia spec)Really? I haven't read the spec in a while (I don't use nVidia hardware at the moment, so NV extensions aren't something I keep up with).

Even so, one would still need to make hardware to do this. To be honest, there are more important performance problems to be solved before taking this one on.

Ozzy
06-09-2004, 01:12 AM
Well, primitives initialisation is *quite* expensive anyway.. that's why in most of the case the best cfg is 1 prim = 1 strip to get the best performances. Too bad that we can't specify/load a modelview matrix in conjonction of nv_primitive_restart, that would be a severe plus regarding performances for identical primitives at != locations for instance. Maybe there is a trick for this kind of situation but it doesn't seems possible regarding the specs. Btw, why this kind of feature can't be added if this nv ext goes ARB? what are really the limitations behind this when u only need to apply a new modelview while the rest of the primitive description remain unchanged?

Jan
06-09-2004, 01:44 AM
My engine internally works with polygons, instead of triangles. I "triangulate" them when filling the buffer for a DrawCall, by simply sorting the indices in such a way, that each polygon gets send as several triangles.
This has some nice advantages, since for collision detection, etc. i can work with a lot less polygons, than i would have to work with triangles, and by reusing the vertices of a polygon for several triangles, i can make heavy use of the pre- and post-T&L caches.

Now i was thinking, if it could speedup the engine, when i use the primitive_restart extension (although at the moment i have a Radeon) and send polygons instead of triangles, because most polygons are actually quads, and each quad gets split into 2 triangles, which is 6 vertices, instead of 4.
I am not sure, if i would get a speedup in this case, so what do you think ?

On the other hand, in the spec it is said:


Is it feasible to guarantee fast performance even in the non-VAR, non-CVA, non-DRE case??? Possibly not.So? What about VBOs ? I donīt see why this should be restricted to VAR.

Jan.

zeckensack
06-09-2004, 10:04 AM
Originally posted by Jan:
My engine internally works with polygons, instead of triangles. I "triangulate" them when filling the buffer for a DrawCall, by simply sorting the indices in such a way, that each polygon gets send as several triangles.
This has some nice advantages, since for collision detection, etc. i can work with a lot less polygons, than i would have to work with triangles, and by reusing the vertices of a polygon for several triangles, i can make heavy use of the pre- and post-T&L caches.

Now i was thinking, if it could speedup the engine, when i use the primitive_restart extension (although at the moment i have a Radeon) and send polygons instead of triangles, because most polygons are actually quads, and each quad gets split into 2 triangles, which is 6 vertices, instead of 4.No, they are certainly 6 indices instead of 4, not vertices.
With NV_primitive_restart it would be 5 indices, and that's it. Everything else (namely vertex traffic, index numeric range, post-transform cache hits) is the same. I really wouldn't expect to gain anything in that special case.

It starts to make more sense for batches of larger polygons, and for strips and fans. For a convex polygon, if n is the vertex count, you need (n-2)*3 indices if you "tesselate" it into indexed triangles. But still, the only thing you can really save by using this extension is index traffic.

V-man
06-09-2004, 03:17 PM
Originally posted by zeckensack:

It starts to make more sense for batches of larger polygons, and for strips and fans. For a convex polygon, if n is the vertex count, you need (n-2)*3 indices if you "tesselate" it into indexed triangles. But still, the only thing you can really save by using this extension is index traffic.The idea is to reduce function call overhead, kind of like the multi_draw_arrays extension.

But instead of making multiple calls with the primitive being GL_TRIANGLE_STRIP, or make use of dead triangles, you can expand the index array and make a single call with GL_TRIANGLES beeing the primitive.

How many glDrawElements (or whatever) calls are you making per model?

IIRC, using dead triangles doesn't cost too much now on NVidia. Not sure about the others.

What is recommended in this area with the next gen hardware?

plasmonster
06-09-2004, 06:00 PM
For the intersted reader:
http://www.nvidia.com/dev_content/nvopenglspecs/GL_NV_primitive_restart.txt


To be honest, there are more important performance problems to be solved before taking this one on.@Korval

I'd be willing to go along with that, if batching wasn't such an important issue, and IHVs were unable to walk and chew gum at the same time (work on more than one pipe issue at a time) :) .

But your point is well taken: The extension is more or less targeted at triangle strips at this point and, as such, is not a generic solution, and not likely to get the lion's share of attention across the board.


Now i was thinking, if it could speedup the engine, when i use the primitive_restart extension (although at the moment i have a Radeon) and send polygons instead of triangles, because most polygons are actually quads, and each quad gets split into 2 triangles, which is 6 vertices, instead of 4.
I am not sure, if i would get a speedup in this case, so what do you think ?@Jan
The best way to utilize this extension is to group your geometry by strips or fans, inserting a restart index where there's a break. It will work with triangles and quads too, but there's not much to gain from it. Unfortunately, at this time, you can't change the primitive type amid stream (that sure would be cool, IMHO).


The idea is to reduce function call overhead, kind of like the multi_draw_arrays extension.@V-man
This is a big win indeed, if you have lots of geometry.


IIRC, using dead triangles doesn't cost too much now on NVidia. Not sure about the others.Thats's a good point. Today's hardware can handle degenerate triangles quite handily.


What is recommended in this area with the next gen hardware? Well, batching is likely to be a huge issue for the forseeable future; the question is then will this extension, or its ilk, be in it. I wish I knew the answer to that one. Maybe there's something leaner lurking out there.

Korval
06-09-2004, 06:49 PM
if batching wasn't such an important issue, and IHVs were unable to walk and chew gum at the same time (work on more than one pipe issue at a time) .Batching isn't that important of an issue on OpenGL. You have some overhead for calling glDraw* 5 times rather than 1 (more than mere function call), but good VBO use is far more important than that. The marshalling of GPU commands is very good these days.

If hardware makers could get state changes to be less costly, that would go much more to rendering performance than any primitive restart.

plasmonster
06-09-2004, 11:46 PM
Batching isn't that important of an issue on OpenGL.I disagree. I see batching as among the biggest problems in the future of graphics. As worlds and characters get ever more complex, issuing all the draw cammands will weigh heavily on performance.


You have some overhead for calling glDraw* 5 times rather than 1 (more than mere function call), but good VBO use is far more important than that.We're talking about thousands of calls here, possibly much more. It depends on many factors.
But this extension is orthogonal to VBOs. It's not enough to simply give the driver the vertex data; you have to tell the driver what to do with it. VBOs are a great way to manage data, but you still have to issue draw commands.


If hardware makers could get state changes to be less costly, that would go much more to rendering performance than any primitive restart.I agree. But why can't we have both? The batching issue has to be addressed by someone. I can do everything possible to optimize my side, but eventually, I have to tell the driver what to do. This extension simply makes that communication more efficient.

Alas, the point of this discussion is probably mute, as there doesn't seem to be any sign of ATi joining the throng, AFAIK.

BTW, I never meant to suggest that this was a cure all. I just think it's a pretty darn good idea.

Korval
06-10-2004, 12:29 AM
I see batching as among the biggest problems in the future of graphics. As worlds and characters get ever more complex, issuing all the draw cammands will weigh heavily on performance.You seem to misunderstand my point.

Let's say you have a program with plenty of CPU time to spare. So, you decide to change the stripping. For every 1 strip, wherever possible, you split it up into 5. Hence, you will need to call glDraw* 5x more than before.

Assuming that the program was vertex transfer limitted to begin with, the performance will drop primarily because of caching behavior dealing with vertex data. That is, the card works best when it reads a long, unbroken string of indices. You can mitigate this somewhat easily enough by putting the indices into a contiguous array. At this point, the performance penalty comes from only 3 potential places:

1: Function call overhead on glDraw*. In our case, we have plenty of CPU time, so this is negligable.

2: Driver marshalling of GPU commands. The boneheaded way of implementing glDraw* is to immediately put the commands into the GPU's FIFO, which could require a switch to Ring0 on the CPU (a slow operation). Few GL drivers do it this way. Drivers marshal GPU commands pretty efficiently these days.

3: Some oddball GPU problem. For whatever reason, the GPU has some significant delay between primitive batches. I have no factual, or even speculative, reason why a significant delay would exist.

1 is trivial, 3 doesn't exist, and drivers are pretty good at 2. Where's the batching problem?

Now, you might have read a PDF on nVidia's site about the importance of batching primitives. They suggest taking drastic measures to get large batches of primitives, because a 1GHz CPU only gets something like 10,000 batches. This PDF only refers to D3D, because D3D can't do #2 well at all. It has to use the "boneheaded" method, because of how the D3D driver model works. GL drivers can, and do, perform appropriate marshalling of GL commands.

None of this is to say that you can send a mesh as a sequence of 1-triangle-sized glDraw* calls. While #3 may not be significant, it is still there, and for rendering large numbers of polygons, it can add up quickly. But, for real numbers, it is quite negligable.

Note that this assumes the use of VBO index buffers as well as ATi hardware. I'm not sure about FX hardware, but I do recall that nVidia hardware through the GeForce 4 definately had issues with the concept of index buffers. While they clearly support VBO index buffers well enough, it is clearly stated that the buffer object containing indices should be a different object from the actual mesh data, as this allows for implementations that can't handle video/AGP memory with indices. The general assumption about this level of nVidia hardware was that the driver, upon receiving a glDraw* command, was required to copy the given indices directly into the FIFO/Marshal queue, which obviously doesn't work well if they are in video/AGP memory.

ATi hardware, of R300 calibur or better (if not R200 hardware), doesn't have this limitation. As such, all it needs to do is copy a 16-32 byte instruction opcode sequence into the FIFO (telling the GPU where the index buffer is, how long it is, and the format) for each glDraw* operation.

It is likely that NV30 fixed the nVidia issue, since NV30 supports primitive restart, which presupposes a better command processor/primitive unit.


The batching issue has to be addressed by someone.The batching issue is, to my mind, resolved with degenerate triangles in strips. With the exception of one thing: Triangle Fans. I would dearly love to fan my terrain, but I can't due to the performance impact.

As such, the only times I make multiple glDraw* calls are for either particles or for state changes. My batches tend to broken up by state change far more than by anything else.

harsman
06-10-2004, 02:28 AM
Originally posted by Sean:
As worlds and characters get ever more complex, issuing all the draw cammands will weigh heavily on performance.
As worlds an characters get more complex, the number of triangles obviously increases which means more triangles per draw call, no?

Maybe I didn't understand your point. I agree that ways to decrease batching overhead like varying vertex stream frequencies and instancing would be useful for many reasons but I really don't think increasing geometric omplexity is one of them.

zeckensack
06-10-2004, 06:34 AM
You shouldn't be comparing NV_primitive_restart supported rendering of fans, strips and polygons to "normal" rendering of the same.

Rendering large numbers of these primitive types may be prohibitive for whatever overhead there is, but then, you shouldn't be doing that anyway. What you do is use GL_TRIANLGES as primitive type, and use indices. This is far more efficient, and it is what you should use as a baseline of comparison when you try to figure out what NV_primitive_restart can do for you.

And if you use degenerates, there is no issue to begin with.

PS: the notes about indirect rendering seem very plausible. We're talking about direct rendering here, right?

Adrian
06-10-2004, 07:12 AM
The way I see it is:

GL_TRIANGLE_LIST -> CPU overhead since more draw calls required.
GL_TRIANGLES -> (CPU/AGP?) overhead due to more indices being sent.
GL_TRIANGLE_LIST + NV_PRIMITIVE_RESTART -> Almost no overhead

Degenerate triangles are a possible alternative, are there any issues with this method? If not why did NVidia create this extension?

- Edited

zeckensack
06-10-2004, 07:24 AM
Originally posted by Adrian:
GL_TRIANGLES -> Vertex transform overheadWhy that? Use indices, problem solved. That's what post-transform vertex caches are for ...

Adrian
06-10-2004, 07:25 AM
Originally posted by zeckensack:

Originally posted by Adrian:
GL_TRIANGLES -> Vertex transform overheadWhy that? Use indices, problem solved. That's what post-transform vertex caches are for ...Yeah I corrected my post :)

Adruab
06-10-2004, 07:59 AM
There's a hack for triangle primitives if you want to jump triangles. Say you had:

123 324 789 985

you could string those together by doing the following:

1234477895

which would expand to:

123 324 447 477 789 985

Many (most... at least on DX) graphics cards will recognize this and automatically not render the degenerates.... True it doesn't work for fans, but who uses those anyways (hooray for fan users you have a useful extension) :p . It does require a few more indices, but over all would still be just as fast as the restart one (one fewer) and only slightly less generic. Given that option, seems like a problem that doesn't really need to be solved (1 fewer indices... come on...).

I'm all for reducing state change overhead though, that seems to be a big rough point for even some of the most optimized engines.

plasmonster
06-10-2004, 02:35 PM
1: Function call overhead on glDraw*. In our case, we have plenty of CPU time, so this is negligable.

2: Driver marshalling of GPU commands. The boneheaded way of implementing glDraw* is to immediately put the commands into the GPU's FIFO, which could require a switch to Ring0 on the CPU (a slow operation). Few GL drivers do it this way. Drivers marshal GPU commands pretty efficiently these days.
Well, I don't have plenty of CPU time. As for 2, if I have a boneheaded implementation, I'd be better off buying a new card, as any extension supposedly designed to increase performance would likely be useless anyway. I didn't quite understand the point in 3.


I would dearly love to fan my terrain, but I can't due to the performance impact.Why the fans?


True it doesn't work for fans, but who uses those anywaysI use them. It's a common method for rendering n-sided polygons. With this extension, I can render a multitude of polygons (strips or fans), all sorted by shader, in one call. Moreover, I can do this quickly and easily on the fly, as it only requires inserting an single index when you build your final list, making it painfully easy to jump around my database. In any case, this will still enable you to batch polygons more efficiently.

Even if only marginally better, worst case, it's still better. As I said earlier, I enjoy using this for the convenience, any performance boost is icing on the cake.

Korval
06-10-2004, 03:42 PM
As for 2, if I have a boneheaded implementation, I'd be better off buying a new card, as any extension supposedly designed to increase performance would likely be useless anyway.Driver updates do more than just add extensions. The bonehead implementation can be fixed, for example. And, for OpenGL, I wouldn't worry too much about, since both ATi and nVidia seem to be pretty much on the ball with their triangle rendering code.


Why the fans?Why would I want to fan terrain? Because heightmaped, hex-based terrain is naturally fan-like. Terrain looks better as a sequence of fans rather than a sequence of tri-strips. It tends to per-vertex light better too.


It's a common method for rendering n-sided polygons.How often do you have an n-sided polygon? Or, perhaps more specifically, why are your artists creating environments that are not inhierently strip-worthy? They are the ones responsible for the state of your terrain, after all.

plasmonster
06-10-2004, 07:16 PM
Driver updates do more than just add extensions. The bonehead implementation can be fixed, for example. And, for OpenGL, I wouldn't worry too much about, since both ATi and nVidia seem to be pretty much on the ball with their triangle rendering code.The point is that I shouldn't have to worry about something I have no control over anyway. How the driver deals with things internally is its business. I have to trust it to do a good job; if it doesn't, there's nothing I can do about it. If, on the other hand, the vender exposes an extension, presumably for the express purpose of making things faster and/or more convenient, then I have to assume it's a good thing, as the vender knows infinitely more about the implementation details than I do.


Because heightmaped, hex-based terrain is naturally fan-like. Terrain looks better as a sequence of fans rather than a sequence of tri-strips. It tends to per-vertex light better too.
Ah, I both see, and agree. Curious, do you mean to say that you use a hexogonal grid? The reason I ask is that some LOD algoritms produce fan like strippings, just by virtue of the winding order. I know, for example, ROAM flavors use fans heavily, while some of the newer SOAR implementations rely on strips exclusively, and end up with very nice fan-like windings.


How often do you have an n-sided polygon? Or, perhaps more specifically, why are your artists creating environments that are not inhierently strip-worthy? I have a fairly substantial BSP model in addition to the terrain. The editor's setup to allow an artist to ham-fistedly add brushes to the world, and apply an arbitrary shader to any side. During preprocessing, I can do quite a bit to optimize; but I can't strip across different shaders, as much as I'd like to.

Adruab
06-11-2004, 10:06 AM
Yes, fanning would be nice for rendering convex polygons. That could help for accelerated 2d rendering.... How often do you use this different brush technique on terrain? If it is used completely overboard then yes, I can see why this would be really helpful. However, if you do have a bunch of edge coherancy between the same shaders it would probably still be more optimial to get an ideal triangulation of the polygons and do some surface analysis, use stripping and or just indexed triangles (get better vertex caching and the like).

Now for the editor I could see it working better with fans, but in an actual game engine.... Is that what you're using it for? What type of preprocessing are you doing?

Plus with height maps, you can still use fan shaped triangulations and strip it, right? (or even just use normal indexed triangles...)

Obli
06-11-2004, 11:01 AM
Ok, I see it's often somewhat faster.
I will let it out for now, it does not look particularly promising unless it gets widespread (and wisdom on that reaches an agreement). I have no time to put it in now just to get a minor speedup that only some people could get (it doesn't change even if it's +40%).

Thank you.

plasmonster
06-11-2004, 02:40 PM
Adruab, Imagine placing a Quake level on top of the terrain, or underneath it (interior only, like in Morrowind). I briefly considered a BSP terrain, but couldn't get the scale and detail I wanted.