Space filling curve primitive

Pop_N_Fresh1 · October 30, 2002, 3:44pm

OpenGL currently has two mutliple triangle drawing prims. TRIANGLE_STRIP and TRIANGLE_FAN. Unfortunately, these prims are not always the most efficient. The common case of a subdived mesh - a terrain created from a heightfield for instance - generates many strips or fans.

A Hilbert Curve could draw a patch as a single prim. They are well-understood, can be created with simple state machines and could be potentially very efficient with real world high-polygon data.

( NV_Primitive_Restart also addresses this but that’s not as fun is it? )

Korval · October 30, 2002, 5:34pm

It takes 3 vertices to specify a triangle. The good thing about strips (in terms of a hardware implementation) is that any consecutive 3 verts represents a triangle. In a fan, the first vert plus any consecutive 2 afterward is a triangle.

In the cases you have presented, there is no clear way to define a triangle. Take the 2nd order Hilbert, for example. Look at indices 2-7. There are 4 traingles specified by these indices. Filling in triangle 7, 2, 3 requires keeping the vertex data for 2 and 3 around for quite some time. That means that 4,5, and 6 have to be sent while still keeping 2 and 3 around. Not only that, 4 has to be stored until you reach 7 also, due to the triangle 4, 6, 7 (or 4, 5, 7, either way).

Your second method is even worse in this respect. The amount of storage needed grows significantly as the length of the geometry sequence increases.

As it stands right now, triangle setup is a very simple process. In fact, it is one of the fastest processess in modern graphics cards. It takes up very little in the way of resources and very little time. To implement what you are suggesting would require sacrificing both time (slower to figure out one of these algorithms than strips/fans) and resources (more silicon for added memory storage). As it stands right now, triangle setup is fast enough to be negligable. Adding this facility could very well make them slower than simple (or more complex multi-pipelined) vertex programs, thus imposing a cap on maximum vertex throughput. In short, your attempt to make geometry transfer faster may well drop overall vertex throughput.

BTW, you can get a lot of milage out of degenerate strips with vertex arrays.

Pop_N_Fresh1 · October 30, 2002, 6:37pm

I’m not suggesting keeping a deep history. Imagine quad strips. You have the 4 vertices from the previous quad. You can then define a new quad using two additional vertices and two from the previous quad. Whereas a quad strip is defined from the last 4 vertices, a quad curve is defined from 4 of the last 6 vertices.

I’m not suggesting this to give greater reuse of vertices, rather reduced function call overhead as i’ve found degenerate strips to be signifigantly slower than lists.

[This message has been edited by Pop N Fresh (edited 10-30-2002).]

Pop_N_Fresh1 · October 30, 2002, 6:39pm

Oh, just in case it’s not clear, i’m only refering to the Hilbert curve. The Z-curve just happens to also be in the image I found. Ignore it.

vincoof · October 30, 2002, 10:10pm

I’d be curious to know how you would like to change OpenGL spec.

The concept is good, but I’m thinking of an implementation where vertices will be dubbed, which is worse than defining multiple triangle strips. That’s why I’d like to see how you would like to implement it.

And, IMHO,Hilbert is barely limited for quad-based square meshes.

Pop_N_Fresh1 · October 31, 2002, 1:24am

I’m not sure what you mean by vertices would be dubbed.

Do you mean transformed more than once? This already happens when using triangle strips to draw an N x N patch as each strip overlaps one row of vertices from the previous strip.

Or do you mean the implementation you’re thinking of would require have duplicate vertices within a vertex array?

[This message has been edited by Pop N Fresh (edited 10-31-2002).]

vincoof · November 4, 2002, 12:35am

Do you mean transformed more than once?

yes that’s it.
I know that they’re transformed more than once with current triangle/quad strips.

What I would expect from Hilbert curves is rendering the mesh faster. And IMO the only way to accelerate current triangle/quad strips is to optimize the pipeline. And if Hilbert does duplicate vertices, then it’s not faster than triangle/quad strips. And because triangle/quad strips are much easier to think with (honestly), I would stick to my good-old strips if Hilbert is not faster.

It’s all about performance. If Hilbert is not faster, then why would you use Hilbert ?

Pop_N_Fresh1 · November 4, 2002, 2:26pm

A curve should have better vertex locality than strips and be capable of drawing more triangles per function call.

They’ll be able to reuse the vertices of the previous triangle/quad just like a strip. In addition, vertices from earlier quads/triangles will often be available as the curve winds because of thier tighter locality which will result in vertex cache hits a strip wouldn’t get.

Basically they should help get rid of the additional function call overhead and effective cache invalidation at the start of each new strip.

Pop_N_Fresh1 · November 4, 2002, 3:19pm

Some quick calculations.

We want to draw a patch of 8 x 8 quads. We assume our video card architecture has an effective vertex cache depth of 10 (as described in NVidia’s Geforce performance documentation).

Strips: We draw 8 quad strips of 8 quads each. We reverse the drawing order (left-right, right-left) in alternate rows for better vertex cache use.

We end up sending 18 (82+2) vertices per strip to the card. 144 (82) vertices altogether.

For every strip past the first one, we get 4 vertex cache hits from reversing the direction of the rows. 28 ((8-1)*4) cache hits altogether.

So our overhead is 144 indices, 116 (144-28) vertex transformations, and 8 function calls.

curve: We draw one curve with 64 quads.

We end up sending 130 (64*2+2) vertices to the card.

A quick scribble with some graph paper shows we get about 7 cache hits in every 4x4 block of quads. 28 (7*4) cache hits altogether (same as strips, didn’t expect that).

So our overhead is 130 indices, 102 (130-28) vertex transformations, and 1 function call.

Korval · November 4, 2002, 5:27pm

If you’re really looking for optimal performance out of strips with a post-T&L cache, you’re doing it wrong.

BTW, your numbers seem to be for an 8x8 array of vertices. If you have 8x8 quads, you actually have a 9x9 array of vertices. So, my numbers will use a 9x9 array, rather than an 8x8 array like yours does. Even with this increase in vertex count, you will see that this is faster in number of transforms.

Given a cache depth of 10, (GeForce 3’s have something like 16-20), you do this. Break the model into two columns, 5 verts each (sharing the middle column of verts).

Send the following in order:

A degenerate strip containing the first 5 verts in the top row(5 indices). This is called seeding.
Degenerately add (2 indices) a strip containing the top 5 verts and the next 5 (10 indices).
Repeat 6 times (12 * 8 repetitions = 96 indices)
Degenerately (2 indices) seed the next column (5 verts), but starting at the bottom, not the top.
Repeat step 3 going up the second column (12*8 = 96 indices).

Now, to compute the number of transforms. Step 1 causes 5 transforms. Step 2/3 causes only 58 or 40 transforms (degenerate tris don’t count towards or against). Step 4 causes 4 transforms (one vert is in the cache already). Step 5 causes 58 or 40 transforms. Total: 89 transforms. The minimum number of transforms is 81 (9*9), so we’re running pretty close to optimally here. That’s a 10%+ improvement over your “curve” method, and it’s rendering more triangles.

Since this is just one gigantic degenerate strip, it is one function call. However, if we wanted to lose the degenerate strips, we could use glDrawMultiArraysEXT instead, were we draw several arrays in one shot. This is a pretty new extension, however.

Assuming degeneracy, this costs 206 indices. Granted, that’s more than either method given here, but it’s probably faster to do it this way. Without degeneracies (still seeding), it costs 170 indices.

With nVidia’s restarting primitive extension, the driver doesn’t even really have to be involved (like it probably does with glDrawMultiArrays, copying to multiple buffers and all). You can even put them in VAR memory with new VAR extensions.

BTW:

We reverse the drawing order (left-right, right-left) in alternate rows

That’s a state change (assuming you’re talking about changing the cull mode). Therefore, it’s likely that this will kill any performance this method would gain.

Pop_N_Fresh1 · November 4, 2002, 11:08pm

BTW, your numbers seem to be for an 8x8 array of vertices

No, they’re for an array of 9x9 vertices. That’s why a each strip has 18 (2 * 9) vertices.

Given a cache depth of 10, (GeForce 3’s have something like 16-20), you do this. Break the model into two columns, 5 verts each (sharing the middle column of verts).

This is good idea. I noticed when doing the math for my above post that smaller strips could save the redundant transform of each row. It would have increased the function call number enough that any gains would be small I thought. I hadn’t pondered degenerate triangles.

With nVidia’s restarting primitive extension, the driver doesn’t even really have to be involved

Like I said in my first post, not as fun

That’s a state change (assuming you’re talking about changing the cull mode).

I’m not talking about changing the cull mode. Assuming clockwise winding alternate bottom, top, right for a the start of a row and then top, bottom, left for the start of the next row.

Thanks for your response. In my own testing i’ve found stitching strips together with degenerate triangles to be slower than lists and haven’t really considered using them because of that. That was with more arbitrary model data however. I’ll do some testing on regular meshes.

Korval · November 5, 2002, 10:30am

Thanks for your response. In my own testing i’ve found stitching strips together with degenerate triangles to be slower than lists and haven’t really considered using them because of that.

One note: my entire post was predicated on using VAR. If you’re not using VAR, you aren’t using the post-T&L cache (according to nVidia at one time), so most of the techniques don’t matter.

Pop_N_Fresh1 · November 5, 2002, 5:14pm

The vertex cache works with compilied vertex arrays and DrawRangeElements in addition to VAR.