PDA

View Full Version : tri strips/fans hardware side: reset, degenerates, and anything else



michagl
02-18-2005, 03:20 PM
it looks like i need to build a triangle stripper... for a couple years i've been using nvidia's utility, but it has a tendancy to produce swiss cheese meshes (which i would consider a bug). however at this point i also have tighter constraints... the stripper should be able to flip edges inside quads and maybe integrate fans.

so i have a few questions... if someone could just point me to documentation that would suffice my curiosity but discussion is welcomed as well.

basicly i'm curious about triangle strips on the hardware side. i'm curious about the rules surrounding degenerate triangles... and i get the impression that there is a way to embed a command to the processor to reset the stripping process inside the strip indices.

i'm also curious if there is a way to somehow get the processor to switch from strip mode to fan mode without resetting the cache or changing out indices... maybe by sending a -1 index or something.

i'm also curious about any information falling within this category... any hardware tool i can use basicly.

michael

sqrt[-1]
02-18-2005, 03:43 PM
Perhaps you could modify the soure of these stippers:
http://www.plunk.org/~grantham/public/meshifier/oldmesh.html

http://users.pandora.be/tfautre/softdev/tristripper/

dorbie
02-18-2005, 04:04 PM
AFAIK, one of the interesting features of nvidia's tri stripper is the consideration of cache coherency w.r.t. the reuse of indexes in the mesh. Bearing that in mind you might want to check out the following links:

http://plunk.org/~grantham/public/actc/

http://www.cs.sunysb.edu/~stripe/

dorbie
02-18-2005, 04:07 PM
I'd strongly suggest avoiding the old striper by Grantham and look at his newer ACTC.

michagl
02-18-2005, 04:43 PM
yeah, hardware cacheing is the only thing that really scares me away from this project... i really don't want to have know the hardware to the last detail, assuming i even find the energy to take things so far.

i will probably start with the nvidia stripper source as i figure it has a good bias towards the hardware.

anyone know anything about the possibility of changing primitive mode midstream? that would be a nice feature wouldn't it?

the meshes i have to work with are relatively small and specialized. automated actually... no way to manage them by hand. their structure is also generally lowsey for stripping, hence the desire to flip edges. i worry that the nvidia stripper might be having a hard time with them, though there aren't any holes like i often get with large meshes. the meshes look more like fan meshes, so i worry that they may require a lot of short strips and or degenerate triangles.

i would like to be able to find the penultimate optimal stripping for each mesh.

PkK
02-19-2005, 11:40 PM
There are extensions for changing the primitive mode:

GL_NV_primitive_restart,
GL_SUN_triangle_list.

michagl
02-21-2005, 11:06 AM
i have a few questions about the nvidia primitive restart externsion. here is an excerpt from the specs:

*What should the default primitive restart index be?

RESOLVED: Zero. It's tough to pick another number that is meaningful for all three element data types. In practice, apps are likely to set it to 0xFFFF or 0xFFFFFFFF.


i have had some issues with the nvidia triangle stripper utility. i was under the impression that it was be default using primitive restart in its output strip(s), passed in a single buffer.

this excerpt seems to either imply that there is no default restart index, or that the index is by default 'zero', or 0.

i wonder if i need to enable the restart, and if the stripper is embedding zeros then that might explain why my meshes sometimes come out with holes, though 95% of the time they are fine, and when they do have holes there are no other sort of anomalies like bogus triangles.

however i did try the latest library which actually produced worse meshes with some of the models that had produced meshes with holes in the 2001 library. it did have a bogus triangle, which might have been going back to vertex zero. however lately i've been using the newest version of the stripper extensively without any apparent issues.

so really i'm a bit confused about it all... i would look at the stripper source and try some expiriments, but my development machine has been running the last couple of days building a data base and i'd rather just get that over with first. takes about an hour to stop and pick back up anyhow.

so anyhow, if anyone uses the nvidia stripper along with primitive restart and would like to clarify some of this for me with a little bit of code, i would be very grateful.

and for anyone at nvidia, wouldn't it be useful for a primitive restart that actually changes the primitive mode. being able to change between strips and fans on the fly would be useful... would be more useful if the hardware could change modes without missing a step... like going from the end of a strip to a fan where the last two strip vertices are the first two of the fan and then back from the fan to the triangle.

sincerely,

michael

Adrian
02-21-2005, 11:52 AM
Primitive restart is disabled by default. The default primitive restart index is zero. You do not have to enable primitive restart, it should work fine with it disabled.

In my app I do this
glEnableClientState(GL_PRIMITIVE_RESTART_NV);
glPrimitiveRestartIndexNV(65535);

and then

SetCacheSize(24);
SetStitchStrips(1);
EnableRestart(65535);
GenerateStrips(pIndex, NumIndices, &pPrimitiveGroup, &NbGroups);

michagl
02-21-2005, 12:24 PM
so teh generatestrip routine actually queries the restart states...

thanks a lot, good to have a solid answer. assuming you are 100% sure your methods are correct.

btw, anyone want to offer a run down on the cache sizes for recent cards... with my own card right now i've been going with 24 which is recommended for geforce3 i think... but the card these days is a quadrofx series, though one of the low end more experimental models... still i'm curious if 24 is still the optimal cache size for modern cards.

michael

Adrian
02-21-2005, 01:16 PM
No, I think you have misunderstood me.

If you want to use primitive restart you must call EnableRestart.

If you dont want to use primitive restart you dont have to call DisableRestart since the default state is disabled.

generatestrip does not query the restart states.

michagl
02-21-2005, 01:39 PM
"Primitive restart is disabled by default. The default primitive restart index is zero. You do not have to enable primitive restart, it should work fine with it disabled. "

i don't have it in front of me, but in a header with the nvidia c++ source i believe it says that primitive restart is used by default. i will have to take a look at the file, i believe this would mean that it would embed restart indices inside the strip array. do if it doesn't query teh state then how does it know what index to use for your restart? does it assume 65535? i seem to recall that generate strips does not have overloads for non unsigned 16bit types. this really isn't a problem as i'm sure it is best used as a preprocessing tool, but how then do you communicate your restart index to it, or does it just assume 65535 which you can change later if needed?

sorry for the trouble btw, i should probably just wait until my development machine is handy to continue any possible further discussion.

just for the record, i might look into those third party stripper efforts in a bit.

michael

Adrian
02-21-2005, 02:31 PM
Originally posted by michagl:
i don't have it in front of me, but in a header with the nvidia c++ source i believe it says that primitive restart is used by default.

if it doesn't query teh state then how does it know what index to use for your restart? does it assume 65535?From the header...

////////////////////////////////////////////////////////////////////////////////////////
// EnableRestart()
//
// For GPUs that support primitive restart, this sets a value as the restart index
//
// Restart is meaningless if strips are not being stitched together, so enabling restart
// makes NvTriStrip forcing stitching. So, you'll get back one strip.
//
// Default value: disabled
//
void EnableRestart(const unsigned int restartVal);

michagl
02-21-2005, 04:19 PM
yeah, i feel like a total ass now, seeing that line in your original post:

EnableRestart(65535);

i don't know how i missed it. not accustomed to reading your coding style maybe... i still swear that i recall reading, probably in that header, that restart was the default. maybe it was stitching is default, and i originally jumped to the conclusion that restart would facilitate the stitching, not understanding at the time exactly what restart is technicly... or at least not having a clue that there was an API for it or anything of the sort... i assumed it was a new built in feature or something.

anyhow, i'm real sorry about not noticing that. in the future hopefully this incident will serve as a reminder to pay more attention.

sincerely,

michael

michagl
02-21-2005, 04:32 PM
so i have NvTriStrip.h in front of me, and i'm just curious about a couple more things if you or anyone is game.

how do you get back strips that are not stitched together and don't utilize restart?

also curious about the last function RemapIndices... its not one hundred percent clear, but as i imagine this function takes a strip and reformats it according to specifications. that should come in handy.

michagl
02-26-2005, 10:23 AM
Originally posted by michagl:
so i have NvTriStrip.h in front of me, and i'm just curious about a couple more things if you or anyone is game.

how do you get back strips that are not stitched together and don't utilize restart?

also curious about the last function RemapIndices... its not one hundred percent clear, but as i imagine this function takes a strip and reformats it according to specifications. that should come in handy.so i've gotten around to looking at the RemapIndices function... and to answer my own question i believe that if strip stitching was disabled the 'numGroups' would be set to greater than 1.

i realize i must appear to be either stupid or at best short sighted at this point...

but just to confound matters. it was my thinking that RemapIndices would basicly take an input strip and give back a new strip according to the present options.

but after deeper investigation this definately does not appear to be the case. it appears that some how the function returns a remapping of your vertex buffer rather than your indices... then it is at that point somehow your responsibility to remap your vertex buffer for 'improved spacial locality'...

if someone would like to explain this notion of spacial locality to me i would apreciate it. i guess i would presume that it would be more reasonable to improve the spacial locality of the indices rather than the actual vertices... does hardware for some reason like vertices to ajacent to one another or even in the same general region? i can understand that for non index based rendering a pointer might be more easilly incrimented along a vertex buffer, but i assumed that on hardware, assuming vertices are generally within the same segment, that there would be no optimization benefit to spacial locality within the vertex buffer. is there some kind of block cacheing that goes on with vertices on hardware?

any info is welcomed. i'm actually considering starting a new hardware performance thread due to drastic performance anomolies i've observed lately. not sure if it would just be better to address in this thread or not.

sincerely,

michael

AlexN
02-26-2005, 05:54 PM
I think remap indices is there to allow you to find out which vertex from the original set you passed into the tristripper corresponds to which vertex in the output. NVTriStrip does all the reorganizing of the vertices itself; however if you have extra per-vertex information other than position, texcoords, and normals, then you would need this to carry this extra information over from your original array to the new array (since the tristripper does not move the extra information when it is reorganizing the vertices).

Adrian
02-28-2005, 09:20 AM
I think that RemapIndices is an optional additional optimisation. It is quite seperate from the generatestrips function.

The way I think it works is you pass in your indices and it reorders them to be more optimal. You must then manually reorder your vertex data to corrrespond to the new indice list.

So say your indice list was 1,3,2 and remapindices returns them as 1,2,3 you would have to swap you second and third vertex in the vertex buffer.

michagl
02-28-2005, 11:08 AM
Originally posted by Adrian:
I think that RemapIndices is an optional additional optimisation. It is quite seperate from the generatestrips function.

The way I think it works is you pass in your indices and it reorders them to be more optimal. You must then manually reorder your vertex data to corrrespond to the new indice list.

So say your indice list was 1,3,2 and remapindices returns them as 1,2,3 you would have to swap you second and third vertex in the vertex buffer.thats how i read it as well i think... but i just can't see the usefullness of such an optimization... but then i don't pretend to understand the hardware or drivers intimately as well.

i plan to discuss this stuff and some other stuff further... but i have about 200 thousand small meshes which need to be restripped, and i've fouled it up 2 times but expect the 3rd to be the charm.

the process takes about 24 hours... the first time i woke up bleary eyed, broke out of the program to check the progress... then looking at the code in the debugger, i saw what at first glance appeared to be a very serious memory bug in the code... but it wasn't on closer examination... but i had exited by that time for some reason... moral of the story, don't fool with code straight out of bed in the morning.

second time, i had stuck in a last minute change in the disk writing code, where i copied and edited a similar line which included an &ddress operator i forgot to remove... so that costed me that run.

this run aught to go over well though... should be finished by tomorrow morning.

then maybe i will discuss a new and very promising arbitrary geometry ROAM aproach which has matured very well as of late and i'm very excited about. i believe it may aproach theoretical limits very closely, and just might be the dawning of a new 'era' of massive infinitely scalable geometric data processing. i will be mostly interested in discussing very fine level hardware concerns.

GPSnoopy
03-01-2005, 12:07 AM
Originally posted by michagl:
thats how i read it as well i think... but i just can't see the usefullness of such an optimization...
[...]
the process takes about 24 hours...
[...]
The remapping of the indices is to render the access of vertex data more sequential; thus improving memory access.

One of the reasons why I wrote Tri Stripper was because NvTriStrip is so slooooowww. :D
AFAIK NvTriStrip has some algorithms that have a complexity of O(n^2).

michagl
03-01-2005, 07:13 AM
Originally posted by GPSnoopy:
The remapping of the indices is to render the access of vertex data more sequential; thus improving memory access.

One of the reasons why I wrote Tri Stripper was because NvTriStrip is so slooooowww. :D
AFAIK NvTriStrip has some algorithms that have a complexity of O(n^2).i would be grateful if you could try to explain that first line in detail, or maybe point me to technical docs.

as for your 'Tri Stripper', i'm not familiar with it. i figure the nvidia tristripper is so slow because it starts with no connectivity information. i've thought about hacking to allow that info to be uploaded, but for now i'd rather just crack my whip at a one of my poor machines. i feel particularly bad about it because i've made the poor thing do it three times now for no good reason.

if your TriStripper produces the same or better hardware optimizations than nvidias, then i would be happy to borrow it. i will eventually have to build a little standard io utility to do this once i get around to hosting it publicly. i plan to hack the triangle stripper to speed it up as much as possible for my particular situation, but not sacrifice any hardware optimizations.

i'm pretty happy with the nvidia stripper... but after i get around to speeding it up some, i might consider trying to hack it so that it can flip quads to get better strips.

edit: oh i see that your tristripper is one of the recommended strippers... the one with the big gaudy drop shadowed red lettered framed internet prescence if i recall correctly.

do you care to discuss it? its cache coherency capabilities... can it do fans? if so can it determine whether a mesh would be better stripped as fans or strips from the perspective of the cache?

GPSnoopy
03-04-2005, 06:36 AM
The remapping of the indices is for reorganizing the vertex arrays (and not the indices themselves). The goal is to improve the memory accesses by improving the cache coherency (but do not mistake this for the post-T&L cache, which is something completely different).
For example, a vertex array that is accessed sequentially is faster than one that is indexed randomly.
1, 2, 3, 2, 3, 4, 3, 4, 5, etc. will be faster than 1, 456456, 2547, 84125, etc.

IMHO, one good way to reorganize the vertex array is to take the indices one by one and to reconstruct the vertex array in the same order (it's bond to work fine with triangle strips generated by NvTriStrip or Tri Stripper, because of the post-T&L cache algorithms used there).

The triangle stripifier with the red lettered logo is Stripe, not Tri Stripper. ;)

On Tri Stripper webpage you can find a comparison with NvTripStrip, although a bit outdated now 'cause Tri Stripper has improved since them (and probably NvTriStrip too).

The main advantage that NvTriStrip has is that it outputs better results, but the difference compared to Tri Stripper is minimal. Although I had people telling me Tri Stripper was giving better results for them, but I could not benchmark this myself.

Tri Strippers is close to NvTriStrip in the way that it takes a triangle list and only outputs triangle strips. It can take into consideration the presence of a T&L cache, but you can also disable that (which is an advantage over NvTriStrip).
Tri Strippers is also more robust than NvTriStrip, it can handle not-so-common geometry (e.g. a triangle that has more than 3 neighbours). But it's also much faster (1 sec vs minutes for tens of thousands of triangles).

NvTriStrip is not slow cause of the lack of connectivity information (Tri Stripper also has to compute it), but IIRC it has some algorithms that have a complexity of O(n^2).

I suggest you look at Tri Stripper webpage for more info on the algorithms that are used; again it's a bit outdated and my English is not so bright but it'll give you a good overview.

michagl
03-05-2005, 05:31 AM
Originally posted by GPSnoopy:
The remapping of the indices is for reorganizing the vertex arrays (and not the indices themselves). The goal is to improve the memory accesses by improving the cache coherency (but do not mistake this for the post-T&L cache, which is something completely different).
For example, a vertex array that is accessed sequentially is faster than one that is indexed randomly.
1, 2, 3, 2, 3, 4, 3, 4, 5, etc. will be faster than 1, 456456, 2547, 84125, etc.
i definately would never have guessed that. it seems odd to me that the hardware would be organized that way. can you or anyone make an architecture argument for this?



The triangle stripifier with the red lettered logo is Stripe, not Tri Stripper. ;)
hmmm, maybe i overlooked yours then. i promise to give it a look as soon as i pass this post on. i think i looked at two, one 'atc' or something, and probably stripe.



The main advantage that NvTriStrip has is that it outputs better results, but the difference compared to Tri Stripper is minimal. Although I had people telling me Tri Stripper was giving better results for them, but I could not benchmark this myself.
i'm actually working on an interesting project right now that might make for a good benchmark. i'm stripping about 200,000 small very similar meshes. you can find a recent visual at:

http://arcadia.angeltowns.com//share//genesis-mosaics-lores.jpg

assuming that url is correct. to me, the resulting strips don't look to be terribly cache friendly, especially given the simplicity of the meshes. looks like there is a lot of room for improvement to be had. those strips were done with nvidia's utility library btw.



NvTriStrip is not slow cause of the lack of connectivity information (Tri Stripper also has to compute it), but IIRC it has some algorithms that have a complexity of O(n^2).
yeah, i imagined so much... just wishful thinking on my part that connectivity might be an issue. just out of curiosity, what percentage is connectivity versus cache constraints/strip fitting?

i regularly strip fairly large meshes for general projects, but this project in the screen above is basicly composed of a whole lot of relatively small meshes... so in that case the exponential performance is not a major concern. i have had to walk away from my machine in the past for about 10minutes while nvidia's stripper spits out a fair sized mesh... but stripping these 200k little meshes generally runs about a day. and i suspect in the future the size of the meshes will probably go up significantly for different optional resolution databases. i'm a little bit frightened about how that might scale... 2 or 3 days, maybe more... so basicly i am looking for faster alternatives.

especially if and when i provide a conversion tool for end users to use on their own machine. for the users, there option would be either to download the database for their hardware and needs, or generate it locally. the download for a database that takes about a day to generate is around 20MB. but downloading also adds traffic to the download server. and for larger databases and users with lower end modems, it might be in thse users interest to generate the data locally. so ideally, to reach a happy medium, i would just like to be able to get the database generation tool as streamlined as possible, and just host common, smaller databases... because the range of options could be too high to satisfy everyone's hardware and application needs.



I suggest you look at Tri Stripper webpage for more info on the algorithms that are used; again it's a bit outdated and my English is not so bright but it'll give you a good overview.i will give it a look. just out of curiosity... are you fully comfortable with the nvidia algorithm? or are the poorer quality strips just a natural result of the obvious performance boost? in the end i could just compile yours and nvidias code (and maybe others) into the utility, and it would just be up to the user to choose between time and performance when choosing options for compiling their database.

if you are interested in what i'm doing with all of this stripping... there is another very imformative thread in this forum (advanced) titled something like, "unique ROAM VBO issues and a clincher", if my memory serves me.

sincerely,

michael

PS: in defense of nvidia's stripper. my meshes are really not very strip friendly on the face... much more fan like in organization. so that could explain to a fair degree why the strips might not look too cache friendly. what would really be nice i figure would be a stripper that can make the decision, against some user defined visual bias, to flip the inside edges of quads here and there to promote a better stripping. how much trouble do you think it would be to integrate that kind of feature into your system?

PPS: counter to nvidia's defense, they could've stuck a degenerate triangle here and there to get a much better cache layout it seems. there appears to be some degenerates in the highlighted mesh in the screen referenced above, even with primitive_restrat enabled... meaning nvidia's stripper does at least have some capacity to negotiate between the appropriateness of a degenerate versus a restart.

jwatte
03-05-2005, 07:15 PM
can you or anyone make an architecture argument for thisYes. It's just like any other cache: the big cost is latency, so when you take the time to read your vertex (which might be 40 bytes), you read some more (say, 64 or 128 bytes) in an aligned burst, which takes very little extra time compared to reading only the 40 bytes. With luck, the next amount of data you read is close to what you read previously, so there's no need to go to the bus at all -- a net savings of your read latency (which is usually much bigger than your burst time).

Anyway, if you're spending significant time on your tri stripper, FIRST make sure you actually need it. There ARE cases where a tristrip will draw faster than a triangle list, but in most cases, a triangle list will be just as fast as a tristrip, because of the post-TnL cache. Don't do complicated work if you don't have to. The difference is the amount of index data you have to push across the bus, but again, sequential bursting is very fast.

michagl
03-08-2005, 03:53 PM
thanks for the explanation... and i would've replied sooner, but i didn't realize there was a new post in this thread.

for what its worth, like you said was my best guess like i said above... but its good to hear something for sure.

is that how the cache works period? because i always imagined the cache working by recognizing when an already used vertex is being used. or does it just rely on big reading chunks?

if it doesn't do both, i get the impression that this is a bad move for hardware. because it wouldn't make so much sense for the cache model to work best for triangle lists, while most would use strips for optimized rendering.

however for the app i'm interested in here, i'm pretty much out in the cold on the burst read, because my vertices are all over the place due to a subdivision pattern. however there is no reason why i can't use an index for a level of indirection, other than the extra cpu side offset addressing... or technicly the vertices are hardly ever accessed directly, remapping the triangle vertex index etc would due, without the extra indirection. strip dependant vertex buffers would be out of the question for my app, but lists would just take up extra memory, even if they were comparably efficient. i figure strippers do the best they can anyhow to account for the vertex buffer order, if that isn't the only thing the pre tnl cache cares about.

is that how it works... is it a pre-tnl cache that reads video memory in bursts, and a post-tnl which works like a traditional cache.

i really don't understand this completely. do hardware caches hold addressable copies of memory, or just ajacent copies of memory, or one or the other?

so thats another optimization to think about.

just to reemphasize this question:

does the cache not use any model that matches already used vertices by index? and is the cache always linear? or can it be filled from vertices throughout the bound buffer?

--------

as long as i'm here. there is definately a major bug in the nvidia triangle stripper which flips the winding of strips. it is dependant on the triangle indices fed into the generate function. which means, you can potentially avoid the bug by just rotating your triangle indices per triangle.

i just upped my triangle count 3 fold, and the bug showed up all over the place. fortunately i was able to rotate my vertices to get rid of the bug's effect.

just to be totally clear, when i say 'rotate' the winding remains the same, x becomes y, y becomes z, z becomes x, etc...