PDA

View Full Version : unique ROAM VBO issues and a clincher



michagl
03-02-2005, 01:45 PM
first let me say that i realize there are a lot of topics covering this domain already available in these forums... but i think in this case it would be better to start fresh.

i do have a specific hardware and VBO api issue, but i would also like very much to discuss general strategies for this unique situation.

before i try to offer some brief context, here is an illustrative aid to refer to:

http://arcadia.angeltowns.com/share/genesis-mosaics-lores.jpg

a new screen hot off the presses. next i will try to describe what is going on in this image.

basicly, this is a ROAM (real-time optimally adapting mesh)system. however i believe it is quite unique in contrast with its predescessors, and i believe, in spirit, perhaps as optimally effecient as is possible with contemporary hardware...

there are essentially two subdivision tiers to this system. the first tier is dynamic, whereas the second is relatively static. splitting takes place in both tiers, but merging only in the first.

the first tier is an equilateral sierpinski tesselation... think taking a triangle, and fitting an upside down copy of itself in its center with vertices at the parent triangle's edge's midpoints. a discontinuity of plus or minus one is allowed along the edges of the first tier, as they will be mended in the second tier. this ROAM system is designed with arbitrary manifolds (non-planar) in mind, so sierpinski is an ideal fit for the first tier.

managing the first tier is a complex memory problem as with any such system comporable perhaps only to a complex real-time dynamic compiler (interpreter), in my opinion. the second tier however, where most of the resolution exist, is designed to off load as much as this memory management as possible.

each triangle, or node, in the first tier forms a discrete component of the second tier. i will refer to this as a 'mesh' or maybe a 'mosaic' at some point. all of these meshes share the exact same connectivity data and require the exact same amount of system and video memory. just about everything about them can be computed only once on boot. all that is left for each 'instance' is essentially per vertex data (position,texcoords,normals,colour,etc), per vertex 'variance' weights, and a single flag byte for each face in the mesh. multiple cameras viewing the same mesh only require that the face flags be duplicated so that unique views can be produced for each camera.

for many reasons it turns out that 8x8 is the optimal resolution for the second tier meshes. which means that each edge of the triangular mesh can be subdivided into 8 edges. in the end the only real task arises from mending the borders between second tier meshes, but a lot of data can be precomputed to aid in this mapping process, which is essentially limited only to 6 cases (xx yy zz xy xz yz) as i recall. there are also many compromises which can be made which could speed up the border mending process while sacrificing the accuracy of the of the tesselation slightly.

finally, the connectivity of the base mesh is more complex than normal connectivity, as essentially it is a multi-resolution connectivity containing all of the information of any view/variance based tesselation of the mesh. though the fully tesselated 8x8 mesh contains 64 faces, the total multi-resolution mesh contains 127 (64+32+16+8+4+2+1) faces, as does each instance, which requires 127 bytes per frustum to store its state.

with this in mind, it is possible to compile a 128bit signature for each possible permutation... it is this signature which i more accurately will refer to as a 'mosaic'. the signature is a series of bits which are set off or on depending upon whether or not the corresponding face is a leaf (visible) or not.

i've built an empyrical database of all possible mosaics. ( through a process which basicly envolves me setting up a gerbil wheel simulation and strapping a rubber band around my joystick thruster... which ran for about 3 days that way with various ever finer splitting constants )

the final result is about 200,000 mosaics, technicly around 193,300... but that number is still growing now and then as extremely rare mosaics are found, but i don't expect to grow too much further.

finally, with that in mind, offline i have used the nvidia triangle stripper utility to compute strips for each mosaic... a database which on disk requires about 20MB, about 4 of which are 128bit keys and 32bit offsets.

finally the basic task is to solve the signature of each second tier mesh given its per-vertex weights and the camera's position. use that signature as a key to quickly look up the apropriate mosaic, and assign it to that mesh for the later rendering routine.

as far as VBO is concerned. each mesh upon creation uploads its per vertex data to video memory set to DYNAMIC_DRAW. when a mesh dies, it is recycled and its video memory is handed over to an incomming mesh, which then simply writes its per-vertex data over the previous owners data... that is to say that the handle is not deleted.

as for the mosaics, as soon as a new mosaic is discovered, its index data is uploaded to video memory in STATIC_DRAW mode, as it will never be overwritten. the mosaic handles are just passed around to meshes as their signature changes. that is to say as well that it is possible for multiple meshes to share the same mosaic handles.

in all there are 45 vertices in each mesh. unlike faces, vertices are shared at every level of subdivision. as well the mean length of the mosaics (tristrip indices) were ~130 before primitive_restart builds, now slightly less. armed with these facts it is possible for the mosaics to be byte encoded, because values greater than 0xFF are never required. 0xFF is the primitive_restart index. this saves consiberable memory, but if there is a performance hit in using byte indices i would like to know.

mostly i would simply like to know how best to satisfy hardware constraints. in the future it would probably be useful to agressively track VBO references and delete them as is appropriate. so this sort of driver behavior would be useful to me. i wonder if i should set the per-vertex uploads to STATIC_DRAW rather than DYNAMIC_DRAW to ensure video memory. the life span of a first tier node is pretty long in computational terms, but depends mostly on circumstance though the minimum life span is also regulated.

EDIT: -to: new readers- BUG SOLVED: STILL OPEN TO DISCUSS OPTIMIZATION HOWEVER

my major concern though, without which i probably would not at this time be sharing this information here, is a sever performance anomaly. the 'mosaic' component of this system was a relatively new idea which has caused me to revisit the system and devote a fair amount of attention to it. since implimenting it the performance has been as good as i had hoped, but occasionally the sytem appears to fail in hardware. i'm fairly certain the source of the matter exists in the graphics hardware.

essentially, i work with the card synched at 60 frames per second. if i push the resolution up so that normal performance is right around 100%, i am best able to guage slow down. presently, i occasionally experience a sharp 50% drop in performance on occasion. i believe this drop occurs at all resolutions, but i have yet to break the cards vsync (i believe it is called) status to see. but in any case this behavior is pretty much unexplainable. it is not fill limited, nor geometry limited, it occurs when both fill and geometry can not be an issue. also it is the case that this effect occurs when the frustum is aimed into particular regions, the boundaries of which could be said to be no more than a single pixel (skirting perspective division). that is to say, i can move the camera only the very slightest, and suddenly i will see the 50% hit, move it back, and frames go back to normal.

i can do this with no virtually no cpu restrictions whatsoever. once the view is set, i can drop the camera, which basicly reduces the simulation to a pure rendering loop. which leads me to believe this is NOT* happening on the cpu. (at least not in my code) also it happens simply by changing the gpu modelview matrix, where as my numbers don't change a bit. so it seems like some kind of hardware culling is slowing things down.

it doesn't occure from calling more or less DrawElements

i'm tempted to think it is some kind of driver bug... i might see if i can find alternative drivers. but it seems more like something that would happen purely on hardware. or maybe the driver is hitting a bug while trying to cull an entire vertex buffer for me or something... which would be much more than i would ask of a driver... especially as i already do this for myself.

maybe something at that point is causing the driver to offload my VBOs to agp memory. ( assuming it isn't in the first place ) ... but even still, i don't see how it could make that decision when the only thing changing for hardware is the modelview matrix. i can't stress that more than anything... the only factor which causes the hit is the modelview matrix. so it must be some form of hardware/drive culling, or culling based memory management, causing this best i can figure.

i don't know what else to say, except that this hit is AWESOME out of the blue, and i can't live with it, and it is totally inexplicable.

for the record, i'm using nvidia drivers from nvidia.com, which i downloaded for glsl support not too long ago.

sincerely,

michael

PS: i'm not here to discuss ROAM vs. static geometry... this work isn't about seeing raw performance gains, its about managing massive streaming geometry with infinite scalability... yes one day all top class simulations will utilize ROAM systems, because that is the only way it is realisticly possible to seamlessly study the volumetric texture of say... 'tree bark', from 10 centimeters to 10 kilometers. or more practicly perhaps, drive a simulation of planet earth from an extremely high resolution elevation map. (which i have done with this system with the ETOPO2 2minute earth topological database)

*edit: added missing negative (NOT)

knackered
03-02-2005, 02:26 PM
Have you looked into geometry clipmaps?
As far as I can see, they offer the best compromise between performance and memory, and pretty much leave the cpu to just upload small amounts of vertex data spread over many frames. CPU/bandwidth usage can be throttled on a per-frame basis....etc.etc. there's loads of good things about them.
Vertex and texture detail can be sampled from compressed data, or generated using any algorithm you like, such as perlin noise.
The active regions (the regions which decide what data you want to 'view' essentially) can be dynamically changed depending on whether you want to zoom into minute detail or into the stratosphere to get an overview of the whole planet....leaving the clip regions to catch up using whatever number of cpu cycles you want to allocate to the task. The only penalty to allocating less cpu time is the rendering of less detail, while the rendering speed actually goes up!
ROAM, even a modified version such as yours, is pretty much redundant as a concept. Streaming into static vertex buffers is where it's at!

michagl
03-02-2005, 03:45 PM
Originally posted by knackered:
Have you looked into geometry clipmaps?
As far as I can see, they offer the best compromise between performance and memory, and pretty much leave the cpu to just upload small amounts of vertex data spread over many frames. CPU/bandwidth usage can be throttled on a per-frame basis....etc.etc. there's loads of good things about them.
Vertex and texture detail can be sampled from compressed data, or generated using any algorithm you like, such as perlin noise.
The active regions (the regions which decide what data you want to 'view' essentially) can be dynamically changed depending on whether you want to zoom into minute detail or into the stratosphere to get an overview of the whole planet....leaving the clip regions to catch up using whatever number of cpu cycles you want to allocate to the task. The only penalty to allocating less cpu time is the rendering of less detail, while the rendering speed actually goes up!the algorithm screen linked to from above facilitates all of these constraints. i could've said a whole lot more about the operation, and actually intended to say a little bit more that i forgot... but anyhow, i admit though that i've never heard of 'geometry clipmaps' as a conventional terminology... sounds like space partition rendering though, which would not begin to facilitate LOD realisticly, or at least smoothly.



ROAM, even a modified version such as yours, is pretty much redundant as a concept. Streaming into static vertex buffers is where it's at!if you are streaming into a buffer it is no longer static as far as i know. the algorithm i described, even as scantly clad as above, if you pay attention, you will see that it aproaches rendering static geometry very closely, and probably hits just about at par with the algorithm you've described, only with many much cleaner features.

anyhow, i would really like to discuss hardware, primarilly VBOs as they relate to the algorithm described.

my highest priority here though is to find the source of this crazy driver/hardware performance hit described above.

i would also eventually like to discuss cpu/gpu parallelism and whatever options might exist there.

i meant to say a lot more in the introductory post, but i will save it for if and when it is able to draw attention.

knackered
03-02-2005, 11:14 PM
I must admit, I just didn't read most of your original post...looked to be a hell of a lot of cpu work, which isn't acceptable....still haven't read it all, it's way too long. Take a look at geometry clipmaps, that's my advice, and if your method works out more efficient then write it up and publish it, then I'll read it...until then, life's too short to read unqualified essays on newsgroups.
Oh, yes obviously if you change the content of a vertex buffer it ceases to be *literally* static, but if you only update small portions of it every 10 or 20 frames, then, in all but wording, it is static.
http://research.microsoft.com/~hoppe/#geomclipmap

Adrian
03-03-2005, 12:24 AM
You need to turn off vsync before doing any kind of performance testing. It's in performance and quality settings. Click on vertical sync then the 'application controlled' checkbox and move the slider to off.

btw the image link is broken.


Originally posted by michagl:
its about managing massive streaming geometry with infinite scalability...
You dont have to use ROAM to achieve that.

michagl
03-03-2005, 05:49 AM
Originally posted by knackered:
I must admit, I just didn't read most of your original post...looked to be a hell of a lot of cpu work, which isn't acceptable....still haven't read it all, it's way too long. Take a look at geometry clipmaps, that's my advice, and if your method works out more efficient then write it up and publish it, then I'll read it...until then, life's too short to read unqualified essays on newsgroups.
Oh, yes obviously if you change the content of a vertex buffer it ceases to be *literally* static, but if you only update small portions of it every 10 or 20 frames, then, in all but wording, it is static.
http://research.microsoft.com/~hoppe/#geomclipmap the post requires about a minute to read if you are comfortable with english. i kept it very succinct. the system as described is gpu limited rather than cpu limited. the cpu is basicly just responsible for calculating distances for lod testing... but as i understand pciexpress technology that could probably be offloaded to the gpu in time. as for reading in small portions every 10 or 20 frames that is exactly what i'm doing. only less often and quite small portions -- 8x8 blocks. finally i don't keep up with pop terminology, but if hoppe has published it i've read it, unless it is very new.

edit: ok, in fairness, 3 or 4 minutes. but for what its worth, the first half is a brief description, the second half trouble shooting.

michagl
03-03-2005, 06:05 AM
Originally posted by Adrian:
You need to turn off vsync before doing any kind of performance testing. It's in performance and quality settings. Click on vertical sync then the 'application controlled' checkbox and move the slider to off.

btw the image link is broken.


Originally posted by michagl:
its about managing massive streaming geometry with infinite scalability...
You dont have to use ROAM to achieve that.yes, i'm aware of vsync. i welcome the concern none the less though.

as for the image, the original url haw 'www.' in it... i ihave no idea why i stuck that in there, habit i guess. anyhow, it should work now. keep in mind the image in very low resolution for illustrative purposes. the highlighted triangle is a fully tesselated second tier mesh.

as for your final comment. i use the term ROAM very literally. i'm not associating with any past project(s), i simply mean 'real-time optimally adapting mesh'... which as far as i'm concerned applies to any algorithm which dynamicly and smoothly manages a mesh with respect to the frustum and topological turbulance, and perhaps other similar features. the 'smoothly' qualifier means that the mesh must make seamless transitions, meaning using 'blending' geometry doesn't count in my opinion.
i must admit i'm partial to elegant solutions.

michagl
03-03-2005, 06:21 AM
before i sign off, i had a pretty good idea last night. i was thinking about the ROAM critique*. the only 'flaw' i could find in the system is having to calculate frustum distances for every updated face. i came up with some ways to aproximate this process in one foul swoop per mesh.

but i had another idea which would not preclude any others. to up the resolution of the mesh, without sacrificing any of the great qualities of 8x8. it is possible to add another vertex in the center of each face, and break teh faces up into 3 self contained triangles.

i will spare the details, but this would up the number of triangles in a mesh 3 fold for free, without sacrificing any existing qualities of teh system.

the only remaining issue, is the triangles would be slightly accute, but i figure they will look ok... if not, i don't plan to make the change cold turkey... it will have to be optional.

*edit: brainflop - replace 'technique' with 'critique'

-FOLLOW UP------------------------

i found a beautiful solution in this vane... as it turns out, it is possible to flip the resulting scalene triangles along their bases (which is the hypotonuse of a quad). this flipping operation can be performed before building the preprocessed mosaics. the result is a mesh with much better fit triangles across the board, fairly closely aproaching equilateral triangles).

the resulting mesh shares exactly the vertices of the mesh fed into the subdivision algorithm, but opposite edges. the edges exist only in the offline preprocessed strip indices as far as the gpu is concerned.

the result is a much smaller mesh can drive teh lod based tesselation of a much finer mesh with optimal triangle cover with zero online performance hit.

Adrian
03-03-2005, 06:38 AM
i work with the card synched at 60 frames per second
but i have yet to break the cards vsync (i believe it is called) status From your original post I read it that you had vsync on and hadnt figured out how to turn it off.

Your screenshot shows an FPS of 102%. I've never seen an fps as a percentage. What is it a percentage of?

michagl
03-03-2005, 08:43 AM
Originally posted by Adrian:
From your original post I read it that you had vsync on and hadnt figured out how to turn it off.

Your screenshot shows an FPS of 102%. I've never seen an fps as a percentage. What is it a percentage of?yes i have vsync on, but i never said i don't know how to turn it off... i just don't feel like turning it off generally. i can generally guage the efficiency of my code by its organization... if people want to ask for hard numbers i might fool with it. but there is no reason i can see to do so now.

as for 102%, that is the frames are running at 102% of 60 frames a second. the vsync is limiting at 102%, though vsync is set for 60. if i turned it off, that percentage would go up outrageously depending on render modes. if i want to test performance, i just turn up the sudivision attenuation coefficients until the frames drop below 100%, then i know i'm in the ball park for realistic testing. i tend to animate my machines... that is i attribute animistic qualities to them... so i feel bad about making them do excessive work unecesarrilly. also turning off vsync is a good way to make windows scheduling even more difficult to work with if things get out of hand.

as for FPS, in short that number is the number of seconds per frame averaged over 60 frames, updated once a second so that it doesn't jump around too erraticly to keep up with. below 50% it is displayed in red, up to 90% in yellow, then green up to 100% and white 100% and above.

3B
03-03-2005, 09:18 AM
Do you actually get values between 30 and 60 FPS with vsync on? I would have expected you to be limited to factors of 60 (60,30,20,15, etc), unless you are right on the edge and not all frames are taking the same amount of time, or frame times are varying a lot within the 1 second sample...

michagl
03-03-2005, 09:26 AM
Originally posted by 3B:
Do you actually get values between 30 and 60 FPS with vsync on? I would have expected you to be limited to factors of 60 (60,30,20,15, etc), unless you are right on the edge and not all frames are taking the same amount of time, or frame times are varying a lot within the 1 second sample...that is interesting. if vsync really works as i believe you describe it, that might actually explain the crazy 50% hit i'm seeing.

i don't understand why vsync would work like that. but assuming it does, i will look into it asap. if you can point me to technical docs regarding the nature of vsync, i would apreciate it.

yes i do get numbers varying across the board, but like you notice, i am averaging over a second for readability, so the average value could be misleading if that is indeed what is happening.

in the past i've noticed better performance across the board when programming various systems with vsync disabled. i realize there is an extension for manually managing vsync at run-time which i've considered using when the frames go over a certain threshold.

in any case, i apreciate the insight, and i would like very much to persue this line of reasoning.

3B
03-03-2005, 10:01 AM
Basically what happens with vsync on, is that you can only swap buffers when the monitor finishes displaying the current frame (aka at the beginning of the vertical retrace interval, which is triggered by the vertical sync signal, thus the name 'vsync'), in this case every 60 seconds. If you take too long rendering, you miss a chance to swap buffers, and have to wait for the next retrace, so it takes a total of 1/30sec since the previous frame before the new one is displayed, leading to the sudden drop to 30 FPS.
In other words, vsync effectively round your frame time up to the next integer multiple of 1 monitor frame, so you get 60/1=60Hz,60/2=30Hz,60/3=20Hz, etc.

Korval
03-03-2005, 10:53 AM
What's probably happenning is that most of the time, you're pushing the hardware so that rendering takes about 16.5 seconds. Enough for 60fps, but just barely. However, sometimes, the hardware has to do some extra clipping or something (camera-dependent) that makes it take 16.7 seconds or so, which is just past the 16.6667 threshold for maintaining 60fps. Since you're v-sync'd, you're going to drop to 30fps.

To verify this, turn off v-sync. If you find a mild framerate drop (60 to 58, for example), then this is probably what is happenning.

michagl
03-03-2005, 11:16 AM
Originally posted by 3B:
Basically what happens with vsync on, is that you can only swap buffers when the monitor finishes displaying the current frame (aka at the beginning of the vertical retrace interval, which is triggered by the vertical sync signal, thus the name 'vsync'), in this case every 60 seconds. If you take too long rendering, you miss a chance to swap buffers, and have to wait for the next retrace, so it takes a total of 1/30sec since the previous frame before the new one is displayed, leading to the sudden drop to 30 FPS.
In other words, vsync effectively round your frame time up to the next integer multiple of 1 monitor frame, so you get 60/1=60Hz,60/2=30Hz,60/3=20Hz, etc.seems like you are correct... i disabled the vsync, which unfortunately is not as easy as it should be on a win2k and linux machine with 3 monitors using nview and a pci card --- i'm hoping nvidia will ever fix this bug in case anyone from nvidia is reading.

anyhow, that seems to explain the sharpness of the hit... it would just cross the tiniest threshold due to hardware culling and then round off to the next edge.

i worry though about the visual effects which might occur from not adhereing to vsync... honestly i just mostly use it as a cheap built in performance limiter, and have never really thought of what visual artifacts might occur from being out of sync with the monitor. care to explain anyone?

----

and finally, as a major bonus to me... i noticed some funny behavior when i had the system set to not render any primitives. i only recently picked this system back up for a couple reasons, but the major reason i've stuck with it lately is this really great 'mosaics' idea i had in the process. anyhow, turns out last time i dropped the system... i believe i had to run off to visit portland/seattle out of the blue... anyhow, i was trying out the hardware occlusion solutions out there, which i found completely inadequate, but there was a line of code i had missed to comment out, which was effectively rendering everything twice, once to occlusion system... and to make matters worse, it was not retrofitted for mosaics, so it was passing an 8bit index buffer as 16bits which meant A) it was overflowing, and B) there was no telling what kind of crazy index values were comming out of combining the bytes in that buffer.

i could've swore the system had a lot more kick when i left it than lately... now it is running incromprehensively fast... with the new mosaic system and removal of this major bug... so i will have to spend a little time playing with it before i can really comment to much.

however i'm still very interested in persueing the hardware aspects of this system. as i fill it is fairly special, and could possibly very well be a corner stone of future graphics systems. it is an ideal candidate for hardware mapping as well i believe... if and when progressive meshing is integrated at the hardware level -- much better than the hardware nurbs tesselators out there, though i admit i'm really not familiar with them as a developer.

michagl
03-03-2005, 11:21 AM
Originally posted by Korval:
What's probably happenning is that most of the time, you're pushing the hardware so that rendering takes about 16.5 seconds. Enough for 60fps, but just barely. However, sometimes, the hardware has to do some extra clipping or something (camera-dependent) that makes it take 16.7 seconds or so, which is just past the 16.6667 threshold for maintaining 60fps. Since you're v-sync'd, you're going to drop to 30fps.

To verify this, turn off v-sync. If you find a mild framerate drop (60 to 58, for example), then this is probably what is happenning.yeah, i think your diagnosis is spot on. i missed your post before posting my last for what its worth, or i woould've given you credit.

michagl
03-03-2005, 11:43 AM
ok, first... i was happy because vsync definately seems to be the reason for the sudden sharpness of the performance hit... but i was still a bit worried, because even with vsync disabled, i still go the gradual performance hit.

i haven't tested anything, but i'm pretty sure i know where the hit was coming from. it seemed to occur most when the final mosaic was on screen... which happens to be the full mosaic. the reason it would occur, is because the mosaics all share the same buffer, so when the occlusion system would pass crazy index values to hardware, they would still be inside the buffer, and get relatively reasonable values. but when it go near the end of the buffer, the values would jump outside the consolidated buffer, meaning anything could get passed to the hardware for rendering... meaning crazy fill was likely going on, because it was not rendered to the screen buffer, because it was going to offscreen occlusion 'buffer' and probably simultaneously to the depth buffer rather.

edit: ok that reasoning really isn't accurate, and i knew that really when i was writing it... but something similar to that is going on, though more likely the real nature would depend more on the organization of the video memory than system memory. the final factor is, that some mosaics would create weird fill rates, while others might not, because the vertices are all normalized in a local space for optimal precision, and the whole system itself takes place in a normalized space around 1.0... so likely all of the vertices on the card would not have caused serious fill issues, until an index picked outside a vector region.

so that is my theory for the performance lulls... not really a theory though because i'm sure its correct. there probably aren't any other performance issues out there.

so finally, if anyone feels like they understand what i'm doing, see promise in it, or would just like to be helpful. i'm very open to hardware and opengl api related advice anyone feels like they can contribute. i will try to take names a give credit if you like.

i'm mostly interested in what i can do to optimize parallelism. how most effectively to utilize VBOs. and i would very much like to know if there is anyway to make a pact with driver when uploading VBOs... for instance to tell it, "here is a buffer, it needs to be uploaded, but it won't change until the buffer needs to be rendered, so take your time uploading if you need to"... i figure there could be some parallelism in DMA (direct memory access) or something, if you can promise not to mess with some system memory, so that the driver can get around to copying it when it feels like it. i don't see any room for that in the VBO api... does parallelism stop when you need to upload a buffer? is it immediately copied into AGP memory for upload to video memory? or can DMA come around and catch it later... i take it everything must go through AGP memory to get to video memory, correct?

i don't pretend to understand any of this stuff... so if anyone feels like weighing in. i would apreciate it greatly.

knackered
03-03-2005, 11:44 AM
Sorry mate, you started waffling on about gerbils at which point I lost my optimism and gave up reading.
I *still* haven't read it. You need diagrams and such like to best describe your algorithm. Oh, and you need to properly analyse it under stress tests and a good profiler...unfortunately, this also requires knowledge of such fundamentals as vsyncing, thread throttling, load analysis, cache coherency...before declaring your method optimal.
Have you read the hoppes paper yet? If you consider siggraph'04 as being new, then yes it's new.

michagl
03-03-2005, 01:03 PM
Originally posted by knackered:
Sorry mate, you started waffling on about gerbils at which point I lost my optimism and gave up reading.the gerbil is an analogy. solving every possible permutation of a base to apex subdivision is a lot more complex than it sounds. so its reasonable to build an empyrical database, before attempting an automated algorithm... so you have something to compare your results to. the 'gerbil' simulation facilitates building the empyrical data base. the camera is the gerbil, the environment is actually a model in the works, of the alien/god/satellite known as gaea from the classic scifi novels penned by john varley, Titan, Wizard, and Demon... part of a promotional demo, essentially though its a gerbil wheel, or a world with simulated centripetal 'gravity', such as the space module from 2001. a 'gerbil wheel'. anyhow, could've been done on a model of earth the same way... but the gerbil wheel seemed more appropriate for the task.


Originally posted by knackered:
I *still* haven't read it. You need diagrams and such like to best describe your algorithm. Oh, and you need to properly analyse it under stress tests and a good profiler...unfortunately, this also requires knowledge of such fundamentals as vsyncing, thread throttling, load analysis, cache coherency...before declaring your method optimal.
Have you read the hoppes paper yet? If you consider siggraph'04 as being new, then yes it's new.sorry mate, but i'm not an acedemic, so all that will have to wait until a complete system is online, and continued development has become an after thought.

as far as vsyncing, thread throttling, load analysis, and what not, that is all quite secondary to the theoretical algorithm... little more than an implimentation concern. and i'm not claiming optimal, and certaintly not claiming perfect, but if you know your stuff, then you aught to be able to look at the algorithm and apply your knowledge of such really hardware api level concerns.

as far as thread throttling, i'm not sure if you mean managing a multithreaded app, or fooling around with os scheduling, but i've stayed away from the windows multi-threading api, because i don't like it, and haven't felt like wrapping it... though the thought does cross my mind on occasion.

as for load analysis, i know where my loads are... but if you know of a good free app for analysis, my ears are open.

as for cache coherency, i don't know if you are referring to cpu, or gpu, but i assume hardware manufacturers don't expect application programmers to keep up with every little hardware development, and that is why nvidia freely provides tools such as their stripper, to manage such matters.

besides, don't get me wrong, because i'm nowhere near those stages with this system. a lot more work needs to be done, and there is a sister system as well even further behind in development, which is responsible for stream writing and reading massive multiresolution images to disk with an opengl type interface. it does everything but triangles right now, which i intend to pick up very soon.

end of the story, chill out, i'm not trying to step on anyone's toes.

edit: sorry about not saying anything about hoppe's paper... i've been meaning to look at it, but a lot has been happening. i'm fairly certain i've read it though... i'm assuming its his streaming simulation of a washington area mountain region. i didn't find anything personally interesting in that paper... i could offer impressions, but it was a long time ago that i read it, so my memories probably would not be accurate. of course the paper may be new, but either way i will give it a look.

-- FOLLOW UP --------

yeah, thats the one i read... though the abstract says its the entire USA rather than just a washington area.... i just made a gernalization on that account. anyhow i don't remember the details, but as i recall, it is really more of a demo, than a real simulation environment... and the fact taht it is done on a regular grid, rids it fairly useless for non-planar geometry. i recall reading it all before, and finding nothing useful... i probably remember it particularly, because i usually do come away with something useful from hoppe's papers... but not that time.

if you insist i will give it a look... but microsofts pdf's are very hard to get at, because they don't allow download's, and i'm on a 56k modem in the middle of no where, a long way away from the nearest hub... so its really more like 28k at best.

knackered
03-03-2005, 03:27 PM
You're actually saying nothing about your own algorithm. You're just pompously rambling on to yourself about essentially nothing, self editing your own text using some kind of third persona. Who are you really talking to?
You've read hoppes paper? Good, you found nothing interesting in it? Because it concerned itself specifically with planar regular grids? That's because it's an optimisation technique for planar regular grids, it wasn't intended to be another progressive mesh technique. It's intended for extremely large planar datasets, and it's the most optimal technique I've come across so far for these datasets.
Your technique is for general meshes then? Yet you say you think it's up to nvidia to provide you with geometry optimisation tools, which suggests you precompute your tesselations offline, which suggests it's limited to relatively small datasets...either that or there's a hell of a lot of cpu work going on at runtime.
I have actually read what you've said now, and I'm none the wiser, all I see is gerbils playing chess inside "equilateral sierpinski's".
For me, try to explain your algorithm in no more than, say, 10 sentences using words of no more than, say, two syllables (avoiding all gerbal analogies, no matter how tempting).
Or, if you prefer, dismiss me with a patronising wave of your keyboard.

3B
03-03-2005, 04:52 PM
Originally posted by michagl:

i worry though about the visual effects which might occur from not adhereing to vsync... honestly i just mostly use it as a cheap built in performance limiter, and have never really thought of what visual artifacts might occur from being out of sync with the monitor. care to explain anyone?
The visual problem with running without vsync is that you end up showing parts of 2 frames (or more for particularly high FPS) during one monitor frame. You end up with the top part of the screen showing the old frame, and the new frame showing on the bottom (or stripes if you render more than 2 frames). Its more noticeable when moving quickly with high contrast scenes, or if you manage to get a consistant enough rendering rate that the transition is stable relatve to the screen.

As far as your algorithm, I don't see enough detail to really tell what you are doing either (or I might be just getting confused by your terminology, not sure...).

Questions I had from what I did get tho:
You said 'equilateral sierpinski tesselation', but the screenshot you posted looked more like right triangles than equilateral triangles. If you meant right triangles, what benefits do you get over quads (aside from the 8bit indices)? I looked at a similar tesselation at one point, but the way I was doing it, it looked to be just a messier version of my quadtree code.

When you generate triangle strips, is this for an entire chunk of 64 tris? or is it some sort of partially tesselated subset of those 64? or something else entirely?

Do you support features like caves and arches?

When you talk about the 2 tiers in your mesh, is tier 2 all the geometry that is actually rendered, or is there geometry in tier 1 also?

michagl
03-03-2005, 04:53 PM
Originally posted by knackered:
You're actually saying nothing about your own algorithm. You're just pompously rambling on to yourself about essentially nothing, self editing your own text using some kind of third persona. Who are you really talking to?i will pay no mind to this bit...



You've read hoppes paper? Good, you found nothing interesting in it? Because it concerned itself specifically with planar regular grids? That's because it's an optimisation technique for planar regular grids, it wasn't intended to be another progressive mesh technique. It's intended for extremely large planar datasets, and it's the most optimal technique I've come across so far for these datasets.i don't doubt it... it just isn't very robust. i assumed he was doing a small tract of land in washington, because i recalled his was a planar algorithm, and the idea of mapping the entire continental US to a plane is just rediculous on the face for me.


Your technique is for general meshes then? yes, 'non-planar geometry' as stated... i prefer 'anisotropic manifolds', but i was trying to keep the terminology simple. if you you would read, and think a little bit, then you could save your self moaning and groaning. if i spare you this much, i hope you know something considerable about VBOs internals, you will be willing to trade me.


Yet you say you think it's up to nvidia to provide you with geometry optimisation tools...if they want to sell hardware, yes... i certaintly would provide tools were i in there situation.


...which suggests you precompute your tesselations offline, which suggests it's limited to relatively small datasets...either that or there's a hell of a lot of cpu work going on at runtime.yes, the tesselations are computed offline, as i've said so plenty of times. and no the datasets are infinite in scale... the tesselations are combined to form every possible configuration. there are around 200k 8x8 tesselations, which saturate the array of states assumable by what i think is conventionally called a 'right triangle subdivision'... split a triangle at the middle of its base, to its apex, then the two formed triangles assume the split line as their bases...

EDIT: this is wrong... the resulting triangles assume the midpoint of the base as their apex.

repeat until the their are 8 edges on either side of the triangle... there are around 200k possible states resulting from this process. a triangle strip is computed offline for each state, which can be easilly retrieved at run-time, to provide an automaticly 'theoreticly' perfect 'piece-wise' stripping.


I have actually read what you've said now, and I'm none the wiser, all I see is gerbils playing chess inside "equilateral sierpinski's".then i'm probably wasting my time. i'm reminded of the parable that says something ironic about humans having two ears and but one mouth...


For me, try to explain your algorithm in no more than, say, 10 sentences using words of no more than, say, two syllables (avoiding all gerbal analogies, no matter how tempting).
Or, if you prefer, dismiss me with a patronising wave of your keyboard.you typed a lot, so i will take you at your word. as for myself, i specialize in irreducible complex alorithms... there is a whole world of algorithms which can not be explained in "10" steps or less from the top down... and they tend not to make it into acedemic papers, if only because acedemic professors were not trained to think beyond 10 steps.

but anyhow, you are extremely hostile and insensible for some reason, but i will patronize you... just keep in mind that i owe you nothing, and i have nothing to proove. its not every day that i get to work with a system that i can be so public about, so presenting my work is admittedly probably not a strong suit.

glossing over all caveats and technicalities:

0)preprocess template multi-resolution connectivity, and compute a triangle strip for every possible permutation of a sudivided triangle to a given depth.

1)at run-time dynamicly recursively subdivide projective or parameterized geometry from the perspective of the view frustum.

2)each leaf node generated by step 1 is replaced with an instance of the template mesh generated in step 0.

3)per-vertex data is computed from streamed image data according to the subdivided texture coordinates from step 1 and uploaded into video memory.

4)according to schedule and frustum activity further subdivide each second level mesh with respect to per-vertex weights and frustum.

5)solve 128bit state for each leaf node's mesh and use it to look up the preprocessed triangle stripping for that state.

6)render visible leaf node meshes according to per-vertex data and assigned stripping.

michagl
03-03-2005, 05:45 PM
Originally posted by 3B:
The visual problem with running without vsync is that you end up showing parts of 2 frames (or more for particularly high FPS) during one monitor frame. You end up with the top part of the screen showing the old frame, and the new frame showing on the bottom (or stripes if you render more than 2 frames). Its more noticeable when moving quickly with high contrast scenes, or if you manage to get a consistant enough rendering rate that the transition is stable relatve to the screen.i was afraid of that, though i've actually never noticed any visual artifacts with vsync disabled.

do lcd monitors and such have the same characteristics for full screen apps?



As far as your algorithm, I don't see enough detail to really tell what you are doing either (or I might be just getting confused by your terminology, not sure...).
that is understandable... i would've like to have had more space and time to explain matters... i think it is all there at least superficially, but it might take a little bit of extra thought to piece together.



Questions I had from what I did get tho:
You said 'equilateral sierpinski tesselation', but the screenshot you posted looked more like right triangles than equilateral triangles. there are both actually if that wasn't clear... the system is two tier. the first subdivision layer is equilateral sierpinski, which maps well to non-planar geometry, but has a side effect of creating very noticible furrows, especially once the geometry reaches a point where it can be considered planar. the right triangle tesselation takes the furrows out.

however there are as well deeper reasons why sierpinski is better fit for top level, and right triangle better for the second level... but those explanations could become quite complex and wordy.

but just to clear things up, your eyes do not decieve you, they are both their. the sierpinski tesselation is dynamic, were as the right triangle tesselation is static. the sierpinski is allowed to merge and recurse down to an infinite depth, but the right triangle has a fixed depth, and is not managed dynamicly... which is to say, when you subdivide the second level, you simply reset it before sudividing, and merging is not desirable.


If you meant right triangles, what benefits do you get over quads (aside from the 8bit indices)? I looked at a similar tesselation at one point, but the way I was doing it, it looked to be just a messier version of my quadtree code.
a lot of the algorithm is very common place, it is just all brought together in an extremely lucrative way. it takes a big picture perspective to take it all in.

first off, as for 8bit indices, that is just a side effect of the fixed resolution of the second tier system. however the second tier system is not limited to 8x8 for 8bit indices. it is limited because a 4x4 mesh has 68 possible states i believe... an 8x8 mesh has around 200,000... so you begin to get the picture, the next jump is 16x16, and the possible states in that might never be reasonable for even the grandest supercomputer.

its kind of like precomputing all of the possible trees of a chess match... merging them into a single tree, and then always following the shortest route to victory based on each successive move. of course for chess though, the size of that tree would be astronomical, but for an 8x8 right triangle tesselation it is doable.

from this aproach, it is fairly trivial to determine an optimal piece-wise stripping of an ever changing dynamic mesh. were in the past ROAM systems could never top strips over 5 triangles, and had to manicly piece them together in real-time with buckets.

then it just happens the the 127 faced multi-resolution meshes can be encoded quickly into a 128bit key, which can be used to quickly retrieve the stripping of each subregion. for a 16x16 resolution, the key would be 512bits.

so we are pretty much upto the level of static meshes were stripping is concerned. the only real remaining issue where static meshes might win out in terms of raw power (putting scalabity aside), is the fact that a really tight solving of the mesh requires a fair number of distance calculations between the vertices and the frustum... but there are a lot of ways to go about tackling that issue, and the aproaches are not even necesarilly mutually exclusive.

finally as for quads... the system is designed for arbitrary geometry, so a simple answer is fitting a quad into geometry is not always necesarrily going to be possible, or even optimal. it is planned in the future, once the sister system responsible for streaming in transformed triangle regions of maps is fully operation to pair up the first tier triangles whenever possible because it is more lucrative to manage texture images in a rectangular fashion. cases were a pairing is not possible, more rare than not, would mean one half of the rectangular texture buffer would go unused.



When you generate triangle strips, is this for an entire chunk of 64 tris? or is it some sort of partially tesselated subset of those 64? or something else entirely?
the maximum tesselation is 64 triangles currently, though in the future that figure will be likely inflated to 64x3, though this inflation will not change the number of possible tesselation, as the new triangles will be completely self contained. the strips can be of any number of triangles between 1 and 64 depending on the sudivision state of that mesh, which is a function of matters such as topological turbulance (mountains) and the relationship of the view frustum, as well as possibly other features. the right triangle tesselation follows a general rule, which i'm assuming you are familiar with.



Do you support features like caves and arches?
well right now, the system is working with projective type goemetry... think raytracing... or 'geometry images', such as you can find in hoppe's material. fully parameterized geometry such as nurbs models would work well. but the specs of the system are designed with the goal of being able to take opengl type polygonal geometry... so a future api might look a lot like opengl. the system which is responsible for streaming mult-resolution 'texture' data is designed with an opengl api in mind.



When you talk about the 2 tiers in your mesh, is tier 2 all the geometry that is actually rendered, or is there geometry in tier 1 also?yes, that is correct, only tier 2 is rendered technicly speaking. if you look in the image i've referenced. the thick lines are the first tier... if you examine them, you are likely to notice a sierpinski pattern... tier two is the highlighted node with edges highlighted with thinner lines.

its tricky to visualize right now perhaps. but i'm planning to essentially add a third tier. to see it, imagine placing vertices in the center of each triangle in the highlighted second tier mesh. then each triangle is split at its vertices, forming three triangles with their apex being the central vertex... these are traditionally called 'voronoii' polygons or something, or just fans really in opengl terminology. anyhow, that results in slightly accute triangles, which is really not a good thing visually i imagine. so the final step is before offline preprocessing the strips, to reverse the edges, which yields triangles with much better cover properties. the end result is at least now, each distance test yields three triangles rather than one... and the overall resolution goes up considerably without sacrificing the great subdivision qualities of the 8x8 mesh. all this comes at zero cost, save for rendering the extra triangles, which is really a gain though because it means not needing to render so many meshes.

so i hope this cleared something up... another one of my goals perhaps by posting here, is to build up a significant record to claim 'prior art' in case anyone trys to pull a patent out from underneath me.

sincerely,

michael

3B
03-03-2005, 08:54 PM
OK, I think I more or less understand now...

Can you change the topology of the mesh as you tesselate it in tier 1, or do things like holes need to be in the coarsest level? for example: could a plane with small holes be simplified to a plane with no holes?

Are you sure you need to worry about LOD within the tier 2 patches? From what I've seen, the current trend is towards doing LOD as coarsely as possible, since the GPU can handle polygons faster than the CPU can handle the LOD (or even the drawing itself).

With 64 tris per batch, you will have a good bit of CPU overhead just sending the draw commands to the card...On my system (2.6G p4, 6800GT), I seem to remember being able to send ~1M batches per second last time I tested, which would give a max of 64Mtri/sec. With larger batches, the GPU can easily* handle twice as many triangles.

*: The main limitation I run into seems to be triangle setup, with a large dependency on how many texcoords the vertex program outputs (ranging from 100M visible tris with no texcoords, to 20M with 4 values in all 8 texcoords). Back facing and culled tris are significantly faster (limited by vertex program length), so for normal views with ~half of the scene backfacing, numbers are a good bit higher.

knackered
03-04-2005, 01:07 AM
Oh, I get it now. No, batch size too small, batch count too high. I wasn't imagining the CPU cyles, you were just not considering that the driver eats cpu cycles.
Back to the drawing board, michagl.

michagl
03-04-2005, 06:36 AM
Originally posted by 3B:
OK, I think I more or less understand now...

Can you change the topology of the mesh as you tesselate it in tier 1, or do things like holes need to be in the coarsest level? for example: could a plane with small holes be simplified to a plane with no holes?
that is a complicated question at this stage of development... i'm sure there are many non-exclusive strategies which could rectify such a situation. like i've said, it is not a cure all system, but the specs are always chosen to leave as much room for robustness in the future. presently as i've stated, the system is limited to projective type geometry.... presently the best model with a whole it could do is something like a torus. the basic problem is there needs to be some way to derive the curvature of the ever finer model.

it isn't a 'precomputed progressive' mesh type system as i understand them. (basicly as a mesh with a tree type structure) it isn't that. and it doesn't commit to a single LOD across the board. so it is best used in cases where the scope of the goemetry is large enough so that only a portion of it might be desirable to detail. so if the entire geometry can be fit inside the view frustum, it is probably best at that point to switch to a different LOD system.

if you are thinking in a video game mind set, try to imagine a vr world, were the detail never weigns no matter how close the camera comes to surface geometry.

the basic philosophy is you don't want your triangles to be smaller than a pixel, and you want your transitions to be seamless.


Are you sure you need to worry about LOD within the tier 2 patches? From what I've seen, the current trend is towards doing LOD as coarsely as possible, since the GPU can handle polygons faster than the CPU can handle the LOD (or even the drawing itself).yeah sure, if you don't want to take surface topology into consideration, then i'm sure you could use a limited set of tilable 'batches' at any resolution you please. i will probably program that as an option in the future.


With 64 tris per batch, you will have a good bit of CPU overhead just sending the draw commands to the card...On my system (2.6G p4, 6800GT), I seem to remember being able to send ~1M batches per second last time I tested, which would give a max of 64Mtri/sec. With larger batches, the GPU can easily* handle twice as many triangles.well for what its worth, the final batch size for the approach outline here will probably be a max of 192, which would be three times as many triangles rather than twice as many. to be honest though, i really don't program for hardware... hardware changes with trends and percieved needs and paradigms, i try to think in terms of as few computational steps as possible. and for what its worth, i imagine this triangle 'fluffing' process could probably be applied recursively infinitely, meaning that a single lod tested vertex could drive as many triangles as is desirable... accomodating this would just mean large offline preprocessing times, and would probably only effect run-time positively, as long as the numbers are taylored to the operating hardware.



*: The main limitation I run into seems to be triangle setup, with a large dependency on how many texcoords the vertex program outputs (ranging from 100M visible tris with no texcoords, to 20M with 4 values in all 8 texcoords). Back facing and culled tris are significantly faster (limited by vertex program length), so for normal views with ~half of the scene backfacing, numbers are a good bit higher.well, there is a system i'm developing which i call MAPS, or the "map system", or maybe "massive array processing system". it basicly offers an opengl interface for stream reading and writing to massive on disk images. at the point that is is fairly functional, i will shift the texturing mode of the 'genesis' system... which is the system we've been discussing. currently vertices are given texture coordinates which reference a shared map(s), but once i complete MAPS triangle drawing system, then each node will get its own personal texture map. at which point at least under many conformal cases the texture and tangent coordinates would be implicit. another way to relieve triangle setup would simply to embed the triangle vertices directly in a float aligned map file. setting the vertices up from there would simply be a matter of streaming them off disk.

i'm not painting any particular aproach as a panacea, this is just a starting point.

more than anything though, i would like to gather any inside incite i can get with respect to the VBO api and other concerns here. mostly optimizing graphics memory management and cpu/gpu parallelism.

michagl
03-04-2005, 06:44 AM
its also worth saying that i don't intend to send anything close to a million batches a second to teh gpu. i also am not designing the system with the intention of maximizing teh gpu across the board. it is meant to be a simulation environment, meaning that it shouldn't require much more than 10% of the total frame time. that is to say that most objects within a visual scene probably would not require such considerations... unless of course teh camera is intensely focused to the exclusion of all else, upon a roam mesh, in which case some sort of 'performance throttling' might be desirable.

3B
03-04-2005, 09:09 AM
Originally posted by michagl:

Can you change the topology of the mesh as you tesselate it in tier 1, or do things like holes need to be in the coarsest level?
that is a complicated question at this stage of development...
Yeah, thats why I was curious if you had a solution for it :)


it isn't a 'precomputed progressive' mesh type system as i understand them. (basicly as a mesh with a tree type structure) it isn't that. and it doesn't commit to a single LOD across the board.I've seen at least one paper that described variable LOD progressive meshed, seem to recall it being just sort of a hand waving 'and we could add this' bit at the end though...



if you are thinking in a video game mind set, try to imagine a vr world, were the detail never weigns no matter how close the camera comes to surface geometry.
That was the situation I was thinking about...with things like tunnels, and arches, that would need to exist in the mesh even if the entire feature was < 1 pixel.


yeah sure, if you don't want to take surface topology into consideration, then i'm sure you could use a limited set of tilable 'batches' at any resolution you please. i will probably program that as an option in the future.

well for what its worth, the final batch size for the approach outline here will probably be a max of 192, which would be three times as many triangles rather than twice as many. to be honest though, i really don't program for hardware... hardware changes with trends and percieved needs and paradigms, i try to think in terms of as few computational steps as possible. and for what its worth, i imagine this triangle 'fluffing' process could probably be applied recursively infinitely, meaning that a single lod tested vertex could drive as many triangles as is desirable... The point is, with modern hardware, you don't need to take local surface topology into consideration... basically, just take x3 idea farther, and draw say a 32x32 patch instead of a 3 triangle fan, at which point you have the equivalent of doing no LOD in tier 2, and you can stop tesselating tier 1 sooner :)
It isn't so much programming for specific hardware, as it is programming for the trend of GPU power increasing faster than CPU power or monitor resolution...

Not saying its definitely better than what you are doing, just curious whether you have tested to see if you really need the extra work to determine triangle level LOD (and the extra GPU ram for the 200k strip index sets). Numbers I would be curious about : how much performance do you lose if you always draw all 64 tris in tier 2? if you stop tier 1 a level higher, and draw 16x16 tris in tier2, or 2 levels in tier 1 and 32x32 tris.




The main limitation I run into seems to be triangle setup,setting the vertices up from there would simply be a matter of streaming them off disk.sorry, wasn't clear...I meant triangle setup as in the step between vertex program and fragment program on the GPU, not the streaming from disk part.


more than anything though, i would like to gather any inside incite i can get with respect to the VBO api and other concerns here. mostly optimizing graphics memory management and cpu/gpu parallelism.The biggest answer here is, send bigger batches to the GPU :)

Aside from that, if your LOD scheme can handle data not being loaded immediately, you might try streaming data in another thread : map the VBO in main thread (or whichever does all the GL calls), pass the pointer to a loader thread, then signal the 1st thread when loading is done to unmap the buffer and start using the data.


its also worth saying that i don't intend to send anything close to a million batches a second to teh gpu. i also am not designing the system with the intention of maximizing teh gpu across the board. it is meant to be a simulation environment, meaning that it shouldn't require much more than 10% of the total frame time.Even more reason to dump extra calculations. If you are minimizing CPU load and aren't pushing the GPU, it seems kind of silly to do extra CPU work to cut down on the GPU load :)

michagl
03-04-2005, 11:50 AM
ok, there is a lot in this last post to respond to, so i'm not going to attempt to detangle all of the quotes. i have two windows open, and i'm just going to dictate between the two, and not miss any points.

as for holes in meshes. the way i imagine it, this system is really designed to kick in once the camera gets close enough to a mesh, that normal geometry like holes, could already be built into the base mesh. its really a system for fully parameterized meshes, like a nurbs model for instance. the idea is to never have visible jagged edges in your geometery.

however it can be used in a projective fashion to build very complex geometry out of a very simple base mesh. a model of earth out of a dipyramid for instance. a simple plane with holes, coming out of a single plane for instance, would be best modeled probably with a raytracing csg type system. that is you would have your plane, and then maybe subtractive cylinders embedded in it.

your right, that it is impossible to derive detail that just isn't there... that is just a fact of reality. if you wanted to smooth a polygonal model for instance, you would be basicly just left with the task of parameterizing every triangle, by fitting curves to the mesh, essentially turning it into a parameterized patch (nurbs) type mode.

smoothing is not everything though, the system also does surface displacement very well. modeling the shaft of a roman column for instance would be quite simple. you could model the bark of a tree by tesselating the base mesh and displacing it, but progressively smoothing the curvature of the tree would not be possible, unless the tree was fully parameterized (nurbs), or smoothing displacements were actually built into a displacement map -- which is really not unreasonable, as i believe Doom3 actually does something like this, to produce low poly count normal mapped models from high poly count models.

the point is, you just can't get detail were it isn't unless you just want to make a best guess. the system also would do a lot better with highly regular meshes... your triangles would need to be as close to equalateral as possible for ideal results.

at the end of the day, it is really just better real-time tesselator... however you derive your coordinates is another matter all together.

for instance, i've popular modeling systems like maya trying to tesselate nurbs models in real time, and the results are just laughable. i don't know if there is dedicated hardware or not for this -- i do no opengl has a nurbs-type api however.

for what its worth though, i haven't yet implimented nurbs type support, though i believe i am planning on it, but it is a fairly low priority well after getting everything else solid.
what it does real well at the moment, is building planetary worlds... it could build them on a sphere, inside a cylinder, in a torus, and just about anything like that. it could also be hacked pretty easilly to do convex geometry were a base mesh is available, or do geometry images like in hoppe's papers. you can also build up some interesting models from typical raytracing primitives... one interface is simply a projective function which takes a point, and a time (t) and returns the proper projection of that point.

its also worth noting that the system handles scale very well. the first tier is done with double precision, the second tier gets by with single precision. another major win, because all of the data points pretty much are in the 2nd tier. the second tier geometry is all normalized and transformed into a local space for optimal precision and easy computations.


I've seen at least one paper that described variable LOD progressive meshed, seem to recall it being just sort of a hand waving 'and we could add this' bit at the end though...i'm not sure about the nature of the attempt there, but this system does achieve a variable LOD mesh very well, very elegantly, and extremely effeciently i would argue.



That was the situation I was thinking about...with things like tunnels, and arches, that would need to exist in the mesh even if the entire feature was < 1 pixel.
yeah, there is no way to get around that fact... it is generally a good idea for continuity to have mipmaps as well to avoid non representative sampling. basicly the data has to be available somewhere. to aid in this though, the MAPS system i'm developing facilitate streaming very robustly, and can locally compress and even encrypt on disk data... local compression would definately be a good idea if the data is on optical disk.

in the future though, simulations will just have to share local databases. that is to say, at some point a horse simulation will be so realistic, and the database so large, that there would be no rational reason for every game to have its on local disk space for its own proprietary horse database, when in the end the result is all the same. the same can basicly be said about detailed databases, like maybe the leaf characteristics of every documented tree. corporations will either have to get together on this, or people will build their own cooperative non-proprietary centralized databases, and entertainment corporations won't be able to keep up... of course complete corporate consolidation (megacorp) is probably the more realistic alternative for the comercial sector.

---

here is the thing about knocking out tesselation in tier 2. it sounds tempting, but if you do it, you will get star shaped seams in the mesh. personally i prefer a nice smooth sphereical tesselation. if you just saturate every tier 1 triangular region, then you are going to get zones were triangles of higher density stick out. i can provide this as an option, but it isn't going to be teh center of my thesis. the problem with this kind of approach is that you are always artificially inflating the detail hoping to over saturate the scars in the mesh. and as well, trust me, it is much more noticible when a whole line of triangles all of a sudden change their resolution, rather than doing it gradually one at a time. the eye just picks it out much more easilly. and finally, you have to decide from where are you going to guage the lod test. do you do it on the center of the triangle? or the nearest vertice, and when you start breaking up massive blocks of triangles based on the nearest vertice, you tend to get greater inconsistancies along the borders... that is your subdivision pattern macroscopicly might not be a nice octagon, but maybe something more resembling a buzzsaw.

finally as far as tesselating for topological turbulance, it is really easy enough to do once you are already testing for the frustum. you just scale the frustum test by the topological variance weight of the tested vertex. if there is no variance, then the lod test comes to zero, and no subdivision is done. also adding this kind of noise to the tesselation tends to create less regular tesselation pattern, especially in heavilly topologicly variable regions. the eye notices when say a shock wave of tesselation is emminating from it. and you really don't want to have to do 'vertex morphing'. a general rule that is good for avoiding vertex morphing, is to never update the mesh unless the camera is in motion. its virtually impossible to notice slight swimming while the camera itself is swimming. you can get away with lower counts in this way.

as far as pushing the 'fluffing' system. first of all just in case it isn't clear. the 3 triangle fans, would not be rendered as fans, but integrated in the triangle strip for each given permutation. as well basic this aproach would scale as so, 3+9+27+81 ... so that at level two each lod tested face would yield 12 faces, at level 3 each face would yield 39 face, and lvl 4 120 faces.

so for a fully tesselated tier 2 mesh, the triangle counts would be:

0:64 - 1:192 - 2:768 - 3:2496 - 4:7680

vertex counts would be like:

0:45 - 1:172 - 2:553 ...

might have to do something special along the borders with vertices, but it should be doable.

basicly, though if you go past level 1, then it means using 16bit strip buffers. and i have a feeling, level one would look good enough, and i intend to impliment it soon. deeper levels get more complex, i figure i won't impliment them for a long time, but probably will eventually. no matter what happens though, its all done offline.

---

one goal for this kind of system would be to introduce extremely ornate geometry into vr. the true underlying detail would not be revealed until the user investigates very closely for a view, at which point all resources could be focused on the ornate geometry. it is especially well suited for geometry which can not be fit entirely within the view frustum at close range.



The biggest answer here is, send bigger batches to the GPU :)

Aside from that, if your LOD scheme can handle data not being loaded immediately, you might try streaming data in another thread : map the VBO in main thread (or whichever does all the GL calls), pass the pointer to a loader thread, then signal the 1st thread when loading is done to unmap the buffer and start using the data.
i'm extremely interested in the 'mapped' version of the VBO api. the last time i looked at the VBO specs... the mapped version to me seemed to be more for the convenience of the application programmer than the drivers. did i take away the wrong impression? and can the mapped interface be used to bolster parallelism?

in the end though, you might just be surprised how little it takes to LOD tesselate the static tier 2 meshes. unlike every ROAM system i've ever seen, the connectivity data is all completely precomputed. it is really little more than a single distance calculation scaled by constant linear and quadratic attentuation factors, for each face actually split. after that is done, a repair algorithm quickly splits necesarry triangles to keep the mesh legal. then a border mending operation corrects the seams between meshes. everything is extremely light weight.

an optimization with pciexpress might be to let the video card compute the scaled distance factors for every vertex and pass it back to system memory through its read lane. the only loss would be that the distances would be computed for every face even if the deeper faces are never split. but for the gpu calculating distance and a little scaling aught to be trivial, and it could even do 4 vertices at a time, and return their 4 factors in the out vector. the vertices would already be in video memory, and just the camera position would need be uploaded as a 'uniform' variable. ( i don't work with shaders on a regular basis so 'uniform might be the wrong terminology )

knackered
03-04-2005, 01:37 PM
It's not the weight of the lod calculation he's refering to, it's the presence of the lod calculation on such small batches. You're feeding a hungry gpu at the same rate you're calculating your lods, therefore you're not working in parallel with the gpu. Think about it. That's the whole point of having a co-processor.
I fail to see why you feel you needn't concern yourself with hardware specific details when the whole point of lod techniques is to work around hardware limitations. Otherwise why not throw the full resolution mesh at the hardware and just complain that the vendors aren't doing their job properly?
It's these kind of implementation-specific details which should *dictate* your approach to the high level design. Something which you seem to keep side-stepping, while maintaining an aloof and arrogant stance. This is why I, at least, am responding to you in a somewhat aggressive manner.
Oh, and for gods sake go to an internet cafe or something...you're giving the impression you're communicating via hand signals, strapped to an oil drum in peru or something.

michagl
03-04-2005, 03:18 PM
Originally posted by knackered:
It's not the weight of the lod calculation he's refering to, it's the presence of the lod calculation on such small batches. You're feeding a hungry gpu at the same rate you're calculating your lods, therefore you're not working in parallel with the gpu. Think about it. That's the whole point of having a co-processor.
I fail to see why you feel you needn't concern yourself with hardware specific details when the whole point of lod techniques is to work around hardware limitations. Otherwise why not throw the full resolution mesh at the hardware and just complain that the vendors aren't doing their job properly?
It's these kind of implementation-specific details which should *dictate* your approach to the high level design. Something which you seem to keep side-stepping, while maintaining an aloof and arrogant stance. This is why I, at least, am responding to you in a somewhat aggressive manner.
Oh, and for gods sake go to an internet cafe or something...you're giving the impression you're communicating via hand signals, strapped to an oil drum in peru or something.first of all, i don't give much credence to you because you are a rediculously negative sort... as for you complaints about my communication, i don't understand what you are trying to say... as for your more redeamable concerns, i will try to address them.

first of all, as for parallelism, i doubt this makes much of a difference, but the lod and uploading is done in its own pass. rendering is done in another pass along with frustum culling. cpu operations are all very lightweight, and for what its worth lod is managed in an extremely staggered fashion, meaning only a very small fraction of 'nodes' are updated with respect to lod each frame. its not as if the whole lot of them are being updated every frame. and even if they were it is still very light weight. all the cpu is really doing while the gpu is rendering is a few dot products against frustum planes per batch, if that is the frustum is active... so in reality if i really wanted to be focusing on parallelism, i should probably be looking into giving the cpu more to do.

i'm leaving the lod phase open for the gpu so i can integrate pciexpress reading capabilities there. right now only VBO uploads are done in that phase, and only if new nodes are created, which is generally fairly seldom.

i'm not trying to knock the socks off anyone, just looking for hardware advice. i know my hard constraints as far as algorithms are concerned well enough.

michagl
03-04-2005, 03:41 PM
let me just make it clear, that i understand you points about pushing a measely 64 triangle max stripped mesh through the gpu.

however there really isn't that much the cpu has to do while rendering, save for the driver's use of the cpu, some quick frustum tests which max out for visible nodes (which only occurs if the camera is active -- and optimizations could easilly be made to only test nodes in the non-intersecting regions of the new and last frustum)... the mesh is recursively traversed, but this could be avoided with a queue filled on the first pass. other than that, nothing is going on.

plus in my defence, it is only very lately that i've switched to the 8x8 model... i've been working with unstripped 16x16, which pushed the gpu fairly well. 32x32 was not possible, and the planned 8x8x3 aught to aproach the 16x16 in terms of gpu load.

finally, you are also assuming that the shader model is very simple if anything. a heavy shader model could easilly feel that gap up real quick. i'm assuming you've seen the shaders going into games like 'half-life2', i think it was called. i don't think there would be much of a cpu/gpu gap left there assuming that fill is not the limiting factor -- ie. triangles not too big in screen space. i intend to try to provide a wide range of options as best i can. i can't however begin to even consider dismissing the technique, which i feel is one of the most promising i've seen in a long time. nothing to do with personal pride.

knackered
03-05-2005, 03:27 AM
Ok, I understand. You obviously know what you're doing.

michagl
03-05-2005, 09:39 AM
oiii, is there no way to limit the number of posts per page in these forums?


Originally posted by knackered:
Ok, I understand. You obviously know what you're doing.i apreciate the apparent pause in your relentless thrashing, but i can't help but wonder if i'm supposed to gleam a hint of sarcasm from this???

never the less, taking you at face value, i apreciate your attention and effort to grasp my business.

i wish i could've made the situation more clear from page 1, but the mean attention span of bbs patrons is pretty narrow, and don't forget that you ever insisted that i be less specific.

still, i admit, i really don't keep up with hardware developments down to the last drop. i've watched hardware come and go just like everyone else who has been around for more than a short spell... and for the projects i typicly concern myself with, hardware is ever an after thought which can be caught up with momentarilly at any time as needed.

so what i'm trying to say, i'm still very interested in discussing hardware innovations. especially the mapped VBO api.

i'm also curious if anyone can say... what is the best aproach to work out how many pixels per vertices for a given piece of hardware. that is, for a given number of shader cycles, there aught to be a ratio, which would tell you your optimal projected triangle size, in order to synchronize the vertex and pixel shader units. i figure the pixel unit must work at a much higher turnover than the vertex unit... and i figure gpus probably support more than one of each... or at least will someday.

like i say, i'm definately no hardware guru. i work with abstract systems... that is i'm not trying to turn something out to the market every other quarter. that is why i'm here looking for related hardware advice.

sincerely,

michael

zed
03-05-2005, 10:41 AM
i finally understand the term 'to waffle'


what is the best aproach to work out how many pixels per vertices for a given piece of hardware.throw theory and expected results out the window (if u care about actual results) and test by writing a (*)benchmarking test that iterates through the various conditions

(*)though of course benchmark in itself aint a natural situation, so mightnt correspond to an actual app, but at least it will give a better idea than theory

knackered
03-05-2005, 10:51 AM
Originally posted by michagl:
i'm still very interested in discussing hardware innovations. especially the mapped VBO api.There's nothing to discuss, it does exactly what the spec says it does. It's really simple, there's no real caveats anywhere...otherwise I'd tell you, and I've spent a fair length of time using it.
If the main reason you're filling up these pages with your monologues is to get some physical record that you had the idea first, then you're going the wrong way about accomplishing it. The idea (or rather, the specifics) are still unclear to my mtv-addled mind, certainly not clear enough to defend a patent.
Can't you write it down, then physically post it to yourself? Or deposit it in a bank?
That's what I did with my theory about George Bush and the popes 'illness'.

michagl
03-05-2005, 12:51 PM
hostile....


Originally posted by zed:
i finally understand the term 'to waffle'


what is the best aproach to work out how many pixels per vertices for a given piece of hardware.throw theory and expected results out the window (if u care about actual results) and test by writing a (*)benchmarking test that iterates through the various conditions

(*)though of course benchmark in itself aint a natural situation, so mightnt correspond to an actual app, but at least it will give a better idea than theorywaffling is like when you are talking to hitler, doing your best not to show expression or, be found out that you are really not a stone cold souless sob... then you loose your composure. congradulations, you waffled... your part of the real human race.

as for a vertex/pixel shader ratio. there aught to be a way to calculate what size of a triangle in screen space is optimal for best synchronizing the vertex and pixel units. basicly, assuming your shaders are linear, how many pixels can the pixel unit process in the time it takes the vertex unit to process 3 vertices, or less given stripping. yeah i get the point that a benchmark is a good idea... but there aught be a way to get an average number, assuming some golden ratio of cache hits.


arb_vertex_buffer_object specs

The latter technique is known as "mapping" a buffer. When an
application maps a buffer, it is given a pointer to the memory. When
the application finishes reading from or writing to the memory, it is
required to "unmap" the buffer before it is once again permitted to
use that buffer as a GL data source or sink. Mapping often allows
applications to eliminate an extra data copy otherwise required to
access the buffer, thereby enhancing performance. In addition,
requiring that applications unmap the buffer to use it as a data
source or sink ensures that certain classes of latent synchronization
bugs cannot occur.

i poured over the intire specifications a year or more ago... and i still dont understand 100% what exactly is going on with the mapped api.

i'm assuming you map agp memory directly, so you can put your calculations directly into the buffer, which is only useful if you don't want a copy of the memory for yourself. is the pointer available after you close the buffer? how would an app handle if you mapped and unmapped the buffer regularly just to use it like system memory? does the data put in the mapped buffer ever go to video memory?

how come there can't be an interface, to give the driver a pointer to your system memory, and tell it to copy the buffer as it feels like it... maybe with a dma system bypassing the cpu entirely? doesn't proposed pciexpress architecture allow for an expanded dma system?

that's all i have to say. i'm loosing my tolerance for this crowd.

would be happy to hear from '3B' again though before i trot off.

knackered
03-05-2005, 02:50 PM
I'd hardly say you drew a crowd.
Good luck with your revolutionary new approach to lod. Be sure to credit me in your paper.

michagl
03-05-2005, 03:31 PM
Originally posted by knackered:
I'd hardly say you drew a crowd.
Good luck with your revolutionary new approach to lod. Be sure to credit me in your paper.credit for what? and what papers? nothing personal, i'm just not crazy about this sort of atmosphere... the 'crowd' allows its members to be nasty to guests. its all too extremely counter productive. its a shame people can't communicate functionally, especially with the lack of easy access to technical docs and ever changing hardware. i will never get over how juvenile the graphics programming scene is.

and as for my previous post. please don't bother responding with an 'implimentation dependant' rag.

Korval
03-05-2005, 03:47 PM
credit for what? and what papers? nothing personal, i'm just not crazy about this sort of atmosphere... the 'crowd' allows its members to be nasty to guests.Yeah, I have no idea why Knackered hasn't been banned by the moderators. I guess it's because he's been here for a while. The best thing you can do is just ignore him. He likes getting a rise out of people, so it's best not to indulge him. Just pretend that he didn't post at all, and you should be fine.

knackered
03-05-2005, 04:26 PM
Sorry Korval, didn't see you there.
What would you have me banned for this time, stalin?

zed
03-05-2005, 08:05 PM
but there aught be a way to get an average number, assuming some golden ratio of cache hits.like i think i said, dont trust theory, in nvidia pdf's about things u usually(always?) see benchmark numbers, i assume these guys know the hardware well, ie theory and reality are often different


waffling is like when you are talking to hitlercant say ive had the experience, though i was talking to a whitepower neonazi at work on friday, i was waffling on about nz being the laughing stock of the world (i was listening to nationalradio at the time) anyways i was saying thats not a bad thing at all.

your posts here (as well as knackered's i might add) have been a source of amusement to me, well done, continue to spread that joy around

knackered
03-06-2005, 03:16 AM
Thank you for your kind words, zed.
You should all look on me as the deformed son nobody in the family ever talks about, who occasionally stumbles up the cellar steps during dinner parties.

I'm sure michalg understands - he sounds like a tolerant man of the world with a keen sense for self ridicule.

michagl
03-06-2005, 07:49 AM
at least this thread is finally more than a single page.


Originally posted by zed:

but there aught be a way to get an average number, assuming some golden ratio of cache hits.like i think i said, dont trust theory, in nvidia pdf's about things u usually(always?) see benchmark numbers, i assume these guys know the hardware well, ie theory and reality are often different


waffling is like when you are talking to hitlercant say ive had the experience, though i was talking to a whitepower neonazi at work on friday, i was waffling on about nz being the laughing stock of the world (i was listening to nationalradio at the time) anyways i was saying thats not a bad thing at all.

your posts here (as well as knackered's i might add) have been a source of amusement to me, well done, continue to spread that joy aroundfirst of all i don't have time be amusing.... and if i did, i certaintly wouldn't do it directly in this venue.

as for vertices versus pixels, the whole idea is to be able to decide an optimal area for triangles in screen space.

finally as for nazis... you missed the point. i chose hitler because i've been told that bringing up hitler, has traditionally meant calling for the end of a thread within the bbs community. but at the same time, i was making the observation that criticizing people for 'waffling' is essentially a fascist notion.

finally as for banning... normally i'm not a big advocate. i believe its up to the community to see that it presents a welcoming face to its potential contributors and guests.

however in this case... first off, opengl effects so many people's lives in such a significant way. if you want to be agressively confrontational in a destructive manner... i would sind find an unoffical opengl venue, or maybe visit the social forum if there is one here at opengl.org.

this forum (coding advanced) however should not be a social, or even political forum... it is a technical forum. and for this reason personal dispositions should be left at the door. especially if they are time consuming and destructive.

in this particular forum (coding advance)... in order to foster a constructive service for opengl contributors. i personally would see it fit to ban confrontational agressors, and discourage defensive confrontation. you simply can't ban a person for defending themselves.

i wouldn't recommend a total ban. just a few warnings, then start with week long bans, and work up to months. a life time ban is a lot like a death sentence, and i could not recommend it, as it allows no spirit for personal reformation.

and no, i'm seriously urked by knackered... though it has possibly forced me on occasion to think more criticly of what i'm doing. and it may have been the contributor that brought up FPS, which inadvertantly lead to the vsync discussion. but on the whole it has been a total bother.

knackered
03-06-2005, 10:04 AM
I believe you to have been the primary aggressor in this thread, michalg.
After I pointed out that your initial post was a little long, you responded:-

the post requires about a minute to read if you are comfortable with english.......finally i don't keep up with pop terminology, but if hoppe has published it i've read it, unless it is very new.This is aggressive and patronising by anyones standards, and was totally unprovoked by anything I had previously said.
You set the tone for the rest of the thread.
I don't mind at all, however. I don't want you banned for being a condesending, patronising, disrespectful, pompous, intellectual snob. It's entirely your right to conduct a discussion in whatever frame of mind you like, and it's up to the others in the discussion to decide whether or not to respond to you or not. Like korval says, don't respond to me and I'll go away.
Alternatively, if you click on my 'profile' button you can add me to your 'ignore list', which, although I've never tried it myself, presumably makes all my posts magically disappear from your browser.

michagl
03-06-2005, 10:31 AM
the post requires about a minute to read if you are comfortable with english.......finally i don't keep up with pop terminology, but if hoppe has published it i've read it, unless it is very new.
This is aggressive and patronising by anyones standards, and was totally unprovoked by anything I had previously said.there is nothing provocative, much less aggressive or patronising in those words. they are quite matter of fact. why would i chastise you if you you can not find a hand full of minutes in your schedule to read much less address a matter to which you owe no obligation whatsoever.

i do not keep up with pop terminology... it is not healthy for science, and only serves to create an elitest environment which precludes people from concerning themselves with science.

'clipmap' can not be found in any contemporary dictionary, and even the words disected, 'clip' and 'map', say nothing meaningful as a compound construction.

but this is besides the point... simply because i do not keep track with pop terminology, reflects nothing upon your character... in which case such a disposition, can hardly be said to be confrontational, much less agressive.

i would were i you, try to take people at face value, especially when communicating via means which allow no real construct for conveying emotional expression.

finally if you take offense of something. bring the matter up then and there, rather than to bury it and feed from what is very likely only a misunderstanding.

sincerely,

michael

SirKnight
03-06-2005, 11:34 AM
Yeah, I have no idea why Knackered hasn't been banned by the moderators.
Dorbie once said they can't see IPs or ban anyone and that the mods have no control over the site.

-SirKnight

SirKnight
03-06-2005, 11:40 AM
You should all look on me as the deformed son nobody in the family ever talks about, who occasionally stumbles up the cellar steps during dinner parties.
haha! When you say deformed can we think 'deformed in the head' or do you prefer physical deformity? ;)

-SirKnight

SirKnight
03-06-2005, 11:48 AM
Alternatively, if you click on my 'profile' button you can add me to your 'ignore list', which, although I've never tried it myself, presumably makes all my posts magically disappear from your browser.
From the FAQ:



Ignore lists apply to private messages only. Anyone you add to your ignore list can no longer send you a private message. You can put someone on your ignore list by viewing the member's profile and clicking on the "add to ignore list" link. Private messaging must be enabled on this site in order to use buddy lists.
So no, everyone still gets to enjoy knackered in all his glory no matter what. :)

-SirKnight

plasmonster
03-06-2005, 03:46 PM
Hi Michagl,

I'm still trying to get my head around your LOD algorithm, but I'm a bit thick these days. I think most everyone who's given LOD any real thought has considered the possibility of a LUT of some kind. I'm a huge LUT fan, and use them wherever they make sense, and occasionally where they don't. I'm a huge fan of theory too: Theory is where all the good stuff starts. But I'm cautious in developing ideas in a vacuum, with little concern for the underlying hardware, for example. I'm keenly aware, however, that theory can and does inspire others, and inspired folks create and program the hardware we use, and contribute to this forum.

I would love to discuss LOD and terrain rendering in general, but I wonder if the math forum might be a better place for that. I'm up to my armpits in other stuff right now, so I couldn't really give it the attention it deserves. But I'm certain that there are others that could.

Also, in the absence of a complete understanding of your algorithm, it would be difficult to advise you on the API specifics. Maybe you could refine your questions with that in mind. Once you have a paper we could read, then it might be easier to target the API details, though it's likely to involve VBOs at the outset, as stated already. I believe, from your point of view, it would be better to simply gain an understanding of the API, such as it is, then apply it as only your intimate knowlegde of the algorithm would allow. In other words, I think it would be easier for you to learn the VBO API than it would be for all of us to learn your algorithm. Anyway, just a few thoughts, and I would very much like to understand your process.

Oh, and consider excusing the behavior of some of the other guests. We can't have the yang without the yin ;) Rest assured, the vast majority of the visitors here are both friendly and generous.

michagl
03-06-2005, 06:07 PM
i apreciate your consideration graham. the basic structure of the algorithm is really very simple to take in. i think it has been described just about as well as i could ever really get around to describing it... though you might have to look here and there in the bloted first page. if you are really interested in continuous LOD, then i figure it would be worth your time. i believe i'm pretty familiar with most every fairly publicized system in this vein... and though like i've said before, there is nothing totally new in this approach, save for maybe the strip LUT, which i've never seen anything close to. its really the confluence of all the different elements which all happen to compliment one another just right were the numbers are concerned. in fact there are a lot of points in the system where it would totally fall apart if the numbers worked out just a little bit different.

as for working in a vaccuum, i actually go out of my way to do so to some extent. i find that aproaching a problem by beginning with the work of those to come before you tends to narrow the imagination. i mean the approach, iterative steps versus leaps and bounds, depends on the sort of person. i personally do better i find starting from scratch, which tends to allow me to position myself "outside the box"... and so far the approach has given me nothing but success. so i plan to stick with it.

as for the vbo api i utilize it all the time and know it well. i've poured over the readilly available resources, but still there is just the chance of missing something... then there is the people that maybe work in the video game industry or something, that might specialize in this sort of thing, way more than i intend to invest of myself, just in time for the nitty gritty hardware to change down the road.

i'm primarilly an algorithm programmer. i don't fancy myself a hardware programmer, and if i really wanted intimate hardware api optimized code, simd, assembly, and what have you... i would probably be more comfortable asking an export to do it for me, in a collaborative effort of course.

i can't say much more without either getting even more wordy, or revealing more personal information than i would like.

but like i said... i had a specific issue i wanted to solve... which has been solved, and i apreciate everyone who helped in that effort. (*i've worked for years without giving much thought to vsync, you miss things here and there when you ae selft taught*) discussing specific hardware/driver matters would just be icing on the cake. if no one is into, its not going to be the end of the world... and at this stage i'm definately NOT* going to be compelled to step out into the wild blue yonder of the internet and try to seek out hypothetical solutions to problems which might not even really exist as far as contemporary hardware can provide.

i'm just trying to say that i'm content. i might have a few more opengl curiosities for other threads under my belt... though i don't keep a list unfortunately. i noticed texture borders changed with my last drivers. and i'm sure inevitably i might have some advanced opengl issues. and maybe i won't have to think twice before resorting to bbs next time. bbs systems usually intimidate me. too many high spirited types for my tastes. but on the whole everyone here has been very helpful and cordial. i only wish i could be of some service, but my opengl experience is generally more general than sort of situations that tend to pop up in this forum. i utilize opengl a lot, but my applications are generally more on the general purpose and far reaching side... not too many far out demos, none actually.

don't get the wrong idea... i'm all up for discussion. but please do so from interest, rather than any percieved obligation.

it really is an interesting system for anyone interested in the domain.

sincerely,

michael

*edit: added negative and comment

knackered
03-07-2005, 01:40 AM
Originally posted by michagl:
i find that aproaching a problem by beginning with the work of those to come before you tends to narrow the imagination.....i personally do better i find starting from scratch, which tends to allow me to position myself "outside the box"...Perfectly honourable approach.


Originally posted by michagl:
but if hoppe has published it i've read it, unless it is very newWhy, given your plan to stay ignorant of other approaches in order to 'think out of the box'? Perhaps you're standing on the shoulders of giants, but aren't willing to give them any credit?


Originally posted by michagl:
it really is an interesting system for anyone interested in the domain.How do you know?


Originally posted by graham:
I'm still trying to get my head around your LOD algorithm, but I'm a bit thick these days.
Originally posted by michagl:
i think it has been described just about as well as i could ever really get around to describing it...Perhaps the readers of your description would be the best judge of that.

knackered
03-07-2005, 01:49 AM
Originally posted by SirKnight:
haha! When you say deformed can we think 'deformed in the head' or do you prefer physical deformity? ;)
-SirKnightThere's no need to bring my 12lb tumour into this conversation. As I keep saying, I'm scared of anaesthetics. :)

Adrian
03-07-2005, 04:01 AM
Originally posted by michagl:
and it may have been the contributor that brought up FPS, which inadvertantly lead to the vsync discussion.Michael, I just wanted to point out *I* picked out vsync from your original post :) because I thought it was likely to be causing the performance issue you were seeing. I didnt explain exactly why because I wanted you to try it and see first. You decided not to.

One important thing to remember with vsync is that when you update your drivers vsync is reset to being on.

michagl
03-07-2005, 09:52 AM
Why, given your plan to stay ignorant of other approaches in order to 'think out of the box'? Perhaps you're standing on the shoulders of giants, but aren't willing to give them any credit?well like everything in the real world it is a mixed bag. of course i wouldn't recommend tilting at windmills with out at least getting your feet wet. i keep up with papers by the name of the authors, not by subject. you run across hoppe and others, usually presenting their work to various colloquia... you check them out, read up on their papers. i tend not to put subjects into google looking for solutions... besides if i did, i doubt i would find much, with the use of pop terminology that no one could find without first being exposed to the terminology through often exclusive channels. as for credit, i can't give any to hoppe for anything related to this effort. of course i'm not the first to descritize lod geometry (at least not for planar algorithms), or utilize equilateral/right triangle recursive subdivision. other than whoever first did that stuff, i can't think of any credit to bestow upon anyone.


How do you know?because i wouldn't say so if i didn't know.


Perhaps the readers of your description would be the best judge of that.how can they possibly judge how much effort i intend to put into publicizing this stuff. i'm not an acedmic, or an advocate. i'm happy to share, its refreshing actually to have a system which there is no harm in sharing. if anyone wants to formally document, i would be happy to do my best to describe it, and maybe scan some penciled diagrams. but i have at least 10 other systems with equally high priorities underway, and i'm not going to stop building systems just to rigorously document anything. i will already be off on a new project, or picking up an old one, as soon as i hit the ground running.

so like i said, i'm not going to personally rigorously describe this system... and no matter how people judge me nothing is going to change that.

and thats that.

PS: there will be a demo for download asap, but don't hold your breath. anyone can see for themselves the results then.

edit: oh forgot... thanks Adrian. i apreciate it. i intend to keep vsync on, but its good of course to know what causes that behavior. i have to wonder if higher refresh rates would relieve this tension a bit. is vsync tied to your monitors refresh rate? or is always around 60fps, which i feel is good for real-time rendering.

michagl
03-07-2005, 12:18 PM
in sympathy for contributors struggling with the concept presented here. i felt i appropriate to add as the final word that if you have any questions, just look at the screen refrenenced at the top of the thread.

it you don't get it, just look harder. every relevant bit of it so far is in that image.

i'm working on some significant enhancements which will up the triangle count (perhaps potentially without limit) without adding any new lod overheads.

but other than that very recent development. every little bit is illustrated in that image.

i'm certain that i could tell more or less exactly what is going on in that image, provided i had zero prior familiarity. the information is all there if you study it logicly... which would be a lot quicker than i trying to verbalize an essentially visual algorithm.

and that's all i have to add.

sincerely,

michael

knackered
03-08-2005, 02:15 AM
Originally posted by michagl:


[QUOTE]How do you know?because i wouldn't say so if i didn't know.
That doesn't answer the question, it merely states your motive.
Still, it's in line with the rest of your ponderings - vague and insufficient, garnished with sideways insults at peoples lack of intelligence. Easy stance to adopt, but doesn't achieve a whole lot, except to make you feel good about yourself.
Why did you bother?

michagl
03-08-2005, 08:40 AM
because i'm quite aware of the options in this domain... and the popular ones don't stack up. there is a major since of failure in the ROAM domain, as even you acknowledged i believe. thats not to say that the benifits of achieving an effective ROAM solution are not emence. right now ROAM versus static meshes is basicly a brains versus brawn affair. the gpu provides the brawn, but logisticly the algorithm is much more efficient.

effective utilization of pciexpress technology will probably render static meshes nothing but a tool of the amateur. in fact i suspect the reading capacity of pciexpress will be one of the top five or three biggest things to ever happen in contemporary hardware graphics.

but even in the abscence of pciexpress, the optimizations possible from this particular approach very closely approaches the raw power of static meshing... without suffering from the lack of scalability in the static mesh, and the lack of continuity of non-continuous LOD.

by asking how i know so? you imply that i'm unaware of the 'competition'. i really did not wish to offer such an envolved response. but as you insult my honor, my ego must've dragged teh rest of me along.

now give me a break please.

sincerely,

michael

PS: i get no self satisfaction from any of this squandering of time and energy. i only lack the capacity for rudeness. that is, if i was not prepared to follow through with the thread, i would never have began it.

knackered
03-08-2005, 09:53 AM
So you're saying that if I keep talking to you, you have no option but to continue talking back? Because you began the thread?
Oh goody! I've found a friend...
Re pciexpress - I agree it will shift the priorities of certain techniques, but I'm not sure lod is one of them. LOD is still a mechanism for gracefully degrading and compressing graphical data, which sounds like a GPU job to me, not something a general purpose CPU should be concerning itself with.
Obviously it depends on the application...if all you're doing is rendering stuff, then it makes sense to divide the work across all available processors. But usually there's collision detection/response, physics, logic, ai, decompression, input polling, audio management, etc.etc. So the CPU has plenty to do, without adding dynamic re-meshing onto the list.

michagl
03-08-2005, 12:38 PM
i definately won't go on forever, but i'm happy to discuss pciexpress, because i'm not sure if i totally get it. i feel like the main intent of this thread has died, so anyone can feel free to go off topic.

the way i understand it, pciexpress basicly has an equally fast read and write buses, and comes with a super dma system on the motherboard. i figure the expanded dma systems will differ with boards.

i figure, what this would allow you to do at teh least, is always be using the gpu to do anything vagely related to 3d and pixel type operations. the gpu goes beyond general purpose simd, to the extent that it also presumably has built in hardware for operations like vector normalization.

so i figure, something like collision processing could probably easilly be spread out across the cpu and gpu.

were the gpu fails is with complexity. you want fine grain parallelism with the gpu. so doing something like space partitioning isn't possible on the gpu alone... but if you are smart in choosing the way you go about it, it might be able to batch process some operations.

like i've said, batch processing distances is perfect for the gpu.

as for ROAM, you are right, it isn't good for everything. generally you don't want to use it if you can fit the entire mesh in the view frustum. but if you can't, static meshes won't scale well, and eventually you'll find yourself rendering more than one triangle within a single pixel, and that is just a waste of power. saturating the resolution of the mesh just to hide scars in non-continuous meshes is also a waste of power... especially when a continuous mesh can be dynamicly generate elegantly at no cost.

only thing else i can add is that i do think that this lod algorithm is suitable for hardware mapping. that is the whole thing could be done on hardware... would be great for a real-time nurbs tesselator if nothing else.

Korval
03-08-2005, 02:55 PM
the way i understand it, pciexpress basicly has an equally fast read and write buses, and comes with a super dma system on the motherboard. i figure the expanded dma systems will differ with boards.Not really.

PCI Express is much like AGP, only twice the speed of regular 8x AGP. And it has good readback performance, compared to the typical implementation of AGP.

However, this is all theoretical. In practice, PCIe is good, but not great. Some hardware prefers to use a crossover bridge to convert PCIe to AGP, thus limitting its tranfer speeds to that of AGP. Other hardware just doesn't allow for the readback speed. And other times, drivers get in the way.

In short, relying on bus transfers to put data on the GPU is not a wise decision. And even in the most perfect of circumstances, PCIe has nowhere near the speed of video memory. Modern graphics cards can outstrip the PCIe bus in terms of mesh data in microseconds. If you limit yourself to PCIe transfer speeds, you can never run the graphics chip at its maximum speed.


especially when a continuous mesh can be dynamicly generate elegantly at no cost.That's kinda the point of contention. There is a cost associated with ROAM; you can't reach the hardware's theoretical maximums, or anywhere near them.

You're wasting power by sending small batches and uploading data; indeed, it can be argued that you're wasting more by doing this than a static algorithm would when it renders more than one triangle per polygon. Despite how elegant ROAM may seem, hardware likes having lots of triangles thrown at it. Sometimes brute force is the way to go.

BTW, remember when I said to just ignore Knackered and he'd go away? At the bottom of the last page? You may want to consider heeding that advice.

michagl
03-08-2005, 03:28 PM
i vaguely realize so much about pcie, but i figure it will take a while to get the drivers and supporting hardware up to speed, but at least its a step in the direction of a two way interchange with the the gpu.

i'm in a bit of hurry, have to feed horses...

i believe i've read that the read and write bus rates are identical. but of course they are still like agp compared to video memory.

but the read hardware i figure is indapendant of the write and gpu side hardware, so it 'wouldn't' (*i meant it 'would be' be a good idea*) i figure to keep the readbus busy, unless there was scheduling conflicts.

i really don't know what the trend is with video game developers as far as how they like their cpu/gpu threads... asynchronus versus synchronous, or unthreaded even... that would depend a lot on how you would use the pcie possibilities of the gpu.

evenually there might even be pcie cards working with multiple gpus in parallel, though it doesn't look like there is a big push for this on consumer class machines.

has opengl made any plans for pcie reading capabilities other than their standard readbacks?

wouldn't it be nice if vertex shaders could write to video memory, as i presume PixelBuffers do. or can that already be done?

it seems like the pixel and vertex processing capabilities of hardware right now are fairly partitioned... maybe the unified VBO type interface and floating point precision buffers will change that.

just to warn everyone, i really don't know what i'm talking about when it comes to such matters. my conception of this stuff is a floating cloud of concepts at best. i prefer to stick to higher level coding, rather than get tangled up in hardware. my turnover rates can't really keep up with hardware changes to the extent that knowing the hardware intemently would be less less harmful than not.

Adrian
03-08-2005, 03:50 PM
As far as I can tell it has been the graphics card that has been the main bottleneck in readback speed not the bus type/speed.

The biggest jump in readback speed came with the release of the GF6(AGP) not with pcie. Readback went from 200Mb/sec(GF5) to 1Gb/sec. In fact the only benchmark figures I've seen from a gf6 pcie card suggested (marginally) slower readback. That was a while ago though and maybe the drivers have changed things since then.

I'm ignoring the asynchronous aspect of pcie I know.

plasmonster
03-08-2005, 06:14 PM
Michagl, here's a great overview of pipeline optimization:
http://developer.nvidia.com/docs/IO/8230/GDC2003_PipelinePerformance.pdf

The concepts presented should apply to most any modern hardware. Perhaps ATI has a similar document?

I read a paper not too long ago describing some novel techniques in LOD and visibility. This one deals primarily in shadow generation, but many other interesting concepts are presented, some of which relate to the application of multiple GPUs:
http://gamma.cs.unc.edu/Shadow

Multiple GPUs... that's kinda scary.

michagl
03-08-2005, 08:04 PM
i will give the pipeline docs a look tomorrow... i suspect some downtime while my development machine is preprocessing some data... i'm timing it on a test run tonight. i really aught to be doing this preprocessing on another machine, but i'm lazy this much i guess, and could use a good excuse to stay away from the computer.

listen i apreciate everyone playing devil's advocate... but i'm very excited about the development of this system. and yes i've worked with CLOD systems in the past, and i know the ups and downs.

there are a lot of other data management, such as streming geometry/maps that happens to work very well with a CLOD aproach as well... and yeah i'm aware of other ways to go about that.

but as long as someone is talking about piling triangles on pixels. it seems to me like one trend will be increasing the smoothness of meshes... no jagged edges etc. and another trend seems to be that monitor resolution is and will continue to be costly... not to mention when head mounted displays become popular. so combined with ultra fine meshes, and limited screen resolution. throwing redundant non-scalable geometry at the gpu looks like a bad idea on the face. if you have a mesh that is pixel perfect in the foreground, assuming there is no LOD whatsoever, you could have hundreds or thousands of triangles piled on top of a single pixel. yeah, i realize there are non-continuous detail saturating techniques that would probably knock those numbers down a bit... 1000s of pixels is just a dramatization, but by the time you do that(especially for non-planar geometry)... you are practicly halfway to a ROAM algorithm already.

plus gpus can't scale forever... and i don't know how often you all get outside, especially in a natural environment... but there is a whole lot of detail out their to simulate. eventually when pushed to the absolute limits, quantum computing even... brains will eventually have to win out over brawn.

quantum physicists talk about uncertainty in observation... sounds like a break down in a ROAM algorithm to me... just kidding.

anyhow, it works very good for me. i'd rather wrestle with something i understand than the hardware. better to build castles on rock than sand. i can't get the kind of numbers you all talk about out of hardware. when i work with big meshes, they can get out of hand pretty quickly performance wise... same goes for all of the games i'm seeing on the market right now. all i know is that the screens i'm getting stack up pretty well, and the triangles are uniform in screen space from front to back, and the load times are nominal, and will only get better, along with performance.

yooyo
03-08-2005, 11:30 PM
wouldn't it be nice if vertex shaders could write to video memory, as i presume PixelBuffers do. or can that already be done?
AFAIK it is not possible on current hardware. Best workaround will be to use fragment shaders for vertex processing.

Store your vertices into texture, write fragment shader which will treat texels as vertices, and render screen aligned quad into pbuffer. If you need more than position you can use MRT. Make sure that you use pixelformat with enough precission. After that do glReadPixels in some PBO and use that PBO as VBO (just rebind same buffer)! Setup pointers and do the job.

Im planning to use this approach for character skinning. I need this because I have multipass solution in my app and I want to avoid same skinning in each pass.

yooyo

michagl
03-09-2005, 08:14 AM
just real quick... this morning i came up with an optional optimization that offers a realistic compromise in the direction of a suggestion made by 3B to take LOD out of the second tier.

basicly i would call this aproach saturation mode... the trick is if you test the lod of the three corners of a second tier mesh, and probably the center just to be safe... and they are within the same LOD, then it is generally safe to assume that the entire node lies within the same LOD, and skip any further LOD tests, and just pick a signature for that fully saturated level from a LUT.

of course to use this approach, it isn't possible to tesselate the mesh based on irregular surface displacement mapping... that is you must tesselate it as if the curvature is completely uniform. however, perhaps the only benefit in tesselating due to surface topology is to add noise to the tesselation, which generally gives the human mind more trouble in noticing dynamic shifting within in the mesh.

you can even assume having noise in the mesh is inefficient in terms of synchronizing the vertex and pixel shader hardware... unless the vertex and pixel units share resources somehow. i assume they maybe share power and possibly cooling resources in the future. don't know that they share any chips though. i assume they are completely partitioned units.

finally, its also possible to take that optimization further. i've never really counted, but i think the 8x8 mesh has about 6 depths, so if you create a LUT for the corners and center depths, there is a possibility of about 1300 keys, which would be the best fit regular tesselation for that mesh. that would give you a best guess, and could forgoe any further lod testing across the board. but it wouldn't give you the final mesh... because that would depend on the mending process along the borders of ajacent meshes. but it would cut the LOD testing down to 4 points per mesh, rather than a maximum of 63 test, and would produce a fairly smooth tesselation... though transitions might be a bit rougher... but like i said, these are just optional optimizations which would depend on the application's needs.

sincerely,

michael

michagl
03-09-2005, 08:23 AM
Originally posted by yooyo:
After that do glReadPixels in some PBO and use that PBO as VBO (just rebind same buffer)! Setup pointers and do the job.
i was wondering if this is a legal operation... i'm assuming you've confirmed that this works? i wasn't sure if VBO and texture memory could be mixed... though i know that is supposed to be the ideal direction for hardware manufacturers.

i assume PixelBuffers go to texture memory, and i seem to recall that PBO isn't really feasable last i heard.

and just to be clear, i realize there is not supposed to be any such thing as "texture memory", but hardware sure seems to treat textures special as far as video memory is concerned in my experience... at least in the past this definately seemed the case.

yooyo
03-09-2005, 08:41 AM
PBO & VBO are just buffers. You can create it using same call (glGenBuffers) and bind using same call (glBindBUffer) but with different binding type. Buffers contains data, nevermind which kind of data.

yooyo

VikingCoder
03-09-2005, 09:43 AM
there is nothing provocative, much less aggressive or patronising in those words. they are quite matter of fact.As a completely dispassionate and uninterested party, you're wrong. Those words are fairly provocative, aggressive, and patronising.

Anyone with a brain can see that.

See what I did just there? I disagreed with you, and I said things that from my perspective are only true, but for some reason you took offense. ;)

Implying that someone is not "comfortable with english" is really offensive bulletin board behavior.

"Waffling" just means being indecisive. I don't think fascists are traditionally very evasive, conversationally speaking.

And for the record, bringing up Hitler means that you LOSE the argument:

http://en.wikipedia.org/wiki/Godwin\'s_law

yooyo
03-09-2005, 09:49 AM
Originally posted by michagl:

Originally posted by yooyo:
After that do glReadPixels in some PBO and use that PBO as VBO (just rebind same buffer)! Setup pointers and do the job.
i was wondering if this is a legal operation... i'm assuming you've confirmed that this works? i wasn't sure if VBO and texture memory could be mixed... though i know that is supposed to be the ideal direction for hardware manufacturers.

i assume PixelBuffers go to texture memory, and i seem to recall that PBO isn't really feasable last i heard.

and just to be clear, i realize there is not supposed to be any such thing as "texture memory", but hardware sure seems to treat textures special as far as video memory is concerned in my experience... at least in the past this definately seemed the case.Here is an example how to render quad in pbuffer, grab image in PBO, bind as VBO and render again but this time as regular mesh.

http://rttv.users.sbb.co.yu/GLFramework01.zip

yooyo

michagl
03-09-2005, 11:29 AM
Originally posted by VikingCoder:
As a completely dispassionate and uninterested party, you're wrong. Those words are fairly provocative, aggressive, and patronising.i'm sorry i'm not a certified anthropologist social worker... nothing i've said is laden with any emotional baggage however you want to interperet it.



Anyone with a brain can see that.
first of all you are quoting a fact that only i can possibly know, as it entails intimate knowledge of the innner workings of my mind subconcious or not. then you are saying anyone with a brain can see that. when in reality, only anyone with *my* brain can really see that... and even my brain might not be able to see that.



See what I did just there? I disagreed with you, and I said things that from my perspective are only true, but for some reason you took offense. ;) no i don't take offence, your premise is just misfounded. and please don't be offended by that, as it is just a matter of fact. 'anyone with a brain' is a very loaded term for knowledge which can be only sentimental at best. you are projecting your own socialization on to me.


Implying that someone is not "comfortable with english" is really offensive bulletin board behavior.this is a good example. i did not imply that anyone is uncomfortable with english, or even did i imply that being uncomfortable with english is a character flaw. most of the world is uncomfortable with english... i was only offering my apologies to these people for not making an effort to accomodate them.


"Waffling" just means being indecisive. I don't think fascists are traditionally very evasive, conversationally speaking.indecision is a sign of doubt. fascists must be doubtless or risk austrosization from their community, or worse.


And for the record, bringing up Hitler means that you LOSE the argument:i was never arguing. like i said in my first post, i did not want a ROAM versus brute force meshing augument. besides i believe there is also a clause in there that the law does not permit intentional declaration. its just a joke, as was my intention in raising hitler from the dead.

please give me a break now.

plasmonster
03-09-2005, 11:41 AM
On the subject of image based CLOD and procedural detail:
ftp://ftp9.informatik.uni-erlangen.de/pub/users/cndachsb/procdetail.pdf

There's a very cool water simulation to be found there too. This utilizes a reverse-projection approach to LOD generation:
http://claes.galaxen.net/ex/

These can be found at www.vterrain.org (http://www.vterrain.org)

So much research... so little time :(

michagl
03-09-2005, 02:48 PM
here is a new screen for anyone intersted.

http://arcadia.angeltowns.com/share/genesis-mosaics2-lores.jpg

basicly you can see here the triangle 'fluffing' mechanism. i call it granularity, though it is more like inverse granularity, because i predict that as it is recursively applied it will produce a granular type effect sort of like coloflower.

the thinner lines are the new lines. i plan to compute a strip database for the full ~200k mosaic database next... should take a bit more than day, maybe a lot longer.

the next step however would be to consider every thicker line to be the inner edge of a quad. before building the strips, you would flip each thick line, reversing the quads. if you look at the image, it is easy to tell that this would produce much more 'square' triangles, and potentially set the mesh up for another round of 'fluffing'. doing this once is trivial, and comes with no cost... doing it recursively will be a bit trickier though if it is even possible to manage effeciently at run-time. that is too say it will take a lot more thought.

this mesh is not completely saturated though... only the bottom left corner is to a degree... most of it could go down one more level. the max triangle count per batch is basicly 192. at this point.

oh, and i read the nvidia architecture paper... i was surprised how well my preconceptions of the hardware aligned with the info there... pretty much spot on actually. the paper really wasn't what i was hoping for. however my goals are a fully run-time configurable, general purpose VR system... a kind of graphics/simulation 'interpreter'... so the paper is nice as centalized collection of all the customizable functionality which needs to be considered for a given configuration.

i'm kind of in the process of pulling all of my work together into some sort of superior robust system that can be reasonably distributed to the public for shared VR resource utilization and development. genesis is the first system which i've actually built with this goal in mind... otherwise i probably wouldn't be working with it at all. it is designed to be a general purpose simulation environment. its funny how much effort i've put into it (about 2 full-time months)... a lot of strange circumstances keep popping up that egg on its development. at this point its pretty interesting as a project i can readilly share with the public, that would actually potentially draw public interest.

oh and, as for what that paper has to say about so called 1000s of triangle batches... i believe that is misleading, because my numbers don't show anything like this. i think it is assuming that you are doing considerable cpu work in the background. it can't mean that just the driver's cpu usage alone is equivalent to passing 1000's of raw triangles through the gpu -- though technicly it might be a little bitty bit useful to issue fewer batch instructions to the driver when applicable, but not really.

Adrian
03-09-2005, 03:09 PM
i believe that is misleading, because my numbers don't show anything like this. What are your numbers (tris/sec) and what hardware are you using?

michagl
03-09-2005, 05:08 PM
Originally posted by Adrian:

i believe that is misleading, because my numbers don't show anything like this. What are your numbers (tris/sec) and what hardware are you using?well for the record, right now i'm using an nvidia QuadroFX500. which is a quirky little card, because it basicly has all of teh features of the quadrofx, except for only the stencil shadow culling planes, series, but not nearly the numbers.

i think it was about as good as the next to best consumer card at the time, but had NV30 and double sided stencil etc.

its getting pretty old i figure... but last i checked the newer cards didn't look exciting enough for an upgrade. i'm hanging out for a decent pciexpress implimentation i think.

but anyhow, perhaps 'numbers' was a bad term... because i don't have the time to actually wrangle up any numbers. but i work with 1000s of triangles static meshes often enough, with no cpu load, and believe me, my card can't spit that out in the time it takes to dispatch a glDrawElements sequence. and i know that, because lately i'm going through a few hundred glDrawElements a frame, and the performance is not much worse if at all in technical terms.

the litmus test i guess, is how long does it take to put out 10 100 triangle batches, versus a single 1000 triangle batch. i don't think it would be much of a performance hit. by the time the 100 batches got started, each subsequent dispatch would be in parallel, even for just 100 triangles.

worst case you could probably use a display list.

michagl
03-09-2005, 05:26 PM
Originally posted by graham:
On the subject of image based CLOD and procedural detail:
ftp://ftp9.informatik.uni-erlangen.de/pub/users/cndachsb/procdetail.pdf

There's a very cool water simulation to be found there too. This utilizes a reverse-projection approach to LOD generation:
http://claes.galaxen.net/ex/

These can be found at www.vterrain.org (http://www.vterrain.org)

So much research... so little time :( yeah, i'm familiar with vterrain... but i'm really not interested in planar algorithms. they are not nearly robust enough to concern my time with. its time computer graphics took a voyage over the edge of the world. i'm not saying planar optimizations are not useful in limited cases at a planar scale, but i have bigger fish to fry right now.

plasmonster
03-09-2005, 06:28 PM
Actually, those links were for anyone interested. Though, it's a pity that you can't benefit from them. Since you suggested that this thread could go off-topic, I only sought to interject a few links related to some of the ideas that were bouncing around.

For the interested, here's the area that deals with spherical LOD:
www.vterrain.org/LOD/spherical.html (http://www.vterrain.org/LOD/spherical.html)

Anyways, I wish you luck, Michagl.

michagl
03-09-2005, 07:41 PM
Originally posted by graham:
Actually, those links were for anyone interested. Though, it's a pity that you can't benefit from them. Since you suggested that this thread could go off-topic, I only sought to interject a few links related to some of the ideas that were bouncing around.

For the interested, here's the area that deals with spherical LOD:
www.vterrain.org/LOD/spherical.html (http://www.vterrain.org/LOD/spherical.html)

Anyways, I wish you luck, Michagl.yeah, no problem, its party in here as far as i'm concerned.

personally i'm familiar with every attempt at non-planar clod, such as the suggested project. as far as i know, all attempts so far have been limited to spherical... but probably only because there is not a large demand for other shapes. right now i'm working with a cylindrical world. i intend to expand it as least as far as to handle real-time nurbs tesselation and displacement mapping.

if anyone is interested in the subject... these sperical ROAM type models are a good place probably to familiarize yourself with the basic process. there are probably well documented papers out their as well on the central matter.

sincerely,

michael

PS: i prefer to be referred to as 'michael' were applicable. i'm sure though that that will only serve to increase incidents of 'michagl'... you can think of the 'g' as a reverse 'e'. its sort of a pun... sorry i'm not more creative when it comes to global handles.

edit: there are actually a lot more such systems out there than sampled on that page. someone told me around 13 a while back. i built one of the first such systems out there near the end of 2001 i think. i never published anything online... 'sean oneil's' system surface around the same time. we corresponded for a little while. back then there was nothing like this out there. he got a part-time contract with Maxis, but his project didn't work out as far as i know. nothing in this project remains from the original project, except they both use a sierpinski tesselation and an interesting space partitioning system which is basicly just useful for embedding objects inside the model's local space... for managing stuff like gravity, automatic heading correction, and simplified collision processing. notice the low resolution of most of those projects... not big sucesses, no static buffers, and no stripping as far as i know... probably some limited fanning somewhere though.

Adrian
03-10-2005, 01:06 AM
Originally posted by michagl:
I'm using an nvidia QuadroFX500.The newest cards can process geometry 10-20x faster than that card. You're designing a new algorithm based on old hardware performance. To get the best out of future hardware batch sizes are going to have to increase not decrease.

How does your terrain engine cope with one million polys on screen, because that is what the next generation of terrain algorithms should be achieving. Your engine would need to do half a million draw calls per second for 30fps.

You say you dont have time to come up with numbers but surely its just a case of dividing the number of rendered triangles by the time it takes to render them :)

Performance is critical yet you dont have any performance figures. I find that strange.

What cpu you are using?

T101
03-10-2005, 02:00 AM
You know, maybe we should get off Michael's case a bit.

Michael comes across as an "architecture astronaut" - head in the clouds and having difficulty explaining himself in terms understandable to us mere mortals.
I only get about half of what he's saying - some pictures would probably help a lot.

But that doesn't necessarily mean his idea is useless. For example, what use was there for fractals until someone thought of using them for compression?

Many people are looking at this assuming that the CPU would have to do a lot of work.

Suppose that precalculating everything in this way IS a smarter approach than just trying to render every triangle, or any of the other LOD mechanisms.

Also suppose that this is a cheap method to select those precalculated tris.

And suppose it is simple (and cheap in terms of space) to implement in hardware.

On those conditions it might well be interesting

Of course it requires Michael to work out the details, and it may well not be worth it on any current or currently planned hardware, but that's not necessarily the point of research.

Michael, it may be an understatement to say Knackered doesn't play well with others.
Usually I just skip his posts, but he (well, I assume it's a he) did give you a couple of useful suggestions.

knackered
03-10-2005, 04:04 AM
Originally posted by michagl:
oh, and i read the nvidia architecture paper... i was surprised how well my preconceptions of the hardware aligned with the info there... pretty much spot on actually.Now I don't know about the rest of you, but I for one am really impressed by this.

knackered
03-10-2005, 04:12 AM
Originally posted by T101:
Michael comes across as an "architecture astronaut" - head in the clouds and having difficulty explaining himself in terms understandable to us mere mortals.
I only get about half of what he's saying - some pictures would probably help a lot.No, Michael comes across as someone who thinks everyone is stupid, except him.

brinck
03-10-2005, 09:15 AM
I think he's Derek Smart in disguise.

/A.B.

michagl
03-10-2005, 09:27 AM
there is a lot to respond to since my last post.


Originally posted by knackered:

Originally posted by T101:
Michael comes across as an "architecture astronaut" - head in the clouds and having difficulty explaining himself in terms understandable to us mere mortals.
I only get about half of what he's saying - some pictures would probably help a lot.No, Michael comes across as someone who thinks everyone is stupid, except him.quickly, i don't believe 'everyone is stupid'... i'm here specificly because i assume at least most of the people here know this stuff better than i do as far as the hardware is concerned. and in my defence, i do my best to keep my distance from people and contemporary television programming, so my socialization might seem strange to you. just think of me as being from another culture than yours. take my words very literally. i'm not a silicon valley lounge lizard.

my cpu right now i believe is around 2.8Ghz, 2.5 at the minimimum -- AMD.

i'm sorry i don't have the time and energy to describe this stuff better... but i also very strongly get the impression that possibly interested parties are really not trying. i'm willing to meet people half way, but i really don't think this is an appropriate forum for an indepth explantion. and frankly my explanation has been at least as explanative as most acedemic papers, which generally do very little more than to scratch the surface.

after i feel like i've followed this line of thought as far as it can go, and have a release build to demo, then and only then will i look into collaborating with people to help really tweak it (and keep it tweaked) for hardware, and help develope rigorous arguments on paper. but i develope a lot of systems, i have at least 10 very exciting projects in development right now... ranging from revolutionary operating system environment design (designed to run atop windows, and debian linux so far, though indapendant operation is the goal ), to an equally exciting GUI system, an dynamic universal interterpreter capable of parsing any unambiguous sequential language(s) interoperably (currently does basic c++, lisp, and opengl), a sleu of graphics systems, a 2D dynamic disk system for streaming massive maps to disk with an opengl interface, which basicly treats one map as the frame buffer, and the other as a texture, with a bios system inbetween... bios targets can be ram to disk, disk to disk, or disk to ram. and other much more exciting work which i'm not at the liberty to discuss. after Genesis is mature enough, i'm planning to set my targets on what will essentially be a portaling system called Daedalus... i won't give myself much time to concern myself specificly with papers and what have you... i'd rather be building something tangible. its a lot more complicated than just this naturally, but for what its worth everything is scheduled for free release.

as for new hardware running 10x faster... please give me the price points on those cards.

i guess my biggest issue however with this many triangles, is i don't see how disk io can keep up with these numbers. it seems like heavy load times would be in order, especially for procedurally generated data. maybe Nintendo style RAM cards will go back in style... because from what i hear, disk io is not keeping up.

finally, i'm trying to cope best i can with these numbers... but even if the algorithm can support infinitely high batches. it seems like the bottleneck would be in getting that data into ram in the first place.

and yes, i do believe the algorithm could be mapped completely to chip. if anyone at nvidia wants to talk, i will try to draw up a more envolved explanation, about how i would imagine the chip. it would make an awesome hardware nurbs solution if nothing else. also, i should add, that i do not believe there is a computationly more efficient way to go about ROAM. so if this algorithm fails, so would the future of ROAM i believe. but i don't believe ROAM will fell, i imagine it will be the cornerstone of graphics come the end of time.

finally, there is a batch solution i intend to impliment next. i plan to replace all the driver dispatches with a single glDrawLists dispatch, and two bytes per mesh in 255 block chunks. each byte will be a display list, the first will transform the modelview into local space if desirable -- ie hold a matrix -- as well as the basic glDrawElements set up. the matrix might be optional, it helps precision for large models, but for smaller models with possible skinning, world space (no matrix) would be more desirable. the second list per mesh will probably just have a single glDrawElements.

the lists would just be recompiled as necesarry (fairly rarely). list zero in each block would be a NOP list (it does nothing)... meshes not visible to the frustum would be set to zero in the list string.

so in the end, all driver instructions would be reduced to a couple glDrawLists calls, and some short byte strings to be passed through AGP.

sincerely,

michael

michagl
03-10-2005, 09:48 AM
before i log off real quick i just wanted to say that most everyone contributing here has been of great service, if only because it got me thinking very criticly. many developments have emerged since the beginning of this thread, which if nothing else would've taken some time longer to emerge were i not to have stumbled into this forum looking for a solution which ultimately turned out to be as trivial as vsync.

and yes, i apreciate knackered even i think, to the extent that he has maliciously attempted to corner me, and forced me to adapt my critical thinking... but then on the other hand maybe i'm just confusing him with other contributors, and he really only served to muck up the discussion... but at least maybe he spiced the thread up enough to rally some onlookers.

all said, i apreciate your efforts... i realize the personal 'costs' of your contributions... and i'm very grateful. i only hope my personal maneurisms do not offer the wrong impressions.

if any ill will is detected in my words, i'm sure it is only a misunderstanding. i might be criticly honest, but i'm not spiteful.

sincerely,

michael

michagl
03-10-2005, 09:54 AM
Originally posted by T101:
some pictures would probably help a lot.there have been two images shared on my behalf throughout the thread, which do well as diagrams.

i'm assuming if you've read from cover to cover you are familiar with them... but just to reiterate, here they are again:

http://arcadia.angeltowns.com/share/genesis-mosaics-lores.jpg

http://arcadia.angeltowns.com/share/genesis-mosaics2-lores.jpg

maybe later i will share some older more picturesqe (photo-realistic) images.

knackered
03-10-2005, 10:07 AM
I've just realised I haven't contributed anything positive to this thread in...a good while now, so I'll leave you alone from now on. This is most certainly my last "contribution" to your thread. Enjoy the peace and quiet so you lot can continue to thrash out the nitty gritty implementation details of this algorithm nobody but michael understands.
I'll go and tease some beginners before supper.

michagl
03-10-2005, 12:17 PM
Originally posted by Adrian:

Originally posted by michagl:
I'm using an nvidia QuadroFX500.How does your terrain engine cope with one million polys on screen, because that is what the next generation of terrain algorithms should be achieving. Your engine would need to do half a million draw calls per second for 30fps.
i don't think a good 'terrain' needs 1 million polygons to be pixel perfect on forseeable displays. i would guess that most of those triangles would be wasted power. with that many data points, you may as well not even render triangles... a point renderer could probably saturate realistic displays quickly. just throw a bunch of points at the framebuffer, and then run a pass through blending the one or two pixels that didn't get hit.

my big question is how the hell are you going to get that much data into ram, without making users wait minutes for it to load up. and finally how, long will users put up with very small flat world partitioned environments.... and that is just discussing terrain engines. i bet those online rpg players would like a planet they can circumnavigate seamlessly, and whoever really delivers that first, nothing less will suffice from then on.

flat worlders be damned.

the days of asteroids are comming to a close.

and i'm building for a system i won't have to replace till the end of time. i'm not crazy about tackling the same problem twice.

plasmonster
03-10-2005, 12:20 PM
Michael, I had a chance this morning to play around with Sean's demo, over at Gamasutra. It's interesting--flying about a solar system, bouncing into planets--a jolly good time. I admit I haven't given spherical LOD much thought, as I see quite a few obstacles to overcome in certain game-world scenarios. And you're right, the spherical LOD section at vterrain is beginning to atrophy (the aforementioned demo is from 2001).

That notwithstanding, I don't see why some of the more hardware amenable algorithms couldn't be applied to a sphere, or any other geometry for that matter. Consider, for example, geo-mipmaps being bent onto a sphere. Additionally, I don't see why an image based system wouldn't work, provided a good metric could be found. Or clipmaps: Again, if a suitable metric could be provided, who knows. The point is that these are all very hardware friendly approaches to detail reduction.

It's just that after playing with ROAM again, I'm reminded of the major drawback of its premise, which is to minimize rendering costs at the expense of the CPU. This just doesn't fly with modern hardware. Unless, of course, you can manage to alter the hardware. Even with a LUT: this would be essentially tantamount to an image based system. The link given earlier describes such a system. If one were to render multiple sections and warp them onto a sphere or cylinder, for instance, this could work quite nicely, and be very (ultra-modern) hardware friendly. In both cases, the warping is uniform, so the side-effects would be marginal, and would depend largely on the scale. Other geometries could be problematic, however, if wapring artifacts are to be minimized. But by and large, its seems to me that the most effective algorithms are simple, and fit very neatly with the way modern hardware works.

I think hardware innovation is a real liberator. The algortihms in use today are only possible with the hardware that supports them. They simply don't make any sense in a vacuum. By not taking advantage of these advances, I think you do yourself a disservice. Perhaps in the distant future, when most, if not all graphices code is executed on the GPU, then a ROAM-like approach might find its way back into the fray. But my vision of the future is far simpler. I envision increasinlgy simple algorithms executed by increasingly prodigious graphics chips.

Anyway, I've enjoyed this dicussion. This is all such entertaining stuff.

Incidentally, the cylindrical world you mentioned reminded of the book "Rendezvous with Rama." Don't know why they haven't made a movie out of that one :)

[edit: format]

gdewan
03-10-2005, 12:56 PM
You may just get your wish:
Rendezvous with Rama (http://www.themovieinsider.com/movies/mid/206/)

Korval
03-10-2005, 01:14 PM
i don't think a good 'terrain' needs 1 million polygons to be pixel perfect on forseeable displays.I'd say that a million is a bit of a stretch, but 500,000-700,000 is not unreasonable. And 1 million is good for the ambitious developer.


how, long will users put up with very small flat world partitioned environments.... and that is just discussing terrain engines.There have been a number of games in the recent past that stream data rather than having specific loading points and partitioned levels. Dungeon Siege, for example. GTA3+, for another example.

Roam is not required for this. Only a system for streaming static mesh data from the disc is needed.

Adrian
03-10-2005, 03:17 PM
Originally posted by michagl:
i don't think a good 'terrain' needs 1 million polygons to be pixel perfect on forseeable displays. i would guess that most of those triangles would be wasted power.Display technology is moving quite fast now. You can buy a 24inch Dell LCD 1920X1200 for $1200 US. I'm sure the price will fall quickly this year like the 20inch version did last year.

It's better that the gpu at least has the opportunity to waste some vertex processing power rather than sitting idle. Despite the wastage the brute force methods will still render more useful tris/sec that a cpu heavy approach.

The 10x faster card I was referring to was an ATI X850XT platinum capable of transforming 800M vertices/sec, compared to the 40-80M of the FX500. (I'm not familiar with the FX500 I found conflicting specs of its transform speed). I believe the X850XT sells for around $500 US.

SirKnight
03-10-2005, 04:06 PM
Originally posted by knackered:
I'll go and tease some beginners before supper.That's the spirit ol chap! :D

-SirKnight

michagl
03-10-2005, 04:12 PM
not to be fececious, but you are drasticly over simplifying the problem graham. its easy to sit back and play devil's advocate when you have nothing to loose.. but the truth is if it was as simple to produce a real 'game' under such constraints, it would've been done a few years ago. i really don't have time to get into the logistics of it all... but graphics programmers are really addicted to the plane, and its counter part euclidean geometry.

doing everything on the gpu is fun for tech demos, but highly impracticle for a real simulation environment.

as for streaming... yes i surely hope all exploration based games utilize some form of streaming. but consider the example of 1 million polygons... unless you are going to force your players to travel down long corridors or whatever to get to a 'save point'... this isn't going to work out. if the player could save in the field, the system has to be prepared to spit those 1 million polygons in view out immediately... and then grab its ass and do its best to stream the polygons which will soon come into view in asap.

there is no way to accomplish this without some discritized LOD based system... and by the time you have that, you are halfway to a ROAM solution. taking the last step, is just a matter of relying on your brain rather than abusing the gpu. but that doesn't begin to change the fact that you have to get those 1 million triangles on the screen and fast. hard disks aren't even that fast in my experience... that effects clod or not. the difference between continuous and non-continuous LOD, which is the only remaining question for terrain, is that non-continuous LOD relies on artificially inflating the triangle count to try to obscure seams. in fact it marvels in the fact tthat its triangles are only a pixel big... because if it was bigger than that, the transition would be obvious.

this kind of thinking appears to be the only way out. granted ROAM algorithms in the face of hardware have not been successful... that is why i'm presening a novel aproach which really walks right up to the effeciencey of static meshing, but comes with much less of a disk IO overhead, and saves geometry for elsewhere. if you understood how it worked, you would see just how non-existant the cpu overhead is. like T101 i believe contributed... everything is precalculated. selecting which 'mosaics' to tile your surfaces with is practicly a non-event that can be tucked in just about anywhere, and optimized to non-existance for a slight drop in precision.

finally as for comments about tailoring a system to spherical or cylindrical geometry. first of all this is not nearly as simple as it might seem on the face. secondly, i'm not trying to turn out 'a' game or something in the next quarter. i'm building a robust system which can be used to produce any 'game' with a minimal code base and maximally abstracted and detangled interface. so for me, a robust solution wins out over a specialized solution. i have no direct plans of trying to turn a profit, so i have no reason to produce disposable code. i'm building a tool which will be ready for next generation hardware when it hits the ground running, while the rest will be still very invested in archaeic highly hardware dependant code bases. hell most code bases are built to be disposable... this is so wasteful, because if people have half the sense they like to think they do, we shouldn't still be compiling games in 2005, there should be a stable all purpose unified run-time configurable solution out there... but all of that sweat was squandered on building second rate rushed disposable systems... a workflow which should've died with 16bit computing. high level development is hardly bound by 32bit computing, and is completely unbound by 64bits.

so i think i will jump ship here, now that i've steered it way off course. *splash*

plasmonster
03-10-2005, 05:03 PM
Thanks, Gdewan, that really blew me away!

:D

michagl
03-10-2005, 05:07 PM
Originally posted by Adrian:

Originally posted by michagl:
i don't think a good 'terrain' needs 1 million polygons to be pixel perfect on forseeable displays. i would guess that most of those triangles would be wasted power.Display technology is moving quite fast now. You can buy a 24inch Dell LCD 1920X1200 for $1200 US. I'm sure the price will fall quickly this year like the 20inch version did last year.

It's better that the gpu at least has the opportunity to waste some vertex processing power rather than sitting idle. Despite the wastage the brute force methods will still render more useful tris/sec that a cpu heavy approach.

The 10x faster card I was referring to was an ATI X850XT platinum capable of transforming 800M vertices/sec, compared to the 40-80M of the FX500. (I'm not familiar with the FX500 I found conflicting specs of its transform speed). I believe the X850XT sells for around $500 US.1200$ is quite a bit if you are not an ultra consumer. i could get 4 computers easy for that price.

guess which i would rather have.

i have the same crt monitors i've ever had for the past 8 years... cost about 150$ a pop then, like i figure they did for a while before, and still do pretty much... though shipping might make a significant dent compared to the flat displays... but those suckers are still damn heavy last i checked.

those tvs suck up a lot of power too... especially the plasmas... be sure to add that to the cost. add the EPA bill to it while you are at it too.

those little portable lcd screens cost a pretty penny too.

once head mounted displays are reasonable... expect lores to suddenly become all the rage.

as for 500$ card... i shelled out 250$ for the quadrofx500 (a good deal) ... 400$ for the one before that... saw maybe a 2 fold speedup at best, but the features were what i was really interested in.

from your 500$ card, i would expect realisticly a 2 or 3 fold boost tops... but i don't plan to wrangle up 500$ anytime soon. 250$ would be more my style.

with these huge batch numbers though... as far as they ring true... i guess people want bigger better geometry over quantity. thats a little sad.. but maybe the bus will over take the triangles someday and throw all the numbers on their heads. the hardware manufacturers should really do their best to not force developers in any direction. it kills capacity for creative growth.

edit: this last thought must've come out of the blue. in response though, of course developers are going to improve the hardware however they can i guess. but something like favoring dense geometry over dispersed geometry tends to force people into thinking in one way... and might even adversely effect hardware developmen trends. of course if those numbers are to be taken seriously, it would lead to the conclusion that it is equally as costly to render 1000 triangles to a pixel as it is to render 1... with those numbers any attempt at LOD seems futile. but this is simply not reality... because a system withou LOD will come to a halt instantly. try working with large models in a 3D modeling environment like Maya. there is no real geometric LOD there, and it will crash pretty quickly. i figure display lists would could take whatever driver overhead there is to batch processing... but not if you have to regenerate the list more often than not... so complex particles, like maybe Macross style space debris could not be done unless most of the particles shared transform matrices.

Korval
03-10-2005, 06:02 PM
if the player could save in the field, the system has to be prepared to spit those 1 million polygons in view out immediately... and then grab its ass and do its best to stream the polygons which will soon come into view in asap. I'm not sure I follow your logic here.

If you're reloading the game (after a previous save), a hard-load operation is expected. Since the data isn't there already, you pop up a "Loading..." screen and load it. There is no problem, and you don't need to force the player into specific locations to save.


the difference between continuous and non-continuous LOD, which is the only remaining question for terrain, is that non-continuous LOD relies on artificially inflating the triangle count to try to obscure seams.No, the primary performance difference between them is that static LOD gives the card the data as it wants it (large, contiguous streams), while the continuous LOD scheme is unable to do so.

The static scheme is not streaming data up to the card very often; maybe once every 5 seconds at maximum player speed. Not only is this eventuality rare, it also happens as the card would prefer: long, continuous streams of data. This is benifitial for the CPU and the GPU.

The continuous scheme has to frequently update the data, possibly in large quantities, but possibly in small portions. In either case, it hurts more on the CPU and the GPU side.


hell most code bases are built to be disposable... this is so wasteful, because if people have half the sense they like to think they do, we shouldn't still be compiling games in 2005, there should be a stable all purpose unified run-time configurable solution out there...I'm not sure you understand the impossibility of what you're considering. There is no one-size-fits-all approach. The reason codebases are "built to be disposable" is because constant hardware changes and competition requires it. You can't predict the future, and everyone who has tried to in this industry has been burned by it. People who thought that non-programmatic hardware was the future built engines around that, and these engines needed almost total rewrites because the future didn't turn out as they hoped.

The needs of an RTS game are not the needs of an FPS game; indeed, their needs in terms of terrain, field-of-vision, and all manor of other things are very different. They have some basic needs in common (they draw meshes if it is a 3D RTS, etc), but there are plenty of needs that they do not have in common, and those games have little in common with something like GTA or Zelda. As such, running one kind of game on the engine of another, while possible, is not terribly wise. You will never make a generalized engine that can run any kind of game in any kind of genre nearly as well as you can if you made a specialized system.

More importantly, the present exists. If I knew for a fact that programmability on GPUs was coming in 2 years, would I build my engine for a game that was coming out in 1.5 years with programmability in mind? Of course not; it makes no sense. It serves no purpose to the present; the needs of a progammability-aware engine are not the same as one that isn't. The purpose in building such a system for a non-programmability-aware GPU makes no sense, and can even make it more difficult to get that game out in time. The future isn't here yet, so there's no need to go to drastic lengths to plan for it.

That's not to say that you should write rediculously short-seighted code either. Flexibility can be built into systems without going too far. Programmer prefer to build flexible systems that can be "easily" replaced by others if they no longer are appropriate for the new task. This does not mean building a gigantic monolithic system that does everything.

Maybe someday ROAM and other similar algorithms will be the right way to go. Maybe you even do ROAM-type stuff on the GPU completely. In which case, all your work will have meaning. However, for today, it is the slower path. For today, it has only limitted applications, and few of them are high-performance. For today, there are better alternatives. And if we ignore today just because we believe that things will be different tomorrow, we lose out to those who are living in the present rather than the future.

Just because there's a cliff 1 mile in front of you doesn't mean you turn; you still have 5,279 feet before that cliff becomes an issue.


i guess people want bigger better geometry over quantity.I don't know what that means. Bigger, better geometry means more (re: quantity) of geometry. Clearly, the size of a trangle has no impact on how long it takes in terms of vertex processing, so the performance benifit of using static schemes is in having more triangles.


the hardware manufacturers should really do their best to not force developers in any direction. it kills capacity for creative growth.With that kind of logic, we'd still be back in GeForce 1-levels of performance.

You have to optimize, and you optimize where it counts. And if it means that large strings of triangles are the optimal path, so be it; at least we have some path that is optimal. The fact that it happens to be brute force (the path that hardware tends to favor) only makes it easier to use.

michagl
03-10-2005, 06:24 PM
yo Korval,

sorry, but you misunderstood me on most every point in your last post... and i take most of the blame in that if not all of the blame. when i get the time i will respond to your words though, if only to clarify my original intent... which probably wasn't as explicit as i would've liked it to be in a perfect world.

i really apreciate the concern though.

sincerely,

michael

michagl
03-10-2005, 07:35 PM
If you're reloading the game (after a previous save), a hard-load operation is expected. Since the data isn't there already, you pop up a "Loading..." screen and load it. There is no problem, and you don't need to force the player into specific locations to save.
my point is simply that for 1 million triangles on screen, unless there is some development in disk technology i'm unaware of, some serious 'hard' load times are going to be in order. personally if i have to look at a loading screen for more than 10 seconds, i'm bothered... i'd just assume do something else while its loading... it breaks peoples lives up into little un retrievable chunks, kind of like commuting and commercial tv. i don't play too many video games, if any by any measurable standard. but the ones i do have, have rediculous loading procedures, which are generally unecesarry, like reloading the environment you were just in for a replay or something. i would probably play more often if not for that. the load times these days are just rediculous, but if people are accustomed to it, i guess they get what they ask for.



No, the primary performance difference between them is that static LOD gives the card the data as it wants it (large, contiguous streams), while the continuous LOD scheme is unable to do so.

The static scheme is not streaming data up to the card very often; maybe once every 5 seconds at maximum player speed. Not only is this eventuality rare, it also happens as the card would prefer: long, continuous streams of data. This is benifitial for the CPU and the GPU.

The continuous scheme has to frequently update the data, possibly in large quantities, but possibly in small portions. In either case, it hurts more on the CPU and the GPU side.
just curious if uploading data halts the cpu in current implimentations if the app is not multi-threaded? this is something i don't know much about. and just for the record the algorithm discussed here only uploads continuously, but large has yet to be seen. the thing that scares me about large uploads, isn't the uploading. its where to get all of the data to upload. for a nurbs surface for instance, all of the vertices must be procedurally generated, but reading from disk is not much better right now with cpu trends versus disk io. any LOD algorithm which utilizes 'mipmapping' to generate displaced geometry must recompute the mesh for each level, or risk non-representative sampling by not utilizing mipmapping, which would mean more visual popping in the LOD. so i have to ask if we are talking about stream non-continuous LOD here, or just streaming in a static mesh with no LOD in chunks?


I'm not sure you understand the impossibility of what you're considering. There is no one-size-fits-all approach. The reason codebases are "built to be disposable" is because constant hardware changes and competition requires it. You can't predict the future, and everyone who has tried to in this industry has been burned by it. People who thought that non-programmatic hardware was the future built engines around that, and these engines needed almost total rewrites because the future didn't turn out as they hoped.
there are reasonable ways to go about providing hardware hooks without throwing the baby out with the bath water... assuming the baby is worth keeping at all in the first place. producing low quality systems has just become habitual. its a waste of energy... but i can understand why it has flurished, because building high quality systems is hard to coordinate with large teams, especially large teams who do not actually rely on the applications they develope -- though that is less of a problem in the video game industry i figure.



The needs of an RTS game are not the needs of an FPS game; indeed, their needs in terms of terrain, field-of-vision, and all manor of other things are very different. They have some basic needs in common (they draw meshes if it is a 3D RTS, etc), but there are plenty of needs that they do not have in common, and those games have little in common with something like GTA or Zelda. As such, running one kind of game on the engine of another, while possible, is not terribly wise. You will never make a generalized engine that can run any kind of game in any kind of genre nearly as well as you can if you made a specialized system.
yeah of course, different systems would be used for a typical top down RTS versus a first person type configuration... same as say a largely 2D verus a 3D configuration. the trick is just to paint as wide a swatch as possible as far as abstraction is concerned, without introducing overburdening dependancies.



More importantly, the present exists. If I knew for a fact that programmability on GPUs was coming in 2 years, would I build my engine for a game that was coming out in 1.5 years with programmability in mind? Of course not;
no you'd be wise to leave hardware to hardware and software to software. that is keep the two seperate dependancy wise. i would personally never build hardware into a system directly. you are right though that graphics hardware 2 or 4 years ago was a very chaotic affair. i think it is fairly well stabilizing though to the point that its behavior can be fairly well predicted and abstracted with the introduction of the programmable gpu... in my experience 64bit processing allows for a much more advantageous environment for recoverable development than 32bit really allowed for... though i believe it could've and should've been done with 32bit. the window is very wide for 64bit though... and the restrictions of 32bits on matters such as run-time programming (lisp), precision, and addressing will not be present with 64bits. and i don't expect 128bit processing anytime soon... i think there is a noticible curve going from the amount of time to transission from 8 to 16 to 32 to 64 bits, which probably correlate with the order of magnitude of the values achievable by such alignments. in other words, developments are becoming increasingly stabilized in the big picture.


it makes no sense. It serves no purpose to the present; the needs of a progammability-aware engine are not the same as one that isn't. The purpose in building such a system for a non-programmability-aware GPU makes no sense, and can even make it more difficult to get that game out in time. The future isn't here yet, so there's no need to go to drastic lengths to plan for it.
if you can save time on the next iteration it is worth while... especially if you invested in the last cycle... and i'm sure this is done to a degree.



That's not to say that you should write rediculously short-seighted code either. Flexibility can be built into systems without going too far. Programmer prefer to build flexible systems that can be "easily" replaced by others if they no longer are appropriate for the new task. This does not mean building a gigantic monolithic system that does everything.
yeah i agree... now is a good time i think to start shifting in this direction. as the days pass by, this approach becomes more and more reasonable. the trick is just being able to recognize and admit that the old ways are counter productive when the times come and the environments change.

the real trick with monolithic systems though i think is, throwing more people at them does no good. so you need a sort of 'heroic' programmer or two who can think with the same mind to pull it off, or at least get it started.



Maybe someday ROAM and other similar algorithms will be the right way to go. Maybe you even do ROAM-type stuff on the GPU completely.
i suspect it will go completely on hardware. the question i think isn't a matter of 'if', but 'when'. would work well probably with the regular mesh images advocated by 'hoppe', and nurbs parameters. but in the mean time i don't believe it is inefficient to pull off with contemporary hardware capacities... just takes a lot more thought to impliment.



In which case, all your work will have meaning. However, for today, it is the slower path. For today, it has only limitted applications, and few of them are high-performance. For today, there are better alternatives. And if we ignore today just because we believe that things will be different tomorrow, we lose out to those who are living in the present rather than the future.
i'm not advocating anything... the results will speak for themselves if i have anything to say for it. the only strong counter argument i see is this whole batching thing which proports that a single glDrawElements dispatch sent across agp memory is as fast as drawing 1000 simple triangles. first of all this breaks down when your shaders have enough instructions... notice how coarse the geometry is with modern games. but even for simple shaders, i have a feeling utilizing displaylists will totally rip the performance hit out of that batch argument. if you have doubts about the cpu overhead for my ROAM algorithm, i have a feeling they are unfounded, because the cpu overhead is nothing.



Just because there's a cliff 1 mile in front of you doesn't mean you turn; you still have 5,279 feet before that cliff becomes an issue.
this depends on just how fast you age going :)

for my own needs, i'm going very fast... or if you don't like the 'fast' analogy... then lets just say my reaction time is very slowwww.


I don't know what that means. Bigger, better geometry means more (re: quantity) of geometry. Clearly, the size of a trangle has no impact on how long it takes in terms of vertex processing, so the performance benifit of using static schemes is in having more triangles.yeah, i was just commenting, that just due to the bus versus the rasterization hardware... current trends favor an environment with a few large high resolution objects... versus even a lot of small high resolution objects. say we can't have 1000 rocks rolling down a ravine, we have to have a single boulder. i'm not saying that is anyone's fault necesarrilly, just unfortunate. however if people grow too accustomed to this thinking, they might pass up an opertunity to improve the bus while focusing too much on the rasterizer. tunnel vision or something.


With that kind of logic, we'd still be back in GeForce 1-levels of performance.

You have to optimize, and you optimize where it counts. And if it means that large strings of triangles are the optimal path, so be it; at least we have some path that is optimal. The fact that it happens to be brute force (the path that hardware tends to favor) only makes it easier to use.yeah i agree, that any improvement is improvement... but if resources are just being focused along one path versus another, i think the aproach is lop sided. and only favors competitive mediocrity.

michagl
03-11-2005, 06:55 AM
just to clear some possible misconceptions up.

i'm really not committed to this aproach as far as my programming work is concerned as much as it may seem.

the truth is the work flow is ameniable enough to quite easilly be modulated to support pretty much all of the common streaming/lod approaches presented here and else where.

i will probably even set it up to do some algorithms i'm not so crazy about just for the sake of benchmarking performance.

if you have a favorite LOD or streaming aproach, feel free to share it, and i will try to provide it as an option.

shouldn't require much more than a days work to pull off each one.

the end result should be a fair list of basic LOD and streaming approaches to compare results against.

the tesselation patterns will probably be the same, without some extra work. but doing things like not varying LOD (push the constant factor up a little and take the linear and quadratic down to zero), saturating the second tier mesh with blending strips (there is no problem with pushing the batch sizes up as high as you like as long as you don't care that the second tier mesh is static save for the triangles right on the border).

you probably wouldn't want to do the first option without using the second. but i believe those two aproaches are standard LOD practices, and i'm sure i could add them both in a day or so.

knackered
03-11-2005, 07:17 AM
Now you're talking sense. Good luck with it, and despite the impression I give, I will be interested in the results.

michagl
03-11-2005, 08:35 AM
i really don't get this bit:



No, the primary performance difference between them is that static LOD gives the card the data as it wants it (large, contiguous streams), while the continuous LOD scheme is unable to do so.

The static scheme is not streaming data up to the card very often; maybe once every 5 seconds at maximum player speed. Not only is this eventuality rare, it also happens as the card would prefer: long, continuous streams of data. This is benifitial for the CPU and the GPU.

The continuous scheme has to frequently update the data, possibly in large quantities, but possibly in small portions. In either case, it hurts more on the CPU and the GPU side.this is something i really don't understand and would like to have very mechanisticly explained to me.

it seems obvious to me that though large continuous streams according to numbers would be more effecient. but the idea of adding a potentially major hit every five seconds is suicide as far as stability is concerned. it looks like a bottle neck to me. you would either have to cut performance all the time so that the upload activity isn't noticible. or break the upload up across multiple frames so that the hit is less obvious.

i don't understand what is going on here as far as exactly how memory is being transported... where code is halted. how buffers are locked. is there any kind of built in parallelism going on whatsoever? or do you have to setup your own threads, and check for 'critical sections' across threads. the nvidia pipeline pdf suggested above appears to report that uploading buffers in a staggered fashion is safe as long as they are not being drawn and uploaded to at the same time. but i still don't even know if using an upload dispatch like glBufferData is going to halt the calling process or not. why is there no built in api for locking data? i would be more comfortable with that. how is memory tranferred from system to agp memory? is there a dma sysem inbetween.

its this kind of stuff that i'm really looking for answers to more than anything.

that is how can i setup opengl to ensure parallelism between uploading and rendering buffers in very hard terms.
don't say hardware/implimentation specific unless you are willing to spell out a number of the most common cases.

i want the last detail. and believe me, i would apreciated definate answers in this vein more than anything else in this thread.

sincerely,

michael

Korval
03-11-2005, 11:56 AM
producing low quality systems has just become habitual. its a waste of energySo, just because they are optimized for a specific path of data (a datapath that is fast, btw, because that's how PC architecture wants to work. Maybe you should suggest that they rearchitect PCs), you consider this a "low quality system"? It seems wrongheaded to consider it low quality just because it doesn't do what you think it should.

I'd love to see subdivision surfaces built into hardware. It isn't going to happen, and their lack doesn't mean that modern hardware is somehow low quality. There just happens to be a specific feature that I would like that doesn't exist.


i would personally never build hardware into a system directly.You've clearly never coded on a PS2 ;)

There are plenty of times and perfectly valid reasons when hardware will dictate the shape and form of various parts of your engine. If you don't expect your users to have multiple CPU's, then you don't bother threading your engine significantly. If, however, you're going to make a PS3 game, with it's multiprocessor Cell chip, then yes, you are going to thread your engine, and this will require some significantly different design.


if you can save time on the next iteration it is worth while... especially if you invested in the last cycle... and i'm sure this is done to a degree.You don't get a "next iteration" if the current iteration doesn't get out of the door.


first of all this breaks down when your shaders have enough instructions...If you're optimizing your vertex transfer when the bottleneck is in processing the data, then yes, it is a waste of time.


but the idea of adding a potentially major hit every five seconds is suicide as far as stability is concernedYou know that you're going to be streaming in data, so you plan for it. You don't load the entire next piece all at once, nor do you process that data all at once; you amoratize the cost of loading over many frames. Streaming relies on asyncronous file IO and file processing. This means that you only do, say, 1 millisecond worth of file processing each frame.

More importantly, if you have a goal of 1 million triangles in the world at any given time, this doesn't mean that everytime you stream new data that you get 1 million tris. You break the world up into streaming segments. From any position in the world, you have loaded the current segment and the "nearby" ones (however far you wish to define nearby). Also, for the sake of good streaming, you have one more level of segments in memory, and those are the ones that you can't see yet. So, when the camera moves into another segment, you already have the data that this segment needs, and you can just start streaming in the data for the neighboring segments that isn't already loaded.


but i still don't even know if using an upload dispatch like glBufferData is going to halt the calling process or not. why is there no built in api for locking data? i would be more comfortable with that.glBufferData is going to take some CPU time, no matter what. It needs to copy data out of your buffer and into a place that the GPU can DMA it. After that, it can be async... as long as you aren't using the buffer, of course.

As for locking, we have that; it's called mapping.


don't say hardware/implimentation specific unless you are willing to spell out a number of the most common cases.You choose to use OpenGL. You have to take the good side with the bad. And part of the bad side is that you can never be entirely sure what is going on behind the scenes. OpenGL doesn't guarentee parallelism; it simply allows for the possibility of an implementation that can offer it. We can make estimates about what the driver is doing and run some tests to verify those estimates. However, we simply do not know what the driver is doing.

michagl
03-11-2005, 01:06 PM
So, just because they are optimized for a specific path of data (a datapath that is fast, btw, because that's how PC architecture wants to work. Maybe you should suggest that they rearchitect PCs), you consider this a "low quality system"? It seems wrongheaded to consider it low quality just because it doesn't do what you think it should.by low quality all i mean is unsalvagable code. constantly rewriting code is a waste of time.



I'd love to see subdivision surfaces built into hardware. It isn't going to happen, and their lack doesn't mean that modern hardware is somehow low quality. There just happens to be a specific feature that I would like that doesn't exist.we are not thinking on the same page at all. i can't get my head around this quote.


You've clearly never coded on a PS2 ;) inlining low level code is a simple process assuming standardized compilers are available. PS2 is a back water aproach to programming i figure... PS3 probably will be more in the PC vein, as i figure the Xbox is.



There are plenty of times and perfectly valid reasons when hardware will dictate the shape and form of various parts of your engine. If you don't expect your users to have multiple CPU's, then you don't bother threading your engine significantly. If, however, you're going to make a PS3 game, with it's multiprocessor Cell chip, then yes, you are going to thread your engine, and this will require some significantly different design.i'm not going to defend this type of thinking. its just a matter of whether total optimizations is a higher priority than flexability and a long lifespan.


You don't get a "next iteration" if the current iteration doesn't get out of the door.i'm not a corporate programmer. personally if i program something, i get it right the first time, and i don't plan on having to ever do it again. the problem is when you rely too much on hardware. its better to think in terms of computational effeciency... from there its easy to insert hardware 'here' and 'there' to be taken advantage of. hardware comes and goes, computation is a safer metric for a system built for survival.


you amoratize the cost of loading over many frames. Streaming relies on asyncronous file IO and file processing. This means that you only do, say, 1 millisecond worth of file processing each frame.be more specific. would you stream disk to ram, then when its all done ram to agp? or stream a little from disk, a little to agp. be a lot more specific actually please. i take it you are only using 'BufferData' once... and some how amortizing that over multiple frames... as that is what you are advocating isn't it. please try to be more clear if you know what you are talking about.



More importantly, if you have a goal of 1 million triangles in the world at any given time, this doesn't mean that everytime you stream new data that you get 1 million tris. You break the world up into streaming segments. From any position in the world, you have loaded the current segment and the "nearby" ones (however far you wish to define nearby). Also, for the sake of good streaming, you have one more level of segments in memory, and those are the ones that you can't see yet. So, when the camera moves into another segment, you already have the data that this segment needs, and you can just start streaming in the data for the neighboring segments that isn't already loaded.
no kidding e_e



glBufferData is going to take some CPU time, no matter what. It needs to copy data out of your buffer and into a place that the GPU can DMA it. After that, it can be async... as long as you aren't using the buffer, of course.what is agp memory? is it just a designated region of system ram, or is it some ram closer to the agp bus? might it ever be possible to dma from system to agp, and then from agp to video. or must the processor always manage memory from system to agp as i understand it? that is then that the agp bus has no dma capabilities in itself then? would that not be a useful piece of hardware? should opengl anticipate something like that?



As for locking, we have that; it's called mapping.that is not how i understand 'mapping' at all. of course using mapping does lock the buffer. nevermind i guess.

Korval
03-11-2005, 02:46 PM
i'm not going to defend this type of thinking. its just a matter of whether total optimizations is a higher priority than flexability and a long lifespan.As time passes, priorities shift, but even now, highly compeditive performance arenas (ie: games) don't have the luxury to waste performance. Optimization is crucial to beating the competition. And so is having the fastest hardware available, even if it means using a brute-force rendering algorithm.


be more specific.No. There is a clear right answer and a clear wrong answer to all the questions you just asked. I'm not going to hold an impromptu lecture on how to write streaming code. Instead, if you are interested, I would suggest either investigating it yourself, or doing some research into these concepts online.


what is agp memory? is it just a designated region of system ram, or is it some ram closer to the agp bus? might it ever be possible to dma from system to agp, and then from agp to video.AGP memory is system memory that is designated to be accessible by the Accelerated Graphics Port bus (and the graphics card at the end of it). It is uncached memory, so that the DMA process doesn't have to go through the CPU's cache.

DMA means, in this context, from system memory to video memory. Typically, regular system memory cannot be DMA'd. In order to have memory that can be DMA'd from, it must be allocated with the OS in some special (possibly undocumented unless you're a driver developer) way. It is uncached memory, for reasons as explained above. If the DMA-able memory is not also AGP memory, then it will not go across the AGP bus, and will therefore use the shared PCI bus (and be up to 8x slower than it could be).

There is no way to DMA memory internally (that I know of); that is, there is no way to asyncronously copy data from one block of memory to another.


should opengl anticipate something like that?You're thinking of OpenGL in the wrong terms. It does not specify how hardware works, nor does it specify how memory works. It simply specifies an API and the results of that API. How it gets implemented on hardware is up to the various OpenGL IHVs.

michagl
03-11-2005, 03:28 PM
As time passes, priorities shift, but even now, highly compeditive performance arenas (ie: games) don't have the luxury to waste performance. Optimization is crucial to beating the competition. And so is having the fastest hardware available, even if it means using a brute-force rendering algorithm.if it isn't obvious, i'm not a game programmer... however it is more difficult, but there are preprocessor type solutions which can yield the same sort of flexibility without giving up the compile step. for me personally though, i can list the highly competitive hardware games which i could care for probably on a singl hand... games have only gotten worse really sense hardware has taken center stage in such an agressive way. and still the best games by most any really aculturated person are games of yester year at least 9 times out of 10. so if i was a game company exec... personally i would focus on managing recoverable resources, so mo effort could go into content and creative thinking. but really the decline in game content, like movie content, is probably more just a function of a popular culture in a nose dive. in the end though, technology doesn't make classic games... if you want to talk games. i used to play a lot of games when i was a kid. i didn't stop playing games because i got older... i stopped because i ran out of games i hadn't already played the life out of. i'm glad really that the game industry has gone to hell. i get a lot more work done without the distraction of a new game every other month.


No. There is a clear right answer and a clear wrong answer to all the questions you just asked. I'm not going to hold an impromptu lecture on how to write streaming code.thats unecesarry, and i figure you know it. you write more than you need to... maybe because you understimate my competance, but it comes off as rediculously condescending at times. i'm not faulting you... but that is the vibe i'm getting.

i know as much about code as anyone down to the meta meta level... i requested hardware specific answeres, if you are giving that is.


AGP memory is system memory that is designated to be accessible by the Accelerated Graphics Port bus ...

... shared PCI bus (and be up to 8x slower than it could be).

There is no way to DMA memory internally (that I know of); that is, there is no way to asyncronously copy data from one block of memory to another.thank you for the thorough explanation.

it seems however that DMAing memory internally aught to be legal as long as it goes through some sort of OS trap... hardware providing of course.

as long as you go through an API/driver which assures that the memory is going to a harmless place such as a VBO then drivers should be able to request a DMA transfer of the OS. the OS knows where the memory is coming from and going to, so everything should be securable. i've seen diagrams of PCIexpress architecture which taught extensive DMA type units on PCIe boards... maybe someone is pushing for something like this.


You're thinking of OpenGL in the wrong terms. It does not specify how hardware works, nor does it specify how memory works. It simply specifies an API and the results of that API. How it gets implemented on hardware is up to the various OpenGL IHVs.i understand fully this facet of opengl. but i hardly think it would be inapropriate to provide some sort HINT system for future expansions in hardware such as the discussed above internal DMAing process.

by not anticipating hardware developments, opengl restricts what drivers could possibly do with legacy type code with future hardware.

it seems logical to me that BufferData should provide a helper function to call prior to BufferData which would lock the system memory buffer you are about to give it. you would call another function to unlock the buffer. if the memory has not yet been transferred by the time you unlock it, then the process would then halt until the transfer is complete. even if hardware does not exist to make use of this now. it is a reasonable development to predict. it would be basicly just a pact with the driver which says:

'you can assume that i'm not going to write to this memory until i unlock the buffer, so take your time to DMA it to agp rather than halting my process now to make a transfer through the cpu'

this is just common since policy... the interface does not need to be tied to any hardware.

making room for the possibility of hardware developments allows code to be upwardsly compatible... not doing so discourages people from utilizing good sense in developing upwardly compatible systems.

Korval
03-11-2005, 05:55 PM
and still the best games by most any really aculturated person are games of yester year at least 9 times out of 10.We're not going to get into that discussion on this forum. Suffice it to say that I strongly disagree with this statement, and leave it at that.

However, the outcome of said discussion notwithstanding, we still have a game industry. And that industry is very compeditive. And it needs high-performance and the fastest possible hardware available.


you write more than you need to... maybe because you understimate my competance, but it comes off as rediculously condescending at times.I don't intend to sound condescending, but I find it is much more productive to explain something in too much detail than to have it misunderstood in a post and then have to keep adding more detail in subsequent posts. Better to get it all out in the open initially.


it seems however that DMAing memory internally aught to be legal as long as it goes through some sort of OS trap... hardware providing of course.I think you've misunderstood what DMA is. It isn't just an async memory transfer; it's a piece of hardware.

All of the DMA processes that I am aware of involve tranfering memory from one memory bank to another. In those cases, the medium for the transfer is some kind of bus. This is what leads me to believe that there is no such thing as an async DMA transfer across system memory. Even if you tried to thread it, all that would do is amoritize the time spent on the CPU, which the user is just as capable of doing by threading it on his end. Since Buffer Objects are objects, they are sharable across contexts and threads.

Also, you must remember that Buffer(Sub)Data has specific requirements. Among them, when the function returns, the user can play with their memory again. So, when the function returns, the driver must have copied the data out of the memory. Since you're not going to find an async version of memcpy anyway, this is hardly an unreasonable restriction.

Once again, I may be wrong, but I do strongly doubt the idea that there is a way to DMA from system memory to system memory. Presumably, old 2D accelerator devices, those with hardware blitting, could do it, but then again, they had dedicated hardware and on-chip memory on their side. Nowadays, similar operations go through 3D hardware, and therefore can easily pipeline the operation.

michagl
03-11-2005, 06:45 PM
yes i agree with you 100% except for maybe the non-presumptious posting, which i either have to disregard, or respond to, or look like i was unaware of the fact.

when i communicate with people at different levels of experience. my general rule is to assume knowledge, and let the other end ask questions when confused. i insist on this for my own sanity both ways.

as for internal DMAing... sorry i wasn't more explicit... when i said something like 'hardware willing' or 'hardware provided'... what i meant was 'if future architectures allow so much'.

of course DMA is hardware. i could be wrong, but i've seen some proposed PCIe architectures, with a lot of DMA systems on board, which all route though a single executive unit. one of them might provide DMA transfer between system memory. which personally i think would be an awesome tool, without going full fledged multiprocessing.

as for BufferData, what i would proprose is another function called something like LockBufferData, which would be called before BufferData... if not called BufferData would return directly to the process... if used however BufferData would assume the buffer is not written to until the buffer is unlocked... in this way, if a future DMA system was on board, then the process could feel safe to move forward.

when you think about long term programming, you think about this sort of stuff. when i program something, i expect it to last to the end of time, with maybe a little minimal upkeep along the way. if i didn't think it could last that long, i wouldn't program it in the first place. i sit down and grapple with the problem, until i'm sure there is no better way to go at it in computational terms. basicly, i build perfect systems or nothing at all. i work extremely fast, but i don't set deadlines for myself like games going to market might have. i think in terms of years, not quarters. so my constraints are not exactly what most might be accustomed with.

i apreciate the service though... i've been out of the hardware loop for a relatively long time. since i picked up my last card i guess.

sincerely,

michael

michagl
03-13-2005, 08:49 AM
for what it is worth, for anyone following the algorithmic component of this thread.

it is possible and very lucrative to simply continue the right triangle tesselation of the mesh infinitely, only with the caveat that after the 8x8 resolution is passed, further tesselation to a given depth is super-saturated.

this aproach is fairly lucrative in terms of vertex reuse... and allows for an infinite and highly regular batch size. however it does break the 'continuous' qualifier of CLOD, but only at the deepest level, at which point such inconsistantly would probably become fairly negligible.

michagl
04-02-2005, 07:35 AM
i said i would report back here after i got around to reversing the meshing. you can see the results in this image:

http://arcadia.angeltowns.com/share/genesis-hyperion4-lores.jpg

the triangles in the foreground though really twice as big as i feel they really should be demonstrate a nearly perfectly smooth contour and silhouette. i can also swear that i made no attempt whatsoever to find a good example. this is actually the first screen i took after fully implimenting the reversal. what clunkiness that is there is really only the result of a very clunky topology database, and exagerated scale, plus the limitations of a local 3x3 pixel power 4 radial filter.

excerpt from a more recent thread:

"in the first image i've successfully reversed the mesh producing perhaps the best possible tiling for smoothly aproximating arbitrary surfaces. it is basicly a tightly packed grid of octagons with a vertex in the middle of each, and a vertex in the middle of the diamond created in the space between the octagons(assuming lod is constant). anyone using displacement mapping might want to look into this. the properites are really amazing. much better than a right triangle fan tesselation, and infinitely better than a regular stripping. i haven't tested it directly yet, but i have a feeling its properties for vertex shading are astounding as well."