VBO, float and ATI

a revival of an old problem that wont stop haunting me. the situation:

a large terrain takes a lot of memory, so saving memory wherever possible is a good thing. that includes storing positions as bytes or at least shorts instead of floats.

this approach works well when using vertex arrays, but slows to a crawl with vbo. so i tested a few things and in the result was: using anything but float in a vertex buffer would be horribly slow. i was aware of conversion to float that would have to happen so i wouldnt have been surprised about a small slowdown, but were talking a factor of 100 here (1200fps and 30fps).

so i wrote a simpler test case, just a quad, once as float in vb, then as byte. this time, the byte version wouldnt even show up. i doublechecked all sizes and just to be sure tried it with short. this time it worked AND was as fast as the float version. so i tried using short in the terrain again -> crawl

ok, so maybe the vertex position is supposed to be float. imagine my surprise, when i added a vb with colors as unsigned bytes and it again slowed down.

as ati didnt bother to reply over the last two weeks (probably have a lot to do making their drivers work with all those single games) i uploaded a test program and today got the chance to try it on a different machine.

result: works perfectly fine on nvidia cards (quadro fx, gf4 -though i wonder how-), while different ati cards including my 9800 would slow down when switching to short.

am i missing an important line in the vbo specs that says something like “youre supposed to use floats and anything else will result in indefined behaviour”? that kind of thing is forcing me to use 4 times the memory it would actually need (more even, but there arent any 5bit types ,-) ). its frustrating to see something require a 64mb buffer if you’d need less than 16… because only the latter has a chance to work.

whats so extremely different between va and vbo except where it is stored and whos managing the memory? what is ati doing so completely different from nvidia in their vbos? why shouldnt i try to cast a pointer aquired by mapping the buffer to another type (as it seems to be ignored and no matter how much explicit casting you use will keep storing the original type).

did anyone have similiar effects? are they supposed to be like that? and if ati drivers cause so much less trouble than a few years ago, how awful must it have been to develop for ati cards back then?

Sounds strange, are you 100% sure it’s not a bug in your code ? Can you post a link to this test program and its code ?

Y.

should contain program and source (config.cfg to switch, terrain.cpp has the relevant part)
http://festini.device-zero.de/VBODemo.zip

i was thinking “must be my mistake” a long time and again, when the simple quad worked without problems (not after it just disappeard when changing to bytes)

as mentioned, it draws correctly and on nvidia cards its as fast as the float version. im trying to think of a situation where that could happen with a bug in my code.

if you happen to have an ati around (and some time you desperately need to fill) you could write a simple test using vbo and float/non-float types (especially unsigned char). i’ll try and throw something together thats not as much of a mess as the program above (though i had that problem right from the start, so it appeared long before the mess).

On a 9800 the hardware only natively supports dword aligned arrays. Bytes should be fast if your array has uses 4 components. Shorts should be fast if your array uses 2 or 4 components.

The reason vbo takes a bigger hit than standard vertex arrays for non-native arrays is because va always copies array data over to the vpu and during this copy data can be converted into a native format. VBO with non-native arrays could be made faster in ati’s drivers, but it would always be better for developers to to use natively supported formats.

If you need more info devrel@ati.com is there to help. Sending them a test app that shows bad performance or a rendering bug helps them solve your issues faster.

for a second i thought that might be it, but then… 3 shorts worked well in a simple test. as did 3 bytes. though i might simply not have enough vertices yet. i’ll make it a few hundred and if that still isnt enough to hurt performance i have to check if its the combination of vbo and vertex program (at least on my old gf3 that had weird results like crashes when using a normal array, unless i did NOT enable normal_array and just used it).

edit: after being about to give up and adding dummy vertex programs and whatnot i finally made it slow again.

using enough vertices (1000) makes it a bit more obvious. using floats things like index buffer or vertex/fragment programs dont make any difference, while with bytes the programs alone make a difference of 450fps compared to 600. same for index buffers. both turned off reach 900, so using index buffers and programs wont make a difference for floats but half the speed for (unaligned) bytes.

floats btw. perform at 2600fps.

the source for this test: http://festini.device-zero.de/vbo.zip

unfortunately the difference seems to scale with the number of vertices. changing one vertex to be 4 consecutive bytes however resulted in 2600fps too. so it seems that ati is simply not as good in handling unaligned data. considering, that this more or less ruins the plan to avoid redundant data im tempted to say f… as in the end, if data needs to be dword aligned i cant see any way to get around this and a 4096x4096 terrain will ALWAYS eat 64mb instead of 16 unless not using vbo.

[This message has been edited by Jared (edited 02-17-2004).]

Originally posted by ribblem:
VBO with non-native arrays could be made faster in ati’s drivers, but it would always be better for developers to to use natively supported formats.
I don’t think “best fit” data conversion is possible at all with VBOs. The vertex layout isn’t declared when sending the data, and worse yet, VBOs can be mapped.

phew… vertex arrays dont seem to be as slow as i remembered them. its probably eating a lot of cpu time to send all the vertices, but at least it allows for nice 4096x4096 heightmaps in less than 17mb which with frustum culling alone still render at 8fps on a radeon9800. lowest lod still does well at 500fps so i guess it should work well enough and as a bonus i get a “free” fallback if vb allocation fails.

still feels like an ugly workaround and i guess theres not much hope that ati will do whatever it takes to make work as smooth as on an nvidia?

On a 9800 the hardware only natively supports dword aligned arrays. Bytes should be fast if your array has uses 4 components. Shorts should be fast if your array uses 2 or 4 components.

To make sure that I understand you, if I have an array of positions stored as bytes, I should pad each position to a 4-byte boundary? So each position takes 4 bytes?

Now, if I had both position and normal in the same array, I should pad each to 4-bytes? So each vertex in this case takes 8 bytes?

Originally posted by Korval:
[b] To make sure that I understand you, if I have an array of positions stored as bytes, I should pad each position to a 4-byte boundary? So each position takes 4 bytes?

Now, if I had both position and normal in the same array, I should pad each to 4-bytes? So each vertex in this case takes 8 bytes?[/b]

at least that made a huge difference in my test. using 2 or 3 byte for position resulted in 400fps, using 4 gave the same 2600fps as floats… i should mention those fps values are a little off, as i removed platform dependent timer code and used 2ghz as fixed speed instead of 1.73.

i think this might be a pretty bad thing, especially since nvidia isnt showing any problems i guess a few developers arent aware of that and the consequences for ati users (though at least all commercial developers will have test systems i hope).

Originally posted by Korval:
[b] To make sure that I understand you, if I have an array of positions stored as bytes, I should pad each position to a 4-byte boundary? So each position takes 4 bytes?

Now, if I had both position and normal in the same array, I should pad each to 4-bytes? So each vertex in this case takes 8 bytes?[/b]

That is correct. If you want position and normal in the same array then pad each out to 4 bytes or 8 bytes total.

Originally posted by zeckensack:
[b] [quote]Originally posted by ribblem:
VBO with non-native arrays could be made faster in ati’s drivers, but it would always be better for developers to to use natively supported formats.
I don’t think “best fit” data conversion is possible at all with VBOs. The vertex layout isn’t declared when sending the data, and worse yet, VBOs can be mapped.

[/b][/QUOTE]

These cases can be handled by keeping a copy in the format the user wants in system memory and then converting that copy to a format the hardware supports at unmap time (remember you can’t draw while mapped). This of course would mean that Jared’s consern about wasting video memory would be done by the driver behind the users back.

Originally posted by ribblem:
These cases can be handled by keeping a copy in the format the user wants in system memory and then converting that copy to a format the hardware supports at unmap time (remember you can’t draw while mapped).
That’s not a good idea. The client application can still make changes to the vertex layout after unmapping … you’ll have to wait until the next glBegin (or equivalent) until you decide what to do, unless you want to risk doing it twice.

I agree that it can be done. That doesn’t make it any less absurd. I wouldn’t want to implement this stuff on the driver side …