ATI VBO with SHORT geometry insanely slow

I’ve done some testing here on a 9250 using Cat. 5.3, and seen som results I can’t explain with anything but driver errors. I’d like some input.

Doing both plain and compiled vertex arrays with SHORT vertices produces roughly the same speed for the geometry I test with, as if I convert the geometry to float and draw that.

However, as I’m curious I wanted to see how VBO behaved, and converted to FLOAT I do get a noticable speedup, but leaving it as SHORT I get a slowdown of orders of magnitude.

  1. Shouldn’t the GPU be able to on-the-fly convert SHORT to the GPU-internal float represenation? Surely even 9250-like hardware can do that?

  2. Should it then, if not able, convert it on upload?

I know, I know, I’m stretching it a bit here. But still, my CPU is AFAIK in now way or form able to convert that amount of geometry in the timeframe given, why I (perhaps falsely) assume the GPU indeed does it for VA and CVA cases, but for VBO falls back to interpreted VB or something. :slight_smile:

seems like i read in the spec that floats were likely to be the common usage type. i suspect that’s what they optimize for.

why would you want the hardware choking on conversions?

I’ve done some testing here on a 9250 using Cat. 5.3
There’s probably your answer. You’re using a Radeon 9250. All bets are off regarding that. ATi posted information detailing which formats are supported in hardware (though all require 4-byte alignment), but the only did it for R300 or better hardware.

I wouldn’t be too surprised if 9250’s lacked the hardware to process non-float values directly. However, your alignment may also be off too, as each component must start on a 4-byte boundary.

bonehead,
I don’t want the h/w choking. That was the point. I also read the same thing about float’s being preferred. I seemingly didn’t realize how much weight preference had. :slight_smile:

Korval,
Thanks, I had completely forgotten about 4-byte alignment. I just got the vertex data delivered as shorts and wanted to do some testing if it could be a viable fmt to just upload to the server without CPU overhead.

What surprised me was that ©VA was seemingly able to handle the shorts quite well while VBO wasn’t. Now I see more reasons to not use this distribution-format for rendering. :slight_smile:

Thanks.

What surprised me was that (C)VA was seemingly able to handle the shorts quite well while VBO wasn’t. Now I see more reasons to not use this distribution-format for rendering.
Well, assumed that shorts are not a native format for ATI hardware, the driver has to convert the data on the fly (on the CPU!). With VBO the data resides mainly in video or AGP memory. So the CPU has to read-back the data, convert it and give the converted data back to the GPU (slow!).
With (C)VA the data is in the main memory. The CPU can read very fast from it, so the conversion is fast(er) and the transfer is sysmem->videomem only.

Try to create a VBO with “I will read it many times”-usage flag, so the driver might place it in sysmem right away. It should be as fast as C(VA) again.

I’m not sure if it’s due here to alignments. After all this is just some bits left at the end of segments. And graphic card memories are about 256 bits if I’m not wrong.

This might best be due to the fact (maybe) that those cards are expecting floating point data (thus with a mantisse, an exponent and such). And it might have to convert integers to floating point data. Here this also might need to get bigger the memory needed in order not to loose precision. All of this can explain the encountered slowdown.

Also, I’d like to arise something else: isn’t that if that’s true a problem because using indices in vertex arrays require the use of integers and not floats ? Does that mean indices should be avoided (I’m not using indices thought but I might need them someday or another).

Sorry to drive a little off topic but I’m interested in the CVA behaviour. On my tests using both VBO and standard vertex arrays, CVA didn’t gave me any useful advantage (but I must say I’m not sure on the meaningness of the test itself).
Can you tell me more on that?

Originally posted by jide:
Also, I’d like to arise something else: isn’t that if that’s true a problem because using indices in vertex arrays require the use of integers and not floats ? Does that mean indices should be avoided (I’m not using indices thought but I might need them someday or another).
Absolutely not. Using indices is highly recommended. Not only does you’re geometry data get smaller, but you also save bandwidth and vertex shading power. Using indices is the only way to utilize the post-transform vertex cache.

For reference: Testing done on a rather slow computer running Windows 2000 with ATI 9250 (128-bit).

Obli:
From a test I did, with an old game’s (a bit unusual) terrain data, I got an almost 50% speedup in FPS just from locking (CVA) the VA for each sector before drawing it. I suspect this is to some extent vendor, card and even architecture specific (I’d suspect CVA over GLX would display higher speedup than CVA using a local AGP card), but I think it still displays CVA indeed has benefits (given sufficient amounts of data, obviously).

Once I got tired of the VBO with short’s issue, and just converted the terrain geometry to floats and uploaded to a number of static VBO’s (one for each terrain sector), I more than doubled the speed compared to plain VA. Please note this was still with plain DrawElements with indices from host memory.

Other tests I have performed with this card suggests putting the index data in VBO (shouldn’t it be called IBO by then? :slight_smile: ) has displayed performance advantages too - but please note I only tested for static stuff.

Humus:
The geometry data gets smaller when using indices? Can I have one too, please? :slight_smile: (j/k)

Besides the obvious of reducing bandwidth, which I personally think isn’t that important anymore unless you upload much texture data every frame, I think the more important issue is to reduce total latency. It costs loads of CPU ticks to perform e.g. an AGP transaction. Add CPU context switches needed to do it and you have a measurable performance impact.

There’s also the issue of how easy/hard it is to manage the data. I still remember my first stumbling steps in OpenGL using glVertex calls, and I must say I think geometry management is way easier when handling just indices. OK, if you need to concatenate data to get decently sized VB’s it might initially be a challenge, but once you have written the code to re-base indices I think you have gained much valuable experience in how to handle geometry efficiently, and wouldn’t ever want to go back to glVertex.

Originally posted by tamlin:
From a test I did, with an old game’s (a bit unusual) terrain data, I got an almost 50% speedup in FPS
Thank you, I’ll redo all my tests then. Since I’ll have to write syntetic benchmark from scratch, maybe I’ll also want to check this behaviout with SHORTS and such.

Besides the obvious of reducing bandwidth, which I personally think isn’t that important anymore unless you upload much texture data every frame
Proper analysis of bandwidth matters in terms of per-cycle bandwidth, not per-second. If you have a bandwidth of 20GB/sec, you can’t transfer 20GB of data in one cycle, then sit idle for the entire rest of the second. So bandwidth is really a “use it or lose it” kind of thing. Either you are using every cycle’s worth of bandwidth or you aren’t.

So, if you need to transfer more data at once than can fit in a single cycle, then the transfer will take multiple cycles. If your bandwidth is only something like 16 bytes per cycle, then a vertex that is 32 bytes in size will take 2 cycles to reach the card. If you could cut it down to 16 bytes, then the transfer would only take 1 cycle.

Of course, T&L/vertex shader time tends to dominate vertex transfer bandwidth from video memory. However, transfer from system memory can still be slower than the rest of the vertex pipe; that’s why it is so important to use VBOs.