PDA

View Full Version : Hardware VBO is ABSOLUTELY slower than software VB



fuzzy3d
07-31-2004, 10:27 PM
http://20030214.vip.sina.com/VBOSpeed.rar
Run VBOSpeed_r.exe, or compile it in VC7. You can see the result.
I tested it on Geforce4Max, Geforce3, Geforce FX5200, GeforceFx5800, ATI 7500,ATI 9200,ATI 9600,ATI 9700,ATI 9800se .
CPU is from P3 1G to P4 2.4G . Intel and AMD.
Reault is: when filling dynamic vertices less then 10000 .Software VB HAS THE TOP SPEED!

Here we go :
1)Software Vertex Buffer
float vertex[1000*3]
unsigned int index[1000*3]

...write something to vertex buffer
...write something to index buffer

glVertexPointer(3,GL_FLOAT,0,vertex);
glDrawElements(GL_TRIANGLES,1000,GL_UNSIGNED_INT,i ndex);

2)Hardware Vertex Buffer with mapping
glBindBufferARB(GL_ARRAY_BUFFER_ARB,nVB);
float* vb = (float*)glMapBufferARB(GL_ARRAY_BUFFER_ARB,GL_WRIT E_ONLY_ARB);
glBindBufferARB(GL_ELEMENT_ARRAY_BUFFER_ARB,nEB);
unsigned int* eb = (uint*)glMapBufferARB(GL_ELEMENT_ARRAY_BUFFER_ARB, GL_WRITE_ONLY_ARB);

...write something to vb
...write something to eb

glVertexPointer(3,GL_FLOAT,0,0);
glDrawElements(GL_TRIANGLES,1000,GL_UNSIGNED_INT,0 );

3)Hardware Vertex Buffer with SubData
float vertex[1000*3]
unsigned int index[1000*3]

glBindBufferARB(GL_ARRAY_BUFFER_ARB,nVB);
glBindBufferARB(GL_ELEMENT_ARRAY_BUFFER_ARB,nEB);

...write something to vertex
...write something to index

glBufferSubDataARB(GL_ARRAY_BUFFER_ARB,0,1000*3*si zeof(float),vertex);
glBufferSubDataARB(GL_ELEMENT_ARRAY_BUFFER_ARB,0,1 000*3*sizeof(unsigned int),index);
glVertexPointer(3,GL_FLOAT,0,0);
glDrawElements(GL_TRIANGLES,1000,GL_UNSIGNED_INT,0 );

On all cards, hardware buffer with mapping is absolutely slower than software buffer.
The fastest is software buffer. 2nd is HW VBO with SubData!

If anyone can beat me, I am soooo glad cos I have to modify my codes to use software vb to render terrain, skeleton without VP ,and particle system. Please check my code.
My email is :
20030214@sina.com

Dtag
07-31-2004, 11:47 PM
I didn't download your app but from what I can see it looks like you were respecifying the VBO every frame. Of course that can not be faster than ordinary memory.
Secondly, the VBO will not be alot faster ( if any? ) while you are not AGP limited. An example:
The Radeon 9800 XT is specified with roughly 24GB/sec transfer.
Assuming you are rendering 333 tris ( 999 vertices ) ( ur code is a little ambigious there, since your buffers are bigger than they need to be ). 999 Vertices require 999 * 3(floats) * 4(sizeof(float)) = 11988 bytes in memory. 999 indices another 999* sizeof(uint)=3996 bytes. So you are submitting a total of 15984 bytes per drawcall. Coming back to the 24GB you would need to submit 24 000 000 000 bytes per second. With that setup, you would need to do about 1500000 drawcalls in one second to be transfer limited ;) .
You are much more likely limited by the overhead involved with the drawcalls or anything else. If you want to get transfer limited, submit more attributes like texcoords, colors etc and draw alot more vertices in one drawcall. If you are looking for information on being drawcall limited, look at this Paper (http://developer.nvidia.com/docs/IO/8230/BatchBatchBatch.pdf)

Ysaneya
08-01-2004, 12:16 AM
Well i have to agree that with "only" 10000 vertices, your results are not very significant. Also what kind of framerates do you get ? If you're getting more than 200 fps, the results don't really mean anything.

Just looking 30 secs on your code, i already saw one major flaw. Video memory is uncached, so if you're getting a pointer directly to video memory, you should write to it sequentially (without "holes"). And that's not what your "CreateTerrain" function is doing, so the map/unmap method is bound to be the slowest.

You can also test with interleaved arrays instead of flat arrays.

Y.

fuzzy3d
08-01-2004, 01:14 AM
thanks. I did not specify VBO every frame. I create it before main loop.

about transfer limit ,i just want to compare software vb and VBO. so 10000 vertices is enough. In real game, like quake3 or other bsp in-door scene. when you render triangles of same shader in one glDrawElement call, only 50-5000 vertices which have same shader you need push into dynamic vb.(with texcoord,TBN,or vcol)

I used mapBuffer /unmapbuffer wrong. but after I modified it, mapBuffer is also the slowest.
Anyone run NV's VAR demo on P4 with Geforce3? IT's the same speed with or without VAR. VAR is just like map/unmap.
And, I have to duplicate my VB in app ram if I use glBufferSubDataARB, even I do it,the speed is still slower than software VB when less 5000 vertices to push. WHY we need dynamic VBO ??I don't understand.

Dtag
08-01-2004, 03:38 AM
"about transfer limit ,i just want to compare software vb and VBO. so 10000 vertices is enough"
( You used 1000 before )
Yeah but the VBO will not get to play out its real advantage - that the data does not need to be transferred across via the AGP, and thus is not dependend on the transfer speed.

"when you render triangles of same shader in one glDrawElement call, only 50-5000 vertices which have same shader you need push into dynamic vb"
Yes but Quake is also not transfer limited and it does not try to get transfer limited.

CrazyButcher
08-01-2004, 06:37 AM
I ran the app
p4 2.2, geforce4 ti 4200 (56.72)

with max size (tris = 20 000)
I get
software: 160 fps
vbo map: 140 fps
vbo sub: 200 fps

the "mapping" is known to be slow, so thats no real surprise, other than that vbo is faster...

also might want to add that that if tris is below 8 192, then software is faster, but above and at it the advantage of vbo grows with growing tris count

fuzzy3d
08-01-2004, 09:28 AM
Originally posted by Dtag:
"about transfer limit ,i just want to compare software vb and VBO. so 10000 vertices is enough"
( You used 1000 before )
Yes but Quake is also not transfer limited and it does not try to get transfer limited.In my real code, I used 16 -- 10000 vertices .You can increase /decrease vertex number by Key A/Z .
I test the demo on GEFORCE3, VBO is the fastest. But on Geforce 4max,fx5200,5800, all ati cards, VBO is slower than software VB when draw less than 10000 triangles in one drawelement call.(vertex number is about 5000).
My main point and main question is : in real game,bsp-scene in-door game engine or patch terrain scene, we always use dynamic vb less than 5000 vertices. But DRIVERS ALWAYS MAKE SOFTWARE VB FASTER. WHY DRIVER DO THAT?

MrShoe
08-01-2004, 02:49 PM
i dont understand why youre saying that you never have alot of data in VBOs in games... in my terrain engine, i have a single VBO with 1024x1024 verteces, and something like 32 bytes per vertex, that all adds up to 32MB in the VBO, if i was transferring all of that over the AGP bus every frame... well, yeah :-)

Obli
08-03-2004, 12:32 PM
I personally think all this comparisons should be done on real word scenarios.
Syntethic benchmarks are good to a degree.
For example, VBO could gain only a minor speedup (or even a slowdown) on a syntethic bench but gain much more on a real app which uses the data bus for real.

I personally always had huge speedups by enabling VBOs so I think this is an ugly configuration artifact but everyone is free to say everything after all.