PDA

View Full Version : VBO vs. Vertex Arrays on a Quadro FX 1500



AnselmG
05-15-2007, 04:19 AM
I just downloaded the NEHE lesson45 which demonstrates VBO rendering. I am using a Quadro FX 1500 board and here comes the interesting thing:

when I activate VBOs I get 60 fps while rendering 526000 triangles. By using VAs I get 260 fps with the same demo!

can anyone explain this effect to me? I thought VBOs (expecially static ones) should definitely be faster (or at least not slower) than vertex arrays - but it seems as if I am wrong...

Zengar
05-15-2007, 05:58 AM
What drivers are you using?

AnselmG
05-15-2007, 07:51 AM
The offical (Quadro) Forceware 91.85

Jens Scheddin
05-15-2007, 08:32 AM
Didn't notice you started a new topic, so I'm posting my answer here too:

Why so many triangles? The standard tut has only 32k triangles. Maybe you exceeded the maximum size of the buffer object. The reason for bad performance with VBO could be too many ot too few data in a single BO.

AnselmG
05-15-2007, 08:47 AM
ok maybe that is the problem, but I expected to be able to render 500K triangles with one VBO efficiently. (Actually 500K vertices are only a few MB of data...)

(I decreased the number of pixels per vertex to 1.0 in the demo - just for evaluation)

Ysaneya
05-15-2007, 01:18 PM
There must be something wrong either in your fps code, either in the number of rendered triangles. There's no way you can get 260 fps @ 526k tris/frame on a Quadro 1500. That'd be near 137 MTris/second, with all the bus tranfers going on for each rendering call.

Y.

AnselmG
05-16-2007, 12:50 AM
you can download the demo from NEHE - just change the code lines 26-28 like this:


#define MESH_RESOLUTION 1.0f
#define MESH_HEIGHTSCALE 1.0f
#define NO_VBOS I was using FRAPS for fps-measurements (60 with VBOs/ ~250 without)

Ysaneya
05-16-2007, 02:14 AM
Geforce 7800 GTX, Pentium Dual core 3 Ghz, 2 GB ram, and the settings you just posted above: 110 fps with VBOs, 42 fps without VBOs. Pretty much what I expected.

I have no idea why it reports a framerate of 260 fps without VBOs on your machine, but it has to be wrong. Maybe a driver bug, not rendering all the vertices ?

Y.

Ysaneya
05-16-2007, 02:31 AM
By the way, I must point out that the proof the number is wrong is that a vertex is 20 bytes, you've got 1.7 millions vertices per frame, if you were indeed rendering at 260 fps via system memory, that'd be a bus bandwidth of 1.7*20*260 = 8840 MB/sec, that's a lot more than what even a PCI X16 bus can achieve.

Y.

knackered
05-16-2007, 02:41 AM
use fraps

Jens Scheddin
05-16-2007, 02:53 AM
Tested on an ATI X1950 pro AGP:

11 fps w/o VBO
111 fps with VBO

With standard VA it looks like the AGP bus or my rather slow CPU (athlon xp 2700+) is the bottle neck.

My guess would be that there is probably something wrong with your drivers or system in general.

Why are you using fraps and not running the app in window mode and take the displayed fps into account?

AnselmG
05-16-2007, 03:52 AM
well actually I am using windowed mode WITH the internal fps counter and the external (FRAPS) - take a look at these screenshots - btw, I have a core 2 Duo 1.8 GHz and 2 GB RAM.

take a look at these screenshots:

Without VBOs:

http://gonzo.uni-weimar.de/~grundhoe/VBO/NO_VBO.jpg

VBOs:

http://gonzo.uni-weimar.de/~grundhoe/VBO/VBO.jpg

AnselmG
05-16-2007, 03:54 AM
actually the framerate is jittering between 230-270 that's why there is a difference between both framecounters - but it is definitely a extremely strange behaviour... :confused:

knackered
05-16-2007, 04:20 AM
fraps is pretty reliable.
are you sure the vbo version doesn't have vsync enabled?

AnselmG
05-16-2007, 04:37 AM
vsync is disabled in both cases - I took both screenshots with completely the same configuration.

I checked my driver settings and disabled "maximize texture memory" and well - now I get ~400 fps without VBOs, with the VBOs I get ~100 fps.
Everything measured with fraps. I really have no idea what's going on here...

RigidBody
05-16-2007, 05:43 AM
just tested it on a quadro fx 3500, driver 2.0.2 NVIDIA 87.56, dual xeon@3.0 ghz in suse linux 10.0.

with vbo: ~120 fps
without vbo: ~40 fps

(according to the window's titlebar)

hardly any difference in fps between 640x480 and 1280x1024 resolution.

RigidBody
05-16-2007, 06:35 AM
by the way- it's not a good idea to calculate fps like in that example:


if( (SDL_GetTicks() - g_dwLastFPS) >= 1000 ) // When A Second Has Passed...
{
g_dwLastFPS = SDL_GetTicks(); // Update Our Time Variable
g_nFPS = g_nFrames; // Save The FPS
g_nFrames = 0;
... }the time difference (SDL_GetTicks() - g_dwLastFPS) can for instance be 1500 (=1.5 sec). in that case the condition is true, and if 100 frames were drawn, fps will be set to 100, although it is only 100/1.5=67.

a better way would look like this:

if( g_nFrames == 100 )
{
float dt = 0.001*(float) (SDL_GetTicks()-g_dwLastFPS);
g_dwLastFPS = SDL_GetTicks();
g_nFPS = (int)(100.0/dt);
g_nFrames = 0;
... }

AnselmG
05-16-2007, 06:51 AM
you are right - normally I am not using this for framecounting - I use a code similar to the one you proposed, but I also use fraps as a reference.

I assume it is a driver bug/feature (?=) - I also cannot reproduce it on any other geforce card. If there's someone with a Quadro card, please try it with the same driver version!

tranders
05-17-2007, 11:00 AM
I modified the test to render the default 32K mesh 100 times (3.3M triangles) at 1024x1024 windowed mode and added logic to switch between modes as opposed to recompiling with a switch. On Vista I see what I would expect, but on XP I'm seeing similar differences (i.e., VAs are faster than VBOs). I also added logic to test display lists.

QuadroFX 3450,

VISTA, Driver 160.03

VA: 4 fps
VBO: 12 fps
DL: 20 fps
NULL(1): 85 fps(2)


XP, Driver 91.36

VA: 26 fps
VBO: 13 fps
DL: 21 fps
NULL(1): 4950 fps


(1) Loop overhead, no draw
(2) VSYNC is non-functional on Vista Aero
--

VBOs and DLs behaved the same between Vista and XP with DLs being the clear winner for static data. There is a very odd anomaly with VAs on the Quadro cards on XP -- and only an NVIDIA developer can answer that question.

I did notice that the NeHe test never calls glFlush() or glFinish() prior to swapping buffers. If I insert a glFinish() prior to calling SwapBuffers(), the VA frame rate is nearly identical to the DL frame rate and the NULL draw frame rate drops to 2750 fps.

Jackis
05-17-2007, 11:53 AM
Actually, if VBO's using is quite accurate, there is no measurable difference between VBO and DL on modern drivers (expecially if you are using glDrawArrays(), not glDrawElements()).

tranders
05-17-2007, 12:06 PM
for (int i=0;i<LOOP_ITERATIONS;i++)
glDrawArrays( GL_TRIANGLES, 0, g_pMesh->m_nVertexCount );vs.


for (int i=0;i<LOOP_ITERATIONS;i++)
glCallList(g_DisplayList);FWIW, the display list was created by recording the glDrawArrays call with the VBO data.

Unless you can identify the the incorrect usage of VBOs in NeHe Lesson #45, there actually is a measurable difference between VBOs and DLs (at least on a Quadro FX graphics card). I would really like to know if anyone can make this data render faster using a VBO instead of a DL.

Jackis
05-17-2007, 12:16 PM
Ah, sorry, I meant, that on common GeForce hardware (not Quadro) the difference between VBO and DL is unnoticeable. I can't say anything ot Quadro, sorry again.

tranders
05-17-2007, 02:17 PM
Further investigation shows that VBOs and DLs display at about the same speed if the object has ~128K triangles in a single draw call. Anything less and the DL is faster. Anything more and the VBO is faster. This would indicate an internal blocking/batching factor difference. On a QuadroFX on XP, VAs are faster in all cases that I have tested - go figure.

Considering that most of my objects are static and have less than 100K triangles, I see no benefit to using VBOs. Longs Peak needs to understand these metrics and either retain geometry-only DL technology or improve VBOs so that they display more efficiently for smaller objects. If there is a difference between Quadros and GeForce cards, one would think that the professional card would simply do the fastest thing regardless of which path was taken.

Komat
05-17-2007, 06:03 PM
Originally posted by tranders:
If there is a difference between Quadros and GeForce cards, one would think that the professional card would simply do the fastest thing regardless of which path was taken. The professional cards and theirs drivers are optimized for use in professional modeling programs. If such programs use VA and DL to draw huge amount of geometry most of the time, the driver behavior (e.g. memory allocation strategy) will be optimized for such operation. The gaming cards on the other hand will have theirs drivers optimized for methods used by popular games so the VBO path might be the more optimized one.

tranders
05-17-2007, 06:49 PM
I'm not at all familiar with how games display their data. I assume that it could be built into one giant (or several large) VBO and subsets could be managed as components change. That would also lend some credibility to the support of instancing. However I still think that (for an identical data set) a professional driver should take additional steps to optimize the DL for the maximum performance (e.g., allocate a large VBO behind the curtain if that will improve performance). Application users pay a premium for these cards so they should expect them to be fast regardless.

It would be interesting to see if there is a similar break-even point on the GeForce cards.

tamlin
05-23-2007, 12:41 AM
It used to be that the "professional" stuff was geared more towards geometry, and "gaming" more towards pixel speed.

But nowadays, when games throw hundreds of thousands of triangles, thousands of state changes and hundreds of both vertex and fragment programs at the card every frame and still get interactive speeds, and have on-board memory sizes at half-a-gigabyte and sometimes more, I wonder what the measurable benefit really is using "professional" cards.

Is it that their drivers are less buggy, or simply more precise (both internally and on the card)? I'm thinking like going float->double in your program, and possibly switching the x87 FPU to 80-bit precision, i.e. getting more precision at the cost of speed.

Related, but a bit o/t, I just the other day tried 3DSMax (8) using OpenGL mode on my 7600, and OMG was it buggy! :-) Perhaps this is an area a Quadro and matching drivers would have been better, perhaps it's a bug in Windows when using (even if completely hidden) layered windows (the OS provided kind that alpha-blends), or simply it's a Max bug. Either way, both software and D3D mode worked as expected.

Komat
05-23-2007, 01:32 AM
Originally posted by tamlin:
But nowadays, when games throw hundreds of thousands of triangles, thousands of state changes and hundreds of both vertex and fragment programs at the card every frame and still get interactive speeds, and have on-board memory sizes at half-a-gigabyte and sometimes more, I wonder what the measurable benefit really is using "professional" cards.
The professional cards have drivers certified for various 3d modeling applications so it is possible that there is better support from vendor of the application if problem occurs when using driver version that was certified with that program.

Some applications (e.g. 3DS Max, AutoCAD) also support special application drivers which can be used instead of the OGL backend (e.g. MAXtreme drivers from Nvidia). From what I read this driver can significantly increase the performance of the modeling related tasks.

Additionaly Quadro series of cards has some features (altrough some from them are likely to be driver only) that are usefull for professional applications such as overlay planes (for more efficient visualization of selections in high polygon geometries), unified back buffer (allows more efficient usage of video memory in applications utilizing multiple OGL windows) or support for synchronized swapping of multimonitor output and OGL stereo. I think that some old Quadros also supported OGL logical operations. It probably also has better support for antialiased lines and wireframe rendering.

V-man
05-24-2007, 02:51 AM
Originally posted by tamlin:
Related, but a bit o/t, I just the other day tried 3DSMax (8) using OpenGL mode on my 7600, and OMG was it buggy! :-) Perhaps this is an area a Quadro and matching drivers would have been better, perhaps it's a bug in Windows when using (even if completely hidden) layered windows (the OS provided kind that alpha-blends), or simply it's a Max bug. Either way, both software and D3D mode worked as expected. I'm surprised. nVidia tended to do everything perfect.
Even ATI works well. I remember in rare cases on a Radeon 9500, it would crash when line rendering was used. At times, you get random lines all over the screen.

RigidBody had some decent numbers there, showing VBO is better. Perhaps the Nehe code is not good.

Jens Scheddin
05-24-2007, 06:53 AM
Originally posted by V-man:
RigidBody had some decent numbers there, showing VBO is better. Perhaps the Nehe code is not good. Well, as far as I can tell, you need at least ~2000 vertices in a draw call for VBO to be as fast or faster than general vertex arrays.
This has been tested on ATI, so it may be a bit different on nVidia hardware. Size of the VBO doesn't matter but shouldn't exceed a few MB.

This might be the case, because the call to gl*Pointer functions is really expansive for whatever reason.

It noticed this, while i wanted to skip conventional vertex arrays and go for VBO all the time in my current engine. After some experimenting, i was pretty dissapointed on the performance of VBOs (most of the time you'll end up rendering less than 2000 vertices at once). They are only fast, if you can batch a lot of geometry into a single draw call. Otherwise they are even _slower_ than VA!

How fast are D3D vertex buffers compared to such cases? Does it behave equally or is it even faster for small buffers? (I bet it's the latter:( )

tamlin
05-25-2007, 08:56 AM
IIRC NVIDIA recommended several years ago (!) batch sizes of 10k-20k. 2k is today to be considered so small the overhead of setting up the buffers may dwarf the actual transactions.

AFAIK if mapping the buffers, the overhead is much larger than if simply uploading manually (possibly that's where the sometimes suggested "upload instead of mapping" comes from?). But whether mapping or uploading manually, once both vertex and index data is on the "server" side, it should be much faster drawing it (only sending commands) than using VA's (sending both commands and data).

The only idea/advice I have is: Collate, collate, collate. Put as much data into the buffers as you can. If you have more than 64K indices (in case the ushort limit kicks in) or you have put many index buffers into one index VBO but all of them start at vertex zero (rebasing them on CPU is possibly faster, but then we may reach the ushort limit), you can for each batch re-base what the server considers index-zero (using e.g. glVertexPointer) to have what's at offset 47911 in the buffer be considered vertex[0]. (note: 47911 is obviously a bad choice for an offset to start a vertex at :-) Try to keep it at least 8-byte, but possibly even 32-byte aligned, especially with a 256-bit memory bus where 256/8=32)

But indeed, you need "larger" amounts of data for VBO to be efficient. Even Begin/Vertex*N/End can probably be (is?) faster than VBO for small batches.

Rounding off: Unless you were aware of it, always set the vertex "pointer" last, just as if you had done immediate drawing. The majority of the (required buffer-) work is done when setting the vertex "pointer", why no other attribute "pointers" should be modified after it. This implies that if using multiple batches of vertex attributes in a single VBO, always end current batch with unmapping it so the driver knows it does no longer have to track the other attribute "pointers" for this batch. Else every following e.g. glNormalPointer could trigger much work, that was really intended for the next batch of vertices.

++luck;

Jens Scheddin
05-27-2007, 07:13 AM
Originally posted by tamlin:
This implies that if using multiple batches of vertex attributes in a single VBO, always end current batch with unmapping it so the driver knows it does no longer have to track the other attribute "pointers" for this batch. Else every following e.g. glNormalPointer could trigger much work, that was really intended for the next batch of vertices. Thanks for your detailed reply. But:
What exactly do you mean by "unmapping" the current batch?

Korval
05-27-2007, 10:23 AM
How fast are D3D vertex buffers compared to such cases? Does it behave equally or is it even faster for small buffers?Due to the design of the D3D driver model, almost certainly not.

Every D3D DrawPrimitive call makes a call into the driver, which will provoke a CPU switch from protected mode to kernel mode. This switch takes a long time (relatively speaking). An nVidia paper a while back suggested that you get approximately 100,000 such calls with a 1GHz CPU (since it's CPU-time limited).

By contrast, calling glDrawElements does not always require a kernel mode switch. The OpenGL implementation can marshal such calls so that they happen when the GPU is running out of stuff to do, thus provoking fewer kernel mode switches.