VBO Performance

teichgraf · August 25, 2006, 7:35am

Hello

I am using VBOs for large Triangle Data with GL_DRAW_STATIC. The vertex array is interleaved with normal and position: NX,NY,NZ X,Y,Z.
The strange think is, I get very different performance on two computers:

Computer A:
Intel Pentium 4 2,8 GHz
512 MB RAM
NVIDIA Quadro4 750 XGL 128 MB Driver: 85.96
Windows XP SP2

Computer B:
AMD Athlon 64 FX-60 Dual-Core 2,61 GHz
3072 MB RAM
NVIDIA GeForce 7900 GTX 512 MB Driver: 85.96
Windows XP SP2

The rendered scene has 1.789.819 Triangles which are drawn indexed with glDrawElements. Number of Patches/VBOs: 9168. Total VBO size: 26,24 MB
I also tried a scene with 384.219 Triangles, 70 VBOs with total size of 4,81 MB.

On computer A the speed-up gained from VBOs is 250% compared to Vertex Arrays.
The performance gain on computer B is just 10% or in some cases 0%! I checked for failures with NV PerfKit 2.0 but no results. The new driver version 91.31 (?) has no effect. NV PerfKit also shows, that the VBOs on computer B are stored in the video memory. So there should be a huge performance gain. I stored the triangle index array also in a VBO - nothing. Is the PCI-Express of computer B so fast, that no performance gain with video mem VBOs is reached? No!:
Another test showed, that another PC with the same Quadro card as computer A and the newest driver has the same frame rates as computer A - 250% gain. On a PC with GeForce 5200 FX and 6X.XX driver the performance gain is also very good. But a GeForce 5950 Ultra with 84.21 driver does not render faster with VBOs. All no PCI-E
???
So it has to be the card? What is wrong with the VBO implementation on the different cards? Or what could be the reason?

My VBO code does the following:
Init:

glGenBuffersARB(1, &name);
glBindBufferARB(GL_ARRAY_BUFFER_ARB, name);
glBufferDataARB(GL_ARRAY_BUFFER_ARB, nVertz*vertexSize, pVertz, GL_STATIC_DRAW_ARB);

Render:

glBindBufferARB(GL_ARRAY_BUFFER_ARB, name);
glInterleavedArrays(GL_N3F_V3F, stride, (CHAR*)NULL);
glDrawElements(GL_TRIANGLES, nIndices, GL_UNSIGNED_INT, pIndices);

Any suggestions or comments for this problem?
Thanks in advance!

holdeWaldfee · August 25, 2006, 8:17am

Looks like a driver issue.
Try it without glInterleavedArrays.

Jackis · August 25, 2006, 8:19am

Hmm, very interesting.
I’ve looked to your VBO intitialization and drawing - it can’t be incorrect. I have some thoughts, but only like a suggestions.

Try using GL_UNSIGHED_SHORT indices, you have about 5500 triangles/batch in second scene, and about 200 triangles/batch in the first one, so 16 bits per index is quite sufficient.
Maybe, the thing is not VBO is slow, but arrays on newer cards are faster )))
What about overall perfomance? I mean, not gain, but overall FPS.

teichgraf · August 25, 2006, 9:12am

@holdeWaldfee: It’s improbable a driver problem. As I posted in the gamedev.net thread, the Quadro4 and the GeForce 7900 use the same driver version.

@Jackis: I also guess that vertex arrays are faster on modern cards. But this is not reasonable. The VBOs are stored in graphics memory and the VAs in system mem. The VBOs have to be faster. I profiled the program and could prove that the VBOs are in the video mem and the VAs not. But why are they not faster!?!?!?
This sucks! :mad:

Thanks for the answers.
On Monday I will try your suggestions. Enough for today.

Korval · August 25, 2006, 11:14am

Please see my complete post on gamedev.net
Either post here or don’t. Don’t do this half-post thing where you put the real post on some other forum and expect us to answer it anyway.

Posting in both places is fine, but it should be the entire post in both places. I shouldn’t have to link somewhere to help you.

As I posted in the gamedev.net thread, the Quadro4 and the GeForce 7900 use the same driver version.
So does the TNT2 and the GeForce 7900; you can’t go by what .exe you install. The unified driver model uses whatever code is appropriate for the particular card, but they all install from the same .exe.

But why are they not faster!?!?!?
First, calm down. Nothing is being served by excessive punctuation.

Second, let’s look at the facts. VBOs are faster than vertex arrays. People developing GL applications use and rely on VBOs all the time; they’ve almost completely replaced standard vertex array usage except on legacy applications.

The conclusion, therefore, is that there’s something wrong in your code, not the driver.

My suggestion is the same as holdeWaldfee’s: stop using glInterleavedArrays. It’s not a good function, and it’s always better to manually interleve than to use this function.

teichgraf · August 26, 2006, 4:31am

@Korval: Thanks for the answer. I changed the first posting to include the whole text without a link.
Sorry for my excessive punctuation. I was really pissed off yesterday. Please imagine my situation: It’s friday afternoon and you are happy that your code works fine on your machine and that the weekend is near. You’re trying the code on some other machines and it doesn’t really work. So the weekend is far away for now.

Is there any cache in the new graphic cards, which stores the VAs? If so, are there any cache line sizes I have to pay attention? What is the perfect size of a Vertex? In my case I use Interleaved Arrays: X1,Y1,Z1, NX1,NY1,NZ1, X2,Y2,Z2, NX2,NY2,NZ2, …

You think I should not use:
glBindBufferARB(GL_ARRAY_BUFFER_ARB, name);
glInterleavedArrays(GL_N3F_V3F, stride, (CHAR*)NULL);
glDrawElements(GL_TRIANGLES, nIndices, GL_UNSIGNED_INT, pIndices);

So I tried glVertexPointer & co., but the program crashed:
glBindBufferARB(GL_ARRAY_BUFFER_ARB, name);
glNormalPointer(GL_FLOAT, 3sizeof(FLOAT), (CHAR)NULL);
glVertexPointer(3, GL_FLOAT, 3sizeof(FLOAT), (CHAR)NULL + 3*sizeof(FLOAT));
glDrawElements(GL_TRIANGLES, nIndices, GL_UNSIGNED_INT, pIndices);

What other dependencies could slow down the VBOs? Anti-Aliasing, Alpha-Blending, …?

Or did I miss a new technique for rendering large Triangle data on modern cards? I use the default T&L stage of the card. Do I have to use a custom shader to get the full power on new cards?

Thanks in advance!

Trahern · August 26, 2006, 4:50am

your stride should be 6sizeof(float) not 3

Ysaneya · August 26, 2006, 8:20am

First of all, which framerate are we speaking of, in each case ? Is it slower on the Athlon fx 60 config ? Could it be vsync limited ?

1.789.819 triangles in 9168 VBOs, that’s an average of 195 triangles per VBO… that’s quite low. If possible, try to use a lower amount of VBOs (even if it’s not your problem, it can do no harm).

I also agree that you should avoid glInterleavedArrays.

Y.

teichgraf · August 28, 2006, 1:41am

Thanks for the answers!

@Trahern: You are richt. Stupid failure of mine. :rolleyes:

I did
glEnableClientState (GL_VERTEX_ARRAY);
glEnableClientState (GL_NORMAL_ARRAY);
ervery rendering of the vbo. Now I kicked it out of the render loop and I do this now during init.
I also use glVertexPointer & Co. and store the indices in a VBO too.

My code is now
Init:
glGenBuffersARB(1, &nameInd);
glBindBufferARB(GL_ELEMENT_ARRAY_BUFFER_ARB, nameInd);
glBufferDataARB(GL_ELEMENT_ARRAY_BUFFER_ARB, sizeof(unsigned int)nIndices, pIndices, GL_STATIC_DRAW_ARB);
glGenBuffersARB(1, &name);
glBindBufferARB(GL_ARRAY_BUFFER_ARB, name);
glBufferDataARB(GL_ARRAY_BUFFER_ARB, nVertzvertexSize, pVertz, GL_STATIC_DRAW_ARB);

Render:
if(glIsBufferARB(name)==GL_TRUE)
{
glBindBufferARB(GL_ARRAY_BUFFER_ARB, name);
glVertexPointer(3, GL_FLOAT, vertexSize, (CHAR*)NULL + 3sizeof(FLOAT));
glNormalPointer(GL_FLOAT, vertexSize, (CHAR)NULL);
glBindBufferARB (GL_ELEMENT_ARRAY_BUFFER_ARB, nameInd);
glDrawElements(GL_TRIANGLES, nIndices, GL_UNSIGNED_INT, (CHAR*)NULL);
}

The gain on the GeForce 7900 GTX is now 50%. Better than before, but not 250% as on the older Quadro4 750 XGL. And on a GeForce 5950 Ultra it does not render faster.
unsigned SHORT instead of unsigned INT indices did not speed-up anything. Which other state change or effect could slow down the VBOs? Or have the Quadro cards better drivers?
Which alternative technique could I try?

P.S.: The absolute frame rate on 7900 GTX with VAs is higher as on the Quadro with VBOs. The frame rate is not effected by VSync.

Vexator · August 28, 2006, 2:01am

[offtopic]is it a good idea to store indices in a vbo the way he did it, i thought it was better to store them in system memory?[/offtopic]

Trahern · August 28, 2006, 2:27am

why? if you don’t change them then I dont see a reason to keep them in the system memory.
And if it’s really better to keep them in it then the driver can do it for you ( STATIC_DRAW is just a hint…the driver can place them wherever it thinks it’s most suitable ).

teichgraf · August 31, 2006, 3:24am

Thanks for your help.
I commented out everything else than the VBO relevant parts in the framework and voila the gain is 250% on the new 7900 GTX card as on the old Quadro4 750 XGL. So the VBO implementation is fine. There seems to be another technique used in the framework which is really slow on newer cards and is affecting the framerate so much. Any ideas which OGL technique is so much slower on newer cards?

knackered · August 31, 2006, 4:14am

Oh…My…God.