NVIDIA VBO + glDrawElements() bug or feature?

It seems that NVidia drivers do an internal indices copy even if they are in a properly setup static_draw_arb VBO?!? [I concluded that by using the AMD CodeAnalyst hotspot disasembly] Then I tried glDrawArrays() instead, which reduced 80% CPU usage time to ~6%… Or perhaps I am not on a fast path somehow? (it would be nice to introduce some clear way of checking that from the API

This is a major bottleneck since I am making an outdoor arcade game with many simply rendered polys.

Also, where can NVIDIA driver bug be reported officialy? I stumbled upon a really stupid and nasty one or two in the latest WHQL Dets and made a test app to prove it…

Have you tried using glDrawRangeElements instead of glDrawElements?

I don’t know if glDrawElements() does a copy on the indices, but it is said that it does a range scan. This is the reason why glDrawRangeElements() is there.

You’ve read this? http://developer.nvidia.com/object/using_VBOs.html

Originally posted by Asgard:
Have you tried using glDrawRangeElements instead of glDrawElements?

***** ONLY FIRST TIME INIT

#ifdef USE_VBO

glBindBufferARB(GL_ARRAY_BUFFER_ARB, 1);

glBufferDataARB(GL_ARRAY_BUFFER_ARB, vertexsno * sizeof(*vertexs), vertexs, GL_STATIC_DRAW_ARB);

#endif

#ifdef USE_VBO

glBindBufferARB(GL_ELEMENT_ARRAY_BUFFER_ARB, 2);

glBufferDataARB(GL_ELEMENT_ARRAY_BUFFER_ARB, facesno * 3 * sizeof(*faces_short), faces_short, GL_STATIC_DRAW_ARB);

#endif

#ifdef USE_VBO

glBindBufferARB(GL_ARRAY_BUFFER_ARB, 1);

glVertexPointer(3, GL_FLOAT, 0, 0);

glBindBufferARB(GL_ELEMENT_ARRAY_BUFFER_ARB, 2);

#endif

***** MAIN LOOP

QueryPerformanceCounter((LARGE_INTEGER*)&start_time);

// batches_per_frame = 200, faces_per_batch = 1000. only vertex array enabled
for( j = 0 ; j < batches_per_frame ; j++ ) {
<TAB>glDrawRangeElementsEXT(GL_TRIANGLES, 0, faces_per_batch3, faces_per_batch3, GL_UNSIGNED_SHORT, 0);
<TAB>ret.polys_rendered += faces_per_batch;
<TAB>ret.objects_rendered++;
}

QueryPerformanceCounter((LARGE_INTEGER*)&cur_time);

glDrawRangeElements() Det 52.16 hotspot

EAX = 0A739340
EDX = 0A5FED24 // looks like some memory mapped heap, maybe AGP memory?!?

******* HOTSPOT LOOP STARTS // looks like implicit flush like in NV_VAR?!?

08B0443C movzx ecx,word ptr [edx] // they take short int index here
08B0443F add edx,2 // advance the index pointer
08B04442 mov esi,dword ptr ds:[0A4D6110h] // load the source base addr
08B04448 lea edi,[ecx+ecx2] // clever way to multiply with 3 (3 floats per vertex)
08B0444B lea esi,[esi+edi
4] // calculate the final pointer to the source vertex array, base + (index * 3 * sizeof(float))
08B0444E mov edi,dword ptr [esi] // load vertex.x
08B04450 mov ebp,dword ptr [esi+4] // load vertex.y
08B04453 mov dword ptr [eax],edi // store vertex.x
08B04455 mov dword ptr [eax+4],ebp // store vertex.y
08B04458 mov edi,dword ptr [esi+8] // load vertex.z
08B0445B mov dword ptr [eax+8],edi // store vertex.z
08B0445E add eax,0Ch // add sizeof(float) * 3 to destination
08B04461 cmp edx,dword ptr [esp+20h] // end of the copy loop?
08B04465 jne 08B0443C

******* LOOP ENDS

glDrawRangeElements() is not good too, even worse in Det 52.16 because it copies vertices not indexes as you can see Can ppl from NVIDIA hint what is this copying about and more importantly is there a way to avoid it except wait for the better VBO support?

[This message has been edited by speedy (edited 11-12-2003).]

[This message has been edited by speedy (edited 11-12-2003).]

Doesn’t this depend upon which card you have? I think storing indices in graphics memory is only possible on FX cards. The original proposals for extending VAR for indices was an NV30 only thing.

Matt

Yes, but the VBO spec doesn’t force the driver to put the indices in video memory. Regardless of which flag you create your VBO with, the driver is free to stick the data in system memory if video memory is for any reason not an option.

– Tom

TESTBED 1
GFX CARD: MSI GeForce FX 5600 VIVO @ AGP 4x
CPU: Athlon XP 1700+
MOBO: Gigabyte GA-7VAX
DRIVER REVISION: Detonator 45.23 & Detonator 52.16

TESTBED 2
GFX CARD: EAGLE GeForce 4MX
CPU: Duron 1200
MOBO: ???
DRIVER REVISION: Detonator 45.23

TESTBED 3
GFX CARD: Sapphire ATI 9600
CPU: ???
MOBO: ???
DRIVER REVISION: ???

INDICES TYPE: unsigned short
VBO USED: YES
VBO TYPE: 1 for indices + 1 for arrays
ARRAYS ENABLED: vertex array only
PRIMITIVE TYPE: GL_TRIANGLE

1 batch per frame, 300000 triangles per batch = 300000 triangles per frame

glDrawArrays(): <1% CPU BUSY : >99% CPU IDLE (triangle rate: 19.980000 MilTris/s) [GFFx 5600 + Det 45.23]
glDrawArrays(): 97% CPU BUSY : 3% CPU IDLE (triangle rate: 10.380000 MilTris/s) [GFFx 5600 + Det 52.16]
glDrawRangeElements(): 50% CPU BUSY : 49% CPU IDLE (triangle rate: 18.240000 MilTris/s) [GFFx 5600 + Det 45.23]
glDrawRangeElements(): 97% CPU BUSY : 3% CPU IDLE (triangle rate: 9.70000 MilTris/s) [GFFx 5600 + Det 52.16]
glDrawRangeElements(): <1% CPU BUSY : >99% CPU IDLE (triangle rate: 23.580000 MilTris/s) [ATI 9600 + Cat]

100 batches per frame, 3000 triangles per batch = 300000 triangles per frame

glDrawArrays(): <1% CPU BUSY : 99% CPU IDLE (triangle rate: 19.980000 MilTris/s) [GFFx 5600 + Det 45.23]
glDrawRangeElements(): 47% CPU BUSY : 53% CPU IDLE (triangle rate: 19.680000 MilTris/s) [GFFx 5600 + Det 45.23]
glDrawRangeElements(): 90% CPU BUSY : 9% CPU IDLE (triangle rate: 19.680000 MilTris/s) [GFFx 5600 + Det 52.16]
glDrawRangeElements(): <1% CPU BUSY : >99% CPU IDLE (triangle rate: 23.640000 MilTris/s) [ATI 9600 + Cat]

1 batch per frame, 100000 triangles per batch = 100000 triangles per frame

glDrawArrays(): <1% CPU BUSY : 99% CPU IDLE (triangle rate: 17.060000 MilTris/s) [GFFx 5600 + Det 45.23]
glDrawRangeElements(): 27% CPU BUSY : 73% CPU IDLE (triangle rate: 16.380000 MilTris/s) [GFFx 5600 + Det 45.23]

100 batches per frame, 500 triangles per batch = 50000 triangles per frame

glDrawArrays(): <1% CPU BUSY : 99% CPU IDLE (triangle rate: 14.260000 MilTris/s) [GFFx 5600 + Det 45.23]
glDrawRangeElements(): 11% CPU BUSY : 89% CPU IDLE (triangle rate: 14.180000 MilTris/s) [GFFx 5600 + Det 45.23]
glDrawRangeElements(): 60% CPU BUSY : 40% CPU IDLE (triangle rate: 14.360000 MilTris/s) [GFFx 5600 + Det 52.16]
glDrawRangeElements(): <1% CPU BUSY : >99% CPU IDLE (triangle rate: 14.80000 MilTris/s) [ATI 9600 + Cat]

NOTES:

  • NVIDIA CPU burning tested on Duron 1200 + GeForce 4MX (Det 45.23) with similar crap results
  • Similar behaviour also with Det 44.xx on TESTBED 1
  • Hotspot disassembly shows that Detonator 45.23 seem to copy indices while ForceWare 52.16 copy vertex data internally
  • (Only) Detonator 52.16 have asynchronous SwapBuffers() so glFinish() was inserted for correct measurements
  • Detonator 52.16 proved unstable while profiling (ie. AMD CodeAnalyst, Intel VTUNE) and under some types of unexpected usage patterns
  • When saturating fill rate using with high triangle counts, complete system stalls occured (Winamp stops playing music, mouse moves erraticaly). NVIDIA bug?!?

Is this Brown Paper Bag for NVIDIA or some deep unmarked pitfall or hardware limitation
I stumbled upon?

Anyone interested can double-check the results with my open-source test app:
http://galileo.spaceports.com/~speedy1/OpenGL%20test%20app%20v2.rar

If you modify/enhance it please send me the patches so we can all share the efforts… tnx.

You must have a mistake in your program: I don’t see anything, I get the tricounts much lower then with my own programs, and last tests gave me the result of 7600 fps(???).

Rather than trying to analyze NVidia’s drivers to death, maybe you should try to write a working test app first. My log says things like 50K fps and 785 MTris/sec on a GeForce3. I wish! I don’t know what you think you’re measuring, but it’s wrong. Very, very wrong.

– Tom

Tom, Zengar, thank you for bearing with me, there is more quite interesting & in-depth info comming…

> Rather than trying to analyze NVidia’s drivers to death, maybe you should try to write a working test app first.
> My log says things like 50K fps and 785 MTris/sec on a GeForce3. I wish! I don’t know what you
> think you’re measuring, but it’s wrong. Very, very wrong.

Tom, You have uncovered yet another NVIDIA driver bug which is solved in ForceWare 5x.xx drivers. In Det’s 4x
they do not support primitive counts > 65536 when using VBO on gfx cards up to FX FX does all the stuff neatly.
I guess you are using Dets 4x + GeForce 3?

If you check the log files you’ll see that when the number of triangles per batch is less then 65536/3 the results are accurate.
(using GL_UNSIGNED_INT on indexed calls)

So the breakdown of the new issue is:

ATI 9600 Cat 3.6: OK!
ATI 9600 Cat 3.9: OK!
GFFx 5600 + 45.23: OK! (but CPU burning)
GFFx 5600 + 52.16: OK! (but CPU burning, even more so)
GF 4MX + 52.16: OK! (but CPU burning, even more so)

GF 4MX + 45.23: NON CONFORMANT! (& CPU burning)
GF 3 + your rev.: NON CONFORMANT!

> You must have a mistake in your program: I don’t see anything, I get the tricounts much lower then with my own programs,
> and last tests gave me the result of 7600 fps(???).

Zengar, which GFX card do you have? What tri counts do you reach with your programs? You are using VBO? I suppose you are using triangle strips
to reach higher tri counts and they give approx. 2.5x boost The test program is working really fine on 3 computers I have access to currently
and I have really tried to make it as bulletproof as possible.

Check out
http://galileo.spaceports.com/~speedy1/OpenGL%20test%20app%20v3.rar

, this is the new open-source conformance + performance + CPU usage test app with graphical feedback so you’ll know it is working…
please check your hardware+drivers with it

P.S. NVIDIA;'s own John Spitzer recommends VTUNE hotspot analysis in his Graphics Performance Optimisation.ppt for this kind of problems

>>>>Tom, You have uncovered yet another NVIDIA driver bug which is solved in ForceWare 5x.xx drivers. In Det’s 4x
they do not support primitive counts > 65536
<<<

That’s not a bug.

Originally posted by V-man:
[b]>>>>Tom, You have uncovered yet another NVIDIA driver bug which is solved in ForceWare 5x.xx drivers. In Det’s 4x
they do not support primitive counts > 65536
<<<

That’s not a bug.[/b]

EXT_draw_range_elements

‘The specification of glDrawElements does not allow optimal performance
for some OpenGL implementations, however. In particular, it has no
restrictions on the number of indices given…’

… and the VBO spec isn’t saying anything about any limits to indices count.

[This message has been edited by speedy (edited 11-18-2003).]

So you(they) are exceeding the limits and nothing is rendering?

Alright, I thought by
“they do not support primitive counts > 65536”

you were not happy with their MAX values.

Actually, it looks like both MAX values are 4096. hmmm…

Originally posted by speedy:
[b]
Check out
http://galileo.spaceports.com/~speedy1/OpenGL%20test%20app%20v3.rar

, this is the new open-source conformance + performance + CPU usage test app with graphical feedback so you’ll know it is working…
please check your hardware+drivers with it

P.S. NVIDIA;'s own John Spitzer recommends VTUNE hotspot analysis in his Graphics Performance Optimisation.ppt for this kind of problems [/b]

I get “Not Found
The requested URL was not found on this server.” Hmmm…

Originally posted by Elixer:
[b] I get “Not Found
The requested URL was not found on this server.” Hmmm…

[/b]

Uh oh, I have just tried it and it works. Please try again, it could have been some temporary error?

Just tried again, same error. It also wants to install CJB management plugin… No thanks.

I get redirected to http://www.spaceports.com/404.html from http://galileo.spaceports.com/~speedy1/OpenGL%20test%20app%20v3.rar

Maybe you still have it cached?

Try using a download manager program thing. I had the same problem.

First, I get the download problem. Second I’m not using any optimised vertex format. I’ve got GF5600 and I get about 60-70 Mtris pro sec. Your program is buggy, believe me. Why if I render a 3ds mesh with about 40000 triangles I get 800 fps? And with your program I get very weird results.