PDA

View Full Version : NVIDIA VBO + glDrawElements() bug or feature?



speedy
11-11-2003, 09:47 PM
It seems that NVidia drivers do an internal indices copy even if they are in a properly setup static_draw_arb VBO?!? [I concluded that by using the AMD CodeAnalyst hotspot disasembly] Then I tried glDrawArrays() instead, which reduced 80% CPU usage time to ~6%... Or perhaps I am not on a fast path somehow? (it would be nice to introduce some clear way of checking that from the API http://www.opengl.org/discussion_boards/ubb/smile.gif

This is a major bottleneck since I am making an outdoor arcade game with many simply rendered polys.

Also, where can NVIDIA driver bug be reported officialy? I stumbled upon a really stupid and nasty one or two http://www.opengl.org/discussion_boards/ubb/wink.gif in the latest WHQL Dets and made a test app to prove it...

Asgard
11-12-2003, 12:19 AM
Have you tried using glDrawRangeElements instead of glDrawElements?

Christian Schüler
11-12-2003, 01:37 AM
I don't know if glDrawElements() does a copy on the indices, but it is said that it does a range scan. This is the reason why glDrawRangeElements() is there.

Relic
11-12-2003, 01:58 AM
You've read this? http://developer.nvidia.com/object/using_VBOs.html

speedy
11-12-2003, 03:02 AM
Originally posted by Asgard:
Have you tried using glDrawRangeElements instead of glDrawElements?


***** ONLY FIRST TIME INIT

#ifdef USE_VBO

glBindBufferARB(GL_ARRAY_BUFFER_ARB, 1);

glBufferDataARB(GL_ARRAY_BUFFER_ARB, vertexsno * sizeof(*vertexs), vertexs, GL_STATIC_DRAW_ARB);

#endif

#ifdef USE_VBO

glBindBufferARB(GL_ELEMENT_ARRAY_BUFFER_ARB, 2);

glBufferDataARB(GL_ELEMENT_ARRAY_BUFFER_ARB, facesno * 3 * sizeof(*faces_short), faces_short, GL_STATIC_DRAW_ARB);

#endif

#ifdef USE_VBO

glBindBufferARB(GL_ARRAY_BUFFER_ARB, 1);

glVertexPointer(3, GL_FLOAT, 0, 0);

glBindBufferARB(GL_ELEMENT_ARRAY_BUFFER_ARB, 2);

#endif

***** MAIN LOOP

QueryPerformanceCounter((LARGE_INTEGER*)&start_time);

// batches_per_frame = 200, faces_per_batch = 1000. only vertex array enabled
for( j = 0 ; j < batches_per_frame ; j++ ) {
<TAB>glDrawRangeElementsEXT(GL_TRIANGLES, 0, faces_per_batch*3, faces_per_batch*3, GL_UNSIGNED_SHORT, 0);
<TAB>ret.polys_rendered += faces_per_batch;
<TAB>ret.objects_rendered++;
}

QueryPerformanceCounter((LARGE_INTEGER*)&cur_time);

glDrawRangeElements() Det 52.16 hotspot

EAX = 0A739340
EDX = 0A5FED24 // looks like some memory mapped heap, maybe AGP memory?!?

******* HOTSPOT LOOP STARTS // looks like implicit flush like in NV_VAR?!?

08B0443C movzx ecx,word ptr [edx] // they take short int index here
08B0443F add edx,2 // advance the index pointer
08B04442 mov esi,dword ptr ds:[0A4D6110h] // load the source base addr
08B04448 lea edi,[ecx+ecx*2] // clever way to multiply with 3 (3 floats per vertex)
08B0444B lea esi,[esi+edi*4] // calculate the final pointer to the source vertex array, base + (index * 3 * sizeof(float))
08B0444E mov edi,dword ptr [esi] // load vertex.x
08B04450 mov ebp,dword ptr [esi+4] // load vertex.y
08B04453 mov dword ptr [eax],edi // store vertex.x
08B04455 mov dword ptr [eax+4],ebp // store vertex.y
08B04458 mov edi,dword ptr [esi+8] // load vertex.z
08B0445B mov dword ptr [eax+8],edi // store vertex.z
08B0445E add eax,0Ch // add sizeof(float) * 3 to destination
08B04461 cmp edx,dword ptr [esp+20h] // end of the copy loop?
08B04465 jne 08B0443C

******* LOOP ENDS

glDrawRangeElements() is not good too, even worse in Det 52.16 because it copies vertices not indexes as you can see http://www.opengl.org/discussion_boards/ubb/frown.gif Can ppl from NVIDIA hint what is this copying about and more importantly is there a way to avoid it except wait for the better VBO support?


[This message has been edited by speedy (edited 11-12-2003).]

[This message has been edited by speedy (edited 11-12-2003).]

MattS
11-12-2003, 08:52 AM
Doesn't this depend upon which card you have? I think storing indices in graphics memory is only possible on FX cards. The original proposals for extending VAR for indices was an NV30 only thing.

Matt

Tom Nuydens
11-12-2003, 09:46 AM
Yes, but the VBO spec doesn't force the driver to put the indices in video memory. Regardless of which flag you create your VBO with, the driver is free to stick the data in system memory if video memory is for any reason not an option.

-- Tom

speedy
11-17-2003, 05:44 AM
TESTBED 1
GFX CARD: MSI GeForce FX 5600 VIVO @ AGP 4x
CPU: Athlon XP 1700+
MOBO: Gigabyte GA-7VAX
DRIVER REVISION: Detonator 45.23 & Detonator 52.16

TESTBED 2
GFX CARD: EAGLE GeForce 4MX
CPU: Duron 1200
MOBO: ???
DRIVER REVISION: Detonator 45.23

TESTBED 3
GFX CARD: Sapphire ATI 9600
CPU: ???
MOBO: ???
DRIVER REVISION: ???

INDICES TYPE: unsigned short
VBO USED: *YES*
VBO TYPE: 1 for indices + 1 for arrays
ARRAYS ENABLED: vertex array only
PRIMITIVE TYPE: GL_TRIANGLE

1 batch per frame, 300000 triangles per batch = 300000 triangles per frame

glDrawArrays(): <1% CPU BUSY : >99% CPU IDLE (triangle rate: 19.980000 MilTris/s) [GFFx 5600 + Det 45.23]
glDrawArrays(): 97% CPU BUSY : 3% CPU IDLE (triangle rate: 10.380000 MilTris/s) [GFFx 5600 + Det 52.16]
glDrawRangeElements(): 50% CPU BUSY : 49% CPU IDLE (triangle rate: 18.240000 MilTris/s) [GFFx 5600 + Det 45.23]
glDrawRangeElements(): 97% CPU BUSY : 3% CPU IDLE (triangle rate: 9.70000 MilTris/s) [GFFx 5600 + Det 52.16]
glDrawRangeElements(): <1% CPU BUSY : >99% CPU IDLE (triangle rate: 23.580000 MilTris/s) [ATI 9600 + Cat]

100 batches per frame, 3000 triangles per batch = 300000 triangles per frame

glDrawArrays(): <1% CPU BUSY : 99% CPU IDLE (triangle rate: 19.980000 MilTris/s) [GFFx 5600 + Det 45.23]
glDrawRangeElements(): 47% CPU BUSY : 53% CPU IDLE (triangle rate: 19.680000 MilTris/s) [GFFx 5600 + Det 45.23]
glDrawRangeElements(): 90% CPU BUSY : 9% CPU IDLE (triangle rate: 19.680000 MilTris/s) [GFFx 5600 + Det 52.16]
glDrawRangeElements(): <1% CPU BUSY : >99% CPU IDLE (triangle rate: 23.640000 MilTris/s) [ATI 9600 + Cat]

1 batch per frame, 100000 triangles per batch = 100000 triangles per frame

glDrawArrays(): <1% CPU BUSY : 99% CPU IDLE (triangle rate: 17.060000 MilTris/s) [GFFx 5600 + Det 45.23]
glDrawRangeElements(): 27% CPU BUSY : 73% CPU IDLE (triangle rate: 16.380000 MilTris/s) [GFFx 5600 + Det 45.23]

100 batches per frame, 500 triangles per batch = 50000 triangles per frame

glDrawArrays(): <1% CPU BUSY : 99% CPU IDLE (triangle rate: 14.260000 MilTris/s) [GFFx 5600 + Det 45.23]
glDrawRangeElements(): 11% CPU BUSY : 89% CPU IDLE (triangle rate: 14.180000 MilTris/s) [GFFx 5600 + Det 45.23]
glDrawRangeElements(): 60% CPU BUSY : 40% CPU IDLE (triangle rate: 14.360000 MilTris/s) [GFFx 5600 + Det 52.16]
glDrawRangeElements(): <1% CPU BUSY : >99% CPU IDLE (triangle rate: 14.80000 MilTris/s) [ATI 9600 + Cat]


NOTES:

* NVIDIA CPU burning tested on Duron 1200 + GeForce 4MX (Det 45.23) with similar crap results
* Similar behaviour also with Det 44.xx on TESTBED 1
* Hotspot disassembly shows that Detonator 45.23 seem to copy indices while ForceWare 52.16 copy vertex data internally
* (Only) Detonator 52.16 have asynchronous SwapBuffers() so glFinish() was inserted for correct measurements
* Detonator 52.16 proved unstable while profiling (ie. AMD CodeAnalyst, Intel VTUNE) and under some types of unexpected usage patterns http://www.opengl.org/discussion_boards/ubb/wink.gif
* When saturating fill rate using with high triangle counts, complete system stalls occured (Winamp stops playing music, mouse moves erraticaly). NVIDIA bug?!?

Is this Brown Paper Bag for NVIDIA or some deep unmarked pitfall or hardware limitation
I stumbled upon?

speedy
11-17-2003, 07:02 AM
Anyone interested can double-check the results with my open-source test app:
http://galileo.spaceports.com/~speedy1/OpenGL%20test%20app%20v2.rar

If you modify/enhance it please send me the patches so we can all share the efforts... tnx.

Zengar
11-17-2003, 11:36 PM
You must have a mistake in your program: I don't see anything, I get the tricounts much lower then with my own programs, and last tests gave me the result of 7600 fps(???).

Tom Nuydens
11-18-2003, 12:54 AM
Rather than trying to analyze NVidia's drivers to death, maybe you should try to write a working test app first. My log says things like 50K fps and 785 MTris/sec on a GeForce3. I wish! I don't know what you think you're measuring, but it's wrong. Very, very wrong.

-- Tom

speedy
11-18-2003, 10:09 AM
Tom, Zengar, thank you for bearing with me, there is more quite interesting & in-depth info comming...

> Rather than trying to analyze NVidia's drivers to death, maybe you should try to write a working test app first.
> My log says things like 50K fps and 785 MTris/sec on a GeForce3. I wish! I don't know what you
> think you're measuring, but it's wrong. Very, very wrong.

Tom, You have uncovered yet another NVIDIA driver bug which is solved in ForceWare 5x.xx drivers. In Det's 4x
they do not support primitive counts > 65536 when using VBO on gfx cards up to FX http://www.opengl.org/discussion_boards/ubb/frown.gif FX does all the stuff neatly.
I guess you are using Dets 4x + GeForce 3?

If you check the log files you'll see that when the number of triangles per batch is less then 65536/3 the results are *accurate*.
(using GL_UNSIGNED_INT on indexed calls)

So the breakdown of the new issue is:

ATI 9600 Cat 3.6: OK!
ATI 9600 Cat 3.9: OK!
GFFx 5600 + 45.23: OK! (but CPU burning)
GFFx 5600 + 52.16: OK! (but CPU burning, even more so)
GF 4MX + 52.16: OK! (but CPU burning, even more so)

GF 4MX + 45.23: NON CONFORMANT! (& CPU burning)
GF 3 + your rev.: NON CONFORMANT!

> You must have a mistake in your program: I don't see anything, I get the tricounts much lower then with my own programs,
> and last tests gave me the result of 7600 fps(???).

Zengar, which GFX card do you have? What tri counts do you reach with your programs? You are using VBO? I suppose you are using triangle strips
to reach higher tri counts and they give approx. 2.5x boost http://www.opengl.org/discussion_boards/ubb/wink.gif The test program is working really fine on 3 computers I have access to currently
and I have really tried to make it as bulletproof as possible.

Check out
http://galileo.spaceports.com/~speedy1/OpenGL%20test%20app%20v3.rar

, this is the new open-source conformance + performance + CPU usage test app with graphical feedback so you'll know it is working...
please check your hardware+drivers with it http://www.opengl.org/discussion_boards/ubb/smile.gif

P.S. NVIDIA;'s own John Spitzer recommends VTUNE hotspot analysis in his Graphics Performance Optimisation.ppt for this kind of problems http://www.opengl.org/discussion_boards/ubb/wink.gif

V-man
11-18-2003, 11:12 AM
>>>>Tom, You have uncovered yet another NVIDIA driver bug which is solved in ForceWare 5x.xx drivers. In Det's 4x
they do not support primitive counts > 65536
<<<

That's not a bug.

speedy
11-18-2003, 12:23 PM
Originally posted by V-man:
>>>>Tom, You have uncovered yet another NVIDIA driver bug which is solved in ForceWare 5x.xx drivers. In Det's 4x
they do not support primitive counts > 65536
<<<

That's not a bug.

EXT_draw_range_elements

'The specification of glDrawElements does not allow optimal performance
for some OpenGL implementations, however. In particular, it has no
restrictions on the number of indices given...'

... and the VBO spec isn't saying anything about any limits to indices count.


[This message has been edited by speedy (edited 11-18-2003).]

V-man
11-18-2003, 01:11 PM
So you(they) are exceeding the limits and nothing is rendering?

Alright, I thought by
"they do not support primitive counts > 65536"

you were not happy with their MAX values.

Actually, it looks like both MAX values are 4096. hmmm...

Elixer
11-18-2003, 02:13 PM
Originally posted by speedy:

Check out
http://galileo.spaceports.com/~speedy1/OpenGL%20test%20app%20v3.rar

, this is the new open-source conformance + performance + CPU usage test app with graphical feedback so you'll know it is working...
please check your hardware+drivers with it http://www.opengl.org/discussion_boards/ubb/smile.gif

P.S. NVIDIA;'s own John Spitzer recommends VTUNE hotspot analysis in his Graphics Performance Optimisation.ppt for this kind of problems http://www.opengl.org/discussion_boards/ubb/wink.gif

I get "Not Found
The requested URL was not found on this server." Hmmm...

speedy
11-18-2003, 02:39 PM
Originally posted by Elixer:
I get "Not Found
The requested URL was not found on this server." Hmmm...



Uh oh, I have just tried it and it works. Please try again, it could have been some temporary error?

Elixer
11-18-2003, 04:16 PM
Just tried again, same error. It also wants to install CJB management plugin... No thanks. http://www.opengl.org/discussion_boards/ubb/smile.gif

I get redirected to http://www.spaceports.com/404.html from http://galileo.spaceports.com/~speedy1/OpenGL%20test%20app%20v3.rar

Maybe you still have it cached?

AndrewM
11-18-2003, 05:01 PM
Try using a download manager program thing. I had the same problem.

Zengar
11-18-2003, 06:16 PM
First, I get the download problem. Second I'm not using any optimised vertex format. I've got GF5600 and I get about 60-70 Mtris pro sec. Your program is buggy, believe me. Why if I render a 3ds mesh with about 40000 triangles I get 800 fps? And with your program I get very weird results.

Tom Nuydens
11-19-2003, 01:32 AM
I am 100% sure that it's perfectly possible to draw more than 64K vertices per call with glDrawElements() (with or without VBO). I'm using a GF3 with the 52.16 drivers, but I don't recall this ever being a problem with older drivers either.

There is a hard limit of 64K vertices when using VAR on GF2-class cards, but this limit was raised to 1M vertices on the GF3 and up. It also does not exist when not using VAR.

There is also a recommended maximum index/vertex count for glDrawRangeElements(), but it is nothing more than that: a recommendation. This number is 4096 for all GeForces. For Radeon 9x00 cards, it's 64K indices/2M vertices.

Your new version consistently reports between 5 and 8 MTris/sec on my machine. The numbers look more reliable, but they also look low. I still don't trust your timing code, though. Could you explain exactly what you're measuring to produce all these numbers? :


glDrawXXX(): 85.959003% [CPU BUSY]
SwapBuffers(): 13.900877% [IDLE]
fps: 38.000000
batches/s: 38.000000
triangle rate: 7.600000 MilTris/s

I'm particularly interested in what conclusion you think you can draw from those first two lines.

-- Tom

Relic
11-19-2003, 02:01 AM
SwapBuffers(): 13.900877% [IDLE]

VSYNC off?

Tom Nuydens
11-19-2003, 02:08 AM
Originally posted by Relic:
VSYNC off?

Yes.

speedy
11-19-2003, 05:35 AM
Your new version consistently reports between 5 and 8 MTris/sec on my machine. The numbers
look more reliable, but they also look low. I still don't trust your timing code, though. Could you
explain exactly what you're measuring to produce all these numbers? :


You have the source, and here are the measured loops. It is soo simple *basic* opengl usage
I really wonder if there is anything that can be made more straightforward. AND it works as expected on
ATIs.




QueryPerformanceCounter((LARGE_INTEGER*)&amp;start_tim e);

unsigned int count = 1;

for( j = 0 ; j < batches_per_frame ; j++ ) {
if( use_vbo == USE_VBO_MULTIPLE_BUFFERS ) {
glBindBufferARB(GL_ARRAY_BUFFER_ARB, count++);
check_for_opengl_error();

glBindBufferARB(GL_ELEMENT_ARRAY_BUFFER_ARB, count++);
check_for_opengl_error();

glVertexPointer(3, GL_FLOAT, 0, 0);
check_for_opengl_error();
}

glDrawArrays(GL_TRIANGLES, 0, faces_per_batch*3);
check_for_opengl_error();

ret.polys_rendered += faces_per_batch;
ret.objects_rendered++;
}

QueryPerformanceCounter((LARGE_INTEGER*)&amp;cur_time) ;





QueryPerformanceCounter((LARGE_INTEGER*)&amp;start_tim e);

glFinish();
glfwSwapBuffers();

QueryPerformanceCounter((LARGE_INTEGER*)&amp;cur_time) ;




[This message has been edited by speedy (edited 11-19-2003).]

speedy
11-19-2003, 05:45 AM
BTW. I am NOT using VBO with multiple buffers! You could try that simply by changing one typedef enum parameter in the run_test() calls.

rp = run_test(GL_DRAW_ELEMENTS, USE_VBO, INDICES_UNSIGNED_INT, FILL_UPPER_HALF_OF_THE_SCREEN, 2, 1, 200000);

-->

rp = run_test(GL_DRAW_ELEMENTS, USE_VBO_MULTIPLE_BUFFERS, INDICES_UNSIGNED_INT, FILL_UPPER_HALF_OF_THE_SCREEN, 2, 1, 200000);


[This message has been edited by speedy (edited 11-19-2003).]

speedy
11-23-2003, 10:59 PM
Tom, Zengar, thanks for the responses, I think I have got close to the bottom of this mess.

I have made a small web presentation with the measured relation between the CPU usage by the
ie. game engine and the GPU triangle count per second (without rasterization).
(with the source & high res graphs included in the archive at the bottom of the page)

http:\\kickme.to\speedy1

Zengar, tri counts are low because I am not optimizing indices in the test case
(they are 012345678...) and you could be using 52.16 drivers...

ie. for ATI,

indices 012 012 012 012 ... give ~4-5MilTris/s

indices 012 345 678 ... give ~34MilTris/s

indices 012 123 234 345 ... give ~60MilTris/s

you can download the new OpenGL test app v4 from the next url and test this stuff for yourself
http://galileo.spaceports.com/~speedy1/OpenGL%20test%20app%20v4.rar

new features:

* supports engine CPU usage simulation
* erroneous serializing glFinish() removed

PITFALLS encountered:

* Detonators 45.23 do not support primitive counts > 65536 when using VBO
* ForceWare 52.16 have weak CPU/GPU decoupling, or I have not found a way to utilize it
* *DO NOT USE glFinish() before SwapBuffers()*, let the ICD driver be smart and find the way
on its own
* indices optimizations concerning t&l pre and post buffers/caches can be of MAJOR significance
concerning triangle rates
* Forceware 52.16 have fully async SwapBuffers()

Anyone from NVIDIA please shed some light on 52.16 issues?!? Thanks in advance.