NVidia 5x slower than ATI

Hi,

I am working on a 3D app that is CAD-like.
It is windowed and has a single main render viewport.
The OpenGL rendering consists of a mix of DisplayLists, Text, a few immediate commands, and mostly Index/Vertex/Normal/Color arrays.

I have been testing the app on a number of Windows systems, and every one that has an NVidia card is approximately 5x slower rendering the array objects than ATI. The few minor DisplayLists etc. are so few entities that they render blazingly fast on both so I cannot use them as any performance guideline.

An example, if I render an array object of 64k indexes + vertex/normal/color data, it will take approximately 1ms on systems with ATI and 5ms on systems with NVidia, this is totally consistent on every system I have tested.
If I add a second array object, I get approximately 2x the render time or 10ms on NVidia and 2ms on ATI.
If I add a third array object, I get approximately 3x on both (15ms NV, 3ms ATI).

The index/vertex/normal/color arrays are very well layed out and optimized. Even if they were a mess so that vertex-cache misses occurred there shouldn’t be such as discrepancy between NVidia and ATI.
The arrays are managed and sent individually to the GPU because I need easy modification of their data CPU-side. FYI I did try interleaving the arrays to send a single array to OpenGL and it made no difference on the NVidia performance issue. The overall performance on all systems improved by about 5%.

This performance difference occurs on all systems I have tested, which are numerous Core2Duo, Core2Quad, i3, i5, all around the 3GHz range; Windows XP Pro, Windows Vista x64, Windows 7 x64; and NVidia 8800GTS, GTX275, and ATI HD3870, HD4870, HD6870 video.

It isn’t v-sync or flush issues, I’ve already tried that, plus the render speed drop occurs pretty much constant as each additional array object is added to the render loop, which can only mean that the performance issue is with the NVidia’s handling of the arrays.

Any ideas?

If I understand it correctly, you use client side vertex arrays and no VBOs.
I would say that maybe client side vertex arrays are simply faster on ATI cards, however, I never tested it so I cannot confirm it.

Do you use VBO or plain memory pointers in gl***Pointer?

Client-side arrays. ie. gl*Pointer and glDrawElements.
The arrays are dynamic CPU-side because often multiple arrays will be updated per frame.

A further note of interest is that two of the Windows systems I test on are exactly identical hardware except one is NVidia (275) and the other is ATI (4870) video, so the issue shouldn’t/can’t be CPU/mobo/PCIe related, but should be GPU as far as I can tell since that is all that differs.

The issue isn’t really noticeable until I start getting into the million+ triangles, eg. a 2M triangle scene the ATI 4870 is 55fps whereas NVidia 275 renders 8fps. Unfortunately, million+ triangle scenes will be common on the software.
It’s also not fill-rate issues as I’ve looked into that.

Perhaps does NVidia require specific index and vertex array layouts or some oddity that ATI doesn’t?

Thanks.

Try to use VBO it should be much faster.

I don’t think VBOs will be of much use to me will they? They are server-side (GPU RAM) which is an issue.
This application has to use as small GPU footprint as possible, preferably less than 256MB GPU, as there can be easily 2GB to 4GB of render data arrays client-side (CPU).
So currently the scene uses dynamic culling, frustum, etc. to determine what client-side objects/arrays get sent to the GPU for each frame.
Using VBOs would mean updating/replacing the VBOs each frame, what type of performance hit will that have?

VBOs aren’t necessary server-side. They also can be client-side it depends on <usage> parameter in glBufferData command (set it to GL_DYNAMIC_DRAW if you change your geometry every frame). When using VBOs driver can make use of DMA, this path should be faster. I recommend always using VBOs.

VBOs gives your application opportunity to work in parallel with the driver.

In my experience updating dynamic buffers is very fast. Try this code:


glMapBufferRange(..., GL_MAP_INAVLIDATE_BUFFER_BIT | GL_MAP_WRITE_BIT);
// ... copy data
glUnmapBuffer(...);

Thanks, I’ll give them a try and see what happens.
I’ll post back the results when I get it re-coded.

VAs should also use DMA for data transfer. It is up to the drivers how this can be implemented, but in the case data is on the client-side I don’t see any reason why not using DMA.

On the other hand, usage is just a hint. It doesn’t oblige vendors. I tried to play with usage values on NVIDIA cards/drivers, and I found no difference. Maybe the drivers are too smart and recognize the usage pattern, or simply don’t care about the value of usage parameter (I bet on the second). I’ll be glad to hear a firm claim that someone has achieved significant change in the application performance by changing usage parameter.

Can you elaborate this statement, please?

Basiclly when using client-side VAs (and no VBOs) vertex data travels through CPU to the server when function is called. On the other hand when data is stored in VBO driver can decide when to pull data to the server (and this transfer can be done in parallel with your application).

See: SA2008_Modern_OpenGL_Kilgard.pdf page: 127 for details.

Thank you for the reply; I’ll take a look at slides again.

Driver does not DMA client memory to GPU memory directly. It has to copy it to driver memory by CPU (memcpy) and then use DMA to upload it to GPU memory. There are several reasons for this behaviour. (memory alignment, page locked memory)

  • ISSUE UPDATE *

I have added the ability to render using VBOs to my application but I am seeing the exact same performance issue.

NVidia-based computers are getting the same 5x slower performance than ATI, regardless of whether I use VAs or VBOs.
In other words, using VBOs gives the same performance result as using VAs, with NVidia 5x slower than ATI.

My code is too long to post here as this is a full CAD-like application.
However, I have also found that the VBO/VA demo application on Songho’s OpenGL website also exhibits the identical problem – it runs 5x slower on NVidia-based systems compared to ATI-based systems.
http://www.songho.ca/opengl/gl_vbo.html

With Songho’s VBO.zip VBO.exe demo I get: ATI=570fps NVidia=59fps

FYI: these test computers are virtually identical except ATI HD4870 vs NVidia GTX275 so this is not a computer difference issue.
I have also tested this on 4 other similar Core2 computers with the same results (ATI HD6970, NV 8800GTX, NV 8300GS, NV Quadro 200 series).

On my software application, as mentioned at the beginning of this thread, I am getting 5x slower on NVidia, with a typical 64k-quads VA or VBO data rendering in 1ms on ATI and 5ms on NVidia.
There is no texturing or other stuff going on at this time, just arrays of Index,Color,Normal,Vertex, and one light.
I have also tried every mix of VA, VBO, single array, interleaved array, individual arrays, etc.
Note that I am using VBOs in GL_DYNAMIC_DRAW usage because they are typically changing each frame.

edit if I change to GL_STATIC_DRAW then NVidia performs well, but should I choose this if the buffers are modified often, as often as every frame?
To clarify, when STATIC_DRAW is used on the VBOs, scene camera changes are as fast on NVidia as they are on ATI, however, object changes that will require updates to the VBO arrays are still just as slow on NVidia, 5x slower than ATI.

So it is very apparent that VBOs do NOT use DMA to transfer data quicker than VAs. They are using identical array transfer. It does appear that ATI may be using DMA while NVidia is not, which would explain the poor NVidia performance.

Any ideas?
I really need to get NVidia performance up to ATI’s.

edit I tried GL_STREAM_DRAW and that seems to work well.
I can perform camera movement static scene updates as well as object array updates and it all seems to run better on NVidia than any other setting that I have tried.
It would still be nice to hear from others/gurus as to whether this is the appropriate choice for me to use.

That number 59fps looks suspicious: Are you benchmarking with vsync on? NVidia drivers default to vsync on.

The OP has already said:

It isn’t v-sync or flush issues, I’ve already tried that

On his own app maybe. I’m talking about the Songho demo (look at the quote). If the demo isn’t displaying the same behavior then perhaps there’s some code difference that might be informative.

I have an NVidia myself (geforce 9800 GT) and I get >1000 fps on that demo with vsync off, but 59 fps with it on.

You are correct, on this Songho test I mistakenly had turned VSync back on at some point on my main NVidia computer.
However, vsync is not relevant to the issue, as I am timing the ms to perform the render loop in my software.

If I change my application to using VBO with GL_STREAM_DRAW I get virtually identical render loop time on both the ATI and NVidia (regardless if the NVidia has vsync on).
This is what I want, and apparently what I will have to use in my software, but I would like to know why this is occurring.

With VAs or with VBOs in DYNAMIC I get slow NVidia performance, typically 5x or more slower than ATI.
If I open a medium scene of 2M triangles (16+ 64k quad objects) I get 18ms on ATI 4870 and 114ms on NVidia 275 on my near-identical computers.
If I open a large scene of 8M triangles (64+ 64k quad objects) I get 70ms on ATI 4870 and 444ms on NVidia 275 on my near-identical computers.

On every test of any scene size and on every ATI and NVidia computer I check (I have tested on more than 6 computers) I get the same results: NVidia is typically 5x slower on VA or VBO DYNAMIC.

Note that I update the scene on every camera movement (update the vertex,normal,color buffers and re-send the arrays of data with DrawElements) since the displayed data can change each frame or every few frames.

I am doing nothing unusual in my software.
For VBO mode it is simply calls to glGenBuffers. Then creating the arrays of vertex,normal,color and index data for the objects to be rendered. BindBuffers on those. Then rendering the objects with frustum-culling in my renderloop with glColorPointer, glNormalPointer, glVertexPointer and glDrawElements.

The fact that it is rendering “properly” on NVidia with GL_STREAM_DRAW makes me think that NVidia is using those hints differently than ATI and what I expected those “usage” hints mean.

I would at least expect GL_DYNAMIC_DRAW to give the same performance on a render loop where the arrays are updated each frame or every few frames.

Since no one has replied yet with any information as to whether they see the same issue, or what might be the issue, I am simply going to add an option in the application to choose from the rendering method so that the user can choose what performs best on their system.

You are correct. See my previous post.

I looked at Songho’s code and he is using GL_STREAM_DRAW, so I cannot use his demo app as a comparison for the issue I see with VA or VBO DYNAMIC, since STREAM_DRAW also works fine on my app.

Well first of all it’s a usage hint; vendors aren’t actually obliged to do anything with it at all. Secondly, I think the spec and documentation could be a lot lot clearer about what the hints actually mean, especially with regard to modification and usage frequencies. The way I read it is that GL_STREAM_DRAW is what you really want here, because your usage pattern is modify/draw/modify/draw. GL_DYNAMIC_DRAW is actually intended for modify/modify/modify/draw/draw/modify/draw/draw/draw/etc usage patterns.

Yeah, I got the same impression as you did when I read them first, and it was only a long time after that I learned to view them from a different perspective.

One thing you could do is add some code to app startup that simulates and times a typical usage, and adjust the parameters when you’re doing it for real based on that

Yes, I agree, I understand they are “hints”, and you would think they could agree on and standardize what their hints actually mean. Or at least how they perform. :slight_smile:

I have been reading the lengthy ARB docs on VBOs.
From their information there sounds like some usage overlap between the two.

What I find irritating, is that ATI’s OpenGL driver is significantly better/more optimized than NVidia’s.
ATI is (apparently) DMA’ing for all vertex array styles: VA and all VBO usages.
Whereas NVidia is only DMA’ing for VBO STREAM.
NVidia needs to get on the ball.