Map Me Again But Don't Waste My Time!

glfreak · January 15, 2011, 10:46pm

glMapBuffer(GL_ARRAY_BUFFER, GL_WRITE_ONLY)
…
glUnmapBuffer(GL_ARRAY_BUFFER)

Incredibly slow when used to draw string of characters (textured quads) using very simple shader.

Any thoughts?

ATI latest drivers! OpenGL 2.1!

ugluk · January 16, 2011, 1:18am

I’ve experienced a phenomenon like that on ATI 5450 and latest drivers, then I switched to glBufferSubData() and it was ok. The problem then was only that glBufferSubData() worked fast but failed to update VBO accurately.

randall · January 16, 2011, 3:42am

Try to create a new buffer every time instead of mapping existing one.

glfreak · January 16, 2011, 11:27am

glBufferSubData did the trick!

Thanks a lot for your help!

Dark_Photon · January 16, 2011, 6:39pm

The buffer object performance Ouija Board strikes again!

Vendors (or ARB): Would you please just tell us how you want us to do this on your hardware so we can make your GPUs look as sexy and competitive as possible?

glfreak · January 16, 2011, 10:41pm

Reading similar posts about VBO performance I conclude:

This proves that the idea of VBO (imitating Direct3D approach) was a horribly bad idea. And the argument that the new hardware has changed the way it pulls vertex data is misleading…

I mean D3D was built since it was born on the idea of buffers (execution buffers) and I believe this is developed into more relatively elegant and developer-friendly vertex buffers.

OpenGL was built from the start on the concept of state machine, switches, on/off, enable/disable, begin/end, which is already very developer-friendly, neat, and performs excellent (Quake, Serious Same, Doom, etc etc…) (CADs…etc. etc.) no one complained.

Now we are heading toward a mysterious awkward approaches being abandoned by the other API, and we adopting all the garbage and replacing the features and functionality that made this API epic for a very long.

Let the driver takes care of this internally, and with VBO we are already copying the vertex data from system memory to GPU. Forget about dynamic data? A function that draws a mesh or polygons? We MUST copy from CPU to GPU buffers.

Next time remove VBO, go back to glBegin/End and traditional vertex arrays. They are much faster and reliable.

Cheers!

Alfonse_Reinheart · January 16, 2011, 11:08pm

Vendors (or ARB): Would you please just tell us how you want us to do this on your hardware so we can make your GPUs look as sexy and competitive as possible?

First, the ARB can’t explain such a thing. The OpenGL specification does not define performance, nor can it.

Second, what exactly does “this” entail? Streaming? What kind of streaming? How frequent are you streaming? Are you talking about vertex formats?

Third, and most important of all, if you want to pressure IHVs to do this, you have to give them incentive. Every application that cops out with display lists or client-side vertex arrays is another reason for IHVs to not bother.

Make your applications rely on buffer object performance, and you’ll find out more about it. The squeaky wheel gets the grease.

I mean D3D was built since it was born on the idea of buffers (execution buffers) and I believe this is developed into more relatively elegant and developer-friendly vertex buffers.

This belief is not congruent with reality.

Execution buffers were an old, old Direct3D 3.0 thing. And Direct3D 3.0 was the first iteration of D3D. The next was 5.0, which abandoned execution buffers in favor of a more OpenGL-like vertex array model.

It wasn’t until 7.0 that vertex buffers appeared in D3D. And the primary purpose of that was, well, pretty obvious.

So I have no idea where you got this idea from. Vertex buffers and execution buffers have nothing to do with each other. Execution buffers are more reminiscent of display lists than buffer objects. And display lists are 1.0 functionality.

OpenGL was built from the start on the concept of state machine, switches, on/off, enable/disable, begin/end, which is already very developer-friendly, neat, and performs excellent (Quake, Serious Same, Doom, etc etc…) (CADs…etc. etc.) no one complained.

I’m not sure how listing a bunch of games that use OpenGL is an argument that it “performs excellent”.

Quake was released in the days before hardware T&L. Without hardware T&L, buffer objects makes no sense, because T&L had to be done on the CPU. Why upload vertex data to GPU memory, only to download it right back to the CPU to transform? Serious Sam was released before buffer objects existed. It might have made use of NV_vertex_array_range, which had all of the pitfalls of buffer objects (though only from a single vendor).

Oh, and Doom 3 (I assume you mean Doom 3, since Doom wasn’t an OpenGL game)? It uses buffer objects. It also uses NV_vertex_array_range, where applicable. So again, I have no idea what you’re talking about.

Let the driver takes care of this internally, and with VBO we are already copying the vertex data from system memory to GPU. Forget about dynamic data? A function that draws a mesh or polygons? We MUST copy from CPU to GPU buffers.

Do you honestly believe that using regular vertex arrays, or immediate mode, doesn’t copy data from the CPU to GPU buffers?

By letting the driver “take care of this internally,” you’re effectively ensuring that, every frame, you are transferring megabytes of vertex attribute data from the CPU to the GPU across a PCIe bus. You’re willing to cede all of that performance?

If I have a million+ vertex model, that uses 64-byte vertex data, I don’t want to have to transfer 64MB of data across the PCIe bus every frame. And that goes double if I’m doing shadowing and have to render it twice.

Next time remove VBO, go back to glBegin/End and traditional vertex arrays. They are much faster and reliable.

No they aren’t. The only thing that can be said to be consistently faster than buffer objects is display lists, and that is only on some hardware.

The simple fact is this: the closer to the hardware you get, the greater the chance that you’ll fall off the fast path. However, you also get greater the rewards if you happen to remain on it.

Client vertex arrays will give generally consistent levels of performance. But they will never give as good performance as proper use of buffer objects will. However, you can do the wrong thing with buffer objects and get poor performance. That’s the nature of getting low level.

It’s like instruction scheduling. If you could write to shader assembly directly, it’s possible you could beat the compiler/linker’s scheduling and improve performance. However, you might also fail miserably at this task, thus making your performance worse.

mhagain · January 17, 2011, 2:07am

Seconded. I’ve really come to believe that the VBO API is just not well conceived or well thought-through. It badly needs to be much clearer what happens, when it happens and under what conditions it happens all the way through. A more prescriptive specification that maybe needs to abandon some of the core philosophies of OpenGL (don’t sweat the hardware details, let the driver handle it)? Maybe.

glfreak · January 17, 2011, 3:03pm

But they will never give as good performance as proper use of buffer objects will.

What’s “proper” use of buffer objects?

IHV-Driver-Bug Dependent?

Any official guidelines?

Alfonse_Reinheart · January 17, 2011, 3:55pm

What’s “proper” use of buffer objects?

Time once was that using the wrong vertex format on the wrong hardware with client arrays would kill your performance. When compiled vertex arrays were en-vogue, there was precisely one vertex format that was accelerated: the one that Quake 3 was using. If you used anything else, your performance died on the spot.

My point is that this is not some new problem that buffer objects created. This has always been around. Finding the fast path has always been fraught with peril and potential performance disaster. Going low-level means taking performance into your own hands.

Longs Peak was going to introduce a vertex format object, which drivers could reject creation of if the vertex format was sub-optimal (similar to GL_FRAMEBUFFER_UNSUPPORTED). But that died.

ugluk · January 18, 2011, 6:52am

Well, when I found about this workaround, I just thought: there must be a reason why game producers test their products on different GPUs. You can check the GL vendor on startup, then select the best method to update a VBO from that. This sort of thing has been around for a while as Alfonse observed. What I wonder about is, if glfreak has tested both approaches on different GPUs.

I use the Dark Photon’s caching VBO and I really don’t care much about the perf of glMapBuffer()/glBufferSubData(). If there is a cache miss, the VBO is orphaned and data has to reloaded, which is not particularly fast in any case - it causes some frame breakage regardless of the method I use.

Kurt_Hudson · February 12, 2011, 10:09am

Why this can’t be added to OpenGL today? It looks quite simple for driver developers to implement something like glGetPerformanceHint(GL_DOUBLE, GL_RGBA) or whatever same…

kRogue · February 14, 2011, 6:01am

In the land of VBO’ and GL… I suspect (but I do not know!) that when a vendor makes a D3D driver, the OS (Windows) selects the allocation/mapping jazz of buffers, not the vendor’s driver. Indeed, if folks remember when Vista was first around, some games ran slower on Vista… the cause was not Vista, but rather some of those games were doing the D3D analogue of glTexImage instead of glTexSubImage in places… the drivers for XP and before did more of the memory jazz and so the driver implementations could “hack” it. Back to GL, the hardware vendor can choose how they do the buffer objects and all the flags are just hints… so we have that VBO behavior can vary wildly across vendors and possibly even driver versions.

My opinion is that there is nothing wrong with the API, but the issue is that each GL implementation can choose how to interpret the hints and so we get as Dark Photon says the “VBO Ouija board”.

On a related note, I wonder if under the Apple platform with different GPU’s there is a big difference of VBO’s for each VBO handling strategy. Anyone have some (recent) numbers comparing VBO Ouija board of Apple GL with different GPU’s?

On another note, ATI gives some advise for attribute alignment as well (I cannot remember, but I think it was 64bit alignment).