Issue with Client Arrays

I’m using client arrays in my program to render a little over 100,000 triangles. I’m using client arrays because I’m concerned about VBO creation time at runtime.

Unfortunately, every five or so frames, I
get an enormous rendering spike (~50 ms frames). Is that normal? I’ve switched from client arrays to VBOs and the spikes disappear completely. The rendering time is also much improved. From what I see, the VBO creation time is as fast or faster than client arrays. Is there ever a reason to use client arrays over VBOs?

Thanks.

Is there ever a reason to use client arrays over VBOs?

No. Well… no. Even if your data is changing every frame, there are enough tools (mapping, etc) to make VBOs much faster than client arrays.

Unqualified, that statement is blatently false. Under some circumstances, client arrays are faster than VBOs. Under others, the reverse is true.

Perhaps you’ll write-up the definitive, How to Cook VBOs That Always Beat Client Arrays whitepaper with an open-source test tool, since the vendors haven’t stepped up to the task? That’ll put some meat in this assertion – code everyone can run and verify with. Please address various vendor GPU cards, driver versions, vertex attribute formats, interleaving schemes, batch sizes, multibatch packing schemes, VBO update schemes, delta times between upload and use, background frame time required for VBO allocation and update, etc. and identify under what subset of circumstances that VBOs will beat client arrays.

And no, this cannot assume that all the VBOs will be created and uploaded prior to “render time”. Many GL apps frequently operate on datasets where all potential batches are much, much larger than all of GPU (and even CPU) memory. And no spurious frame time spikes are allowed.

Bottom line, this isn’t as stupid-simple as you pretend that it is, and you do a disservice to readers by being so frivolous.

Pretty harsh words, although his general advice is sound.

VBOs are the method to use these days. Sure, with drivers still supporting client-side arrays there MIGHT be corner-cases where they could be faster, but the general intention is, that VBOs replace them completely. Of course you are right, that only after extensive testing one can say anything for sure, but without having hard evidence for him being incorrect AT ALL, you might not want to shout so loud about the statements being “blatantly false”.

ViolentHamster: Use VBOs. They are the way to go, at least they are intended to be. There are a few do’s and dont’s to get good performance, but as long as you master them you can be pretty sure to be on a fast-path. With client-side arrays, even if you find a case that might be faster in some test, it is very possible that some other hardware / driver combo will be slower. VBOs are the thing drivers optimize today. Client-side arrays are only there for backwards-compatibility.

Of course Dark Photon is right, that the whole thing isn’t that “stupid-simple” to get right, but one thing IS “stupid-simple”, and that is that the ARB and all vendors agree on VBOs being the “true” way to specify vertex-arrays today. At least you can be sure, that they do a lot to guarantee optimal performance in the most common use-cases. Now all you need to do, is to actually use them in the proper way.

Jan.

Yeah, sorry. As an engineer, assertions like this:

[quote]Is there ever a reason to use client arrays over VBOs?
No.[/QUOTE]
really bug me, even if I didn’t know it was false. No qualifications, no proof, no nothing. This says that policy and politics have more to do with the statement than fact. GL3 says X, so I’ll say X. GL3 also says no display lists, and those are the fastest method out there (on NVidia), bar none!

The fact is (based on personal experience) that VBOs aren’t always faster than client arrays. Are they the future? Yes. When writing new code should you try them first? Yes. But only when performance is sufficient, or irrelevant (i.e. academia, not industry) should you fail to try client arrays and display lists as well. Particularly the latter, as it usually beats out both (on NVidia).

…only after extensive testing one can say anything for sure, but without having hard evidence for him being incorrect AT ALL, you might not want to shout so loud about the statements being “blatantly false”.

Well, as soon as Alfonse publishes his test code to prove his original assertion, we’ll have many examples of this.

Some of it makes intuitive sense: with client arrays, the driver can just DMA the vertex data into temp ring buffers for pipeline rendering, and once the app reaches a steady-state, no allocation is required (presumably). However, with VBOs, not only CPU (driver) but GPU allocation/garbage collection is required to make space for those VBO buffers – because you’re going to call them up by handle later and the driver has to remember the contents. Also as I recall VBO binds are relatively expensive, thus the repeated recommendation from GL folks here that you often have to pack multiple batches into shared VBOs to get the perf up. But who knows, with VBOs the box is too black with so many permutations and so few hard rules.

…but as long as you master them … the whole thing isn’t that “stupid-simple” to get right … Now all you need to do, is to actually use them in the proper way.

Therein lies the rub. Got some URLs you can point folks to for the definitive word on using them the “proper way” (not just a few general tips)? I’m hoping you do and that I just haven’t tripped over them yet. This is a recurring topic on the boards.

GL3 also says no display lists, and those are the fastest method out there (on NVidia), bar none!

This may be hard for you to believe, but OpenGL is a cross platform API. It isn’t NVIDIA_GL, it’s OpenGL. Just because NVIDIA spent the effort to optimize display lists doesn’t mean that ATI or Intel will (or should, for that matter). The same goes for client arrays.

If you absolutely must have the 100% maximum performance across all platforms, expect your rendering code to be a nightmare of switches and special cases, jumping from displaylists to buffer objects to client arrays based on a myriad of factors like IHV, driver version numbers, and the like. Expect to have to test each of these cases across all OpenGL implementations, each time a new driver version comes out, and ship new patches of your code to stay ahead of the curve.

If you want your rendering code to be sane, then just use buffer objects. If you actually run into an issue on one platform or another, deal with it then. Premature optimization is always the wrong move.

Show me a reasonable usage case for buffer objects, something that someone would actually want to use, that can’t be made faster than client arrays.

Common sense tells me that client arrays should never beat VBOs. If you think about it. What would a modern GPU driver have to do in case of client arrays?

  1. Scan the indices for min/max vertex indices (glDrawRangeElements might prevent that) in order to find ou how much memory is needed.
  2. Allocate a temporary buffer(s) on the GPU (for vertex attribs and indices)
  3. Upload all associated data (synchronously!)
  4. Issue a draw call

What would a VBO based code do?

  1. Allocate a VBO of appropriate size (the app should knows this)
  2. Upload the data (via glBufferData/glBufferSubData/glMapBuffer)
  3. Issue a draw call.

Assumed that the GPU cannot render directly from system memory, how could the client array code be any faster?

Assumed that the GPU cannot render directly from system memory, how could the client array code be any faster?

If the driver doesn’t implement buffer objects correctly, this is very possible.

For example, a naive driver (like the one you describe) could implement all buffer objects as video memory. Thus, when you map a buffer, you get back a video memory pointer. The problem here is with streaming: frequently changing the vertex data.

If you are using and changing the data every frame, then every time you map the buffer, the thread must halt until the renderer is finished with pulling vertex data from the thread. The client arrays approach will simply allocate new memory for you, and a malloc is almost always faster than a glFinish.

This is why the STREAM hint exists. This is why the GL_INVALIDATE_BUFFER_BIT exists. This is why the glBufferData(NULL) trick exists. If a driver doesn’t implement these correctly, then you’re not going to get improved performance with buffer objects in the streaming case (not without manual double or triple buffering). Indeed, you’re going to get worse performance.

That being said, I still say to use buffer objects exclusively. Yes, a bad driver or driver bug can ruin your day. But you’re already beholden to the driver for client array optimization (a naive implementation of which, like the one described above, has atrocious performance). Assuming your driver will do the wrong thing isn’t going to get you good performance.