Mixing VBO and client memory arrays?

Is there any penalty involved when mixing VBO arrays and non-VBO arrays when doing glDrawElements()?

Situation: trying to emulate GL_QUADS on OpenGL ES (which does not have GL_QUADS) to draw many separate rectangular areas. I set up a pre-calculated index array (0,1,2, 0,2,3, 4,5,6, 4,6,7, etc) that I store in a VBO.

The vertex array, on the other hand, is 100% dynamic, so using a VBO and doing glBufferSubData() before glDrawElements() seems to be the wrong solution.

Actually, when using glDrawElements() without any VBOs at all, I get better performance on my PowerVR device.

Question is: are there any drawbacks to keeping the fixed index array in a static VBO, and the vertex array in a client memory array (i.e. only doing glBindBuffer() for the index array, but not for the vertex array)?

As soon as you have a vertex array in client memory then it will be a bottleneck. I can even imagine that rendering this way by using indices from buffer object and vertex arrays from client memory can be even slower than having everything in client memory but that depends on the platform.

Problem is, I more or less need to have the vertex array in client memory in any case, since it’s a like a draw queue that is sucessively and dynamically built until it needs to be flushed (it’s never reused). Doing the copy-to-VBO just before doing DrawElements seems like an extra overhead (and as I said, it’s a performance hit on my device).

The only sane use that I can see would be to use two VBOs: one that is rendering, and one that is being filled out - like double buffering, and using glMapBuffer to fill out the new array (unfortunately glMapBuffer is not available in OpenGL ES).

The only sane use that I can see would be to use two VBOs: one that is rendering, and one that is being filled out - like double buffering, and using glMapBuffer to fill out the new array (unfortunately glMapBuffer is not available in OpenGL ES).

Double buffering can help, but these kind of performance characteristics are usually platform dependent. The problem with BufferSubData is that it’s a blocking command. If it wouldn’t be then most probably the original approach would be satisfactory.

Mixing vbos and client memory is the worst case on NV hardware (I don’t know about other vendors). It is better to have everything in client memory. Of course the best choice is to have everything in vbos.

The vertex array, on the other hand, is 100% dynamic, so using a VBO and doing glBufferSubData() before glDrawElements() seems to be the wrong solution

The solution to your problem is to replace the data via:
glBufferData(GL_ARRAY_BUFFER, data_size, data_ptr, GL_STREAM_DRAW);

It allows the driver to asynchronously work on your batches.

Question: What kind of geometry do you render? Is it by chance (bounding) boxes?

It allows the driver to asynchronously work on your batches.

That’s true, if the driver does orphaning but I’m not sure whether such thing is implemented for an embedded system like a PowerVR device. Besides, BufferData blocks also so the CPU is stalled till the copy is done which means that the GPU is stalled as well, because no commands are fed meanwhile.

I know exactly what you mean.

For years I kept reading “Use VBOs! It’s the new, cool way”. But for years, NVidia client arrays kept smoking VBOs for dynamically populated batches (and even “statically” populated batches in many cases). So what to do? Well, use client arrays of course. Did that for many years. Until at last Rob Barris helped design an OpenGL extension that made “streaming with VBOs” faster than NVidia client arrays (see post below). You get this speed-up when you can reuse batches (i.e. not have to upload every batch every frame), and you magnify this speed-up when you can dispatch them with NVidia bindless.

For details, read these:

No, while that is possible, it is not necessarily true. The GL driver pipelines like crazy where it can. And it’s not like a client arrays batch freezes the GPU pipeline. The GPU may already be pretty darn busy doing something else right then so the latency could be hidden. But a bunch of tiny client arrays batches, yeah, that’s increasingly likely to cause a CPU-side bottleneck (that is, if the GPU finishes everything ahead in the work queue, and the CPU is still mucking around trying to feed it the next command, then yeah, you’re submission bottlenecked). This is less likely with bindless, as it reduces the CPU-side batch dispatch overhead, but it’s still possible.

Agree on the first. That’s definitely an ugly case.

But the latter (best = all in VBOs) is not necessarily true. On NVidia, client arrays are pretty darn efficient. They can smoke VBO perf if you don’t do your VBO mucking just right. Naive VBO updating will often lose against client arrays. See previous post for details.

I would say best = in VBOs with NV bindless dispatch, OR display list dispatch. Those are roughly equivalent. Anything else I’ve tried is slower (on NVidia). If I had to cite a general “2nd place”, I’d say that is client arrays for smaller batches and classic VBOs for huge batches.

I’m no GL driver writer, but from what I’ve read that’s not true. For a single GL buffer handle, there can be multiple “memory blocks” associated with it behind-the-scene (orphans). GL may be ripping batches off another “orpan” while you’re loading into another. No synchronization bottleneck here. Again, a good GL driver pipelines like crazy where it can. You can help it by telling it when to orphan, and advising it when your modifications will not “stomp on its toes” so you can avoid needless orphaning when possible. See the link to Rob Barris’ post above for more details on this.

Thanks for all the info! In a way, I’m happy to see that I wasn’t too far off when doubting the usefulness of VBOs for dynamic geometry.

As I said, I’m sitting on OpenGL ES (1.1), so I’m basically lacking all the fancy features (no GL_STREAM_DRAW, no glMapBuffer, no glMakeBufferResidentNV, etc).

So I guess I’ll simply go with the client memory arrays solution then, and be fairly confident that it’s (at least close to) the optimal solution.

This is less likely with bindless, as it reduces the CPU-side batch dispatch overhead, but it’s still possible.

Please don’t refer always to bindless, everybody does that nowadays. That’s an NVIDIA-only tech and in its current form it will never make it to core or to a multivendor extension (fortunately). OpenGL is a cross-vendor, cross-platform API, so talking about a vendor extension as a generic solution is rather useless.

[quote]BufferData blocks also so the CPU is stalled till the copy is done which means that the GPU is stalled as well,
I’m no GL driver writer, but from what I’ve read that’s not true.[/QUOTE]
It has to be true, because there is no other solution. While the driver is copying the data you passed in as a pointer to BufferData, the procedure must not return, because there is no guarantee about whether the application frees up the client side buffer right after passing it to BufferData. This means that at least during the copy, the application is stalled and no new draw commands are feeded meanwhile to the driver (at least in the current context/thread).
I know about orphaning and I was not refering to the stall caused by the replacement of the buffer data. I know that the old and the new buffer data can coexist (at least this is how it is done on desktop drivers, maybe it is not the same for embedded drivers), I was only referring to the stall incured by the completion of the actual BufferData call.

This means that at least during the copy, the application is stalled and no new draw commands are feeded meanwhile to the driver (at least in the current context/thread).

With client side vertex arrays absolutely the same must happen. glDrawElements won’t return until either
a) the drawcall itself is done or
b) the driver has made a copy of the data first. The problem is, it is difficult for the driver to find out how much data will be referenced by the draw call (until it scans all the indices), so the driver will probably do a)

So, glBufferData() is the better solution. The driver won’t have to guess how much data has to be copied, because we explicitly told him how many bytes to copy.

glMapBuffer/glUnmapBuffer will be even slower. In my experience, mapping is not faster then glBufferData for very small buffers. That is because, even though glMapBuffer gives you a pointer back, there’s NO guarantee that the data won’t get copied a second time. This is because some drivers might return
a) a pointer to directly mapped GPU memory (where mapping is probably a heavyweight operation) or
b) a pointer to some system memory and the driver does a copy on unmapping

Reading the original post, I guess its about rendering many very small objects that have no more than 8 vertices per drawcall. The actual memory bandwidth to copy this data around is probably negligible. He should think about using vertex shaders that create the quads from a “unit-quad” via few shader parameters (passing position and size to the vertex shader). This results in some kind of instancing and problem to upload data for each drawcall would then just go away :slight_smile:

The calls to render such quads would then look like



glBindBuffer(GL_ARRAY_BUFFER, unit_quad_vbo);
glBindBuffer(GL_ELEMENT_ARRAY_BUFFER, unit_quad_ibo);
glVertexAttribPointer(0, GL_FLOAT, 2, 0,0,0);
for (int q=0; q<num_quads; ++q)
{
// pass position and size of quad into the vertex shader via dangling attributes
   glVertexAttrib4fv(POS_SIZE_ATTRIB, &quad_pos_size[q]);
   glDrawElements(GL_TRIANGLES, ....)
}

This would allow to render many many Quads (all from the same VBO data) with minimal effort

This means that at least during the copy, the application is stalled and no new draw commands are feeded meanwhile to the driver (at least in the current context/thread).

But that’s not what you said. Let’s look back:

The GPU is not stalled. No new draw commands are issued from the CPU, but the GPU can (and will) continue to process previously issued commands. It will only stall in the (highly unlikely) event that it has finished with all of the previous work.

Furthermore, your statement suggests that the CPU is stalled until after the DMA is complete. This is unlikely. It’s much more likely that the implementation will allocate a temporary buffer, copy your data into it, and DMA from there as needed. This is generally true of any OpenGL command that pulls directly from client memory.

So, glBufferData() is the better solution. The driver won’t have to guess how much data has to be copied, because we explicitly told him how many bytes to copy.

Which incidentally is the primary reason why glDrawRangeElements exists.

With client side vertex arrays absolutely the same must happen. glDrawElements won’t return until either
a) the drawcall itself is done or
b) the driver has made a copy of the data first. The problem is, it is difficult for the driver to find out how much data will be referenced by the draw call (until it scans all the indices), so the driver will probably do a)

So, glBufferData() is the better solution.

Agree with that, but if you check my posts, I didn’t say that vertex arrays are any better. I was just suggesting that he should not mix client and server side vertex arrays. It is usually platform specific that whether only-client-side arrays or BufferData would be more efficient: e.g. I would go with BufferData on desktop but maybe I would stick to client side vertex arrays on embedded devices.

The GPU is not stalled. No new draw commands are issued from the CPU, but the GPU can (and will) continue to process previously issued commands. It will only stall in the (highly unlikely) event that it has finished with all of the previous work.

Okay, consider the following situation:

  1. some draw commands are issued
  2. then BufferData is used to update the vertex buffer
  3. DrawElements is issued
  4. some further draw commands are issued
    Well in this case about the time the BufferData is stalling the CPU, the GPU is in fact processing the previously issued draw commands, however step 3 and 4 will still have to be processed by the driver and meanwhile it is very probable that the GPU has already finished the earlier tasks so it will be stalled until the driver can send the processed draw commands.
    Of course, this is not for sure and it strongly depends whether the application is CPU or GPU limited as if it is GPU limited then it would not incur any performance hit. However, if the CPU and the GPU load would have been otherwise well balanced or if there is a CPU limit (which, to be honest, is much more common nowadays) then BufferData in fact can incur a GPU stall.
    Everything depends on the application so maybe my suggestion is not valid in this case, but we cannot say that there can be no GPU stall at all if using BufferData often.

Furthermore, your statement suggests that the CPU is stalled until after the DMA is complete. This is unlikely. It’s much more likely that the implementation will allocate a temporary buffer, copy your data into it, and DMA from there as needed.

That’s also true, but the copy to the temporary buffer still takes time. Not much, but in real-time application even a little stall can have severe effects.

Okay, consider the following situation:

  1. some draw commands are issued
  2. then BufferData is used to update the vertex buffer
  3. DrawElements is issued
  4. some further draw commands are issued

And therein lies your problem: don’t do it this way.

There is no need for your glBufferData or glMapBuffer(GL_INVALIDATE) call to immediately precede the glDrawElements call. The way it should work is:

1: Generate vertex data and upload (glBufferData or glMapBuffer, however you wish).
2: Draw some stuff.
3: Draw some stuff with the generated vertex data.
4: Draw some more stuff.

meanwhile it is very probable that the GPU has already finished the earlier tasks so it will be stalled until the driver can send the processed draw commands.

Actually no. If you’re doing any serious rendering work, it is very probable that the GPU is still rendering commands you issued from last frame. And if it’s not, then you probably have bigger problems with feeding the GPU than dynamic rendering.

That’s also true, but the copy to the temporary buffer still takes time. Not much, but in real-time application even a little stall can have severe effects.

The copy to the temporary buffer takes the same amount of time whether you’re using glBufferData or glDrawRangeElements with client-side arrays (if you are using CSA’s use glDrawRangeElements). The driver has to copy the data either way. So you lose nothing with the glBufferData method.

If you’re doing any serious rendering work, it is very probable that the GPU is still rendering commands you issued from last frame.

That’s true, but if the CPU time of a frame is more than the GPU time, then the GPU will be stalled anyway and this is a much more common problem in today’s renderer.

The copy to the temporary buffer takes the same amount of time whether you’re using glBufferData or glDrawRangeElements with client-side arrays

Yes, I know, but as I mentioned it ealier, I’m not against BufferData as an alternative to client vertex arrays. You missed the point why I talked about the blocking issue of BufferData. I mentioned it because skynet said that the solution to avoid the blocking of BufferSubData is to use BufferData. Both have their advantages (BufferSubData is more lightweight, BufferData allows orphaning thus eliminates sync) but both are blocking calls. That’s the only thing I wanted to point out.

Both have their advantages (BufferSubData is more lightweight, BufferData allows orphaning thus eliminates sync) but both are blocking calls.

What kind of “lightweight” are you referring to? From an application performance standpoint, glBufferData() is the more lightweight operation. If you refer to what the driver has to do/manage behind the scenes… thats none of our business :slight_smile:

I mean lightweight by the fact that no new GPU memory allocation has to be done in the driver. I’m pretty sure that it takes more CPU time for the driver for allocating a new buffer storage than updating an existing one and it is our business as we should care how much time our API calls costs in order to make our renderer efficient, even though we do not care about the actual implementation details.
It is not a coincidence that it is mentioned everywhere that BufferSubData and TexSubImage is a lightweight operation compared to BufferData and TexImage.

I’m pretty sure that it takes more CPU time for the driver for allocating a new buffer storage than updating an existing one

That seems unlikely. Did you read those links Dark Photon pointed you to? Particularly this one??

If you call glBufferData(NULL), or use glMapBufferRange(GL_INVALIDATE_BIT), then you’re asking for new, uninitialized memory. If you call glBufferSubData, you’re asking to copy client memory to already existing memory.

Updating a fresh piece of memory requires no GPU synchronization. Updating an already existing piece of memory requires synchronization. The latter is generally going to hurt more than the former.

Orphaning dynamic streamed buffers is a technique that has been recommended since the earliest days of buffer objects. There’s probably a reason for that.

If you call glBufferData(NULL), or use glMapBufferRange(GL_INVALIDATE_BIT), then you’re asking for new, uninitialized memory… Orphaning dynamic streamed buffers is a technique that has been recommended since the earliest days of buffer objects.

I know that and I know that this way you don’t have sync issues. I’m aware of all these techniques and I agree that if you update the whole buffer this should be better, but consider updating only 1K of a 1M buffer, then orphaning is maybe not the best choice.
The problem is that we are arguing about something that we all know and agree about and you just always misunderstand my statements. I don’t really want to continue it and it obviously won’t help marcus256 if we are just picking on each other’s statements.