OpenGL glBufferData cause stuttering while loading terrain LOD

twippe · November 1, 2017, 11:51am

Hello,

I have implemented a chunked lod terrain system (https://www.classes.cs.uchicago.edu/archive/2015/fall/23700-1/final-project/chunked-lod.pdf).

everytime I move my camera and LODs changes my application stutters a lot with big fps drops when different chunks changes their LODs

I’ve found that it’s not my algorithms that are slow so it comes from uploading to GPU (glBufferData)

One solution I imagine would be to preload a mesh for each LOD for each chunk and just switch the meshes so I won’t have to make any transfers to the gpu while moving the camera. I would still have to load dynamically the mesh for filling the cracks but it’s not too big.

each of my chunks have 1616 tiles, each tile has 2 triangles (quad). so with the maximum lod I transfer 161623 = 1536 vertices per chunk.

do you have any other idea to optimize the transfer to gpu ? I use opengl ES 2.0
thanks

Silence · November 2, 2017, 12:16am

glBufferData will allocate memory on the GPU (also, as not stated, do you also explicitly delete the old memory that you don’t use anymore ?). It will be more fast if you allocate once, and just copy into that part of memory when required (use glBufferSubData for this).
You can also have a look at glMapBuffer and glMapBufferRange. Depending on your needs and what you are doing, glBufferSubData or glMapBuffer will give you better performances.

Edit: it seems that mapping buffer function might be absent on GL ES 2. So you might have to stick with glBufferSubData.

mhagain · November 2, 2017, 1:17am

The thing that’s not immediately obvious from this discussion is that the CPU and GPU operate asynchronously. The CPU will queue up commands and data, and at some point in time (when it’s queue is full, at the end of a frame, whenever) submit them for execution. At some other arbitrary point in time after that the GPU will pick them up and execute them. This allows both processors to operate simultaneously for theoretical high performance.

What this means in terms of buffer objects is that if you attempt to overwrite or replace a buffer that has outstanding draw calls pending on it, everything must stop, and those outstanding draw calls must run to completion before the overwrite or replace operation can continue. This completely breaks the asynchronous operation - it’s the moral equivalent of putting a big dirty glFinish call at that place in your code - and hence you get stuttering or otherwise poor performance.

OpenGL in general terms recognizes this model - hence glFlush and glFinish - but regrettably the first implementation of buffer objects didn’t, and so requires careful nursing and hand-holding to avoid performance pitfalls. In other words you can’t just call one of the original bunch of glBuffer APIs and expect it to be well-behaved or expect the driver to do the right thing. You need some degree of awareness of what’s actually happening behind the scenes, and to build your code around that. It’s not a good abstraction.

This has of course been patched through subsequent evolution of the API, which is of scant consolation to those targetting downlevel hardware or API versions.

twippe · November 2, 2017, 2:43am

[QUOTE=Silence;1289169]glBufferData will allocate memory on the GPU (also, as not stated, do you also explicitly delete the old memory that you don’t use anymore ?). It will be more fast if you allocate once, and just copy into that part of memory when required (use glBufferSubData for this).
You can also have a look at glMapBuffer and glMapBufferRange. Depending on your needs and what you are doing, glBufferSubData or glMapBuffer will give you better performances.

Edit: it seems that mapping buffer function might be absent on GL ES 2. So you might have to stick with glBufferSubData.[/QUOTE]

I do not delete the old memory but I don’t use glBufferSubData either, I use glBufferData
I have one vbo per mesh (per chunk) so I just use glBufferData when a mesh changes.

If I would use glBufferSubData when a mesh changes I would have to also clear the old mesh state (glBufferData does it) so I don’t know if I would gain any performance ?

GClements · November 2, 2017, 6:06am

Replacement (glBufferData) can copy the data into unused memory and “orphan” the existing memory (mark it for deallocation once pending commands have completed). This is also possible for overwriting (glBufferSubData), but harder to implement if you’re only overwriting a portion of the buffer: either the unmodified portions must be copied, or the modified portion must be copied to unused memory and then the overwrite must be enqueued in the command stream.

Whether or not a given implementation will actually perform such optimisations is unspecified.

The one constant is that the implementation has to allow for the possibility that the client memory from which the data is sourced may be modified at any point after the function returns. So it either has to copy the data during the function call, or use some form of copy-on-write mechanism (which in turn would risk client-side stalls).

The safest option is to always copy into memory which is known not to be used by pending commands. I.e. newly-allocated buffers or previously-unused portions of existing buffers. But even then, if the command queue is deep and you’re updating the data rapidly, the system may simply not have enough memory to hold all of the different versions of the data over the interval between uploading and rendering. The result is that allocating new buffers will stall until existing buffers can be discarded.

Dark_Photon · November 2, 2017, 6:36am

I’d recommend reading this wiki page as it is very relevant to your question: Buffer Object Streaming.

Reading the chapter in OpenGL Insights on efficient buffer transfers is also worthwhile. Here is a copy of it online: Asynchronous Buffer Transfers. Note that it pre-dates PERSISTENT COHERENT buffer mapping, so see the previously referenced wiki page (and other tutorials) for that.

twippe · November 2, 2017, 8:59am

I just tried the orphaning method and it is worse now

If I understand correctly this is how I implemented : (code snippet for updating a mesh)

    glBindBuffer(GL_ARRAY_BUFFER, buffers[0]);

    glBufferData(GL_ARRAY_BUFFER,
                 maxPosSize,
                 NULL,
                 glMesh->renderMode);
    glBufferSubData(GL_ARRAY_BUFFER,
                 0, mesh.getPositions().size() * 3 * sizeof(float),
                 glm::value_ptr(mesh.getPositions().front()));

maxPosSize should be the same always (i capped it according to the maximum lod vert count)
I have 3 others vbo’s like that for normals, uv’s and indices for each mesh

I think I am going to try the multiple buffering method, I guess I am gonna need a lot more than 2 buffers otherwise the orphaning method should have worked (should be equivalent to double buffering ?)

twippe · November 2, 2017, 3:33pm

So now I cycle through 6 buffers and still no difference.

I render buffer n-1 until update, update buffer n, render buffer n until update, update buffer n+1, render buffer n+1 until next update etc…

no matter how much buffers I put, no difference, I think I am doing something wrong or maybe I am transferring too much data…

Dark_Photon · November 2, 2017, 7:00pm

[QUOTE=twippe;1289178]I just tried the orphaning method and it is worse now
…
I use opengl ES 2.0[/QUOTE]

I’m sorry. I missed seeing the ES2 mention the first time through. Which GPU(s) and GLES drivers?

My first response was primary geared toward feeding data via buffer objects to desktop GL, where the most common vendor GL drivers tend to support more buffer streaming capabilities than GLES.

On mobile it’s different. Buffer streaming capability is more limited (especially in ES2 drivers/GPUs, which are getting old), the drivers tend to be less compliant (partly due to the poor match of tile-based GPUs to GLES), and the consequences of feeding vertex data to GLES poorly via buffer objects can be much more severe than on desktop GL due to GPU architecture differences, blocking the draw thread for as much as 1-2 full frames when you “get it wrong”.

Your best bet here is to get very familiar with the OpenGL ES Programming Guide for your GPU and GLES driver. It should provide recommendations on how to get the best performance when updating buffer objects on their GLES implementation. If not, contact your GPU vendor or check their developer support forums for this information.

In the absence of this valuable vendor GLES driver info, just use client arrays for streaming vertex and index data to the GPU for starters. Particularly in ES2 drivers, the vendor has probably spent some time ensuring that streaming of vertex data through the API and to the GPU with client arrays is efficient. The only batch data I wouldn’t stream via client arrays for starters would of course be vertex data that is defined on app startup and doesn’t need to change at runtime. There you’d of course create and populate buffer objects for those on startup and then just use them at draw time (which shouldn’t result in any draw thread blocking – aka implicit synchronization)

If you do want to try your hand at some buffer object streaming without vendor driver guidance, here are some recommendations. First, the issue is this. Mobile pipelines are “very” deep (frames deep). This is necessary to minimize the RAM bandwidth needed for rasterization such that slow CPU DRAM can be used instead of fast VRAM common on discrete desktop GPUs. Consequently, the amount of time between 1) when a draw call referencing a buffer object has been submitted to the driver and 2) when the driver/GPU is actually finished reading from that buffer object can be a fairly long time compared to desktop GPUs. If you try to change a buffer object within this period, the driver may hard-block your draw thread until the GPU reaches #2, depending on driver architecture. However, wait until after #2 to change the buffer object, and you’re usually OK. So the generally recommended strategy for avoiding these draw thread blocks is to not change a buffer object until ~3 frames after you last submitted a draw call to the driver reading from it.

To really see how your GL command stream is being executed on the hardware, you want to use the GPU vendor’s profiling tool. That will tell you a lot, and clearly point out the places where you are doing something inefficient like blocking in the driver or not keeping the GPU units busy. It’ll show you when you successfullly clear a bottleneck, which you can’t always tell from profiling the draw thread.

If I understand correctly this is how I implemented :

glBindBuffer(…);
glBufferData(…, NULL, …);
…

I’ve definitely seen GLES drivers that don’t support orphaning, choosing to block (aka “implicitly synchronize”) in this case instead of orphan. It could be your driver does this. Check your GPU vendor’s OpenGL ES Programming Guide for details.

twippe · November 3, 2017, 7:17am

The gpu I am testing on is Adreno 320 (it supports GLES 3 but I am targeting lower devices so I use on GLES 2), I’ve just looked into related resources (Adreno GPU SDK - Tools - Qualcomm Developer Network), did not found annything relevant for the moment
I think I am just going to avoid transferring too big buffers, I’ll preload all meshes for different LODs and it should be fine.

Dark_Photon · November 3, 2017, 7:06pm

Check out:

Qualcomm Adreno OpenGL ES Developer Guide (May 1, 2015)

and search for references to “vertex buffer object” and “buffer object” in general.

Here’s an interesting snippet:

So in this case (BufferSubData) it sounds like they’re saying that instead of implicit synchronization (draw thread blocks), they’re going to ghost the buffer object (aka resource renaming) if the buffer object is referenced by a draw call still in-flight.