GL_MAP_UNSYNCHRONIZED_BIT and glFenceSync

glFenceSync() adds a fence sync into the opengl command stream and it will be signalled once all previous GL calls have exited the pipeline on the GPU (correct?).

I am not really sure what ‘exited the pipeline’ means in combination with GL calls that initiate a DMA transfer to the GPU.

If I insert a glFenceSync right after glFlushMappedBufferRange, on a buffer that was mapped with

GL_MAP_WRITE_BIT | GL_MAP_UNSYNCHRONIZED_BIT | GL_MAP_FLUSH_EXPLICIT_BIT
when exaclty will the returned Sync be signalled? When the GPU started to initiate the DMA transfer of the flushed data (probably that’s the moment when the glFlushMappedBufferRange exits the GPU command queue) or when the data actually has finished transferring to the GPU?

If I structure my commands like so:


FlushMappedBufferRange(GL_ARRAY_BUFFER, vertex_interval_start, vertex_interval_length);
glFlushMappedBufferRange(GL_ELEMENT_ARRAY_BUFFER, index_interval_start, index_interval_length);
glUnmapBuffer(GL_ARRAY_BUFFER);
glUnmapBuffer(GL_ELEMENT_ARRAY_BUFFER);
GLsync gpusync = glFenceSync(GL_SYNC_GPU_COMMANDS_COMPLETE, 0);
glFlush();
glWaitSync(gpusync , 0, GL_TIMEOUT_IGNORED);

glDrawElements(…);

should this ensure that the GPU won’t start rendering the buffer before the data has been made fully visible to the device? I am asking because I am doing this at the moment but still see some artifacts when GL_MAP_UNSYNCHRONIZED_BIT is set.

[QUOTE=genesys;1280280]glFenceSync() adds a fence sync into the opengl command stream and it will be signalled once all previous GL calls have exited the pipeline on the GPU (correct?).

I am not really sure what ‘exited the pipeline’ means…[/QUOTE]

You’re mixing several concepts together inadvertantly.

Read this:

GL_SYNC_GPU_COMMANDS_COMPLETE means pretty much what it says. This isn’t (directly) about when you submitted the work to the GPU. It has to do when all previous commands have been completed on the GPU.

Think of it as inserting a bookmark in the GPU pipeline stream that floats along between the work items you’re submitting. When that bookmark gets to the complete end of the GPU pipeline, your fence is signaled.

The rest of your question seems to pertain to puzzlement as to when the data you copy into your buffer gets pushed to the GPU at the front of the GPU pipeline. It will be pushed when the buffer flush event occurs. You’ve demanded control of that, so it’ll be when you explicitly request it.

I am asking because I am doing this at the moment but still see some artifacts when GL_MAP_UNSYNCHRONIZED_BIT is set.

Most likely what you’re doing is either:

  1. Submitting batches for regions of the buffer you haven’t modified and flushed to the GPU yet, or are
  2. Changing regions of the buffer after submitting batches for these regions but before the GPU has finished rendering those batches.

Orphaning (buffer respecification) avoids #2 without requiring you to do your own manual synchronization (with sync objects).

I highly recommend you read this:

thank you Dark Photon for your help, it is highly appreciated!

[QUOTE=Dark Photon;1280281]You’re mixing several concepts together inadvertantly.

Read this:

I’ve read this before and I’m not sure what concepts exactly I am mixing up?

[QUOTE=Dark Photon;1280281]
GL_SYNC_GPU_COMMANDS_COMPLETE means pretty much what it says. This isn’t (directly) about when you submitted the work to the GPU. It has to do when all previous commands have been completed on the GPU.[/QUOTE]

exactly, this is clear. what is unclear to me is when exactly a command has been completed on the GPU if this command involves a DMA data transfer from CPU side memory to GPU memory. Does it mean the command "has been completed on the GPU’ exactly once the data has become visible to the GPU?

[QUOTE=Dark Photon;1280281]
The rest of your question seems to pertain to puzzlement as to when the data you copy into your buffer gets pushed to the GPU at the front of the GPU pipeline. It will be pushed when the buffer flush event occurs. You’ve demanded control of that, so it’ll be when you explicitly request it.[/QUOTE]

that’s exactly my question, yes. What does “It will be pushed when the buffer flush event occurs” mean in particular? is DMA transfer initialized once the command enters the GPU pipeline and it will exit it only after the DMA transfer has been completed?

[QUOTE=Dark Photon;1280281]
Most likely what you’re doing is either:

  1. Submitting batches for regions of the buffer you haven’t modified and flushed to the GPU yet, or are
  2. Changing regions of the buffer after submitting batches for these regions but before the GPU has finished rendering those batches.

Orphaning (buffer respecification) avoids #2 without requiring you to do your own manual synchronization (with sync objects).

I highly recommend you read this:

Yes I know this wiki page and try to implement a flavour of it. I do not want to orphan the buffer though since it is very large and in each frame I make changes to only a handful of small subregions of the buffer. Preferrably I want to keep the full buffer allocated in video memory, map a small subrange of it into client memory, fill and flush it and once the subregion has been transferred (preferrably without stalling the CPU so I can do other work, that’s why I use GL_MAP_UNSYNCHRONIZED_BIT) render the buffer.

It works currently propperly, I got rid of the artifacts through syncing. My current problem though is that the driver is moving the buffer from video memory to pinned system heap and renders directly from there. This seems to generate the bottleneck of my rendering. Since I make only small changes to subregions every frame I want the driver to keep the buffer in video memory and push only the changed ranges (I am making sure to only map the minimal necessary Subrange that is going to receive changes, and explicitely flush every subregion of this range. Still - as soon as I Map the buffer for the first time, the driver frees the video ram allocation and allocates it on system heap).

It seemed to me you might be confusing when the work is submitted and when the work is completed. I apologize if I misunderstood.

exactly, this is clear. what is unclear to me is when exactly a command has been completed on the GPU if this command involves a DMA data transfer from CPU side memory to GPU memory. Does it mean the command "has been completed on the GPU’ exactly once the data has become visible to the GPU?

No, not just after it has become visible to the GPU. After all work associated with the command has been performed by the GPU. If a component of that work involved a memory transfer, that has to be complete. Here’s the language from the GL spec:

that’s exactly my question, yes. What does “It will be pushed when the buffer flush event occurs” mean in particular?

Here we’re switching to talking about the front-end of the GPU command queue, not the back end (ala SYNC_GPU_COMMANDS_COMPLETE).

When you’re filling a buffer object on the CPU (e.g. with Map), you’re providing data to the GL driver. After you submit your changes “and they are flushed”, then the GL driver is free to start moving them toward the GPU. For Map, flush happens by default on Unmap. However, you can request that not happen on Unmap (which you are, with FLUSH_EXPLICIT), and then it happens when you say it happens (with FlushMappedBufferRange).

It works currently propperly, I got rid of the artifacts through syncing. My current problem though is that the driver is moving the buffer from video memory to pinned system heap and renders directly from there. This seems to generate the bottleneck of my rendering.

Ah yes. I remember hitting this. Are you working on an NVidia GPU? If so, there’s a work-around for this. Just let me know.

I am curious how you’re detecting that the driver did this migration behind-the-scenes.

Thank you a lot for your answers, makes all sense to me now!

[QUOTE=Dark Photon;1280308]
Ah yes. I remember hitting this. Are you working on an NVidia GPU? If so, there’s a work-around for this. Just let me know.

I am curious how you’re detecting that the driver did this migration behind-the-scenes.[/QUOTE]

Exactly, I am on NVIDIA. Currently I try to change the app to use glBufferStorage instead of glBufferData for the initial datatransfer in the hope that this will keep it in video memory. If there is another workarround I’m very eager to hear. I was seeing worse than expected performance on the GPU, profiling with NSIGHT does not indicate any kind of bottleneck, so I created a Debug context and added a debug callback with glDebugMessageCallback and that gives me the following output:

----------opengl-callback-message----------
message: Buffer detailed info: Trying to allocate VBO (1687) with size:, 50.00 Mb to location: VID
------------------------------ //looking good, this is what I want

----------opengl-callback-message----------
message: Buffer detailed info: Buffer object 1687 (bound to GL_ARRAY_BUFFER_ARB, usage hint is GL_STATIC_DRAW) will use VIDEO memory as the source for buffer object operations.
------------------------------ //why is the driver stating that hint is static_draw? I am passing GL_DYNAMIC_DRAW with glBufferData!

----------opengl-callback-message----------
message: Buffer detailed info: Buffer object 1687 (bound to GL_ARRAY_BUFFER_ARB, usage hint is GL_DYNAMIC_DRAW) has been mapped in HOST memory.
------------------------------ //this happens when mapping the buffer. Now it recognizes it is dynamic_draw and maps it into HOST memory

----------opengl-callback-message----------
message: Buffer detailed info: Freeing VBO (1687) with size:, 50.00 Mb from location: VID
------------------------------ //now it’s freeing the buffer from video memory, this is not what I want!

----------opengl-callback-message----------
message: Buffer detailed info: Trying to allocate VBO (1687) with size:, 50.00 Mb to location: SYSHEAP
------------------------------ //now the driver is reallocating the VBO on the System Heap, this is not what I want!

----------opengl-callback-message----------
message: Buffer detailed info: Buffer object 1687 (bound to GL_VERTEX_ATTRIB_ARRAY_BUFFER_BINDING_ARB (0), GL_VERTEX_ATTRIB_ARRAY_BUFFER_BINDING_ARB (1), and GL_ARRAY_BUFFER_ARB, usage hint is GL_DYNAMIC_DRAW) stored in SYSTEM HEAP memory has been updated.

----------opengl-callback-message----------
message: Buffer detailed info: Buffer object 1687 (bound to GL_VERTEX_ATTRIB_ARRAY_BUFFER_BINDING_ARB (0), GL_VERTEX_ATTRIB_ARRAY_BUFFER_BINDING_ARB (1), and GL_ARRAY_BUFFER_ARB, usage hint is GL_DYNAMIC_DRAW) will use SYSTEM HEAP memory as the source for buffer object operations.
------------------------------ //not good

why is the driver stating that hint is static_draw? I am passing GL_DYNAMIC_DRAW with glBufferData!

You could use a std::random_device to pick which GL usage hint you pass and it probably wouldn’t matter. Because people have used hints incorrectly so often in the past, and because the API doesn’t actually stop you from doing the wrong thing with the wrong hint, drivers have no choice but to basically ignore whatever you tell it and allocate the memory based on how you use it.

There’s a reason why ARB_buffer_storage uses a completely different “hint” system.

Currently I try to change the app to use glBufferStorage instead of glBufferData for the initial datatransfer in the hope that this will keep it in video memory.

If you’ve got access to glBufferStorage, I’d suggest ditching the flushing altogether and rely on a persistently mapped buffer. Some people have had good success with that.

Check this out:

[QUOTE=Dark Photon;1280315]Check this out:

Unfortunately it seems this didn’t work. While with this method I indeed do not see anymore the driver freeing the VBO from video memory (and I also don’t see the message anymore that it reallocates it in System Heap), the driver still says that it will use the the buffer from System Heap and I see the corresponding performance impact.

Any other ideas on how to force the driver to actually render the buffer from video memory? I am updating in every frame only less than 5% of the entire buffer so performance would be much better if the driver was only uploading few % of changed data and render the whole thing from VRAM.

[QUOTE=Alfonse Reinheart;1280311]
There’s a reason why ARB_buffer_storage uses a completely different “hint” system.
If you’ve got access to glBufferStorage, I’d suggest ditching the flushing altogether and rely on a persistently mapped buffer. Some people have had good success with that.[/QUOTE]

Using glBufferStorage unfortunately didn’t change the situation. The driver is still rendering from System Heap memory. Persistent mapping won’t give me an advantage if I can’t get the driver to let the GPU render the buffer from VRAM first, as I am not driver overhead limited.

If the driver decides that the usage pattern of your application would better be served by moving the buffer to system memory, then it’s going to move the buffer to system memory. And there’s nothing you can do to stop it.

Would performance be “much better” in this case? DMAs aren’t exactly free, no matter their size. The fact that you’re not changing much of the buffer is less important than how many DMAs you’re issuing per-frame. So the driver has to weigh the costs of your DMA operations relative to the cost of rendering across the PCIe bus.

I’m not saying that you aren’t right. I’m saying that you don’t really have proof that you’re right. Since you can’t force the driver to do things the way you want, you can’t prove that your way will be faster.

Persistent mapping won’t give me an advantage if I can’t get the driver to let the GPU render the buffer from VRAM first, as I am not driver overhead limited.

You assume that the only purpose of persistent mapping is to avoid the driver overhead of map/unmap calls.

Once you persistently map a buffer, the driver isn’t going to be able to easily shuffle it around (which is why persistent mapping is tied to immutable buffer storage). So it could (no guarantees) lock the buffer into video memory if you map it right after it gets allocated there, but before you do the things that cause it to drop out to CPU memory.

It’s certainly worth a shot.

I am very positive it will! rendering directly from system heap is also a DMA process (of the whole buffer instead of only the few % I am actually changing per frame) and the current bottleneck. You are right that the number of actual changes also has an impact, but we’re talking about less than 10 discreet subranges per frame, which I am explicitely flushing. Rendering the whole thing statically from videoram takes about 1ms compared to 9ms from SYSTEM HEAP.

Also, the driver doesn’t even bother to observe my usage patterns - it shifts it to System Heap on the very first glMapBufferRange() call.

I tried it - if I create the buffer with glBufferStorage with the GL_MAP_PERSISTENT_BIT, the driver allocates the buffer already on SYSTEM HEAP to begin with. This seems to me like a really lazy implementation on the driver side :frowning: (updating the driver didn’t help either by the way).