General questions regarding persistently-mapped buffers

guitio2002 · February 7, 2018, 9:34am

Hi all,

I have been using persistently-mapped buffers for a while in my applications but have never quite understood how they actually work.

Specifically, I’d be interested in what happens when say submitting a draw call where the pixel shader would read from a persistently|coherently-mapped SSBO that was just written to by the CPU.

Does the driver detects that the resource is being used and schedules a DMA transfer? then it would stall the CP until that transfer has completed?
Will it always copy the full range? or can it somehow detect which ranges were written to by the CPU?
Or does it work completely differently?

I am mainly interested in what happens with NVIDIA and AMD architectures.

Many thanks!

Guillaume

GClements · February 7, 2018, 11:02am

The coherent flag only guarantees that the data will be transmitted to the GPU eventually. It won’t necessarily be transmitted in time for any particular command.

The CPU’s MMU tracks which pages are modified. This is used to implement swap and (writeable) memory-mapped files, but can also be used to efficiently map GPU memory.

Alfonse_Reinheart · February 7, 2018, 7:09pm

Does the driver detects that the resource is being used and schedules a DMA transfer? then it would stall the CP until that transfer has completed?

No. By using a persistently mapped buffer, you are surrendering any and all automatic stalling of the GPU or CPU for buffer read/write operations. That is, if you need to make sure the GPU doesn’t execute some command before the CPU writes the data, then you cannot send the GPU command before the CPU has finished writing the data. And if you need to make sure the GPU has written or read some data before the CPU reads or writes, then you must use a fence or something to make sure that the GPU has finished that operation before the CPU starts its work.

Will it always copy the full range? or can it somehow detect which ranges were written to by the CPU?
Or does it work completely differently?

Nothing is being copied; that’s the whole point of persistent mapping. A persistent buffer doesn’t need to detect what you modified. All coherency says is that you don’t have to tell anyone what ranges of memory are affected by any reading/writing. It doesn’t mean you can violate operation ordering as above.

Flushing and the client mapped buffer barrier (the things you avoid when you do coherent mapping) are all about cache coherency. See, memory operations are expensive, so it would be silly to do a whole memory operation just to read one byte. So CPUs don’t; they read many contiguous bytes at once; they load these bytes into local memory called the “cache”. When you access bytes on the CPU, they’re first loaded into the cache. So if you need to read 4 adjacent bytes, you only do one memory operation.

Any changes you make to those bytes are made to the cache, not to the memory behind that cache. Eventually, changes to the cached bytes are stored out to main memory. But until that happens, main memory and the CPU’s cache are not coherent.

This means that if some operation outside of the CPU’s control wants to look at those memory addresses, it will see the old data. Also, if some operation outside of the CPU’s control writes data to those memory addresses, the CPU’s cached copy won’t be updated. And since the GPU is usually “outside of the CPU’s control”, then there is a problem.

The flush operation exists to ensure coherency by forcing all cache entries in that range to be written to main memory, thus ensuring that the GPU’s reads will see everything the CPU has written. Issuing the client mapped buffer barrier bit causes all cached bytes in that range to be removed from the cache, so that any new reads you do from the buffer will read what the GPU has written to those locations.

A coherent mapped buffer doesn’t need those because it ensures coherency. How? Well, if the GPU really is “outside of the CPU’s control”, then the obvious way to ensure coherency is to not use the cache for memory accesses involving that mapped buffer range. And if the GPU can actually see the CPU’s caches (Intel’s on-die GPUs can), then there’s no problem at all with using the cache, since the GPU has just as much access as another CPU. Notably, Intel’s Vulkan implementations don’t offer mappable memory that isn’t coherent.

guitio2002 · February 8, 2018, 1:45am

Thanks both!

Thanks Alphonse for the detailed answer, that does make a lot of sense

I don’t think I was describing an invalid operation so just to double-check on this, here’s the use case I was thinking of:


// init
GLuint bufId;  // create SSBO
char *bufData = (char *)glMapBufferRange(GL_SHADER_STORAGE_BUFFER, 0, bufSize, GL_MAP_COHERENT_BIT | GL_MAP_PERSISTENT_BIT | GL_MAP_WRITE_BIT);

// frame
glClientWaitSync(fence, GL_SYNC_FLUSH_COMMANDS_BIT, timeout);  // I actually first check with no flags nor timeout whether the fence was already signalled
glDeleteSync(fence);
bufData[rangeStart..rangeSize] = someValues();
glBindBufferRange(GL_SHADER_STORAGE_BUFFER, 0, bufId, rangeStart, rangeSize);
glDrawElements(GL_TRIANGLES, 6, GL_UNSIGNED_SHORT, nullptr);  // bound PS will read from buffer range
fence = glFenceSync(GL_SYNC_GPU_COMMANDS_COMPLETE, 0);

Maybe it sounded like I wanted to patch the memory against some work that’d be already in flight?

Cheers for the answers, I was really thinking that there had to be a copy step, I can see now that I was wrong

Guillaume

Osbios · February 10, 2018, 11:05am

The GPU just has to flush its cache for the involved buffer and range before issuing commands on it and once again before signalling any sync objects that where inserted after the command. Otherwise it would behave like C++ volatile on the GPU side and that probably would hurt performance to much to be usable. (Has to go over the PCI-E bus after all)

In OpenGL persistently-mapped buffers also take over the roll of staging buffers. I guess that use case is not needed in Vulkan because you get all the control over memory barriers.

Alfonse_Reinheart · February 11, 2018, 7:20am

The GPU just has to flush its cache for the involved buffer and range before issuing commands on it and once again before signalling any sync objects that where inserted after the command.

But the CPU cache won’t know about the GPU’s changes. If the GPU wrote values to that memory, the CPU cache could still have old data. Similarly, if the CPU wrote data to the buffer, the CPU may not have fully written it into the actual memory, so GPU reads that don’t go through the CPU cache won’t see the changes.

guitio2002 · February 11, 2018, 9:36am

Isn’t it the point of those persistently-mapped buffers that the user is responsible for synchronisation? And in particular, responsible for avoiding CPU/GPU read/write conflicts for a given region of memory?

So the GPU should be able to assume it is safe/correct to flush it’s caches only before issuing the commands and upon fence signalling.

Indeed having to travel the PCI-E unconditionally upon every memory request sounds like it’d be massively slow and possibly simply unusable.

Alfonse_Reinheart · February 11, 2018, 3:47pm

[QUOTE=guitio2002;1290437]Isn’t it the point of those persistently-mapped buffers that the user is responsible for synchronisation? And in particular, responsible for avoiding CPU/GPU read/write conflicts for a given region of memory?

So the GPU should be able to assume it is safe/correct to flush it’s caches only before issuing the commands and upon fence signalling.[/quote]

I wasn’t talking about the GPUs caches; I was talking about the CPUs caches. The original question was how COHERENT works without needing to explicitly call a function to ensure that the CPU caches have been cleared/invalidated appropriately.

Slow? Perhaps, though that all depends on how you read/write it. Write-combined access sequentially works plenty fast enough, and it follows the rules of COHERENT memory just fine. So as long as you write sequentially, you ought to be OK.

Also, why would COHERENT exist if there was never a performance penalty? There’s a reason why its an option rather than the required behavior of any persistent mapped buffer.

For some cards, COHERENT is free. For other cards, it has costs. That’s not a surprise.