Does the driver detects that the resource is being used and schedules a DMA transfer? then it would stall the CP until that transfer has completed?
No. By using a persistently mapped buffer, you are surrendering any and all automatic stalling of the GPU or CPU for buffer read/write operations. That is, if you need to make sure the GPU doesn’t execute some command before the CPU writes the data, then you cannot send the GPU command before the CPU has finished writing the data. And if you need to make sure the GPU has written or read some data before the CPU reads or writes, then you must use a fence or something to make sure that the GPU has finished that operation before the CPU starts its work.
Will it always copy the full range? or can it somehow detect which ranges were written to by the CPU?
Or does it work completely differently?
Nothing is being copied; that’s the whole point of persistent mapping. A persistent buffer doesn’t need to detect what you modified. All coherency says is that you don’t have to tell anyone what ranges of memory are affected by any reading/writing. It doesn’t mean you can violate operation ordering as above.
Flushing and the client mapped buffer barrier (the things you avoid when you do coherent mapping) are all about cache coherency. See, memory operations are expensive, so it would be silly to do a whole memory operation just to read one byte. So CPUs don’t; they read many contiguous bytes at once; they load these bytes into local memory called the “cache”. When you access bytes on the CPU, they’re first loaded into the cache. So if you need to read 4 adjacent bytes, you only do one memory operation.
Any changes you make to those bytes are made to the cache, not to the memory behind that cache. Eventually, changes to the cached bytes are stored out to main memory. But until that happens, main memory and the CPU’s cache are not coherent.
This means that if some operation outside of the CPU’s control wants to look at those memory addresses, it will see the old data. Also, if some operation outside of the CPU’s control writes data to those memory addresses, the CPU’s cached copy won’t be updated. And since the GPU is usually “outside of the CPU’s control”, then there is a problem.
The flush operation exists to ensure coherency by forcing all cache entries in that range to be written to main memory, thus ensuring that the GPU’s reads will see everything the CPU has written. Issuing the client mapped buffer barrier bit causes all cached bytes in that range to be removed from the cache, so that any new reads you do from the buffer will read what the GPU has written to those locations.
A coherent mapped buffer doesn’t need those because it ensures coherency. How? Well, if the GPU really is “outside of the CPU’s control”, then the obvious way to ensure coherency is to not use the cache for memory accesses involving that mapped buffer range. And if the GPU can actually see the CPU’s caches (Intel’s on-die GPUs can), then there’s no problem at all with using the cache, since the GPU has just as much access as another CPU. Notably, Intel’s Vulkan implementations don’t offer mappable memory that isn’t coherent.