Why would a driver that can map non-system memory be required to create a copy on BufferSubData? That’s contradictory. If it can map, and that turns out to perform reasonable, it can surely use that trick for its BufferSubData implementation.
I don’t think you understand what BufferSubData has to do. Here’s the situation.
I have a VBO that’s 1MB in size. Now, a VBO is, basically, just a char*, right? So, let’s pretend that the driver implements VBO’s purely in client memory, not AGP or video. So, the pointer is in your client space and all is good. The thing is, the driver owns this memory, even though it allocated it in your client memory space. But, of course, drivers do this all the time.
When you call glMapBuffer, all it returns is the char* that it allocated. You can freely treat this as an array of 1024*1024 bytes and do with it what you will. If I’m generating data, clearly the best thing to do is to generate it directly into the mapped buffer.
However, glBufferSubData is different. It takes a pointer to an already existing array of bytes. If I’m generating data, the best performance I can get is by generating my data into a 1MB block that I allocated myself, then calling glBufferSubData who copies that data into the VBO memory.
Clearly the mapped case is faster, as the BufferSubData case has to copy data out of my array and into the VBO.
Now, if the VBO were in AGP memory, nothing changes. Assuming that glMapBuffer works as one would expect (ie, returning a pointer to the VBO data in AGP memory), I can use it like before. Now, I have to be careful to generate my data sequentially and to never read from this pointer. But that’s all.
Nothing changes for the BufferSubData case either. It still requires an extra copy.
If the VBO were in video memory, and the driver can map video memory directly, then nothing changes. In the map case, I’m still generating data directly into the destination. The BufferSubData case still needs an extra copy that the map case doesn’t.
Now, here’s the thing. Let’s say that the driver can’t map video memory directly. This is the only bad case, as the card now must allocate a 1MB block of memory, download the VBO data from the card, and give it to you. However, many drivers (with the proper VBO hints) cache such data in main memory, to eliminate the allocation and download steps. At which point, mapping is no slower than BufferSubData, as both require copying.
Indeed, mapping is probably faster, since the driver memory is probably uncached and properly aligned for DMA purposes, whereas client-allocated memory is not. Which means that the driver may need to do a second copy of the buffer when calling BufferSubData.
So, yes, mapping is better. When you’re generating data. But if you’re not actually generating the data, if you already have it in an array (from the disc, for example), you may as well use BufferSubData and let the driver do the optimized copy.