I wanted to add this to the technical wiki too, but I didn’t find an obvious place for it. If someone with more insight sees it - please add the info (if massaged).
Some time ago someone (on this board) had a problem with memcpy to a mapped buffer; the last 32 bytes didn’t “take”. I think I may have found the reason for it.
Seemingly starting with Visual Studio 8 (2005) the runtime library uses the assembler instruction “movnti” for at least memset - and it is fast! 4-5 times faster than the old “rep movsd” on the box I tested with (C2D, DDR2-533 in dual config). The older memset “only” reached ~970MB/s, while the movnti version reached almost 4.5GB/s, at least for “larger” amounts.
That’s a helluvalot of memory to write in a second!
While I haven’t checked memcpy, I’m willing to bet it also uses that very instruction for streamed writes. (feel free to fill in)
If that suspicion is correct, the hardware uses write-combining, and the last write is not the last “word” of the write-combine chunk (cache-line sized AFAIK, but I could be wrong), my hypothesis is that this last write could actually stall in the write-combine-cache and therefore not be committed when the user expected it to be.
For CPU-only work this likely wouldn’t matter, but for memory being pushed to a mapped region of a gfx card (over a bus!) it could wreak havoc unless terminated by a flush that sends the last writes also to the “remote” memory.
I know, this isn’t strictly OpenGL as such, and it’s restricted to x86-ish CPU’s, but it is advanced and it could be a help (for both users and IHV’s) to find potential problems at the end of buffers, not to mention a possibility to speed up local memory writes (where movnti is available).