Compiler specific memcpy issue to watch out for

tamlin · January 6, 2008, 4:13am

I wanted to add this to the technical wiki too, but I didn’t find an obvious place for it. If someone with more insight sees it - please add the info (if massaged).

Some time ago someone (on this board) had a problem with memcpy to a mapped buffer; the last 32 bytes didn’t “take”. I think I may have found the reason for it.

Seemingly starting with Visual Studio 8 (2005) the runtime library uses the assembler instruction “movnti” for at least memset - and it is fast! 4-5 times faster than the old “rep movsd” on the box I tested with (C2D, DDR2-533 in dual config). The older memset “only” reached ~970MB/s, while the movnti version reached almost 4.5GB/s, at least for “larger” amounts.
That’s a helluvalot of memory to write in a second!

While I haven’t checked memcpy, I’m willing to bet it also uses that very instruction for streamed writes. (feel free to fill in)

If that suspicion is correct, the hardware uses write-combining, and the last write is not the last “word” of the write-combine chunk (cache-line sized AFAIK, but I could be wrong), my hypothesis is that this last write could actually stall in the write-combine-cache and therefore not be committed when the user expected it to be.

For CPU-only work this likely wouldn’t matter, but for memory being pushed to a mapped region of a gfx card (over a bus!) it could wreak havoc unless terminated by a flush that sends the last writes also to the “remote” memory.

I know, this isn’t strictly OpenGL as such, and it’s restricted to x86-ish CPU’s, but it is advanced and it could be a help (for both users and IHV’s) to find potential problems at the end of buffers, not to mention a possibility to speed up local memory writes (where movnti is available).

Jan · January 6, 2008, 5:08am

So, you mean the driver should do an implicit flush when unmapping a buffer? Or how could one solve this problem?

Jan.

Zengar · January 6, 2008, 5:28am

Always write 32 bytes more

Nicolas_Lelong · January 6, 2008, 7:49am

I’m far from being a cache-guru, but, along with ‘movnti’, Intel provides instructions like ‘sfence’ that may help in this case. It looks like the ‘_WriteBarrier’ intrinsic may do the trick.

This is vastly untested speculations though… :}

imported_jwatte · January 6, 2008, 12:15pm

Yes, if that was the problem, then adding an inline call to SFENCE would fix it. You can try it yourself. However, it would be prudent for the drivers to insert that instruction in the unmap call, to make sure they are correct.

tamlin · January 6, 2008, 1:08pm

Jon is right. I suspect it may have been an issue that the driver vendor hadn’t anticipated (or simply ignored) this, someone used effectively a fast-path memcpy (or streaming write), and unmap didn’t safeguard against it.

I’d expect drivers produced today do the right thing (/me expects to see new drivers released shortly ).