Hey if I had to copy from non write combined memory to write combined memory what is the fastest way? Is memcpy any good?
Hey if I had to copy from non write combined memory to write combined memory what is the fastest way? Is memcpy any good?
It's not documented - I had to find out the hard way... (and I would really appreciate a good explanation for this)
memcpy actually performs quite well for copies to AGP or video memory (if fastwrites is enabled). On some machines you might gain some % using SSE/MMX, but I doubt it's worth it (your bottleneck is likely to be elsewhere)...
Michael
Copying to AGP is one of the few cases where memcpy() is good. Copying regular memory to regular memory, memcpy() is pretty poor (as implemented in the MSVC 6.0 library and GLibc, anyway).
The documentation for VertexArrayRange is that the memory must be aligned on 4 byte boundaries IIRC. If you weren't aligned on 4 byte boundaries, then aligning to 32 will certainly fix that :-) You might want to go back and align to "only" 4 to see if that, too, helps.
32 bytes is the fetch buffer/write combiner/cache line size on a Pentium III. That shouldn't have much to do with your AGP memory, except if you (or the driver) forget to use SFENCE properly and you don't do complete-write-combiner overwrites.
"If you can't afford to do something right,
you'd better make sure you can afford to do it wrong!"
If I read the spec correctly, 4-byte alignment is only necessary for NV10 (Geforce2)... For NV20, there aren't any pointer alignment restrictions (except that <pointer> must be 32-byte aligned, which I take to be the pointer to the begin of the VAR-memory range.)
Michael
I'm certain and according to source safe I was already 4 byte aligning, but today when I switch back to 4 byte everything seems to work fine. I think it might be a deeper bug(or something that had more to do with another piece of code than the alignment).
Hopefully the bug will reoccur and I can track it down.
so are you on nv20 (gf3) or nv10 (gf2)?
Michael
memcpy is fast but not the fastest. THe main reason memcpy is slow is that it copies one byte at a time. If you are using floats for the verts (which you probably are) you need to copy 4 bytes at a time. Wait I am on to something 4 bytes = 32 bits. Which is a float.
Learn to program assembly. Its four lines to write a memcpy function that copies 4 bytes at a time. I wish I could give you source but to be honest its been a while since I have done it. (Although I really should get back into the habit). Let me go look at the intel website and I will post the code.
Devulon
There is source for plenty of memcpy routines available on the net.
As jwatte pointed out, memcpy is actually very fast for AGP/Vidmem, and I don't think you will need anything faster for any real-world application (at least not until AGP8x or more comes out...)
Michael
I beleive there are better versions of memcpy that copy 32 bits at a time, and can handle non 4 byte divisible array sizes. I already have something like this.
I havent tried it, but I heard using MMX is better for this since you can move 64 bits at a time.
Does anyone know if there are instructions for copying large chunks of data? Something that can move 1 KB with a single instruction perhaps?
V-man
------------------------------
Sig: http://glhlib.sourceforge.net
an open source GLU replacement library. Much more modern than GLU.
float matrix[16], inverse_matrix[16];
glhLoadIdentityf2(matrix);
glhTranslatef2(matrix, 0.0, 0.0, 5.0);
glhRotateAboutXf2(matrix, angleInRadians);
glhScalef2(matrix, 1.0, 1.0, -1.0);
glhQuickInvertMatrixf2(matrix, inverse_matrix);
glUniformMatrix4fv(uniformLocation1, 1, FALSE, matrix);
glUniformMatrix4fv(uniformLocation2, 1, FALSE, inverse_matrix);
Please, people, read the fine source before posting on this forum. If you don't, you'll just end up perpetuating bad myths.
The MSVC implementation of memcpy() turns into a REP MOVSD, which copies 32 bits at a time, with minimal loop overhead. Any optimized UNIX libc will do a similar thing.
The issue is more that the CPU is so much faster than the memory subsystem these days, that copying longwords is not really faster than copying bytes :-/
When copying to cached memory, memcpy() wastes a lot of time write-allocating cache lines, which leads to pretty poor performance. Any "plain" instruction copy operation will have the same problem. The way to get copy to cached memory to go fast is to bypass the cache for the output buffer, or if you're on AMD or PPC, to pre-clear the output buffer cache lines.
When writing TO AGP memory, you're writing to un-cached memory, so the write allocation is not a problem. You can get some amount of speed-up by properly streaming DRAM pages and pre-warming the cache for the input buffer, but that's about it. And it's not like the ratio CPU : Memory speed will go DOWN anytime soon, so it's only bound to get mooter.
"If you can't afford to do something right,
you'd better make sure you can afford to do it wrong!"