Fastest way to stream CPU-derived vertex data

I’m writing an app that does some complex character animation, which I’m computing on the CPU. I’ve implemented two methods of streaming the data into OpenGL:
[ol][li]Compute the values in a memory buffer, then call glBufferSubData[*]Call glMapBuffer (with GL_WRITE_ONLY) and stream the results into it[/ol][/li]I would expect the latter to be faster since it is zero-copy. However, it is actually about 10% slower on my system (GF6600GT, 96.40 driver, Athlon XP 3000+). I’m guessing that there is a stall somewhere, although it makes essentially no difference if I call glBufferData with a NULL pointer just before glMapBuffer (to indicate that the current data may be discarded). Incidentally, the buffer is allocated as GL_STREAM_DRAW.

What’s the recommended way to stream vertex data from the CPU without either double-copying or stalling waiting for the GPU to finish with the previous data?

Originally posted by Bruce Merry:
I’m guessing that there is a stall somewhere, although it makes essentially no difference if I call glBufferData with a NULL pointer just before glMapBuffer (to indicate that the current data may be discarded).

It might be stall or the returned buffer can be allocated from memory that is susceptible to way in which memory access is done (e.g. write combined memory which expects sequential writes and might have performance hit when cache miss happens).

The code that computes the vertices writes out the memory sequentially. It’s computed with SSE code and uses _mm_stream_ps (== MOVNTPS instruction) to copy each computed value to the mapped memory, so I don’t see why write combining should break (not that I know much about it).