Filling a PBO with multiple threads

I allocate a single PBO, map it, and want to use multiple threads to fill faster than it could be filled with a single thread.

Are there any problems or performance issues related to this?

The only thing I can think of, is that you must have NumThreads less or equal NumCPUs as each CPU has a write combiner cache and if you have two threads on the same CPU you’ll screw up the write combining. (Or are write combiner caches not per CPU, but per something else?)

Does anyone know if dual core architectures have a write combiner per core?

Just don’t.

Your CPU can write to its caches at tens of GB/s.
Data can go from the CPU to the chipset at no more than 6.4GB/s. Data can travel across AGP at no more than 2GB/s (and a little more for PCIe). So what’s the max bandwidth when moving data from the CPU to graphics memory? Surely that’s determined by the slowest link in the chain, which happens to be AGP or PCIe, unless of course you’re not actually working with a PC architecture.

Pardon my French, but what in the heck are you thinking? You’re trying to widen a bus bottleneck by throwing more CPU at it, all while the CPU is the least of all problems when it comes to data movement. This just can’t work.

I’m using Nvidia’s PBO upload/download test app to find the max bandwidth, and I’m trying to match that. Right now, our CPU usage is maxxed while doing uploading, so I know its a CPU limitation. Two CPUs can generate the data twice as fast.

Summary: our application is not limited in memory bandwidth or AGP bandwidth, but rather how fast the CPU can generate the data and fill the PBO.

Also curious if anyone knows if graphics hardware has special hardware to convert float32 <–> float16 or it uses the host CPU(s) to do it?