PDA

View Full Version : Upload speed to GPU on several machines....



jimmycox
01-12-2010, 06:09 AM
Hi all,

I'm developing this application that is often uploading large textures (~100Mb) to the GC...

Recently, as the upload time is becoming a bottleneck, I've started timing it...

On the 2 desktops I can time it, the upload time is about the same: 300ms (Gaming Desktop MB Asus P5Q Pro Geforce GTX260 and Workstation HP Z800 Quadro FX3800)...

When I tried on my laptop (Lenovo T400 with ATI Mobility HD3470)
the time is divided by 3!!!

I find this quite puzzling...

Can anyone give me some pointers on how to investigate on this difference?
Isn't this time related to the speed of the PCI Link?

I would have expect a laptop to be behind...
any hints of what's going on?

thx

Pierre Boudier
01-12-2010, 09:03 AM
hi,

you are comparing different products from different vendors, so performance difference can have many root causes.

in general, when doing data transfer, several different bottlenecks can happen:
- cpu overhead in the driver
- data conversion / reformating
- pcie bus while transferring the data
- cpu bandwidth while copying the data from user space

depending on the size of the data and on the formats you use (input and target formats), you will see different maximum speed.

ZbuffeR
01-12-2010, 09:17 AM
Do you use PBO for async uploads ?

jimmycox
01-12-2010, 09:22 AM
hi,

in general, when doing data transfer, several different bottlenecks can happen:
- cpu overhead in the driver
- data conversion / reformating
- pcie bus while transferring the data
- cpu bandwidth while copying the data from user space


Maybe it's not relevant, but that's actually the ATI hardware that is fast...
But is it thinkable that ATI use some kind of optimization in its driver to speed up such transfer? Maybe compressing the data or whatever...
If yes, even to that extent?

jimmycox
01-12-2010, 09:24 AM
Do you use PBO for async uploads ?

No.. that is in the plan though later on...as I understood it would be a way to prevent the stall of execution for 300 ms or so...

but right now, it's just glTexSubImage2D....

Pierre Boudier
01-12-2010, 09:44 AM
on ATI hardware, pbo is not necessarily speeding up data transfer; we are able to transfer async most of the time anyway.

I noticed that ATI numbers were better but did not want to show off; I am happy that someone noticed it still.

there has been significant improvement over the last year in that area, and most of data transfer is fully hardware accelerated for ATI (data conversion, data alignment, ...), which might be what happens here.

jimmycox
01-13-2010, 06:24 AM
on ATI hardware, pbo is not necessarily speeding up data transfer; we are able to transfer async most of the time anyway.

there has been significant improvement over the last year in that area, and most of data transfer is fully hardware accelerated for ATI (data conversion, data alignment, ...), which might be what happens here.


On the hardware level, I would have trouble believing that a 3000$+ workstation would be subpar of a 1.5yr old laptop... so I'm tempted to believe that something happens at driver level...

If this is correct, this is an impressive feat from AMD/ATI...

I think I can access a mobile workstation with a FireGL... might be interesting to see what kind of numbers it would give...

pjcozzi
01-13-2010, 07:10 AM
Are you timing just the data upload? Is it possible that the laptop's GPU is sharing memory with main memory? So on the laptop, the upload is just main memory to main memory, but on the desktop, the upload is main memory to video memory.

Regards,
Patrick

Pierre Boudier
01-13-2010, 08:43 AM
On the hardware level, I would have trouble believing that a 3000$+ workstation would be subpar of a 1.5yr old laptop... so I'm tempted to believe that something happens at driver level...

If this is correct, this is an impressive feat from AMD/ATI...

I think I can access a mobile workstation with a FireGL... might be interesting to see what kind of numbers it would give...

high end ASIC tends to have more ALU, texture units, memory bandwidth, ROP, ...

however, some performance aspect tend not to change at all:
- bandwidth of pcie bus (related to number of lanes / pcie gen 1 or 2)
- primitive throughput (related to engine clock)

so it is possible for a low end product to be faster than a high end in primitive throughput, if the low end has faster engine clock.
similarly, it is possible to be faster in data throughput if your cpu and system bandwidth are faster for your laptop than the one used for our high end ASIC.

data transfer performance is often limited by cpu speed and cpu memory bandwidth, as opposed to always limited by pcie bus.