PDA

View Full Version : NV40 glReadPixels..



qzm
04-18-2004, 06:03 PM
Just to try to continue Wons thread on this area of the NV40 with a slightly better topic...

Does anyone have any info yet on the glReadPixels performance of the new NV40? either the current AGP implementation, or even the future PCI Express ones?

(I know they are the same chip, but it is quite possible the bridge chip will make a difference to this particular feature).

There are more that a new developers out here who are waiting for this information.. I know it will increase the functionality of these cards massively for me if they can up the readout speed.

And for those people who don't care - there ARE very valid reasons to want to get images back out of cards! There units are providing a whole new level of graphics CAPABILITY to computers.. not just for display to a monitor!

Regards,
Stuart.

Adrian
04-18-2004, 08:22 PM
I found this information about the Quadro FX4000 which I think is based on the NV40.
"...accelerated pixel read-back performance that improves graphics throughput to more than 5X the performance of previous generation graphics systems"
Source: www.nvnews.net (http://www.nvnews.net)

I did also email one of the NV40 reviewers and ask them to run a readpixel benchmark but I havent had a reply yet.

dorbie
04-18-2004, 08:37 PM
Did you send the reviewer a benchmark?

Adrian
04-18-2004, 08:52 PM
I sent them a link to download a benchmark.

Nutty
04-19-2004, 03:22 AM
I dont think readpixels is ever going to be truly fast. But I like Nv's asynchronous version, which can mask the slowness if delayed data can work for you.

Won
04-19-2004, 04:01 AM
Thanks qzm.

Nutty --

There is no reason why ReadPixels shouldn't be fast. NV and ATI fail to take advantage of AGP transfers in that direction, being long-optimized for being texturing machines. 3Dlabs does use AGP ReadPixels, which is quite fast (fast enough that the lack of async readback might not be an issue). Having both would be fantastic, particularly if I can get my hands on hardware that can do this within a month. In any case, 3Dlabs already got it, and will probably still have it for their next part. Choice/competition is always good for developers and our customers. We'll also have to see about ATI, which certainly isn't standing still.

I expect that by next generation hardware (the stuff not yet talked about), you will be able to effectively saturate the bus in both directions, and in the case of PCI-Express, you should be able to do both directions simultaneously.

BTW, IHVs do keep their tabs on these forums, so you do get a chance to indirectly dictate what becomes important for the IHVs assuming you suggest something sane (which is for the IHVs to decide for the most part). So the "ReadPixels will never be fast" attitude is both incorrect (it already is on some hardware) and counterproductive.

-Won

dorbie
04-19-2004, 07:57 AM
Well the issue is supporting DMA transfers from the card to system memory instead of falling back on PCI transfers and that may not be trivial especially when you consider the GART page map. The information available suggests that now NVIDIA seems to be saying they're 5 times faster than earlier generations, good, but not exactly blowing the doors off AGP (or PCI Express)considering even the PCI reads seem relatively slow and you might be able to go 5x faster even within the PCI bandwidth. I'm wondering if someone was just looking at raw bandwidth numbers and arriving at an estimate based on bus speeds.

This is all with the underlying assumption that mobo chipsets can support DMA'd AGP reads.

NVIDIA is reading this, so the silence is not encouraging. I know NVIDIA has strict communication policies but some are free(?) to post technical info. How about a simple description of the read bandwidth situation, bearing in mind that silence on the issue is not the most reassuring policy.

Tzupy
04-19-2004, 08:04 AM
Hi,
I doubt the AGP 8x version of NV40 will have 4x readback bandwidth; the NV40 with PCI Express (and HSI bridge) is supposed to achieve AGP 12x writes and 4x reads; only the native PCI express NV45 will be 'the real thing' in term of reads; and the performance of the Quadro 4000 may not be the same as the GeForce 6800 (since it will sell at about 5 times the price). But I would be very happy if the AGP 8x version of the GeForce 6800 would have 4x read capability (and an affordable price).

Nutty
04-19-2004, 08:30 AM
Won, what is the bandwidth of AGP4x and 8x ?

zeckensack
04-19-2004, 08:43 AM
AGP 4x tops out at roughly 1 GB/s (32 bit QDR at 66MHz), AGP 8x is twice that.
PCI transfers over the AGP are equivalent to PCI-66, with a theoretical maximum of 266MB/s.

Nutty
04-19-2004, 12:21 PM
Still nowhere near good system ram performance, nevermind VRAM performance.

Like I said, its never going to be truly fast. Not ever, will it even compare to the internal bandwith of a mid ranged graphics card, and the divide is growing ever larger as GPU's leap ahead in performance, while cpu's shuffle along.

Even if it was fast, a blocking approach is going to be bad for paralellism bewteen cpu and gpu, so the only real solution is an asynchronous method like NV's pixel data range. Or do everything internal to the graphics card.

Adrian
04-19-2004, 01:32 PM
Originally posted by Nutty:
Still nowhere near good system ram performance, nevermind VRAM performance.
No one expects to be able to read video memory on the cpu like it was system memory. The problem is that reading at 100Mb/sec synchronous is almost useless. >2Gb/sec asychronous *is* truly fast imo.



Even if it was fast, a blocking approach is going to be bad for paralellism bewteen cpu and gpu, so the only real solution is an asynchronous method like NV's pixel data range.
...we will have an ARB version of PDR soon.



Or do everything internal to the graphics card.
It's not always possible or practical to do everything on the gpu.

Nutty
04-19-2004, 02:01 PM
No one expects to be able to read video memory on the cpu like it was system memory. The problem is that reading at 100Mb/sec synchronous is almost useless. >2Gb/sec asychronous *is* truly fast imo.Why not? I thought thats what ppl in this thread were wanting to see. Readback at full AGP4x/8x speed.

But even that is slow in the world of graphics, if its synchronous. Async is alot more useful.

Won
04-19-2004, 03:47 PM
Nutty --

I see your point. NVIDIA already has PDR/PBO which is half the equation and I was asking/hoping for efficient (you're right; "fast" is a vague term) AGP readbacks. I'm hearing 5x improvement (which is around AGP 4x speed) which might actually be useful to me. Anyway, depending on the application, the two things exchange importance. Of course, the GPU-CPU channel will always lag, and over time this gap will only increase. In my particular case: I need both soon.

Dorbie --

AFAIK, ReadPixels pretty much saturates PCI, so a 5x improvement will put it in AGP-land. The chipset I plan on using (AMD 8151) certainly supports fast ReadPixels for cards that support it. My understanding is that it is not uncommon for newer cards.

-Won

qzm
04-20-2004, 01:53 AM
Well, currently I see around 130MB/sec readback using fx5600/5700 ultra level of cards (hey, I don't need more GPU.. processing power, 5200 would probably be fine at present..), which is just a hair short of PCI..

Unfortunately, I *need* about 4 times that.
And async reads sometimes don't help - the CPU is thumb twiddleing anyway (although hey, async is never bad!).

I guess I will just hold my hopes up for the NV40, and play the wait-and-see game.

The application is hard real time - the data going on the pages is only available as the pages are made, and the result is needed immediately. I suspect make people (including nvidia themselves) have quite nice on-realtime accellerated renderers using their cards, but the readback problem really kills the realtime approach..

The biggest problem is that as soon as you move up the the higher accuract framebuffers, you drop to a CRAWL, and lets face it, that's a good part of the reason to use these cards!

All I can say is that it's a darn pity that alpha is not available on DVI connectors ;) .. but that's a whole other story.

Regards,
Stuart.

Tzupy
04-20-2004, 05:00 AM
Hi Won,
I wondered why you do need at least 4x readback bandwidth, and found a possible
reason: the 3D rendering that occurs in a holographic display may be achieved by
multiple 2D renderings (of the slices that make up the holographic display), then
the readback of those to main memory, then the transfer to the holographic device.
I am just curious about the matter, it must be fascinating to work on future technology...
If it's not confidential, would you mind giving some details?

zeckensack
04-20-2004, 07:42 AM
Nutty,
I see where you're coming from. Still, it can't be denied that most consumer ReadPixels implementations are horribly inefficient. Even if a pure bandwidth improvement wouldn't be The Right Thing ultimately, it would still be better than what we have now. It's okay if you ask for asynchronous stuff, we all want that, but that's IMO no good reason to dismiss the baby steps as a waste of time.

Re the asynchronous stuff, the issue is of course that you can't read back what hasn't been rendered. Okay, let's do some bean counting.

Read a 1024x768 RGBA8888 color buffer with 266MB/s bandwidth. Takes 11 ms (45 ms on a Radeon 9200 btw). Ignoring overdraw and complex fragment operations, an FX5900XT can fill that buffer in just 500Ás (800Ás for the Radeon 9200). You already know by now what I'm getting at ...

Yes, you'll have to flush and stall for old fashioned ReadPixels. But the truth is that if you compare the times, what you lose to that flush is utterly insignificant compared to what you lose on the actual readback.

If you do that once per frame, the driver won't have the opportunity to buffer (and, as a result, flush on ReadPixels) more than one frame anyway.
You can spend 22 cycles per fragment for fancy shaders before the two metrics pull even.
That's potentially 22 wasted cycles per fragment, only because ReadPixels is so dog slow.

I really think readback bandwidth is to be fixed first.

barthold
04-20-2004, 10:07 AM
Originally posted by Won:
Thanks qzm.

Nutty --

There is no reason why ReadPixels shouldn't be fast. NV and ATI fail to take advantage of AGP transfers in that direction, being long-optimized for being texturing machines. 3Dlabs does use AGP ReadPixels, which is quite fast (fast enough that the lack of async readback might not be an issue).

-WonWon, if you need async transfers they are available on the Wildcat 6110/6210 and 7110/7210 when using the image buffer extension. Essentially this extension lets you associate an event with the completion of a ReadPixels/DrawPixels or TexImage/TexSubimage command. These commands will now return immediately to the application while the transfer is in progress. The spec is at: http://oss.sgi.com/projects/ogl-sample/registry/I3D/wgl_image_buffer.txt

Barthold

Won
04-21-2004, 03:41 AM
Tzupy --

Way to visit the webpage. You're going to have to figure out the rest on your own.

Barthold --

I think someone from 3Dlabs (possibly you) mentioned this in a teleconference once. The reasons why I cannot use the Wildcat is because fast, async readback isn't the only feature I need, but is the one that is typically missing from consumer-level video cards. The P10 VPUs are probably good enough, and hopefully the newly announced P20 is better (hey kids! check out www.3Dlabs.com (http://www.3Dlabs.com) for the new whitepaper) but more options are good, particularly since 3Dlabs boards are typically more expensive than the NVIDIA/ATI options.

-Won

Adrian
04-24-2004, 02:52 PM
Originally posted by Adrian:
I did also email one of the NV40 reviewers and ask them to run a readpixel benchmark but I havent had a reply yet.He replied. He doesn't have the card anymore but he hopes to get it back "sooner or later" and will run the benchmark then. He also said he will run it with the new ATI card when that arrives. I will post the results when I get them.

V-man
04-24-2004, 03:57 PM
Curious that there is such a interest in glReadPixels and glDrawPixels all of a sudden.

You guys aren't expecting a huge leap for sync reads right? Even if it is a NV40 or even on PCI-Ex with a native interface....

davepermen
04-25-2004, 09:44 PM
Originally posted by V-man:
Curious that there is such a interest in glReadPixels and glDrawPixels all of a sudden.

You guys aren't expecting a huge leap for sync reads right? Even if it is a NV40 or even on PCI-Ex with a native interface....actually, if you go trough the foriis on the web, the request for it was always there.

and, nutty, you don't get it, do you? async or sync doesn't mather if the data transfer rate is just too low. try to get decent fps if you want to readback each frame at full res. you simply can't, at no real useful res, on any card by now. and this why? for NO reason. if you _need_ to have it nonblocking, and don't have an extension that does it, use a thread. i've done that yet, and it works great. but the raw data transfer rate is that low that you can't get anything useful out of it.

of COURSE nonblocking async readback helps scheduling it bether, but it does NOT help to gain the NEEDED bandwith to actually get your data back. and there are much scenarios where this is not only useful, but required.

but the biggest reason why it should work is simple: agp defines high-speed readback. it's in the specs. it's entierly doable. there is NOTHING preventing it. i BET it doesn't really take much more work for the vendors.

but now that pcie is underway, i don't think they really care anymore. how old is agp now? quite an age it has, and during all these years, NO gaming gpu EVER supported the right from the start defined fast readback. depressing.

Korval
04-25-2004, 11:17 PM
but the biggest reason why it should work is simple: agp defines high-speed readback. it's in the specs. it's entierly doable. there is NOTHING preventing it. i BET it doesn't really take much more work for the vendors.That reminds me of something that nVidia's Matt mentioned back in the early AGP 2x/4x days. He mentioned something about most motherboards not supporting AGP fully correctly, thus requiring various work-arounds or generally poor performance. Some AGP 4x wouldn't get past 2x performance, even in AGP performance tests. I wonder if it is possible that this is really a motherboard hardware problem more than anything. After all, if none of the AGP implmentations today (as they did before) don't fully implement the spec, then it's kinda hard for graphics card vendors to make their cards use that functionality.

I'm not saying that this is the case; I don't have enough information either way. But, there is some precident for motherboards impeeding the performance of various graphics operations.

Adrian
04-26-2004, 12:22 AM
Originally posted by V-man:
You guys aren't expecting a huge leap for sync reads right? Even if it is a NV40 or even on PCI-Ex with a native interface....What's the point in having bi-directional bandwidth if it's not going to be used?

I doubt the NV40 will show much/any improvement but I expect future cards to.

Having said that the Quadro version of the NV40 does appear to have 5x readback performance as I mentioned earler.

V-man
04-26-2004, 05:11 AM
That reminds me of something that nVidia's Matt mentioned back in the early AGP 2x/4x daysThat's correct, but the thread was about people installing their nvidia drivers and in the control panel it says PCI mode instead of AGP something-X
and also the drivers disable write combining on VIA systems.

I think it's the same situation with ATI and perhaps others.

I don't know jack about GPU design and nearly nothing about drivers, but it's quite possible that the reason you get poor readback is neither AGP nor lazy driver writing. It could be that the GPU doesn't like a glFinish.

Who knows ...


Having said that the Quadro version of the NV40 does appear to have 5x readback performance as I mentioned earler. Could you pin-point the page and location.
I only saw game performance benchmarks.

AGP5x? That means a little over 1GB/s

Keep in mind that if it's async performance, then it's not a true measure of how fast glFinish and then read back can occur.

I want to see numbers with sync reads.

Adrian
04-26-2004, 05:23 AM
I'm not sure where the article was exactly but translate this page into English and search for the word "read"
http://news.hwupgrade.it/12266.html

Async performance would be a silly thing to measure. I really don't think they are referring to that. Apart from anything else async readback is a driver feature and not specific to the FX4000.

davepermen
04-26-2004, 05:28 AM
sync or async doesn't mather. this is just an api issue. performance is a driver/hw issue. why they don't allow agp mode for readback is beyond my understanding (except the "uh, who cares about that? it costs money to create it, so forget it.. nobody needs it anyways, and if someone does, we will have some nice powerpoint presentation showing him otherwise" reason).

first, we need fast readback (delay doesn't mather for the speed of the readback, just for how long it takes to get the beginning of the transfer. but you all know that)

it's like me, running a celeron, knowing it's a fullfletched p4, and i know 90% of it's power is simply there for nothing. but in my case, it's because the hw isn't there (the cache, i mean, is broken). in the case of agp readback, IT'S ALL THERE, THEY WHERE JUST LAZY). that makes rather agressive..

a nice async api, thats a completely different issue. of course that will still be needed/useful.

vmh5
04-26-2004, 10:30 AM
First: to the people who think glReadPixels is not very useful and doesn't require much attention: There is a lot more to graphics than games these days..... espescially since programmable GPU's. My particular task is 3d hw rendering with OpenGL. My biggest bottlneck: glReadPixels and I can prove to you it's not hw, it's not me, it's a small amount of really bad code in the drivers.

A glaring issue with glReadPixel that hasn't received much attention here is data conversion. I.e. pulling 16bit per component float buffers into unsigned short images. Or even flipping the channel ordering of good old 8bit per component...
And in most of these cases you can improve on the driver performance by several orders of magnitude by pulling the data off the card in a different format and doing the conversion yourself.... which is of course absurd.

What's really absurd in this respect is that the people at ATI and NVidia have exchanged more email just with me personally than it would take a summer intern from a sub-par cs department to fix their drivers. Let me show you:

Create a simple glut app with a pbuffer and a timer and do some stuff like this:

resetTime();
glReadPixels(0,0,tw,th, GL_BGRA_EXT, GL_UNSIGNED_SHORT, data16);
millisecondDeltaTime());

What you will see is huge drops in performance for certain data types. Here are some numbers I ran on my 2.5Ghz Athalon with Radeon 9700 and GeForceFX 5200 on 4 channel 720x486 buffers. Check the absurd numbers for 16bit per component on ATI... And BTW doing any kind of conversion on the CPU on these buffers with the most simplistic method imagniable takes no more than 3.5 ms.

for a GeForce FX 5200:

time to send 8bit per component rgba to card: 3.4ms
time to send 8bit per component bgra to card: 3.8ms
time to send 8bit per component abgr to card: 3.3ms
time to send 8bit per component argb to card: 3.5ms

time to read 8bit per component rgba from card: 15.2ms :confused:
time to read 8bit per component bgra from card: 10.7ms
time to read 8bit per component abgr from card: 10.7ms
time to read 8bit per component argb from card: 10.5ms

time to read 8bit per component rgba from card as 32bit float : 11.6ms
time to read 8bit per component bgra from card as 32bit float : 26.0ms :confused:

time to send 16bit per component rgba to card as unsigned short : 19.8ms
time to send 16bit per component bgra to card as unsigned short : 21.9ms
time to send 16bit per component rgba to card as short : 19.9ms
time to send 16bit per component bgra to card as short : 21.6ms

time to send 16bit per component rgba to card as 32bit float : 13.9ms
time to send 16bit per component bgra to card as 32bit float : 19.9ms

time to read 16bit per component rgba from card as short : 20.2ms
time to read 16bit per component bgra from card as short : 22.7ms
time to read 16bit per component argb from card as unsigned short: 53.1ms :mad:
time to read 16bit per component abgr from card as unsigned short: 69.0ms

time to read 16bit per component rgba from card as 32bit float : 23.7ms
time to read 16bit per component bgra from card as 32bit float : 24.3ms

Radeon 9700

time to send 8bit per component rgba to card: 5.6ms
time to send 8bit per component bgra to card: 5.8ms
time to send 8bit per component abgr to card: 40.8ms :mad:
time to send 8bit per component argb to card: 51.9ms

time to read 8bit per component rgba from card: 23.9ms
time to read 8bit per component bgra from card: 19.3ms
time to read 8bit per component abgr from card: 18.9ms
time to read 8bit per component argb from card: 19.0ms

time to read 8bit per component rgba from card as 32bit float : 70.9ms
time to read 8bit per component bgra from card as 32bit float : 70.8ms

time to send 16bit per component rgba to card as unsigned short : 20.5ms
time to send 16bit per component bgra to card as unsigned short : 16.0ms
time to send 16bit per component rgba to card as short : 20.6ms
time to send 16bit per component bgra to card as short : 16.2ms

time to send 16bit per component rgba to card as 32bit float : 21.1ms
time to send 16bit per component bgra to card as 32bit float : 13.0ms

time to read 16bit per component rgba from card as short : 983.7ms :mad: :confused: :mad:
time to read 16bit per component bgra from card as short : 1002.3ms

time to read 16bit per component rgba from card as 32bit float : 985.2ms
time to read 16bit per component bgra from card as 32bit float : 973.3ms

Korval
04-26-2004, 11:27 AM
time to read 16bit per component rgba from card as short : 983.7ms
time to read 16bit per component bgra from card as short : 1002.3ms

time to read 16bit per component rgba from card as 32bit float : 985.2ms
time to read 16bit per component bgra from card as 32bit float : 973.3ms That's getting a little degenerate :eek: Is there some kind of delaying for-loop going on in there? ;)

While converting from/to integer to/from floating-point is a "slow" operation, there's something more going on in that ATi driver than just data conversion, or even data download. Even at PCI download bus speeds, a full second is sufficient to read a few hundred MB of data or so. So it isn't transfer. And data conversion, as you pointed out, can't take that long. So it isn't data conversion. What's left? Some nonsense?

Maybe ATi has been devoting its driver development resources on other things, and the reading for more... unusual formats (you have to admit, reading 16-bit per component is a bit off the beaten path) is using very old driver code. Like pre-8500 driver code. Stuff written back in the days when ATi's drivers were really crappy.


What's really absurd in this respect is that the people at ATI and NVidia have exchanged more email just with me personally than it would take a summer intern from a sub-par cs department to fix their drivers.I wouldn't be so sure. Code can get really ugly, especailly if driver development teams have changed a few times over the years. Brutal hacks can form, and so on.

I'm not entirely sure what can be done to improve glReadPixel performance. Let me state that in another way. Both ATi and nVidia make their money off of gamers. As such, these features/optimizations have driver development priority. What would be needed to correct the more atrocious cases would likely be a week or so of one driver programmer's time, to possibly rearchitect the read-pixel pipeline. Of course, since that time could go to game performance enhancing or supporting features for new hardware, it is likely that it won't get allocated any time soon.

The best way to get ATi and nVidia on board is to get game developers to try to use glReadPixels. Granted, because it is slow, they won't do it, so it becomes a catch-22. Either that, or one of the two sides goes ahead and allocates the time to improve performance. This would force the other side to do the same to stay compeditive.

Adrian
04-29-2004, 01:49 AM
"The Quadro FX 4000 is based on the NV40GL, which is a superset of the NV40 but with additional hardware and software features"
http://www.xbitlabs.com/news/video/display/20040428145354.html

So that probably means that the readpixel performance improvement won't be seen in consumer cards based on the NV40. Unfortunately readpixel speed tends to be seen as something only useful for 'professionals'.

I'm just speculating.

Tzupy
04-29-2004, 04:51 AM
Hi,
The 'GL' suffix is typical for the Quadro chips, it doesn't mean they
are so much different. Probably - at manufacture - the gaming chips and
the 'GL' chips are 100% identical, then some features are disabled on the
gaming chips, and the DRIVER makes the rest of the difference.
Here are a couple of reasons for the NV40 to have 4x readback:
1) the PCIE version will have 4x readback, through the HSI; this means
that the NV40 chip has 4x capability, and hopefully the AGP 8x version
will take advantage of the capability (if the drivers expose it).
2) the ATI R423 (PCIE) should have 16x capability (at least theoretical),
and maybe the R420 has a decent readback capability, so NV40 could look
bad if it sticks with the current AGP 1x readback.

Adrian
04-29-2004, 05:32 AM
Ok, but I'd be surprised if NV's marketing department decided to not mention it in the consumer NV40 release even though it existed. Despite fast readback being of no immediate relevence to consumers, it's still another bullet point.

Tzupy
06-01-2004, 06:10 AM
Hi,
It seems that the NV45 will have the same readback bandwidth as the NV40, since it's not a native PCIE, but has the bridge integrated.
Someone from nVidia PLEASE correct me if I'm wrong...