NV40 glReadPixels..

qzm · April 18, 2004, 6:03pm

Just to try to continue Wons thread on this area of the NV40 with a slightly better topic…

Does anyone have any info yet on the glReadPixels performance of the new NV40? either the current AGP implementation, or even the future PCI Express ones?

(I know they are the same chip, but it is quite possible the bridge chip will make a difference to this particular feature).

There are more that a new developers out here who are waiting for this information… I know it will increase the functionality of these cards massively for me if they can up the readout speed.

And for those people who don’t care - there ARE very valid reasons to want to get images back out of cards! There units are providing a whole new level of graphics CAPABILITY to computers… not just for display to a monitor!

Regards,
Stuart.

imported_Adrian1 · April 18, 2004, 8:22pm

I found this information about the Quadro FX4000 which I think is based on the NV40.
“…accelerated pixel read-back performance that improves graphics throughput to more than 5X the performance of previous generation graphics systems”
Source: www.nvnews.net

I did also email one of the NV40 reviewers and ask them to run a readpixel benchmark but I havent had a reply yet.

dorbie · April 18, 2004, 8:37pm

Did you send the reviewer a benchmark?

imported_Adrian1 · April 18, 2004, 8:52pm

I sent them a link to download a benchmark.

Nutty · April 19, 2004, 3:22am

I dont think readpixels is ever going to be truly fast. But I like Nv’s asynchronous version, which can mask the slowness if delayed data can work for you.

Won · April 19, 2004, 4:01am

Thanks qzm.

Nutty –

There is no reason why ReadPixels shouldn’t be fast. NV and ATI fail to take advantage of AGP transfers in that direction, being long-optimized for being texturing machines. 3Dlabs does use AGP ReadPixels, which is quite fast (fast enough that the lack of async readback might not be an issue). Having both would be fantastic, particularly if I can get my hands on hardware that can do this within a month. In any case, 3Dlabs already got it, and will probably still have it for their next part. Choice/competition is always good for developers and our customers. We’ll also have to see about ATI, which certainly isn’t standing still.

I expect that by next generation hardware (the stuff not yet talked about), you will be able to effectively saturate the bus in both directions, and in the case of PCI-Express, you should be able to do both directions simultaneously.

BTW, IHVs do keep their tabs on these forums, so you do get a chance to indirectly dictate what becomes important for the IHVs assuming you suggest something sane (which is for the IHVs to decide for the most part). So the “ReadPixels will never be fast” attitude is both incorrect (it already is on some hardware) and counterproductive.

-Won

dorbie · April 19, 2004, 7:57am

Well the issue is supporting DMA transfers from the card to system memory instead of falling back on PCI transfers and that may not be trivial especially when you consider the GART page map. The information available suggests that now NVIDIA seems to be saying they’re 5 times faster than earlier generations, good, but not exactly blowing the doors off AGP (or PCI Express)considering even the PCI reads seem relatively slow and you might be able to go 5x faster even within the PCI bandwidth. I’m wondering if someone was just looking at raw bandwidth numbers and arriving at an estimate based on bus speeds.

This is all with the underlying assumption that mobo chipsets can support DMA’d AGP reads.

NVIDIA is reading this, so the silence is not encouraging. I know NVIDIA has strict communication policies but some are free(?) to post technical info. How about a simple description of the read bandwidth situation, bearing in mind that silence on the issue is not the most reassuring policy.

Tzupy · April 19, 2004, 8:04am

Hi,
I doubt the AGP 8x version of NV40 will have 4x readback bandwidth; the NV40 with PCI Express (and HSI bridge) is supposed to achieve AGP 12x writes and 4x reads; only the native PCI express NV45 will be ‘the real thing’ in term of reads; and the performance of the Quadro 4000 may not be the same as the GeForce 6800 (since it will sell at about 5 times the price). But I would be very happy if the AGP 8x version of the GeForce 6800 would have 4x read capability (and an affordable price).

Nutty · April 19, 2004, 8:30am

Won, what is the bandwidth of AGP4x and 8x ?

zeckensack · April 19, 2004, 8:43am

AGP 4x tops out at roughly 1 GB/s (32 bit QDR at 66MHz), AGP 8x is twice that.
PCI transfers over the AGP are equivalent to PCI-66, with a theoretical maximum of 266MB/s.

Nutty · April 19, 2004, 12:21pm

Still nowhere near good system ram performance, nevermind VRAM performance.

Like I said, its never going to be truly fast. Not ever, will it even compare to the internal bandwith of a mid ranged graphics card, and the divide is growing ever larger as GPU’s leap ahead in performance, while cpu’s shuffle along.

Even if it was fast, a blocking approach is going to be bad for paralellism bewteen cpu and gpu, so the only real solution is an asynchronous method like NV’s pixel data range. Or do everything internal to the graphics card.

imported_Adrian1 · April 19, 2004, 1:32pm

Originally posted by Nutty:
Still nowhere near good system ram performance, nevermind VRAM performance.

No one expects to be able to read video memory on the cpu like it was system memory. The problem is that reading at 100Mb/sec synchronous is almost useless. >2Gb/sec asychronous is truly fast imo.

Even if it was fast, a blocking approach is going to be bad for paralellism bewteen cpu and gpu, so the only real solution is an asynchronous method like NV’s pixel data range.

…we will have an ARB version of PDR soon.

Or do everything internal to the graphics card.

It’s not always possible or practical to do everything on the gpu.

Nutty · April 19, 2004, 2:01pm

No one expects to be able to read video memory on the cpu like it was system memory. The problem is that reading at 100Mb/sec synchronous is almost useless. >2Gb/sec asychronous is truly fast imo.
Why not? I thought thats what ppl in this thread were wanting to see. Readback at full AGP4x/8x speed.

But even that is slow in the world of graphics, if its synchronous. Async is alot more useful.

Won · April 19, 2004, 3:47pm

Nutty –

I see your point. NVIDIA already has PDR/PBO which is half the equation and I was asking/hoping for efficient (you’re right; “fast” is a vague term) AGP readbacks. I’m hearing 5x improvement (which is around AGP 4x speed) which might actually be useful to me. Anyway, depending on the application, the two things exchange importance. Of course, the GPU-CPU channel will always lag, and over time this gap will only increase. In my particular case: I need both soon.

Dorbie –

AFAIK, ReadPixels pretty much saturates PCI, so a 5x improvement will put it in AGP-land. The chipset I plan on using (AMD 8151) certainly supports fast ReadPixels for cards that support it. My understanding is that it is not uncommon for newer cards.

-Won

qzm · April 20, 2004, 1:53am

Well, currently I see around 130MB/sec readback using fx5600/5700 ultra level of cards (hey, I don’t need more GPU… processing power, 5200 would probably be fine at present…), which is just a hair short of PCI…

Unfortunately, I need about 4 times that.
And async reads sometimes don’t help - the CPU is thumb twiddleing anyway (although hey, async is never bad!).

I guess I will just hold my hopes up for the NV40, and play the wait-and-see game.

The application is hard real time - the data going on the pages is only available as the pages are made, and the result is needed immediately. I suspect make people (including nvidia themselves) have quite nice on-realtime accellerated renderers using their cards, but the readback problem really kills the realtime approach…

The biggest problem is that as soon as you move up the the higher accuract framebuffers, you drop to a CRAWL, and lets face it, that’s a good part of the reason to use these cards!

All I can say is that it’s a darn pity that alpha is not available on DVI connectors … but that’s a whole other story.

Regards,
Stuart.

Tzupy · April 20, 2004, 5:00am

Hi Won,
I wondered why you do need at least 4x readback bandwidth, and found a possible
reason: the 3D rendering that occurs in a holographic display may be achieved by
multiple 2D renderings (of the slices that make up the holographic display), then
the readback of those to main memory, then the transfer to the holographic device.
I am just curious about the matter, it must be fascinating to work on future technology…
If it’s not confidential, would you mind giving some details?

zeckensack · April 20, 2004, 7:42am

Nutty,
I see where you’re coming from. Still, it can’t be denied that most consumer ReadPixels implementations are horribly inefficient. Even if a pure bandwidth improvement wouldn’t be The Right Thing ultimately, it would still be better than what we have now. It’s okay if you ask for asynchronous stuff, we all want that, but that’s IMO no good reason to dismiss the baby steps as a waste of time.

Re the asynchronous stuff, the issue is of course that you can’t read back what hasn’t been rendered. Okay, let’s do some bean counting.

Read a 1024x768 RGBA8888 color buffer with 266MB/s bandwidth. Takes 11 ms (45 ms on a Radeon 9200 btw). Ignoring overdraw and complex fragment operations, an FX5900XT can fill that buffer in just 500µs (800µs for the Radeon 9200). You already know by now what I’m getting at …

Yes, you’ll have to flush and stall for old fashioned ReadPixels. But the truth is that if you compare the times, what you lose to that flush is utterly insignificant compared to what you lose on the actual readback.

If you do that once per frame, the driver won’t have the opportunity to buffer (and, as a result, flush on ReadPixels) more than one frame anyway.
You can spend 22 cycles per fragment for fancy shaders before the two metrics pull even.
That’s potentially 22 wasted cycles per fragment, only because ReadPixels is so dog slow.

I really think readback bandwidth is to be fixed first.

barthold · April 20, 2004, 10:07am

Originally posted by Won:
[b]Thanks qzm.

Nutty –

There is no reason why ReadPixels shouldn’t be fast. NV and ATI fail to take advantage of AGP transfers in that direction, being long-optimized for being texturing machines. 3Dlabs does use AGP ReadPixels, which is quite fast (fast enough that the lack of async readback might not be an issue).

-Won[/b]
Won, if you need async transfers they are available on the Wildcat 6110/6210 and 7110/7210 when using the image buffer extension. Essentially this extension lets you associate an event with the completion of a ReadPixels/DrawPixels or TexImage/TexSubimage command. These commands will now return immediately to the application while the transfer is in progress. The spec is at: http://oss.sgi.com/projects/ogl-sample/registry/I3D/wgl_image_buffer.txt

Barthold

Won · April 21, 2004, 3:41am

Tzupy –

Way to visit the webpage. You’re going to have to figure out the rest on your own.

Barthold –

I think someone from 3Dlabs (possibly you) mentioned this in a teleconference once. The reasons why I cannot use the Wildcat is because fast, async readback isn’t the only feature I need, but is the one that is typically missing from consumer-level video cards. The P10 VPUs are probably good enough, and hopefully the newly announced P20 is better (hey kids! check out www.3Dlabs.com for the new whitepaper) but more options are good, particularly since 3Dlabs boards are typically more expensive than the NVIDIA/ATI options.

-Won

imported_Adrian1 · April 24, 2004, 2:52pm

Originally posted by Adrian:
I did also email one of the NV40 reviewers and ask them to run a readpixel benchmark but I havent had a reply yet.
He replied. He doesn’t have the card anymore but he hopes to get it back “sooner or later” and will run the benchmark then. He also said he will run it with the new ATI card when that arrives. I will post the results when I get them.