PCI Express - Is anything going to change?

Stephen_H · March 22, 2004, 7:02pm

Has anyone had any experience with this? I’m curious if PCI Express is going to change the way you do things when you are writing performance game code in OpenGL?

Graphics card will be using 16x PCI Express with upload/download bandwidths of 4GB a second… for getting data from the card, thats about 30 times faster than on the PCI bus.

Because of all the extra bandwidth, I could see drivers keeping just one copy of the data and passing off it between the card and the AGP memory as needed… is this what is going to happen?

DirectX take a more hands on approach with card memory management, but OpenGL is deliberately more ambiguous, allowing you to give hints, but ultimately it is up to the driver. Is this a win/loss when combined with PCI Express?

I’m just curious what everyone thinks… there hasn’t been a general discussion really of the impact of PCIE on OpenGL yet.

MojoMagic · March 22, 2004, 7:17pm

I don’t think much will change.

My understanding is that the increase in bandwidth is primarily in the downstream direction. ie: From the card to the system.
I think this is now asynchronous (?) so should not affect uploads… Can anyone confirm this?

I can mainly see a benefit in reading screen/pixel buffers back from the card and doing on-the-cpu type effects more easily. In short video editing-esq effects.

Stephen_H · March 22, 2004, 7:22pm

As far as I know, yes, there are 16 250Mb/s serial channels that will lead to the graphics card, and 16 that will lead away from it, for a combined total of 4Gb/s up and 4Gb/s down. Up and down bandwidth is not shared in PCIE, unlike the down PCI bandwidth from current cards which is shared among other PCI devices.

Korval · March 22, 2004, 7:40pm

Basically, it will be much more reasonable to do render-to-texture effects and then do some CPU-based post-processing effects.

system · March 23, 2004, 5:32am

As far as GL is concerned, the API won’t change.
Look at VBO’s spec. No mention about any particular technology, therefore it will be around forever.

The old glReadPixels and glGetTexImage will probably still perform bad on some cards.
As we all know, some drivers have an issue with these functions and the bus isn’t to blame.

In summary

bad drivers = bad performance

don’t blame the bus

bunny · March 23, 2004, 7:14am

Hardware ccclusion querying is currently rendered almost useless by the one-way nature of the AGP bus. I’d expect it to benefit a lot from PCI express.

bunny · March 23, 2004, 7:17am

Originally posted by V-man:
[b]As far as GL is concerned, the API won’t change.
Look at VBO’s spec. No mention about any particular technology, therefore it will be around forever.

The old glReadPixels and glGetTexImage will probably still perform bad on some cards.
As we all know, some drivers have an issue with these functions and the bus isn’t to blame.

In summary

bad drivers = bad performance

don’t blame the bus[/b]
The drivers may not be perfect, but the AGP bus just isn’t optimised for reading back data, and therefore the fundamental problem lies with the bus. Bearing that in mind, why would you expect driver writers to bother optimising such commands, when it would be a waste of time?

imported_jwatte · March 23, 2004, 7:38am

{quote]
Hardware ccclusion querying is currently rendered almost useless by the one-way nature of the AGP bus.
[/QUOTE]

I don’t think that’s true. Hardware occlusion query only requires reading back a single word or two from the card. Read-back bandwidth just doesn’t matter for it. I don’t quite understand why you think it’s useless as-is, though?

Hardware occlusion query, just like any other read-back operation, needs for all rendering up to the read point to finish before it can determine the answer. This serialization is a correctness requirement that won’t change based on the bus type. Hardware occlusion is already nicer than, say, reading pixel data, because you can specify a point which you will later query, allowing the card to pipeline work more effectively than finishing the entire current pipeline before returning to you.

davepermen · March 23, 2004, 7:47am

one solution is to parallelize the tasks on cpu and gpu at a much lower level… means you give a small task to the gpu, do a small task on cpu, and immediately will be able to combine the result on gpu or cpu, doesn’t mather, and then continue. this parallelism is currently possible, too, wich is the reason why sometimes a fullscreen readback, as well as occlusion queries don’t hurt at all.

the trick is, if you fully exploit the parallelism, you will be able to use the much higher readback bandwith for a lot of nice stuff.

then, the gpu will be more a coprocessor than now, where it is mainly a “formatter” of your graphics data (with some readbacks…).

but programming parallel processors, thats another story

bunny · March 23, 2004, 10:41am

Originally posted by jwatte:

I don’t think that’s true. Hardware occlusion query only requires reading back a single word or two from the card. Read-back bandwidth just doesn’t matter for it. [/QUOTE]
It’s not so much the bandwidth as the fact that the AGP bus only allows data to travel in one direction at a time. PCI express allows bandwidth to go both ways, simultaneously; it’s full-duplex. This can only be a good thing as far as occlusion querying is concerned.

I don’t quite understand why you think it’s useless as-is, though?

Hardware occlusion query, just like any other read-back operation, needs for all rendering up to the read point to finish before it can determine the answer. This serialization is a correctness requirement that won’t change based on the bus type. Hardware occlusion is already nicer than, say, reading pixel data, because you can specify a point which you will later query, allowing the card to pipeline work more effectively than finishing the entire current pipeline before
returning to you.[/qb]
In practice, at least on the ATi chipsets I’ve used, even using a parallel approach is way too slow to be useful; I’ve found that the resulting frame rate is significantly lower than with brute force, even where ~80% of the geometry can be culled. I’ve subsequently found that doing the same thing in software (with SSE enhanced rasterising) gives much better results. In my experience I’d say it isn’t usable, and a lot of others have said the same thing. There may be some very specific cases in which it can improve performance, but as a general case solution, it sucks, frankly.

imported_Adrian1 · March 23, 2004, 11:01am

Originally posted by bunny:
The drivers may not be perfect, but the AGP bus just isn’t optimised for reading back data, and therefore the fundamental problem lies with the bus.
NVidia has a >150% performance advantage over ATI with readpixels. There is also no PDR equivalent for asynchronous readback on ATI hardware. That for me is far from perfect.

bunny · March 23, 2004, 12:17pm

Originally posted by Adrian:
[quote]Originally posted by bunny:
The drivers may not be perfect, but the AGP bus just isn’t optimised for reading back data, and therefore the fundamental problem lies with the bus.
NVidia has a >150% performance advantage over ATI with readpixels. There is also no PDR equivalent for asynchronous readback on ATI hardware. That for me is far from perfect.[/QUOTE]And where in my post did I disagree with that? My point is, why should ATi bother optimising it when the maximum peformance is so limited by the bus architecture? It seems pointless, especially since most game developers would avoid glReadPixels like the plague anyway.

Also, last time I checked, glReadPixels wasn’t too hot on my geforce 2 pro either, although admittedly, it’s an old card and I haven’t updated the drivers for 6 months. Perhaps the situation is different with newer cards and drivers?

Tom_Nuydens · March 23, 2004, 12:25pm

Originally posted by bunny:
In practice, at least on the ATi chipsets I’ve used, even using a parallel approach is way too slow to be useful; I’ve found that the resulting frame rate is significantly lower than with brute force, even where ~80% of the geometry can be culled.
I would really like to hear about what kind of app/data you used to come to this conclusion.

I’ve seen a Radeon 9700 slice through data sets with upwards of 20 million triangles like butter, with no CPU-based culling involved whatsoever. Would you like to try that with brute force, or with your SSE-enhanced rasterizer?

Occlusion queries aren’t the answer to all the world’s problems, but they work really well provided that you use them appropriately.

– Tom

bunny · March 23, 2004, 1:32pm

Originally posted by Tom Nuydens:
[b] [quote]Originally posted by bunny:
In practice, at least on the ATi chipsets I’ve used, even using a parallel approach is way too slow to be useful; I’ve found that the resulting frame rate is significantly lower than with brute force, even where ~80% of the geometry can be culled.
I would really like to hear about what kind of app/data you used to come to this conclusion.

I’ve seen a Radeon 9700 slice through data sets with upwards of 20 million triangles like butter, with no CPU-based culling involved whatsoever. Would you like to try that with brute force, or with your SSE-enhanced rasterizer?

Occlusion queries aren’t the answer to all the world’s problems, but they work really well provided that you use them appropriately.

– Tom[/b][/QUOTE]Ok, at the risk of derailing the thread:
The data was a scene consisting of about 500,000 triangles in total, broken into about 400 models, each of which contained a hierarchy of meshes. Each of these was rendered as an indexed VBO. I used the model and mesh object’s OOBBs as occludees and the scene was sorted from front to back using a radix sort.

My initial approach was to traverse the scene, rendering each object after determining its visibility. This was pretty apalling, as you might expect, so following the guidelines in a nvidia paper on the subject, I tried sending the occludees through in batches, running increasingly large numbers of queries at once. This improved performance a bit, but it made the querying less accurate; brute force was still more effective. I tested this on a Radeon 9700 Pro and a Radeon Mobility 9600 Pro.

The software rast OTOH, takes a few shortcuts when selecting occluders, and it works pretty well. Typically I get about a 100% increase in frame rate. The nice thing is, I can run queries in software while at the same time rendering in hardware, and every object can be occlusion checked before rendering.

I might give hardware occlusion querying another try at some point, but I’m not desperately keen to do so given the experience so far. Still, if you have any tips on how to make it viable then I’d be interested to hear them.

dorbie · March 23, 2004, 1:56pm

Things will change, you’ll have vastly more read bandwidth (and some more write). You also won’t be restricted by the limited size of the GART for DMA transfers so I think getting data to the card efficiently will be simpled and larger databases operating efficiently will be possible.

Elixer · March 23, 2004, 8:01pm

From everything I read, PCI-express will allow max of 8Gb each way (both up & down). However, the real question is if the mobo makers will all use the 16x parts ? They got 1x, 4x… and so on and each are a bit longer than the other. I assume since more pins/traces are involved, that means mobo makers want to cut costs, so opt for the slower parts.

I don’t see PCI-express making a big splash until at least 2006. Would be nice to have & play with though.

imported_jwatte · March 23, 2004, 8:07pm

Why are you talking about read bandwidth and occlusion culling in the same paragraphs?

Occlusion culling requires read-back of 4 bytes of data. FOUR BYTES! That’s a single bus cycle (well, two) and speeding the transfer up just won’t help at all. Even if you do 100 of those per frame, at 100 frames per second, 100 bus cycles EVEN ON PCI is drowned out in the noise. Even with a 64-cycle latency timer. Read-back bandwidth has NOTHING to do with performance of occlusion querying.

It sounds to me, as if what you’re doing is drawing a proxy in your scene, then reading back the result, then deciding whether to draw the real thing. That’s not the right way to get asynchronicity for occlusion querying. To get proper asynchronicity, you should design your use to read the result that you queued the previous frame during the current frame. If you need the answer during the same frame, you’re likely to do better using some other mechanism.

If you read back occlusion query results during the same frame as you queued them, and get slow results, this is not a fault of occlusion querying, and not a fault of read-back speed; it’s a mis-use of the API. I think that’s what Tom was getting at in his reply.

davepermen · March 23, 2004, 8:37pm

Originally posted by Elixer:
[b]From everything I read, PCI-express will allow max of 8Gb each way (both up & down). However, the real question is if the mobo makers will all use the 16x parts ? They got 1x, 4x… and so on and each are a bit longer than the other. I assume since more pins/traces are involved, that means mobo makers want to cut costs, so opt for the slower parts.

I don’t see PCI-express making a big splash until at least 2006. Would be nice to have & play with though. [/b]
as i guess the cards of nvidia and ati, at least the high end cards, will only fit into 16x ports, wich means, yes, there will be at least one 16x on each mobo that wants to call itself gamer-aware.

oh, and about the usefulness of pci-ex. just wait and see. where ever there is a bottleneck lifted, at least one, or two, find some ways to (ab? )use it for interesting things.

harsman · March 24, 2004, 12:04am

While occlusion queries doesn’t require significant read back or “upstream” bandwidth, the AGP bus is still unidirectional. Switching from downstream vertex hoovering to sending a tiny four byte occlusion query result still sounds like an inefficient stall to me. Of course, all this assumes you’ve hid the latency of the query properly since forcing GPU-CPU synchronisation will overshadow this comparably tiny performance hit by far.

I’m not a hardware engineer or a driver developer though, so I’m guessing and extrapolating here.

bunny · March 24, 2004, 2:54am

Originally posted by jwatte:
Why are you talking about read bandwidth and occlusion culling in the same paragraphs?..
Excuse me?? I didn’t bring up bandwidth at all; you did. I was talking about the fact that the AGP bus doesn’t allow data to travel in both directions simultaneously, which means that all downstream traffic has to cease when you want to read data back.