PDA

View Full Version : PCI Express - Is anything going to change?



Stephen_H
03-22-2004, 07:02 PM
Has anyone had any experience with this? I'm curious if PCI Express is going to change the way you do things when you are writing performance game code in OpenGL?

Graphics card will be using 16x PCI Express with upload/download bandwidths of 4GB a second... for getting data from the card, thats about 30 times faster than on the PCI bus.

Because of all the extra bandwidth, I could see drivers keeping just one copy of the data and passing off it between the card and the AGP memory as needed... is this what is going to happen?

DirectX take a more hands on approach with card memory management, but OpenGL is deliberately more ambiguous, allowing you to give hints, but ultimately it is up to the driver. Is this a win/loss when combined with PCI Express?

I'm just curious what everyone thinks... there hasn't been a general discussion really of the impact of PCIE on OpenGL yet.

MojoMagic
03-22-2004, 07:17 PM
I don't think *much* will change.

My understanding is that the increase in bandwidth is primarily in the downstream direction. ie: From the card to the system.
I think this is now asynchronous (?) so should not affect uploads... Can anyone confirm this?

I can mainly see a benefit in reading screen/pixel buffers back from the card and doing on-the-cpu type effects more easily. In short video editing-esq effects.

Stephen_H
03-22-2004, 07:22 PM
As far as I know, yes, there are 16 250Mb/s serial channels that will lead to the graphics card, and 16 that will lead away from it, for a combined total of 4Gb/s up and 4Gb/s down. Up and down bandwidth is not shared in PCIE, unlike the down PCI bandwidth from current cards which is shared among other PCI devices.

Korval
03-22-2004, 07:40 PM
Basically, it will be much more reasonable to do render-to-texture effects and then do some CPU-based post-processing effects.

V-man
03-23-2004, 05:32 AM
As far as GL is concerned, the API won't change.
Look at VBO's spec. No mention about any particular technology, therefore it will be around forever.

The old glReadPixels and glGetTexImage will probably still perform bad on some cards.
As we all know, some drivers have an issue with these functions and the bus isn't to blame.

In summary
------------------
bad drivers = bad performance

don't blame the bus

bunny
03-23-2004, 07:14 AM
Hardware ccclusion querying is currently rendered almost useless by the one-way nature of the AGP bus. I'd expect it to benefit a lot from PCI express.

bunny
03-23-2004, 07:17 AM
Originally posted by V-man:
As far as GL is concerned, the API won't change.
Look at VBO's spec. No mention about any particular technology, therefore it will be around forever.

The old glReadPixels and glGetTexImage will probably still perform bad on some cards.
As we all know, some drivers have an issue with these functions and the bus isn't to blame.

In summary
------------------
bad drivers = bad performance

don't blame the busThe drivers may not be perfect, but the AGP bus just isn't optimised for reading back data, and therefore the fundamental problem lies with the bus. Bearing that in mind, why would you expect driver writers to bother optimising such commands, when it would be a waste of time?

jwatte
03-23-2004, 07:38 AM
{quote]
Hardware ccclusion querying is currently rendered almost useless by the one-way nature of the AGP bus.
[/QUOTE]

I don't think that's true. Hardware occlusion query only requires reading back a single word or two from the card. Read-back bandwidth just doesn't matter for it. I don't quite understand why you think it's useless as-is, though?

Hardware occlusion query, just like any other read-back operation, needs for all rendering up to the read point to finish before it can determine the answer. This serialization is a correctness requirement that won't change based on the bus type. Hardware occlusion is already nicer than, say, reading pixel data, because you can specify a point which you will later query, allowing the card to pipeline work more effectively than finishing the entire current pipeline before returning to you.

davepermen
03-23-2004, 07:47 AM
one solution is to parallelize the tasks on cpu and gpu at a much lower level.. means you give a small task to the gpu, do a small task on cpu, and immediately will be able to combine the result on gpu or cpu, doesn't mather, and then continue. this parallelism is currently possible, too, wich is the reason why sometimes a fullscreen readback, as well as occlusion queries don't hurt at all.

the trick is, if you fully exploit the parallelism, you will be able to use the much higher readback bandwith for a lot of nice stuff.

then, the gpu will be more a coprocessor than now, where it is mainly a "formatter" of your graphics data (with some readbacks..).

but programming parallel processors, thats another story :D

bunny
03-23-2004, 10:41 AM
Originally posted by jwatte:
I don't think that's true. Hardware occlusion query only requires reading back a single word or two from the card. Read-back bandwidth just doesn't matter for it. [/QUOTE]
It's not so much the bandwidth as the fact that the AGP bus only allows data to travel in one direction at a time. PCI express allows bandwidth to go both ways, simultaneously; it's full-duplex. This can only be a good thing as far as occlusion querying is concerned.



I don't quite understand why you think it's useless as-is, though?

Hardware occlusion query, just like any other read-back operation, needs for all rendering up to the read point to finish before it can determine the answer. This serialization is a correctness requirement that won't change based on the bus type. Hardware occlusion is already nicer than, say, reading pixel data, because you can specify a point which you will later query, allowing the card to pipeline work more effectively than finishing the entire current pipeline before
returning to you.[/qb]In practice, at least on the ATi chipsets I've used, even using a parallel approach is way too slow to be useful; I've found that the resulting frame rate is significantly lower than with brute force, even where ~80% of the geometry can be culled. I've subsequently found that doing the same thing in software (with SSE enhanced rasterising) gives much better results. In my experience I'd say it isn't usable, and a lot of others have said the same thing. There may be some very specific cases in which it can improve performance, but as a general case solution, it sucks, frankly.

Adrian
03-23-2004, 11:01 AM
Originally posted by bunny:
The drivers may not be perfect, but the AGP bus just isn't optimised for reading back data, and therefore the fundamental problem lies with the bus. NVidia has a >150% performance advantage over ATI with readpixels. There is also no PDR equivalent for asynchronous readback on ATI hardware. That for me is far from perfect.

bunny
03-23-2004, 12:17 PM
Originally posted by Adrian:

Originally posted by bunny:
The drivers may not be perfect, but the AGP bus just isn't optimised for reading back data, and therefore the fundamental problem lies with the bus. NVidia has a >150% performance advantage over ATI with readpixels. There is also no PDR equivalent for asynchronous readback on ATI hardware. That for me is far from perfect.And where in my post did I disagree with that? My point is, why should ATi bother optimising it when the maximum peformance is so limited by the bus architecture? It seems pointless, especially since most game developers would avoid glReadPixels like the plague anyway.

Also, last time I checked, glReadPixels wasn't too hot on my geforce 2 pro either, although admittedly, it's an old card and I haven't updated the drivers for 6 months. Perhaps the situation is different with newer cards and drivers?

Tom Nuydens
03-23-2004, 12:25 PM
Originally posted by bunny:
In practice, at least on the ATi chipsets I've used, even using a parallel approach is way too slow to be useful; I've found that the resulting frame rate is significantly lower than with brute force, even where ~80% of the geometry can be culled.I would really like to hear about what kind of app/data you used to come to this conclusion.

I've seen a Radeon 9700 slice through data sets with upwards of 20 million triangles like butter, with no CPU-based culling involved whatsoever. Would you like to try that with brute force, or with your SSE-enhanced rasterizer?

Occlusion queries aren't the answer to all the world's problems, but they work really well provided that you use them appropriately.

-- Tom

bunny
03-23-2004, 01:32 PM
Originally posted by Tom Nuydens:

Originally posted by bunny:
In practice, at least on the ATi chipsets I've used, even using a parallel approach is way too slow to be useful; I've found that the resulting frame rate is significantly lower than with brute force, even where ~80% of the geometry can be culled.I would really like to hear about what kind of app/data you used to come to this conclusion.

I've seen a Radeon 9700 slice through data sets with upwards of 20 million triangles like butter, with no CPU-based culling involved whatsoever. Would you like to try that with brute force, or with your SSE-enhanced rasterizer?

Occlusion queries aren't the answer to all the world's problems, but they work really well provided that you use them appropriately.

-- TomOk, at the risk of derailing the thread:
The data was a scene consisting of about 500,000 triangles in total, broken into about 400 models, each of which contained a hierarchy of meshes. Each of these was rendered as an indexed VBO. I used the model and mesh object's OOBBs as occludees and the scene was sorted from front to back using a radix sort.

My initial approach was to traverse the scene, rendering each object after determining its visibility. This was pretty apalling, as you might expect, so following the guidelines in a nvidia paper on the subject, I tried sending the occludees through in batches, running increasingly large numbers of queries at once. This improved performance a bit, but it made the querying less accurate; brute force was still more effective. I tested this on a Radeon 9700 Pro and a Radeon Mobility 9600 Pro.

The software rast OTOH, takes a few shortcuts when selecting occluders, and it works pretty well. Typically I get about a 100% increase in frame rate. The nice thing is, I can run queries in software while at the same time rendering in hardware, and every object can be occlusion checked before rendering.

I might give hardware occlusion querying another try at some point, but I'm not desperately keen to do so given the experience so far. Still, if you have any tips on how to make it viable then I'd be interested to hear them.

dorbie
03-23-2004, 01:56 PM
Things will change, you'll have vastly more read bandwidth (and some more write). You also won't be restricted by the limited size of the GART for DMA transfers so I think getting data to the card efficiently will be simpled and larger databases operating efficiently will be possible.

Elixer
03-23-2004, 08:01 PM
From everything I read, PCI-express will allow max of 8Gb each way (both up & down). However, the real question is if the mobo makers will all use the 16x parts ? They got 1x, 4x.. and so on and each are a bit longer than the other. I assume since more pins/traces are involved, that means mobo makers want to cut costs, so opt for the slower parts.

I don't see PCI-express making a big splash until at least 2006. Would be nice to have & play with though. :)

jwatte
03-23-2004, 08:07 PM
Why are you talking about read bandwidth and occlusion culling in the same paragraphs?

Occlusion culling requires read-back of 4 bytes of data. FOUR BYTES! That's a single bus cycle (well, two) and speeding the transfer up just won't help at all. Even if you do 100 of those per frame, at 100 frames per second, 100 bus cycles EVEN ON PCI is drowned out in the noise. Even with a 64-cycle latency timer. Read-back bandwidth has NOTHING to do with performance of occlusion querying.

It sounds to me, as if what you're doing is drawing a proxy in your scene, then reading back the result, then deciding whether to draw the real thing. That's not the right way to get asynchronicity for occlusion querying. To get proper asynchronicity, you should design your use to read the result that you queued the previous frame during the current frame. If you need the answer during the same frame, you're likely to do better using some other mechanism.

If you read back occlusion query results during the same frame as you queued them, and get slow results, this is not a fault of occlusion querying, and not a fault of read-back speed; it's a mis-use of the API. I think that's what Tom was getting at in his reply.

davepermen
03-23-2004, 08:37 PM
Originally posted by Elixer:
From everything I read, PCI-express will allow max of 8Gb each way (both up & down). However, the real question is if the mobo makers will all use the 16x parts ? They got 1x, 4x.. and so on and each are a bit longer than the other. I assume since more pins/traces are involved, that means mobo makers want to cut costs, so opt for the slower parts.

I don't see PCI-express making a big splash until at least 2006. Would be nice to have & play with though. :) as i guess the cards of nvidia and ati, at least the high end cards, will only fit into 16x ports, wich means, yes, there will be at least one 16x on each mobo that wants to call itself gamer-aware.

oh, and about the usefulness of pci-ex. just wait and see. where ever there is a bottleneck lifted, at least one, or two, find some ways to (ab? :D )use it for interesting things.

harsman
03-24-2004, 12:04 AM
While occlusion queries doesn't require significant read back or "upstream" bandwidth, the AGP bus is still unidirectional. Switching from downstream vertex hoovering to sending a tiny four byte occlusion query result still sounds like an inefficient stall to me. Of course, all this assumes you've hid the latency of the query properly since forcing GPU-CPU synchronisation will overshadow this comparably tiny performance hit by far.

I'm not a hardware engineer or a driver developer though, so I'm guessing and extrapolating here.

bunny
03-24-2004, 02:54 AM
Originally posted by jwatte:
Why are you talking about read bandwidth and occlusion culling in the same paragraphs?...Excuse me?? I didn't bring up bandwidth *at all*; you did. I was talking about the fact that the AGP bus doesn't allow data to travel in both directions simultaneously, which means that all downstream traffic has to cease when you want to read data back.

dorbie
03-24-2004, 02:01 PM
Graphics slots will be PCI Express 16X from the very beginning (at least on the GFX card, I'd double check the mobo you buy... FYI it looks like NVIDIA's first gen cards will have a bus bridge on the GFX card to support PCI Express probably negating many of the benefits).

It looks like the other system PCI Express slots will be 1X on consumer boards, this is still an improvement over typical consumer mobos current PCI implementations, and each slot has bandwidth independent of the other traffic to other slots so again that's an improvement.

jwatte
03-24-2004, 07:11 PM
@bunny: The bandwidth loss of "stopping the bus" is minimal compared to the big stall of reading the query results too early. That's what you've been saying, right? Reading back on the bus stops download bandwidth from being used for the few cycles it takes to read?

If you have big performance problems with occlusion query, like you suggest you have, then it's much more likely you're querying too early (or too often) rather than the small loss caused by the AGP reading back.

If you have small-ish performance problems, then it's likely that AGP bus bubbling is a problem; then you should find another solution to your querying need. I just reacted to your saying that occlusion query is "almost useless"; I don't think that's justified at all.

V-man
03-25-2004, 07:10 AM
Todays AGP 8x is a high performer and not many games are maxing out the bus during gameplay.
Other types of software that need continuously send large quantities of data to the card (like video replay) might benifit from PCI-Ex

PCI-Ex 16x is nice and all but this is something for the future IMO. It's not a miracle solution.

And hearing that there will be at most 1 PCI-Ex 16x slot is not an issue for me.
Not many people will have 2(or more) dual head cards in their system for driving 4(or more) monitors.

In fact, on another forum Im on, one guy wants to have 4 monitors with PCI-Ex system.

The drivers may not be perfect, but the AGP bus just isn't optimised for reading back data, and therefore the fundamental problem lies with the bus

No, the primary problem is the drivers since reading back a dumb block of memory should *NOT* be slow when you match formats.

The same goes for writing using glDrawPixels. It should not be so slow.
Why create texture and render a billboard whe you can use DrawPixels.

bunny
03-25-2004, 01:01 PM
Originally posted by V-man:
Todays AGP 8x is a high performer and not many games are maxing out the bus during gameplay.
Other types of software that need continuously send large quantities of data to the card (like video replay) might benifit from PCI-Ex

PCI-Ex 16x is nice and all but this is something for the future IMO. It's not a miracle solution.

And hearing that there will be at most 1 PCI-Ex 16x slot is not an issue for me.
Not many people will have 2(or more) dual head cards in their system for driving 4(or more) monitors.

In fact, on another forum Im on, one guy wants to have 4 monitors with PCI-Ex system.

The drivers may not be perfect, but the AGP bus just isn't optimised for reading back data, and therefore the fundamental problem lies with the bus

No, the primary problem is the drivers since reading back a dumb block of memory should *NOT* be slow when you match formats.

The same goes for writing using glDrawPixels. It should not be so slow.
Why create texture and render a billboard whe you can use DrawPixels.Note the use of the word "fundamental" in my post: a bottleneck in hardware is fundamental because it can't be worked around; a problem with drivers is easily remedied. I'd be surprised if the problem doesn't improve with PCI express.

I agree about glDrawPixels though; there seems little reason for that being slow.

Korval
03-25-2004, 01:13 PM
o, the primary problem is the drivers since reading back a dumb block of memory should *NOT* be slow when you match formats.So, explain precisely how the driver should fix the problem of the bus across which the data is being transfered being excruciatingly slow? Not to mention the fact that said bus does not allowing bidirectional data transfer, thus every glReadPixels call will provoke a glFinish().

This is not a driver problem.


I agree about glDrawPixels though; there seems little reason for that being slow.My guess with glDrawPixels is that it can be implemented in 2 ways.

One is to directly write pixels to the framebuffer. This, among other things, requires a full glFinish. Also, this probably violates the GL spec because it probably says that per-fragment and post-fragment processing happen on glDrawPixels as well as other methods of rendering.

The other is that, to do a glDrawPixels, they have to internally create a texture (memory alloc. Not fast), load it with your data (memcopy. Also not fast), change a bunch of per-vertex state so that they can draw a quad (state change), draw the quad, and then change the state back (state change).

Ultimately, glDrawPixels is just not a good idea. Hardware's designed for drawing stripped triangles, not arbitrary bitmaps from main memory.

But glReadPixels performance definately should improve. As long as there's nothing in the hardware itself (outside of the bus) that prevents it.

bunny
03-25-2004, 01:14 PM
jwatte: Like I said, I tried a number of things to reduce the number of queries. I was only able to make small gains. It's possible by persisting further that I could have improved it more, but the framerate hit in the scene I was rendering was so bad that it just wasn't worth it. Bear in mind that the usefulness of OC is dependent entirely on the type of scene. In the scene I was rendering, much of the time OC wouldn't cull much of the geometry at all. SW rast just seems like a more robust solution for what I'm doing, and I'm certainly not alone in coming to that conclusion.

evanGLizr
03-25-2004, 06:31 PM
Originally posted by Korval:
[QB]
o, the primary problem is the drivers since reading back a dumb block of memory should *NOT* be slow when you match formats.So, explain precisely how the driver should fix the problem of the bus across which the data is being transfered being excruciatingly slow? Not to mention the fact that said bus does not allowing bidirectional data transfer, thus every glReadPixels call will provoke a glFinish().
The reason why glReadPixels implies a "glFinish()" is unrelated to the bidirectional data transfer.
If your graphics card has proper support for glReadPixels, the only reason the driver needs to sync the card ("glFinish()") is because of the synchronous nature of glReadPixels in OpenGL spec. The application must have the data available when at glReadPixel call return time.

What is really necessary is an asynchronous glReadPixels, it's of little use to have the fastest readback bus in the world if you have to wait idle until the current rendering has finished and your data has returned to the CPU.

By pipelining glReadPixels calls, you should be able to hide most of your latencies.



This is not a driver problem.


I agree about glDrawPixels though; there seems little reason for that being slow.My guess with glDrawPixels is that it can be implemented in 2 ways.

One is to directly write pixels to the framebuffer. This, among other things, requires a full glFinish. Also, this probably violates the GL spec because it probably says that per-fragment and post-fragment processing happen on glDrawPixels as well as other methods of rendering.
Well that shouldn't be a problem, because a glDrawPixels is treated as a point wrt texture sampling and color interpolation, so the whole quad gets the same color & texel values.

Anyway, directly writing things to the framebuffer is *very* bad, and that's why a function like glDrawPixels - contrary to what you think - is good, because it abstracts the app from the underlying video memory layout and its use doesn't force a pipeline flush (unlike a buffer "lock").



The other is that, to do a glDrawPixels, they have to internally create a texture (memory alloc. Not fast), load it with your data (memcopy. Also not fast), change a bunch of per-vertex state so that they can draw a quad (state change), draw the quad, and then change the state back (state change).
There's a third method and is that the graphics card supports drawpixels natively, where the pixel data is supplied as fragment data, in the same way a graphics card supports "texture downloads" (those seem to be "texture uploads" for non-driver people).
glDrawPixels (or glReadPixels, for that matter) has never been a priority for consumer cards, that's why you don't find "fast" implementations of those, but I'm sure you can find them in workstation class boards (DCC applications like Maya perform tons of glDrawPixels/glCopyPixels).

On the other hand, the second method doesn't need to be slow at all. You don't need to allocate the texture everytime, you can use a scratch texture, or even a pool of them if you want to be able to pipeline glDrawPixel calls. Loading the texture with the data is a data transfer that you have to do anyway (even in the native-support case), and the state juggling & drawing a quad with that texture once it's in video memory is fast.



Ultimately, glDrawPixels is just not a good idea. Hardware's designed for drawing stripped triangles, not arbitrary bitmaps from main memory.
I don't agree with that, in fact I believe that glDrawPixels is a great tool to avoid having to "lock" framebuffers around and guess which is the format things are really stored into or forcing the hardware vendors to implement a given memory layout.

Korval
03-25-2004, 07:43 PM
The reason why glReadPixels implies a "glFinish()" is unrelated to the bidirectional data transfer.It's true that glReadPixels had to syncrhonize for other reasons, but this is one reason. Not the only one, but one.


What is really necessary is an asynchronous glReadPixelsWhich is what PBO is supposed to offer.


There's a third method and is that the graphics card supports drawpixels natively, where the pixel data is supplied as fragment data, in the same way a graphics card supports "texture downloads" (those seem to be "texture uploads" for non-driver people).You're talking about a graphics card where "fragment data" is something more than the result of per-vertex interpolation and scan conversion. I would imagine that, for most cards, this is simply not a reasonable way to build the card. Fragments are generated from the scan converter and interpolation units directly; there's no "backdoor" that can be used to insert a fragment.


You don't need to allocate the texture everytime, you can use a scratch texture, or even a pool of them if you want to be able to pipeline glDrawPixel callsOr, you can simply not care and simplify your driver development. The 3 applications that really want to do glDrawPixels probably don't need to do them fast.


and the state juggling & drawing a quad with that texture once it's in video memory is fast.Fast? You've just swapped out your vertex program, as well as all of its parameters. Let alone any other state that isn't supposed to effect glDrawPixel operations. Plus, you have to put it back after the operation. State changing isn't slow just because it's the application asking for it. It is slow because of the stall bubbles that it provokes into various pieces of the pipeline.


I believe that glDrawPixels is a great tool to avoid having to "lock" framebuffers around and guess which is the format things are really stored into or forcing the hardware vendors to implement a given memory layoutThe thrid option is, of course, to simply don't do it. Don't do things where you need to do something that only glDrawPixels can do. Bit-from-memory operations are bad; that is why textures live in server memory, not client. glDrawPixels is just a bad idea. Any drawing method that requires any form of virtually direct access to the framebuffer (effectively writing pixels to the FB) like this is a bad idea and should be avoided at all costs.

wimmer
03-26-2004, 03:00 AM
To chime in on the occlusion query debate...

I found that the calls to occlusion query themselves take up a significant amount of time if done too often (independent of when or how you read back). So any strategy that reduces the number of queries is a good one...

Michael

Adrian
03-26-2004, 05:02 AM
Originally posted by bunny:


Also, last time I checked, glReadPixels wasn't too hot on my geforce 2 pro either, although admittedly, it's an old card and I haven't updated the drivers for 6 months. Perhaps the situation is different with newer cards and drivers?I don't think readback performance has changed much with newer graphics hardware. The PDR extension makes a big difference though because it allows readpixels to work asynchronously so readpixels (in some scenarios) is almost free.

If we were talking about a 10% difference in performance then I would agree with you but the difference is very large. If only one vendor has fast readback then it won't be used in games which is why all vendors should at least try to be close in terms of performance/capability. Even in its currently slow state readpixels is useful. PDR and NV's hardware/drivers have made it much more viable.

Faster readback with pci express has the potential to radically change the way graphics engines are written imo.

Won
03-26-2004, 05:57 AM
Since occlusion queries are not bandwidth-bound, the only difference I can see between an AGP/PCI implementation and PCI-Express is lower latency due to less bus turnaround and synchronization. The rest is in the GPU and driver.

It would be nice if occlusion queries were more abstracted so you can actually get whatever rendering statistics about whatever part of the pipeline you wanted. Self-tuning engines would be interesting. Maybe you can also have a fragment-shader-writable/queriable accumulators available as well, but that's definitely going off-topic...

The status of the pixel pack path is pretty annoying. NVIDIA supports asynchronous but only 64-bit PCI writes. 3Dlabs supports 4x AGP writes but only synchronously. ATI supports neither. Who knows; the driver can be pulling the data off the card one pixel at a time. Also, these implementations typically only have their fast paths for UNSIGNED_CHAR RGBA, BGRA, and sometimes RGB and BGR. That means you have to do all the pixel formatting yourself in software or in a fragment program.

However, all of these guys know that people are interesting in optimizing glReadPixels; everytime I mention it to developer relations, they claim that they've been hearing it more and more. Keep clamoring. I would be surprised if there WEREN'T a fast, asynchronous glReadPixels implementation available within a year. For 3Dlabs, it really is just a matter of driver development. Maybe ATI's new PCI-Express cards will have it out of the box. Of course, that doesn't really help those of us who need it now, so we compromise.

I bet that we'll also have fast glReadPixels on AGP, as well. Say NVIDIA decides (when they make their PCI-Express native cards) to implement fast bidirectional transfers. When they slap on their HSI bridge, you have fast AGP writes and glReadPixels.

An interesting thing about PCI-Express is the potential to have multiple fast-bus video cards operating in tandem. Of course, AGP 3.0 has this as well, but no one seems to care yet, so we'll see. It'll get to be a bigger issue when people really start performing computation on video cards, which requires fast glReadPixels to some extent.

-Won

dorbie
03-29-2004, 03:53 PM
The problem with optimizing glDrawPixels or glTexImage etc is that the OpenGL spec allows all sorts of data types format conversions alignment and swizzle operations including memory strides, offsets etc. Even LUTs and convolutions are there in the imaging extensions. It is not straightforward. Sure to set this up for a simple case should be fast & easy, but getting the kind of coverage to reliably support various formats and types (internal and external) with different memory alignments etc etc, is probably a huge pain in the ass. So you get fast coverage for some common stuff and some fallback code path for a lot of other stuff unless it suddenly becomes more important than figuring out how to cheat at the next Futuremark benchmark. It is improving steadily (it seems to me), it used to really suck.

That's life. I'm still glad I can buy the fastest most programmable graphics on the planet for $500 at "Best Buy".