Direct acces to tetxure memory & drawing buffers

I think it would be very usefull to have an openGL command that returns a pointer to the adress of the front and back buffers as well as an especific textures.
This way you coud alter the contens of its memory areas with DirectDraw on PC or Quickdraw on MAC without having to move memory blocks.
At the end all this is just video memory.

This is a bad idea as it’ll lock the hardware vendors to a certain memory mapping. Lost freedom => less optimization possibilities.
Direct memory access is going to be removed from DX too. From DX8 and on it won’t be supported anymore.

I beg to disagree. For instance, let’s
assume we want to texturemap HDTV video.
With OpenGL, the standard technique is
to use glTexSubImage2D to introduce the
HDTV image into the texture. Unfortunately,
the overhead of glTexSubImage2D can be
very large and strongly reduces
performance.

I think that we definitely need a “backdoor”
into the graphics memory as it can
drastically improve performance with
streaming textures or very large textures.
We’re talking about an order of magnitude
speed difference between the standard
OpenGL path and direct texture access!

I’m with Humus on this one. Front buffer, MAYBE, though I suspect it’ll raise merry hell with multi-frame pipelining. Backbuffer, no way. There was an interesting Carmack post on /. a while back, making a good argument for using blits for page swaps, and one of the reasons was that a linear addess space was NOT necessarily the best format for a backbuffer.

I agree that a faster way of putting data from other video sources would be useful, but this should be an API call hiding the implementation, not exposing a raw pointer. I’d be very surprised if the Khronos OpenML initiative didn’t include something like this as an OpenGL ext.

I think that you guys are right in the way you see that this will open an endless number of new things to care of ( memory mapping schemes, API implementations e.t.c ) but just think that the only I’m asking for is a way to know where in the video memory is my data, I’m not asking about any particular ordering-structure of video memory because I think this is the easiest part to deal with (pointer math).

This could be a feature that maybe not all the people needs, but in the other hand it’s very easy to implement & is not going to affect any ather OpenGL functionality.

Just a pointer to the data that’s all I need !

In the same way we search on the internet for sample code & docs about advanced rendering techniques.
We could search info about the specific way a video card is storing our data.
(If we need to )

Shure this is not good for games or for other aplications designed for running on a million different hardware configurations but I’m shure there’s a lot of people out there that are using OpenGL for very specific projects other than games with very specific hardware configurations other than the standard consumer PC.

If the development of an API is going to be always linked to the standard people needs then this is going to happen very slow.

As KZ says a “backdoor” is definitely needed.

There are a lot of very good reasons to not do this. The first and simplest is that it breaks the pipelined model for 3D – once you start writing directly to video memory, you have to sync with the chip to make sure you don’t write on top of memory it’s using.

This feature has caused so many problems in D3D that DX8 has eliminated it.

  • Matt

I think mcraighead’s opinion is shared by most card vendors - unfortunately. I’ve
had e-mail exchanges with various vendors trying to convince them of the need for direct texture access - and didn’t get anywhere except for one vendor who provided
the direct texture access we needed.
As a result, our application runs much
faster on an 18 months old graphics card
compared to any consumer card currently
on the market.

Of course, SGI’s O2 also provided direct
texture access via the dmbuffer mechanism
and I wouldn’t be surprised that OpenML
will propose a similar extension.

Frankly, I’m surprised at the vendors’ resistance to
direct texture access (“don’t touch hardware, we’ll do a better job”) while
practice proves otherwise. At the same
time, vendors are introducing proprietary
OpenGL extensions that are very low-level and address specific hardware issues (eg NVIDIA’s vertex-in-AGP-memory and fence extensions).

Also note that any direct memory access
doesn’t necessary imply that vendors would
be locked into a specific memory layout.
Actually, it would be fairly simple to standarize a mechanism that would the application to inquire how the internal graphics memory layout is organized; the
application can then generate the texture
info in the vendor’s proprietary memory
layout directly, thus bypassing the terribly
expensive glTexSubImage2D.

Come on, guys. Give me that memory pointer,
I’ll put it to good use.

[This message has been edited by kz (edited 11-10-2000).]

The short answer is, “no”.

The long answer…

The fact that DX8 has eliminated this feature really casts doubt on it. It has caused an absolutely huge number of problems in D3D!

I see two major areas where people might be asking for this access:

  • framebuffer access
  • texture access

For framebuffer access, use DrawPixels and ReadPixels. Both are fast on GeForce. If it’s not fast, we can optimize it; there is no theoretical reason it would need to be slow, certainly.

For texture access, we are working on ways that texture downloads can be cheaper. There is no inherent limitation that causes TexSubImage to offer poorer performance than directly writing to video memory. There are Windows platform restrictions that make doing this correctly very difficult.

Once we offer video memory pointers to apps, Bad Things can happen quickly. We have to sync the hardware before any such pointer is usable, which kills performance. We have to take some kind of system-wide mutex so that apps don’t stomp on top of each other if we decide to reorganize video memory.

We cannot give you a pointer to the start of your framebuffer without taking a system-wide OS mutex. What happens if you move the window? The part of the framebuffer used by your app moves with it. That means we have to take a mutex that prevents any window events from occurring. In turn, this means that if you take the lock and never release it, the system will hang. Even if you take it for, say, a second, the system will suddenly become very unresponsive to input. NT does a better job than 9x here, but not good enough for us to trust apps. In fact, in certain ways, it is worse on NT, to the point where it may not be safe to do this at all.

Finally, direct writes to video memory are actually not fast on most PC platforms today. In fact, this is the “Fast Writes” feature that some of you have heard about. Without that feature, writing to video memory directly is much slower than writing to AGP and then pulling from AGP. (which only the driver is in a position to do). Even where they are implemented, in many cases, there are motherboard and chipset bugs that break things pretty quickly. Also, they only work for sequential writes (just like AGP write combining), and apps that read from video memory directly (don’t laugh, lots of old [and new] DirectX apps do this) get absolutely horrendous performance, since CPU readbacks over the AGP bus are absolutely disgustingly slow, and video memory is uncached.

The reason this works for vertex array range is that video memory vertices are best reserved for static vertex data. In fact, we specifically recommend AGP instead for dynamic data. Also, the synchonization hazards for vertex data are much simpler than those for framebuffer data – vertex data is read only, and we have spun off the synchronization problem to the app (NV_fence), and there is no way that vertex data can get asynchronously relocated like framebuffer memory can.

There are genuine problems with the current situation, and we are working to solve them, but unfortunately there are platform limitations and there’s only so much time in a day. Furthermore, none of these limitations are inherent OpenGL limitations.

Offering these pointers (either FB or texture) to apps opens up the biggest Pandora’s Box in all of graphics programming. Microsoft did it, and they regretted that decision for years. I refuse to make that mistake again with OpenGL.

  • Matt

Thanks for the response, Matt.

> The fact that DX8 has eliminated this >feature really casts doubt on it.
I never used DX8, but we shouldn’t confuse “feature” and “implementation”.

> For texture access, we are working on ways > that texture downloads can be cheaper.
Promises, promises :wink: At least good to hear
that you are working on it.

>There is no inherent limitation that causes >TexSubImage to offer poorer performance >than directly writing to video memory. >There are Windows platform restrictions >that make doing this correctly very >difficult.
Typically, the user would load a texture
with linear memory layout using glTexSubImage2D; glTexSubImage2D typically
makes a copy of the texture and usually
does some reordering of texels to match the
internal memory layout. From a developer’s
point of view, even with reduced overhead
for glTexSubImage2D it doesn’t make sense
to first have to stream texture into main
memory and then have OpenGL copy/reformat
it again. I’d rather stream texture in the
proprietary memory layout directly onto the
card.

>We have to take some kind of system-wide >mutex so that apps don’t stomp on top of >each other if we decide to reorganize video >memory.
What’s wrong with a glTexLock() function
assuming a glBindTexture of a resident
texture?

> We cannot give you a pointer to the start of your framebuffer without taking a system-wide OS mutex.
Well, not my problem :wink: I’m only interested
in texture.

> Finally, direct writes to video memory are > actually not fast on most PC platforms > today.
I get very decent performance with AGP2x
without fast writes. Of course, it required
careful optimization (non-temporal writes)
as direct texture access is a very low-level
feature. But just like C++, by empowering
users you also give themselves the rope to
hang themselves with…

> Offering these pointers (either FB or >texture) to apps opens up the biggest >Pandora’s Box in all of graphics >programming. Microsoft did it, and they >regretted that decision for years. I >refuse to make that mistake again with >OpenGL.

Hmmm. I think you should pay more attention
to your customers here. 30+ Millions triangles, fantastic fillrates simply don’t
mean squat to me if I can’t move texture
fast enough onto the card. The current
glTexSubImage2D speeds is to low; either
optimize the driver much more, provide
alternative approaches (count me in for
beta test) or give me that pointer - your
competition could do it, why can’t you? :wink:

Well, the number of reads and writes varies based on several factors, but the “traditional” TexSubImage data flow is as follows:

App reads data off disk
App writes data into buffer
Driver reads data out of buffer
Driver writes data into internal buffer
Driver reads data out of internal buffer
Driver sends data to HW (somehow or another)

Note that “Driver sends data to HW” may or may not involve direct writes to video memory. That’s one way to implement it, but not the only way.

That internal buffer is a result of a Windows platform limitation. We’d like to eliminate it, but it may be a long-term goal. This would eliminate a read and write.

You can already get noticeable speedups in many cases by matching your data format with the HW data format. Here are our optimal matchups for internal format, format, and type:

GL_RGB5: GL_UNSIGNED_SHORT_5_6_5/GL_RGB
GL_RGB8: GL_UNSIGNED_BYTE/GL_BGRA (avoid 3-byte data types, even at the cost of padding)
GL_RGBA4: GL_UNSIGNED_SHORT_4_4_4_4_REV/GL_BGRA
GL_RGBA8: GL_UNSIGNED_BYTE/GL_BGRA

Another obvious speedup is texture compression, if you can use it. It helps in every step of the download process, assuming of course that you have the texture stored in compressed format on disk.

Another likely speedup (I haven’t ever tried it, but it should work) is to use a file mapping and to pass us the pointer to the file mapping instead of reading from disk yourself. That saves a temporary buffer and a copy.

So if you did that, and we got rid of our temporary buffer, the dataflow would be:

Driver reads data off disk
Driver sends data to HW

No inefficient extra copies at all! So it can be done without any pointers at all. We could even put in prefetches that would make sure we were overlapping IDE and graphics.

Now, if only we could get graphics to DMA directly from IDE… hmmm…

[Actually, that might not be impossible, although it would require some heavy-duty kernel hacking, I’m guessing.]

  • Matt

Streaming texture from disk doesn’t sound
like high performance to me. Even high-end
RAID5 SCSI will do about 100 MBytes/s, which
about matches glTexSubImage2D.

For me, desired performance is at least 100 MTexels/s effective download speed on
an AGP4 system. If I can’t get a pointer
to local graphics memory, then at least
give me a pointer to the intermediate
buffer and let me write texture in a format
that matches the internal (tiled?) memory
organization. Something like
void* pMemory = glTexGetMemory(); //
myFillTexture(pMemory);
glBegin(GL_QUADS); // will initiate DMA

etc…

This simple approach might not allow for
overlapping DMA/write to next texture which
is desirable. Some guarantee about “active”
textures not being moved around in main memory would be nice - would allow for
texture generation and texture download in
different threads.

[This message has been edited by kz (edited 11-11-2000).]

Okay, so now I’m not clear what you’re trying to do. What does myFillTexture do? Where is this dynamic texture coming from?

  • Matt

Okay, let me make something totally, absolutely clear.

Writing texels to video memory directly is not necessarily, and in general, is NOT the fastest way to download a texture.

I can think of at least 4 ways to download textures, and it’s the slowest one, actually.

Second, we want to get rid of that very intermediate buffer you want to write into.

If we didn’t have the intermediate buffer, there would actually be no reasonable way for us to expose texture downloads any further than the way TexSubImage does today. To go any further would involve revealing reams of information about the internal organization of our driver and our HW.

And finally, if you’re doing a video texture, the thing that seems like the biggest restrictions about current OpenGL texturing is that the textures must be powers of two and that you can’t get a native, say, YUV (just an example, pick your favorite color space) texture.

If your dataflow model is to decode the video totally on the CPU and then to push the entire decompressed video to the HW, of course you’ll be in trouble… that’s why people do HW decoding.

Now, our HW decoding is only exposed through DirectX, but it’d probably be possible to link it up somehow. That seems like, in all likelihood, a more efficient model…

If you really want to push a video from your app, I would still just recommend using a compressed texture.

To be honest, these days we’re getting a lot more requests for fast CopyTexSubImage than for fast TexSubImage…

  • Matt

> Writing texels to video memory directly is not necessarily, and in general, is NOT >the fastest way to download a texture.
On some systems it is currently…

> If we didn’t have the intermediate buffer, there would actually be no reasonable way > for us to expose texture downloads any further than the way TexSubImage does > today. To go any further would involve revealing reams of information about the > internal organization of our driver and our HW.

… which I wouldn’t mind at all :wink:

I’m not doing video, just mentioned it as an
example of why you would need fast access
to texture. We’re doing dynamic textures
generated using user input, so there’s
no way we can precompute textures (and
compress them).

Even if we could reveal it, we wouldn’t because we want to be able to change our data structures and be flexible with future hardware. The last thing we want is for application code to directly lock us into specific HW features that we may want to drop so that we can save gates.

There is a fundamental problem with doing dynamic textures in real-time by writing the texels directly to video memory that I can’t see how any HW would get around. If the HW is running asynchronously, which any good HW is, then it is quite likely that 2 or 3 frames will be in the pipeline at any given time. If you overwrite the texture directly, you must wait for all those frames to finish, or else you’ll get obvious texture corruption problems.

If you’re generating the textures procedurally, as you seem to imply, why not use OpenGL to compute the textures? You can then take advantage of acceleration when computing them and then use a CopyTexSubImage to get your texture.

And yes, we are actively working to make CopyTexSubImage much faster, as I’ve mentioned on the Advanced board…

  • Matt

I’ve proposed a possible solution to this problem in private email to kz, at least for the problem of dynamic textures. Hopefully something will come of it.

  • Matt

I’m just wondering, is CopyTexSubImage the right way to go, or is the DirectX model, of being able to set the render target to a texture better? I’m wondering which option gives the most opportunity for optimization. CopyTexSubImage seems to guarantee that there will be a copy from the framebuffer to a texture, which could be a speed hit. DirectX SetRenderTarget to a texture seems like it would decrease rendering speed greatly when writing to that texture (as the pipeline should be optimized for writing to the backbuffer, not a texture which might have an entirely different format in memory). So which API got it right?

Also, I heard non power of two textures mentioned. What are the costs of implementing this? The worst thing, it seems, would be the extra complexity introduced in mipmapping. Are more freely sized textures a possibility for the future?

Originally posted by timfoleysama:
Also, I heard non power of two textures mentioned. What are the costs of implementing this? The worst thing, it seems, would be the extra complexity introduced in mipmapping. Are more freely sized textures a possibility for the future?

Yes, some cards do already support this for example Matrox G400.

I know this is starting to wander a bit OT, but does anybody know of any consumer hardware implementing texture borders? I’ve got an ugly problem texturing a sphere which can’t really be solved any other way.

Does non-power-of-2 texture support make texture borders trivial, or is it more involved than that?

Texture border support would be pretty much independent of non-power-of-two support, although if you had the right clamp modes I suppose you could probably get it to work.

Non-power-of-two support in OpenGL is something we are definitely looking at. Our existing hardware supports it.

  • Matt