PDA

View Full Version : pBuffers vs renderbuffers (FBO) - memory usage



Mark Shaxted
02-12-2008, 04:52 AM
Would I be right in thinking that pbuffers & renderbuffers both exist in ONLY vidmem?

What I would like to do is utilise video memory, but WITHOUT affecting system ram. Is this possible?

An ideal solution would be volatile textures (which have no system ram counterpart), but this is not possible right now - all textures reside in both system & video memory. Are you listening ARB?

However, it occured to me that pBuffers & FBO renderbuffers (according to the spec) exist in pbuffer/framebuffer memory, and in the case of pbuffers ARE volatile in that they can be lost with a mode change. This implies that there is no copy kept in system memory.

Is my thinking above correct? Or am I way off the mark.



What I'm doing is basically a colour managed image editing app, which supports dual monitors. There will be an original image in memory, a serious of adjustments, and a final image. The final image data then needs to be converted into each monitor colour space before being displayed. Using GDI is simple - just convert on the fly when blitting to the screen. However, it would be ideal if I could use OpenGL to store the converted image data for each monitor - but without impacting system ram. I could then also move the colour management to fragment shaders at a later stage.

Any thoughts or comments would be much appreciated.

Regards
Mark

zeoverlord
02-12-2008, 06:44 AM
You can never really be sure it does not affect system ram, and openGL provides no apparent way to do this (though i don't know about openGL 3.0), it may be the case that it doesn't when using render buffers and possibly p-buffers, though probably not texture buffers.
Either way it's always best to assume it does and work around that.
You could always write a test app.

akaTONE
02-12-2008, 12:16 PM
OSes these days have VRAM virtualized. Therefore, the drivers have to have a place to pageoff the pBuffer or renderbuffer when there is enough VRAM pressure to force a pageoff. The drivers may implement this in different ways where one may pre-allocate the backing store or one where this allocation is deferred until the pageoff is forced. Depending upon the design of the driver, objects may be pre-allocated if the backing store is going to be mapped into the task's address space. If the object has no case where it would benefit or be possible to be directly accessible by the task, then that object may have the backing store allocation deferred until pageoff. By having the backing store, one can also potentially get the fallback case of allocating the object in GART because VRAM is too full to allocate anything else. So, with modern PCs and OSes, there is almost no possibility for objects to ONLY exist in VRAM. Many clients may be using the GPU besides your GL app, so the driver has to be able to pageoff your objects to make room for the objects needed by Quake 4, modo, a window manager or whatever else.

arekkusu
02-12-2008, 12:21 PM
An ideal solution would be volatile textures (which have no system ram counterpart), but this is not possible right now - all textures reside in both system & video memory. Are you listening ARB?

We're listening. (http://developer.apple.com/graphicsimaging/opengl/extensions/apple_object_purgeability.html)

Korval
02-12-2008, 12:25 PM
OSes these days have VRAM virtualized.

Um, which OS's? Certainly not WinXP. So relying on VRAM virtualization is likely not practical for the near future (3 years or so).


We're listening.

Apple providing an extension doesn't quite amount to the ARB listening. However, last information on GL 3 suggested that they would offer a parameter for various GL objects that would allow you to select whether the object has a backing store or not.

Mark Shaxted
02-12-2008, 02:20 PM
However, last information on GL 3 suggested that they would offer a parameter for various GL objects that would allow you to select whether the object has a backing store or not.

I really hope so. I've been treading water, so to speak, with a commercial app - delaying it hoping we'd have GL3 specs (and sample drivers) by now. It'd very simple to specify GL3 for the optimum user experience, as opposed to GL2.1, or GL2 + extension, or GL1.x + this, that and the other. It's very difficult for a consumer to understand.

I agree about virtualising video ram too - vista is yet to gain critical mass, and even if it does you can switch off dwm (at least in fullscreen mode).

My real desire for video memory only, though, is that in many cases there are machines out there running vista with 1gb of ram - vista takes over 600mb. These machines often have 512mb of video ram which is difficult to access without using the pagefile for something.

Mark Shaxted
02-12-2008, 02:23 PM
arekkusu... you work for apple then?

Korval
02-12-2008, 04:51 PM
you can switch off dwm

I doubt that this will turn off VRAM virtualization. That's a driver-level thing, and I don't think they force IHVs to write several versions of their drivers.

akaTONE
02-12-2008, 09:20 PM
OS X has had it since pretty much day one. Microsoft now has it as well in Vista.

V-man
02-13-2008, 09:53 AM
I don't see how they are going to keep a backup in RAM for FBOs without effecting performance. I'm pretty certain there is no backup in RAM, if the FBO is lost, the driver can recreate it but the pixels would either be garbage or some specific values.

Probably, the only thing they keep in RAM is that "some FBO exists with so and so attachment" which probably consumes 100 bytes of RAM per FBO.

Mark Shaxted
02-13-2008, 10:55 AM
I don't see how they are going to keep a backup in RAM for FBOs without effecting performance. I'm pretty certain there is no backup in RAM, if the FBO is lost, the driver can recreate it but the pixels would either be garbage or some specific values.

Probably, the only thing they keep in RAM is that "some FBO exists with so and so attachment" which probably consumes 100 bytes of RAM per FBO.

Well, my 'guess' is that any textures that are attached ARE stored in RAM (at least RAM is allocated to hold them if the driver needs to do some swapping), but renderbuffers (colour buffers) are not. This would tie in with pbuffers which are volatile in nature.

I think it boils down to the conceptual difference between framebuffer memory & texture memory. Textures are swappable, framebuffers aren't. Unless you virturalise VRAM...

However, all this is pure speculation on my part, and I may be well off the mark here.

As an aside, up above I mentioned turning off dwm. By doing this you ease pressure on VRAM - it's far less likely that vista will force the graphics driver to swap out MY image data for the OS's. And it's not like I expect someone to be playing quake while running my app, which is a photo management/editing package.


Mark

Mark Shaxted
02-13-2008, 10:58 AM
Just to clarify what I WANT to do. For example, part off my app allows the user to browse photo thumbnails, up to 252*168 in size. This would take up ~165kb - multiply this by a thousand and that's a lot of memory, and system RAM will be precious. By using a volatile texture these thumbs would only impact VRAM, and take negligable CPU resources to display. Now, we know this isn't possible (yet?), but if renderbuffers are only in VRAM, I can DrawPixels into the backbuffer etc. The CPU will be doing a lot of other work, so anything I can offload will result in a significant performance increase overall.

arekkusu
02-13-2008, 05:24 PM
arekkusu... you work for apple then?

Yes.


Apple providing an extension doesn't quite amount to the ARB listening.

No, but we have to start somewhere. Remember the traditional path to get features promoted to the core?


I don't see how they are going to keep a backup in RAM for FBOs without effecting performance. I'm pretty certain there is no backup in RAM

On an OS with properly virtualized VRAM:

FBO attachments are just textures. They have to be virtualized like any other resource. Renderbuffers can be implemented as just another texture, albeit perhaps with different storage layout depending on the POT/NPOT stride/twiddle/mipmap capabilities of the renderer.

Surfaces in general (window, pbuffer, renderbuffer) need to be virtualized. If you dirty a surface and another process wants to use all available VRAM during its timeslice, you need to evict, then restore when the first processes returns. If your performance tanks due to surface paging, your system VRAM is overcommited for the (set of) application(s) running on it.

"Purgeable" resources allows an app the option to declare "don't page off my dirty surface, I'll recreate it as needed."

Mark Shaxted
02-13-2008, 05:42 PM
arekkusu... Thanks for that. I only wish there was an ARB version - whether or not VRAM is virtualised (vista vs XP), it's the perfect solution. So I'll say again... are you listening ARB? :)

One more question. When you do page out VRAM, does it go into pre-allocated RAM, or does the OS allocate on demand? I'd guess that vista would be similar.

Sorry for asking seemingly windows specific questions, but they're not really - I'm just after some insight into driver memory management wrt fbo renderbuffers/pbuffers.

Many thanks
Mark

akaTONE
02-14-2008, 09:30 AM
It all depends on the type of object as to whether it is paged off to preallocated RAM. Mark, you say that you do not expect somebody to be playing a game while running with your app but the driver cannot be sure of this. What happens when the computer is put to sleep? Some things are turned off and quite possibly one of these things is your video card. So, the driver will need to page off EVERYTHING so that when you open the laptop, everything pops back up the way it was. It pages everything back on and voila. For the app you are talking about, why could you not have a preallocated number of textures, enough to fill up the screen once or twice over, then TexSubImage into those as new images come into view that are not already in a texture?

The game Enemy Territory Quake Wars uses this same concept for its whole Mega Texture system. Textures are allocated up front and then as a new tile comes into use, a texture from the free list is given to that tile. The tile's data is then TexSubImage'd into the corresponding texture. As long as the video card is not facing too much VRAM pressure, paging will not occur. You're system memory usage is then also bounded by the number of textures that are preallocated. Just be sure to stay on the fast path for recreating these textures so that any kind of orphaning does not come into play.

The one downside to the Apple extension is that it requires a flush every time you want to unpurge any object. This is again due to the nature of virtualized vram and nothing being definite until the kernel actually processes the commands.

V-man
02-14-2008, 01:04 PM
arekkusu... you work for apple then?

Yes.


Apple providing an extension doesn't quite amount to the ARB listening.

No, but we have to start somewhere. Remember the traditional path to get features promoted to the core?


I don't see how they are going to keep a backup in RAM for FBOs without effecting performance. I'm pretty certain there is no backup in RAM

On an OS with properly virtualized VRAM:

FBO attachments are just textures. They have to be virtualized like any other resource. Renderbuffers can be implemented as just another texture, albeit perhaps with different storage layout depending on the POT/NPOT stride/twiddle/mipmap capabilities of the renderer.

Surfaces in general (window, pbuffer, renderbuffer) need to be virtualized. If you dirty a surface and another process wants to use all available VRAM during its timeslice, you need to evict, then restore when the first processes returns. If your performance tanks due to surface paging, your system VRAM is overcommited for the (set of) application(s) running on it.

"Purgeable" resources allows an app the option to declare "don't page off my dirty surface, I'll recreate it as needed."



I thought virtualization meant that not all of the mipmap chain needs to be in VRAM in order to sample from some mipmap level. So, I don't see the relevance with backing up a FBO to RAM.
If for every frame you render to your FBO, then you unbind it, and the driver has to copy the texture and the mipmaps back to RAM, this sounds like a performance loss.

I suppose it is the same issue with dynamic VBOs.
glMapBuffer is not likely giving you a address to VRAM.
When you unmap the buffer, the driver will later on copy to VRAM.

Korval
02-14-2008, 04:39 PM
I thought virtualization meant that not all of the mipmap chain needs to be in VRAM in order to sample from some mipmap level. So, I don't see the relevance with backing up a FBO to RAM.

The reason that you need a backing store for images and VBOs is because video memory is (on pre-Vista Windows OS's) volatile. That is, if your application no longer has input focus, you are given no guarantee that your stuff is still there.

Virtualizing VRAM means that you are guaranteed that your stuff will be there. So there's no need to keep a copy around in main memory.

skynet
02-14-2008, 04:59 PM
Virtualizing VRAM means that you are guaranteed that your stuff will be there. So there's no need to keep a copy around in main memory.

Question aside: Is there any hardware yet that offers truly virtualized vram? Hasn't this been a promise of DX10... I don't see it realized yet :-(

Korval
02-14-2008, 05:09 PM
Is there any hardware yet that offers truly virtualized vram?

What do you mean by "truly virtualized"? The kind of virtualization under discussion is an OS/Driver-level thing; it's got nothing to do with the hardware.

skynet
02-14-2008, 05:28 PM
I meant the hardware solution: The on-chip VRAM is just seen as (small) cache while the application sees a virtual VRAM of say 16GB. The hardware transparently manages the finegrained upload of missing pages (that have a small size, like 4-64kb or so), either from system ram or disk.

It would make software-side "virtual texturing" like Id's megatextures (and the sucessor system in Rage) much easier to implement, if not superfluous.

Korval
02-14-2008, 05:34 PM
I meant the hardware solution:

That's a "solution" to an entirely different problem. And neither ATi nor nVidia have "solved" it.

akaTONE
02-15-2008, 09:00 AM
In response to REAL virtualization.

Think about what happens when a page fault occurs on a CPU. The CPU has to suspend the currently running process in order to then call an interrupt procedure to load the required data from disk. Then once it has loaded the data, it can update the page table with the new address and reload the suspended process. Then continue along its merry way. A typical x86 CPU has 8-16 integer GPRs, 8-16 FP registers, and 8-16 SSE registers. This does not include all the additional control registers that need to be stored away.

For a GPU to have this same mechanism, you have a GPU that needs to be able to suspend the rendering mid-primitive because a tile is in a page with a non-present backing. Unlike a CPU though, the GPU has an arbitrary number of execution units performing vertex shaders, geometry shaders and pixel shaders. You also have all the fixed function stages in between them and at the ends of pipeline as well as the texture units fetching data and having control state of their own. What state do you save?

If you were to save all the state for say an R600, you'd have multiple megabytes worth of information that would need to be saved away and then reloaded after the data was either loaded from system memory or disk. However, you can get away with checkpointing and getting a minimal set of state to restart each portion of the pipe. This is still not a cheap operation as the GPU is not the one that can actually handle the page fault, the CPU has to handle all the behind the scenes magic since you do not want to keep your system memory locked down all the time.

With some new features in GPUs such as MEMEXPORT, just using the checkpointing system may not be possible since restarting the shader units could cause the memory ordering to be different since the shader engines have to replay everything up to the checkpoint before they start actually running new pixels, vertices, triangles, etc...

In most cases, if this is a serious app, there are not going to be that many other things needing to be processed by the GPU and so, there is not going to be another string of commands to fill this time while it is faulting. Then again, this may cause more apps to start using shared contexts so that each context is off doing its own thing so that if a page fault does occur on one context, there may be other work for the GPU to do while waiting. This may cause more of the OpenGL programmers of the world to have to start going through their own version of the multi-threaded programming transformation.

So:

1) the CPU still has to handle the page faults. Even if the GPU could transparently queue up transfers to/from system memory. the CPU has to be involved in finding the pages the GPU needs in system memory and locking them down. In the worst case scenario, the CPU will also have to bring in the pages needed from disk. If you want the GPU to be able to do all this without CPU intervention then you are going to have to have even higher system memory usage because it is going to need all objects' backing stores to be locked down in memory while they have commands pending. Currently, they are only locked down during the paging.

2) the amount of memory required to store the state of a GPU is much higher than that needed for a CPU.

3) what does the GPU do while waiting for the data to be paged in?

ZbuffeR
02-15-2008, 05:22 PM
Thanks AkaTONE, very interesting post.

After reading it, it seems to me that there is roughly two ways of doing useful hardware texture/buffer virtualization :
- transparently handled by the driver, expect all buffers stored in CPU memory. Page faults on the GPU triggers the upload of the relevant block, from RAM to VRAM.
- handled by the OpenGL application, by registering a callback to a GPU page fault, so the app can upload the block from disk in parallel. Difficulty would be to splice the texture in blocks suitable for each hardware, looks more complex than current texture formats BGRA8 LUMINANCE_FP16 etc ...

And, in both cases, to answer your point 3) about keeping the GPU busy doing useful things when a new block is uploaded, take advantage of mipmapping. Like with clipmaps/megatexture/googleearth/etc : sampling from lower res mipmap when optimal mipmap block is not available.

Cache eviction would be LRU/LFU based, with a bias toward low resolution mipmap levels. When fetching blocks, adjacent blocks (in 3 dimentions for 2D mipmapped textures) become also candidates for upload, to anticipate probable future needs.


I am not enough into hardware to know if the above is even realistic, but at least it seems less complex than a full-featured CPU-like virtual memory.

Any comments ?

akaTONE
02-15-2008, 08:23 PM
Intriguing idea. I had never thought to register callbacks from an application to o this. However, the problem is now you are going to have some really awkward message/signal routing mechanism so that you can get a signal that a buffer/texture/rendertarget faulted for a specific range/region/volume. Think what has to happen here. The GPU faults saying it wanted to write to a specific address. Now the driver has to then figure out what object that was associated with. Then it has to figure out what page and associated range/region/volume that address was in. This info then gets propagated from an interrupt to the application. The application then has to either procedurally generate the data, go through the file system to get the data, or copy the data from its own internal memory. Then the GPU has to queue up a transfer that will either do a straight copy OR do a blt that converts from the linear format you provide the data in to the layout the GPU uses for optimal cache reuse.

As for the idea of having the LOD being clamped by the levels that are present, that is a good idea. Now the question is, what component sets this clamping? Is it set on the texture fetch faulting? How does the sampler proceed once this clamp is set AFTER the LOD was already calculated and turned into real offsets? Is it the driver's responsibility to check that all of a texture is in memory? Since most drivers work on a coarse-grained level, this may not be so useful until fine-grained virtualization is supported.

Another issue is vertices in buffer objects. Buffer objects have no concept of level of detail. If a page from a buffer object faults, you are either going to have to wait for the data OR return some default value such as 0, 0, 0, 1. People can live with lower quality textures being used, but if you get incorrect geometry, then you start to have serious issues. You could reduce the size of the buffers needed by eventually using the tessellator in the latest ATI chips. You can get amazing levels of detail by generating the geometry on the fly ON the GPU. Other than that, there is no way to deal with that info missing.

Also, color buffers and depth buffers have no concept of level of detail either. They have no fallback but to wait for the data.

Unfortunately, the only solution besides the current per object virtualization is fine-grained per page. This is the only way you solve all the problems. But, you get the extra overhead of having to deal with page faults. If ATI/nVidia/Intel can get the cost of the page fault down, i.e. the amount of data that needs to be saved off, then this becomes more viable.

There is a HUGE AMOUNT OF WORK that goes on in the driver and hardware to make sure what you see on screen is what you want ... usually ;-)

Mark Shaxted
02-16-2008, 03:00 AM
There's an easier solution that works 'most' of the time. More VRAM :)

Current we have:

Disk (pagefile) -> main memory -> bus -> 512mb DDR3 VRAM (for example)

Why not...

Disk -> RAM -> bus -> 4gb DD2 VRAM -> 512mb DDR3 VRAM

Then we have the CPU responsible for virtualising between main memory & the larger 4gb visible VRAM, and the GPU responsible for virtualising between the low speed 4gb VRAM and the smaller high speed VRAM. The 4gb will be the GPU equivalent of an L3 cache. Now, you may say that cost is a factor - but today I can buy 4gb DDR2 at RETAIL prices for less than 50.

akaTONE
02-16-2008, 02:19 PM
The memory is the cheap part. Part of the R600's reason for being so big and expensive was that it had a 512-bit external bus. This required the actual chip size to be much larger than the die. You had to have a large enough area for all the extra pins to connect to the board. The successor GPUs cut the bus down to 256-bits,128-bits and 64-bits as cost savings. The 512-bit bus was overkill anyways. Now if they implemented a second memory controller, they are going to go back in the other direction in terms of pins and price.

Besides, there is no reason to do this at the card level when it could be done just as easily by allocating a gigantic chunk of system memory, wire it down and then point a range of GART at that block. Then, they could treat this area as another region of VRAM and no other driver component would need to worry about it. But, depending upon how a driver lays out the backing stores of objects, the driver could just use the backing store directly and not need this gigantic block of pseudo-VRAM.

Usually, if you have more system memory, the driver will be able to allocate more space for the GART. Then this extra memory combined with the current virtualization schemes, should handle just about as many cases as re-architecting the memory hierarchy of a GPU.