NV30/r300: swap buffers by blit from float buffer?

imported_jwatte · September 13, 2002, 2:00pm

I understand that I can’t actually render out from a floating point frame buffer. Thus, I have to, in the end, render into a frame buffer that’s traditional 888(8) format.

However, if I want double-buffering (or triple-) it seems weird to me that I’d need to first convert to 8888, and then to copy that converted data to the front buffer, assuming I’m using block transfers for swapping (which I am, because I also use GDI child windows even in full screen mode).

Is there a way to request that my floating point surface be used as a blit source when time comes to blit to the front surface, and only keep the front surface in 8888 format? I can see how to do this if I don’t want to sync with vertical retrace, but it would be useful to also be able to do this while syncing, to avoid tearing in the cases where you care.

Nakoruru · September 13, 2002, 8:59pm

If you are willing to copy the float to the front buffer in order to SwapBuffers, then why not just request an SWAP_METHOD_EXCHANGE pixel format and blit tot he backbuffer then do a page flip? It would cost the same (except memory).

Otherwise, if you are willing to deal with tearing, why not just require a single buffered window to save memory, and then ‘blit’ the float to the front buffer.

All you really need is a way to wait for vsync, but if you do that then you are going to slow down your program a lot. Not because you are waiting for vsync necessarily, but because you can queue up commands while the GPU asyncronously waits for the swapbuffers.

It would be nice to be able to request floating point RENDER_TO_WINDOW buffers, but I’ve never seen a pixel format with a different number of bits than the display also have the RENDER_TO_WINDOW bit set, even though it should be possible. Maybe its a limitation of Microsoft’s ICD mechanism.

imported_jwatte · September 14, 2002, 8:20am

Thanks for your suggestions, although as I already stated in my post, I cannot use flip, because I have GDI child windows of my OpenGL top-level window (and also often run in windowed mode). I also already stated I don’t want to do an asynchronous blit to the front buffer because of tearing.

It seems to me it should be perfectly possible for the driver to arrange things so that the float->int conversion happens during the swapbuffers blit, as long as I promise not to use that specific float buffer again before the blit has taken place (or the driver can just stall me until the buffer is free again).

I suppose, because nobody has replied with a “we’re doing that,” that I should send in a request to the ARB people (or at least some nVIDIA and ATI addresses).

mcraighead · September 14, 2002, 5:11pm

Jon,

I’d say you actually don’t want what you are suggesting.

The basic model I’d envision for a simple “high dynamic range” app is as follows. Render to a single-buffered half_float (i.e. 64-bit) pbuffer with depth. Then set up a fancy shader that does your HDR effect, and bind that pbuffer as a texture, and draw to your window’s back buffer. Then, SwapBuffers.

If we are flipping, this is no worse than going to the front buffer in performance, and it rules out tearing.

If we are blitting, there’s still a good reason to do this: again, tearing. Your “HDR blit” may well be a very slow operation. There is no way we can guarantee it running in less than a screen refresh.

On the other hand, with a regular blit, it is possible to guarantee this.

Matt

imported_jwatte · September 14, 2002, 7:24pm

> If we are blitting, there’s still a good
> reason to do this: again, tearing.
> Your “HDR blit” may well be a very slow
> operation. There is no way we can
> guarantee it running in less than a screen
> refresh.

Scout’s honor: my HDR shader will be very efficient. So can I have this?

Actually, another thing that would be useful, and would possibly be simple to do in hardware, would be to get max, min, average and possibly variance of color channel values out of a buffer that you’ve drawn into (or, probably sufficient, for all fragments that have made it through all the tests). That way, you could actually do pretty reasonable shutter control in a very fast manner.

Now, this might not be in current hardware, but it ought to be added to new hardware (I view it as similar in complexity to the occlusion query support hardware).

mcraighead · September 14, 2002, 9:01pm

It may be fast enough on hardware X, but will it be fast enough on hardware Y?

Matt

Nakoruru · September 14, 2002, 10:37pm

Min/Max is something that I really really want as well!

I was thinking something similar to mccraighead about how you probably want to do your exposure pass as a last full-screen quad. This pass would also convert float to int. If you have to do this most of the time, then the ability to blit directly from a float buffer to the front buffer is not really useful (it would just convert your float to int)

I was thinking that min/max could be done with an extension of OpenGL 2’s pack/unpack language. It seemed to me that if you could have a global register that could be written to (and gotten with a glGet), that you could use CopyPixels to a proxy target to do min/max.

Basically a program you can run per pixel, with a set of registers you can use to return values. The proxy target is so that you do not actually have to copy pixels somewhere to run the program (although you could if you want).

imported_jwatte · September 15, 2002, 7:14am

Okay, everyone seems hell-bent on forcing an extra copy pass onto the graphics pipe line. This means the only way to avoid that copy pass is to not have a back buffer, and live with the tear always, justified by saying that “well, sometimes there will be a tear, so let’s enforce it”. In the ideal world, there would be a tool to let me TRY to avoid the tear while still not taking another full copy hit; that’s what I’m asking for.

Nakoruru, what you’re suggesting seems very powerful, but is way more than I asked for. All I want is a counter (or a few counters), per render target, which accumulate/update each fragment coming out the other side of the pixel shader, and which clear when you clear the render target color buffer, OR which clear when you tell it to. Ideally, these are available as “global registers” to pixel shaders texturing out of that render target into some other render target. This could be done reasonably efficiently in hardware, without slowing down any pixel path (except possibly render target switching, and that not by much).

nutball · September 16, 2002, 12:34am

I’ve been wondering why there’s no such thing as global, or persistent read/write registers available to vertex/pixel programs.

I’m assuming that one of the reasons is that it makes life very difficult if you have multiple pixel/vertex pipelines.

Isn’t much of the increase in performance of modern hardware down to implementation of multiple independent vertex/pixel pipelines, which can operate in parallel?

If so you can’t guarantee that all pixels/vertices will pass through a given pipeline (unless you want to turn off a significant fraction of your hardware).

So you will either have multiple sets of these registers (one per pipeline), or a single set with all sorts of nice performance-obliterating synchronisation going on whenever they’re accessed.

Nakoruru · September 16, 2002, 4:22am

nutball,

That is why I want global registers in a programmable pack/unpack hardware. My belief is that such hardware would not be very parallelizable to begin with.

It’s just a non-fixed-function generalization of what jwatte wants. It is what I came up with after I applied by principle that there should be little new fixed function capabilities.

However, I think that this feature could be super useful while we wait for a general programmable method. The ability to get min/max would be very similar to the occlusion query. In fact, it could very well be called EXT_minmax_query

davepermen · September 16, 2002, 6:00am

actually, minmax would not be enough (and could be emulated with downsampling of the screen (wich you need for the overbright blurring afterwards anyways, 2in1 ))

what you want is full statistics on how much bright the screen is in the end… oh, you get that from the downsampling as well… hehe

imported_jwatte · September 16, 2002, 8:12pm

dave: That’s why I wanted variance, as well.

nutball: I’m saying these globals are read-only, and reflect the status of the previous render target, or previous “counter hold” operation.

nakoruru: Yes, your suggestion is very powerful, but also very expensive. My suggestion could be handled while rendering, with no extra pass on the data necessary.

You could even think of an implementation that let you update statistics variables that are update-only (no read), and then something that locks these variables into read-only registers and clears the update-only versions. This could be built in an easily parallelizable fashion, and could even be useful for things like custom histograms.

Of course, at that point, you might get into multiple render targets, so what you do is examine the r/g/b of the outgoing fragment, and add in some value at some location in a second render target (a k a “table” or “histogram”). Except we don’t have blending in FP…

davepermen · September 16, 2002, 9:53pm

well, as i need the downsamples anyways (for fast filtering of the whole image they are very useful…) i will not die when that statisticals are not in… i mean, most of the time (today) you don’t actually need them, and so they would count there in and count and count for nothing… bether count afterwards…

btw, how many texture samples can be done again in a pixelshader? 32 on the ati? thats an 8x4 box wich we could sample down to one pixel… hm… the scaling down will be quite quick that way… (and due cachecoherent sampling it will be quick as well )

wimmer · September 16, 2002, 11:31pm

Dave,

what you mean is: render image to float buffer, bind it as texture, render to lower res float buffer, and for each result pixel, output the min/max of 8x4 pixels in the source texture, and do this until the image is small enough to be read with readpixels or you are at a 1x1 texture?

Michael

davepermen · September 17, 2002, 12:19am

yes, about that…

sample 32 floatvalues, sum them up, get max,average,min,variance,what ever you need. render this into a 32xsmaller buffer, and repeat till you are at 1x1 size. then, you can bind the result, and voilà, there you have the values you want…
you need 3 times a downsample, and then a downsample of the resting 24 values… if you’re at 1024x768. thats acceptable, not?

Nakoruru · September 17, 2002, 4:44am

I said min/max, but only because EXT_min_max_average_variance is a long name. Maybe EXT_fragment_stats? Get min/max/mean/variance for each fragment property (r,g,b,a,z,stencil). You should also be able to reset the values at any point, you can use a fence to know when to read/clear the values without syncronizing.

EDIT: I did not considered overdraw. In order to get the stats for values in the final framebuffer (not the average value drawn) you would have to be able to remove the contribution for pixels that are overwritten. It is possible to do this for mean, min, and max. I do not know about variance.

[This message has been edited by Nakoruru (edited 09-17-2002).]

davepermen · September 17, 2002, 5:29am

Originally posted by Nakoruru:
[b]I said min/max, but only because EXT_min_max_average_variance is a long name. Maybe EXT_fragment_stats? Get min/max/mean/variance for each fragment property (r,g,b,a,z,stencil). You should also be able to reset the values at any point, you can use a fence to know when to read/clear the values without syncronizing.

EDIT: I did not considered overdraw. In order to get the stats for values in the final framebuffer (not the average value drawn) you would have to be able to remove the contribution for pixels that are overwritten. It is possible to do this for mean, min, and max. I do not know about variance.

[This message has been edited by Nakoruru (edited 09-17-2002).][/b]

actually, the stats have to be performed AFTER the image is rendered… thats by far the simplest and fastest method… i would prefer a generic delayed_readback… you specify what you want, the target buffer, and can later say, now i want it…

think about it. the way i propose is simple and quite fast to implement, as well as to perform… then, as you want the stats yourself, you send the delayed readback into the buffer you will use in the next frame…

btw, you can do the whole on gpu, no need for readback… just do the exposure function when writing onto the 32bit buffer in the end. this can be done in the pixelshader, and, well… the values you need for that, are stored in a simple 1d-texture, ready for sampling out… the other texture you have is a simple floatingpoint texture, wich is your originalsized rendered image…

for what do you need the readback (i know that the readback can be handy if you want the informations, but for what do you actually want them? )

Nakoruru · September 17, 2002, 9:12am

The method I proposed would allow you to do statistics for any individual piece of geometry, so it is a little more general. If I just wanted to do the framebuffer then I would just use a fragment program like you suggest.

The statistics can be used as feedback to adjust the parameters of an exposure function. Just like we talked about a few weeks ago daveperman.

davepermen · September 17, 2002, 11:22am

Originally posted by Nakoruru:
[b]The method I proposed would allow you to do statistics for any individual piece of geometry, so it is a little more general. If I just wanted to do the framebuffer then I would just use a fragment program like you suggest.

The statistics can be used as feedback to adjust the parameters of an exposure function. Just like we talked about a few weeks ago daveperman.

[/b]

well if you propose statistic-registers for the vertexshaders as well, yes, else, i don’t see much use in having them directly in the pixelshader, as, most of the time, you only care about the in fact visible pixels…

my proposal solves the same as well, you can adjust the exposure-time with my data… and, my one is yet possible, hehe
(would even work with minmax as adjustvalues on todays hw… but doing exposure on a 8bit value will be, uhm, unprecious )
(hint: 3d texture for exposure lookup, hehe )

i know your ones sounds good… but my one is now possible, thats why i like it