Placemark for f-buffer details.

This is an invitation to post info and answer questions on f-buffer implementation and API, feel free to post here ATI.
http://graphics.stanford.edu/projects/shading/pubs/hwws2001-fbuffer/

API?
Size?
Overflow?

Originally posted by dorbie:
This is an invitation to post info and answer questions on f-buffer implementation and API, feel free to post here ATI.

In case it’s of any interest, the first time I heard about the F-Buffer was in the OpenGL 2.0 white papers. I think that in the first versions of those papers there was even specific provision for it in the new OGL 2 scheme of things.

We considered adding F-Buffers (The F-Buffer: A Rasterization-Order FIFO Buffer for Multi-Pass Rendering, Graphics Hardware 2001) in addition to the aux data buffers already described. These have some very nice properties, however they have an unbounded memory footprint (the paper gives suggestions for solutions) that OpenGL doesn’t have any mechanisms to address. We didn’t feel we had time to take on this as well and want to defer it to OpenGL 2.1

OpenGL Shading Language Spec 1.1, pg 14

[This message has been edited by evanGLizr (edited 03-10-2003).]

The f-buffer also seems to have some very nasty properties, but it’s better than nothing. The devil is very much in the details and the details are not forthcoming.

The approach to tackling some of the most important issues is wide open. Trying to replicate exact fragment rendering without overflow is a nightmare that looks like it requires primitive level subdivision. Unless some really mind bending ‘new magic’ stuff goes on you’re left with some really nasty problems in the application or middlewear domain.

Even if OpenGL happens to handle subdivision or region based rasterization to avoid overflow, you can’t really expose the passes, and you leave OpenGL to transparently apply the multiple fragment shaders across passes. This seems inflexible complex and something inherently to be avoided. OTOH if you leave the passes up to application level support then you have simplicity but the aforementioned nasty application problems. With an inherently region based architecture it makes more sense, but some other undetermined constraints still seem essential.

You can always take the hit and go out to video memory but then you often wind up wasting all that memory you were so proud of saving and the fbuffer winds up as a fancy auxbuffer cache that limits your flexibility and might rarely save you some time.

It seems that even once you have a working f-buffer in hardware, you may be left scratching your head as to how best to expose it.

[This message has been edited by dorbie (edited 03-11-2003).]

I wanted to correct this for the record, since it has been annoying me. I had cause to think about this some more recently and it looks like (and has been stated or strongly implied) that the ATI F-buffer implementation will transparently execute arbitrarily complex shaders by executing them in smaller groups of instructions and storing the results registers to the F-buffer between ‘passes’. The passes will not be exposed in any way, there’s no need for any additional API and no nasty issues. The multipass is hidden and complex shaders just work. It should be very clean.

I assume that the full fragment programs will be stored in video memory or on chip and a FIFO overflow would simply cause the next batch of instructions in the larger fragment program to be loaded and executed for the fragments already in the FIFO (or it simply renders in groups of fragments sufficiently small that it never overflows the FIFO). This means arbitrarily complex programs with the main additional overhead being reads & writes to the FIFO and constant reloading of the program instructions as fragments are rendered in batches. The latter overhead depends on the size of the FIFO and number of registers needed between passes.

None of the nasty issues arise because this is implemented within the confines of a single fragment program using whatever state limits are already imposed by OpenGL. It is beautiful, a clear win.

Some of this is still speculation.

[This message has been edited by dorbie (edited 04-05-2003).]

Sounds reasonable. I think that most cards these days rasterize some form of tiles, blocks or “cache lines” or whatever. They probably do the overflow/spill/re-start on a per-tile/block/line/chunk basis, so it’s probably mostly using super-fast on-die SRAM, plus large chunks being burst to/from DRAM, so it should be efficient enough to actually use.

Originally posted by jwatte:
Sounds reasonable. I think that most cards these days rasterize some form of tiles, blocks or “cache lines” or whatever. They probably do the overflow/spill/re-start on a per-tile/block/line/chunk basis, so it’s probably mostly using super-fast on-die SRAM, plus large chunks being burst to/from DRAM, so it should be efficient enough to actually use.

Note that what follows is just speculation.

I don’t think the re-start is done per tile, it will be done per polygon batch, as one of the reasons to use an f-buffer is to be able to use longer programs than the ones that fit in the VPU (you can also use the f-buffer in case you don’t have enough temporaries, as the f-buffer behaves as an “unlimited” register file).
There’s no much reason for an f-buffer (instruction-length-wise) if you already have a hierarchical memory structure for program instruction storage, which is what you both seem to describe.

The scenario I see for programs longer than the maximum length is:

  1. The app sends the long shader to the driver.
  2. The app sends one or several polygon batches.
  3. The driver splits the long shader in two (or more) shader segments (the shader is split in such a way that each segment fits in the VPU). The intermediate segments read and write from/to the f-buffer, the last segment will write to the framebuffer.
  4. The driver iterates through all the shader segments. It sets the first segment and sends the geometry. For each subsequent segment, it sets the segment as current program and triggers something in the vpu to generate fragments from the f-buffer and run the program on those fragments.

Other notes: The fact that the f-buffer is not exposed at the API level doesn’t mean is problem-free, as the driver still has to take into account the case where the f-buffer overlows (which is the problem the opengl 2.0 white paper talks about, a problem difficult to predict without doing the real rendering, as the size of the f-buffer is dependent on the number of and geometry of the polygons - overlapping polygons with z-test on may require less f-buffer memory, etc).

When the opengl 2.0 whitepaper talks about exposing the f-buffer, it means doing so at the shader-language level, as a sequentially accessible “unlimited” register file (hence the comparison to aux data buffers).

EDIT: Boh, removed comments on retransforming the geometry, if you have an f-buffer, the second and subsequent passes don’t use geometry but the f-buffer itself (obviously).

[This message has been edited by evanGLizr (edited 04-06-2003).]

Most of my earlier posts were on the f-buffer overflow problems and issues of transparency, however the overflow issue is solved with the implementation being discussed.

The f-buffer is a limited resource, that’s the whole point, overflow is excruciatingly painfull (it always breaks the f-buffer at some level and most of the implementation issues surround avoiding or handling overflow). The transparent option discussed means that the overflow can be avoided. As I said overflow might trigger progression to the next program chunk for the fragments in the FIFO, either that or some analysis is made of the program and divides the f-buffer size by max registers between any two passes and fragment count limited to that. Since the f-buffer is a single FIFO it seems to me that the fragment count for any given pass would have to be limited by f-buffer size divided my the most registers sent between ANY two passes in any given shader. If not you get overflow.

Primitive order rasterization of fragments seems like a possibility however some subdivision (in whatever form I don’t care really) is essential for large primitives. Exactly how this happens is architecture dependent and we don’t have detail on such things to make even an educated guess.

The API transparency and lack of overflow comes at a price, the penalty paid for transparency is that the OpenGL state accessible for sources within even a large shader (textures colors etc) is still that of a single pass (although I expect multiple taps from the same textures is possible). In theory I suppose ATI could blow the doors off fixed limits like max textures and a few other resources too, and use the f-buffer to help even with that, but the viability depends on the overhead of making huge numbers of those state change for each fragment group before overflows. They’d have to define all the multitexture targets etc or clean up the API w.r.t. multitexture etc. It could be done, the only question is how quickly. The functionality alone may be worth it for some, just tell everyone else exceeding 6 textures may be very slow, but you can use as many as 1024 textures in a single shader if you’re patient (or maybe it’s fast!). All entirely speculation of course. I do expect that each shader instruction group will at least be able to keep reading the same textures over and over again within the confines of the usual limits and that pretty much means unlimited taps in a single shader. Well, we’ll see.

Great stuff, I’ll only be truly impressed of we hear of some ungodly max texture count though :-), even if it’s slow and <7 is fastpath. I’ll be disappointed if we can’t tap textures until the cows come home.

P.S. as for shader trees etc, the only reason you’d be interested in this case is minimizing registers required between passes. Nothing I’m discussing requires this. We’re talking dumb low level fragment assembly code that is in it’s most basic form just run in sequence and registers required to be stored for a later pass are simply sent to the FIFO to be pulled off when the next internal pass hits the same fragment. All that other stuff is higher level unless someone tries to get clever reordering the provided assembly, it’s possible but doesn’t much affect the discussion.