Emulating Frequency Stream Divider in OpenGL

What would be the best method to emulate DirectX’s Frequency Stream Divider in OpenGL?

The point being to stream both per vertex data from one VBO, and per primitive data from another VBO without having to manually duplicate the per primitive data to each vertex in the primitive.

BTW, I don’t mind using vendor specific extensions, but I couldn’t find anything for this.

There is no direct way (except manually duplciation the data :slight_smile: ). Nvidia had designed an extension for it, but it was never released because they abandoned the technique in later hardware (probably it wasn’t a big performance saver anyway).

Still, OpenGL is much more “batch friendly” then DirectX9. The usual way to do it is something like


for instances:
 # set the per-instance value
 glVertexAttrib(..)
 # draw the array
 render()

Looks like Humus has a demo of that very thing ala pseudo instancing in the Domino demo…

http://www.humus.ca/index.php?page=3D

Good to know that the stream frequency divider got canned in the hardware.

I’m dynamically generating all my geometry per frame right now using transform feedback. Also I’m drawing this geometry 6 times (once per side of a cubemap) so there is a big 6x read factor (for vertex attributes) for me to consider.

Looks like I will have to do a texture buffer fetch to grab those per primitive attributes from my transform feedback pass…

It just wasn’t flexible enough. Now that the latest hardware supports vertex and instance IDs as well as indexing large constant buffers, you have full control over what data the shader should fetch. In theory you could do away with attributes and vertex arrays.

What would be the best method to emulate DirectX’s Frequency Stream Divider in OpenGL?

For those that don’t track this kind of thing…

[b]NVX_instanced_arrays[/b] - the original extension NVidia introduced for vertex stream frequency divider support under OpenGL. Was around for a while in NVidia’s Linux drivers (prob MSW too), exposed for Direct3D via SM3.0 vertex shaders. VSFD support was implemented in HW on NVidia GeForce 6 cards, but in SW on GeForce 7. Not available in latest NVidia drivers (at least on Linux with GeForce 8).

[b]EXT_draw_instanced[/b] - the new, better way to do things that passes an instance ID into the vertex shader for instance-specific texture lookups, etc. Supported in NVidia’s Linux drivers, but (AFAIK) requires a GeForce 8.

For pre-G80 support, you have to fall back to pseudo-instancing (replicated instance data in draw calls), for loops around instance draws, or other. Don’t know about the AMD/ATI side.

But as Zengar points out, OpenGL doesn’t have the horrendous batch submission overhead that Direct3D once did, where (in Direct3D) maximizing batch sizes and minimizing batch counts was absolutely paramount to decent polygon throughput (google marshalling opengl for details – e.g. this). I gather this has since improved in D3D10 for the few that are running it, but I don’t know (or much care; thankfully, my day job is 100% OpenGL). Try it both ways and see.

An aside: If you use a texture to upload dynamic instance-specific data, be sure to use a PBOto avoid the unnecessary, non-pipelined memcpy to a driver-internal DMA-aligned buffer. Should net you closer to the 3.2GB/sec practical rate versus a subload from user memory.

Inded I don’t have any of these extension exposed on my GF6800, win recent drivers.

The new hardware gives gl_PrimitiveID as well.

This is what I wonder about.

I would assume that the hardware (in this case the GeForce 8 Series) would have some kind of vertex attribute fetch hardware working in parallel with the unified shader hardware. So regular vertex attributes are fetched without taking shader resources. I would also assume that the vertex attribute fetch hardware has the ability to convert packed bytes and packed half floats automatically to floats without using shader resources.

Is this correct?

Constant buffers are limited to 4096 vec4’s, so in many cases a texture fetch (say from a texture buffer object) is needed. Which also will take shader resources, even if the thread is immediately context switched the GPU still has to schedule the texture fetch (right?).

And if you needed to fetch both floats and half-floats or bytes (ie using compressed vertex data), you would need multiple texture fetches from different textures or you would have to manually unpack the packed data from a float (which from what I have been told on G8x chips is emulated with multiple type conversion and integer instructions, check out the CUDA PTX guide). Meaning there would be performance costs associated with this.

So it would seem to me that there would be some inherent advantages to using the hardware vertex attribute fetch (if possible) over simply fetching your own “attributes” in a vertex shader.

But perhaps I’m completely wrong here?

Any of you OpenGL experts know about this?

If you have a look at some R600 block diagrams (for example here, middle of page), you’ll see that vertex fetch and texture fetch is acutally performed by the same units (though with separate L2 caches).

Since you don’t need filtering for vertex or constant fetches, each “texture unit” can fetch 16 unfiltered values along with 4 bilinear filtered values per clock. And the format conversion units can be shared for textures and vertex data.

G80 may be different in this regard and attributes may be more efficient, but I don’t think the difference would be large. On the other hand, the increased flexibility allows for better data reuse in some cases, which reduces storage space and possibly bandwidth (e.g. multiple index buffers).

I would assume that the hardware (in this case the GeForce 8 Series) would have some kind of vertex attribute fetch hardware working in parallel with the unified shader hardware. So regular vertex attributes are fetched without taking shader resources. I would also assume that the vertex attribute fetch hardware has the ability to convert packed bytes and packed half floats automatically to floats without using shader resources.

That doesn’t make any sense.

Fetching vertex attributes has always been a pre-condition of a shader. Whatever the cost of this operation is (memory access, pre-T&L caching, transforming attributes into floats, etc), it happens “before” the shader itself runs. So while you can’t really tell the difference between vertex shader time and vertex fetch time, you also can’t control how long it takes to fetch attributes. You generally use as few as you can get away with and accept whatever the vertex fetch cost as the minimum performance you will get in that batch.

It’s not something you can control or optimize.

As far as I can tell the SFD allows for attribute replication / fractional-stride and the newer instancing mechanisms do not.

Attribute replication can be useful for particle systems where the data is being CPU sourced, but many of the attributes can be held constant across each of the verts of a particle, while some are varying. Without SFD you have to replicate that data in each such attribute.

Restated, SFD provides for bandwidth reduction in some CPU-sourced-data cases.

There’s also a way to do it with uniforms I think, but the chips that benefit from SFD don’t have a lot of uniforms.

Well, you can put some data into textures to reduce the amount of vertex fetch. Generally this will be faster since vertex fetch and texture units are separate and if you have the bandwidth to feed it you can potentially double the fetch rate.

Generally this will be faster since vertex fetch and texture units are separate

They still go across the same memory bus. And there’s no guarantee (or even a suggestion) that they will be executed in parallel.

Vertex ID, instance ID, integer operations and texture buffer objects enable everything SFD does and much more.

I should clarify what I mean by “in parallel”.

1.) Meaning separate hardware is fetching attributes for one or many other vertices, while other vertices are currently running through the shaders.

2.) The inverse to this would be that the compiler simply adds hidden “vertex attribute fetches” into the code of the shader itself (working somewhat like a vertex texture fetch), or that the hardware runs an internal “shader” to fetch the attributes before running the vertex shader. It is possible that parts of the hardware are being removed (or will be in the future) and run on chip using the programmable shader hardware itself. I can see the ROP eventually going this route…

Anyway, sure this seems like a stupid question now, but I’d really be stupid if (2.) was the case and I kept on making the assumption that (1.) was correct.

Given that ATI R600 has separate vertex attribute fetch and texture fetch hardware, implies option (1.), and I can see the possibility that vertex texture fetch would be an option if the attribute pipe was saturated.

As for the GeForce 8, I’ve been assuming option (1.), but I’ve yet to find anything that confirms this.

Also you can compress data in the VBO’s at the expense of having to dynamically decompress the data in the vertex shader.

So there is a clear vertex attribute fetch (or memory bandwidth) to extra shader operations trade off.

Hence my comment about having the bandwidth to feed it. If you do, it’ll be faster.

On R600 they do, don’t know about G80. I’ve noticed a decent performance increase by going from R2VB (only vertex fetch) to VTF (both vertex and texture fetch) on the R600.

On R600 they do, don’t know about G80.

Excluding the obvious fact that the R600-based hardware is not likely to be a popular card for the forseeable future (see 8800GT), there is hardware that isn’t the R600 out there. Making an optimization for a specific card like this, particularly one that requires fundamental restructuring to the basic essence of what makes something a mesh (vertex data now being stored in textures), is just not reasonable.

I’ve noticed a decent performance increase by going from R2VB (only vertex fetch) to VTF (both vertex and texture fetch) on the R600.

Well, then ATi clearly needs to work on their drivers (fat chance). If the texture unit can fetch data faster than the vertex unit, then the drivers should be using the texture unit and not the vertex unit. And if it can do vertex and texture fetches in parallel, then it should decide on its own to use both. And it should create buffer objects on its own under the knowledge that this may in fact be possible (ie: making them such that they may be accessed from a texture unit as, presumably, an buffer-object texture).

This is not something that developers should have to do themselves. This is clearly transparent optimization territory, considering how hardware-dependent it is (if hardware doesn’t support fast vertex texturing, it’ll murder your performance).

Indeed, this is a fine argument for geometry-based display lists.

I don’t have proof, but I find it fairly likely that the G80 would also be helped by this. Of course the G7x will be slaughtered by this technique though. Whether it’s reasonable to use is up to anyone to decide for himself, I’m presenting it here because it was asked for. But for many typically high-poly parts of many games, like the terrain and game characters, this is a relatively straightforward and intuitive approach. Depending on your application, and if you’re targetting the high-end, it could perfectly well be reasonable.

This is not something that developers should have to do themselves. This is clearly transparent optimization territory

I think you’re overly optimistic about what a driver could do about this. First of all, texture units and vertex fetch are shared between all shaders (vertex, geometry, fragment). It would require application specific knowledge to figure out what’s the best distribution of the available resources, something the driver doesn’t have. An automatic conversion may slow down some applications because it needed the texture units in the fragment pipe instead and wasn’t vertex limited in the first place. Secondly, vertex fetch and texture fetch can’t easily just replace each other in the shader. There has been talk about doing the reverse, having the vertex fetch unit help with texturing in DX10, in particular implementing the Load() function. However, this was quickly deferred into the future and I’m not sure it’ll ever be implemented given the complexity of the issue. It might be used for some game specific optimizations for an important title, but I doubt there will ever be a generic solution. Vertex fetch can only access linear data. Textures are usually not linear. They are typically swizzled. Textures also have a lot more possible formats than vertices do. And texture units can only access linear data under certain alignment requirements and other limitations. The driver would have to shuffle around the data, swizzle and unswizzle, pad and align to make this work. This management alone could have larger overhead than you’d win in the end.

Why have the vertex fetch units in both Xenos and R600 been referred to as point sample texture units then?