Gaps between DX10 and OpenGL 3.2

elFarto · August 22, 2009, 8:38am

In an effort to close the gap to DX10[.1], I’m interested in collecting the current features present in DX10, but not in OpenGL 3.2.

Here’s what I have so far:
[ul][li]per-instance vertex attributes, vertex attributes that change per instance, rather than per vertex.[*]asfloat/asint bitwise conversions, e.g. packing a 16-bit integer into a 16-bit float[/ul]Can anyone think of anything else?[/li]
Regards
elFarto

Ilian_Dinev · August 22, 2009, 9:02am

The first one is present, as http://www.opengl.org/registry/specs/ARB/instanced_arrays.txt

elFarto · August 22, 2009, 9:34am

I was under the impression this wasn’t supported on DX10 class hardware, see here.

Regards
elFarto

Dark_Photon · August 22, 2009, 10:52am

I was under the impression this wasn’t supported on DX10 class hardware, see here[/QUOTE]
Specifically, here. With the latest NVidia 3.x drivers it’s not supported on a GeForce 8 at least.

Also this instanced array thing (specifically vertex attribute frequency dividers) was first implemented in OpenGL (AFAIK) on GeForce 6 cards (DX SM3.0) as NVX_instanced_arrays. Per Michael Gold, “NVX_instanced_arrays was implemented in software for GeForce 7xxx, anticipating the SM4 functionality. In the end it proved lackluster enough that we killed it.”

So it sounds very likely that ARB_instanced_arrays might only be supported on the ancient hardware that had support for it, in the NVidia camp at least.

But who knows. Maybe it’ll make a comeback. It can certainly be more convenient to push instance data into vertex arrays than texture buffers, though that definitely caps how much you can pull in and forces how you pack it.

elFarto · August 22, 2009, 11:33am

The hardware obviously supports something like instanced_arrays, just a little bit simpler.

There’s a nice example on MSDN on how D3D10 does it, note that it uses 2 separate buffers to hold the data, one for vertex data and one for instance data.

Regards
elFarto

Brolingstanz · August 22, 2009, 12:11pm

By my reckoning 3.2 + extensions pretty much clinches the greater 10.1 feature set, and then some (sync, seamless cubemap, …).

If 3.3 interns the recently added extensions, which seems very probable, that should neatly wrap up SM4.1 and go a stride or two towards SM5.

Though I confess that after all the rumpus at Siggraph09 I’m already way primed for SM6 & 7.

Back to my software renderer…

Alfonse_Reinheart · August 22, 2009, 2:08pm

If 3.3 interns the recently added extensions

What “recently added extensions” are you referring to?

elFarto · August 23, 2009, 3:11am

Found a few new ones:
[ul][li]Map the contents of a texture (ID3D10Texture[123]D::Map)[*]Ability to use front/back buffer as a texture. This seems to mirror the GLimage object from the original Longs Peak design.[/ul]Regards[/li]elFarto

Brolingstanz · August 23, 2009, 3:18am

Personally I’m pretty excited about all the new Apple extensions.

bertgp · August 24, 2009, 7:11am

The hardware obviously supports something like instanced_arrays, just a little bit simpler.

There’s a nice example on MSDN on how D3D10 does it, note that it uses 2 separate buffers to hold the data, one for vertex data and one for instance data.

Regards
elFarto[/QUOTE]

What I don’t understand is this: why does OpenGL make the devs go through hoops to get optimal rendering performance? I mean, obviously the driver is the best place to figure out if texture buffers, per-instance vertex attributes, uniform value array, etc. should be used on any given piece of hardware so why can’t we have one API call to set per-instance stuff and let the driver figure out the most optimal way to send the data?

Conversion between a uniform array and a 1D texture buffer should be handled automatically by the driver. Why should we bang our heads trying to figure out when to use each method? Even worse, this almost always has to be done through reverse engineering of the drivers and extensive benchmarks. This is pretty easy to screw up by forgetting to take into account one of the numerous states or not replicating exactly the application’s usage pattern. Also, this is obviously not “future-proof” as the driver behavior can change.

elFarto · August 24, 2009, 10:32am

bertgp:

What I don’t understand is this: why does OpenGL make the devs go through hoops to get optimal rendering performance? I mean, obviously the driver is the best place to figure out if texture buffers, per-instance vertex attributes, uniform value array, etc. should be used on any given piece of hardware so why can’t we have one API call to set per-instance stuff and let the driver figure out the most optimal way to send the data?

Conversion between a uniform array and a 1D texture buffer should be handled automatically by the driver. Why should we bang our heads trying to figure out when to use each method? Even worse, this almost always has to be done through reverse engineering of the drivers and extensive benchmarks. This is pretty easy to screw up by forgetting to take into account one of the numerous states or not replicating exactly the application’s usage pattern. Also, this is obviously not “future-proof” as the driver behavior can change.

I think part of the problem is legacy behaviour. But the methods you specified, uniform array and 1D texture buffer are different. While there should be one way todo things, we shouldn’t go removing things just because they’re kinda similar.

One more thing I found:

[ul][li]sub-views of textures[/ul][/li]Regards
elFarto

Alfonse_Reinheart · August 24, 2009, 10:40am

I mean, obviously the driver is the best place to figure out if texture buffers, per-instance vertex attributes, uniform value array, etc. should be used on any given piece of hardware so why can’t we have one API call to set per-instance stuff and let the driver figure out the most optimal way to send the data?

Because texture buffers and uniform buffer objects are not bound to the idea of per-instance data. There are many uses of these features that do not revolve around per-instance data. There is no API for setting per-instance data at all; it’s all up to the user.

bertgp · August 25, 2009, 7:07am

You’re right but I didn’t make myself clear enough. I mixed the instancing comment with the texture buffer stuff.

Fundamentally, uniform arrays and texture buffers are ways to send data to the shader. Regular textures are different because they are optimized for different memory access patterns (1). However, I can’t see any use case where one of the 2 aforementioned methods is a better fit conceptually than the other. Maybe there is one and someone can enlighten me.

In practice, one approach is faster than the other depending on the size of the data and maybe some other undocumented factors. Why is that? You can’t know without extensive benchmarks. Even with all the results, why have 2 ways of doing essentially the same operation?

As I’m writing this, I realize that this is essentially the same comment as those complaining that there are multiple ways to fill a VBO or a texture, with no indication as to which one to use to get optimal performance. Please it needs to be easy to know which API path to use in different use cases. Take out the guesswork for the devs!

(1) By the way, I learned some time ago that textures are optimized for certain access patterns by looking into the hardware design side of GPUs and by talking to people who develop GPU drivers. How is somebody starting OpenGL supposed to know when he should use this instead of say a big uniform array? Without experience with the fixed function pipeline which had well-defined roles for textures, the multitude of options for sending data to a shader is confusing.

Alfonse_Reinheart · August 25, 2009, 11:07am

However, I can’t see any use case where one of the 2 aforementioned methods is a better fit conceptually than the other.

One of them sets uniforms. The other does texture accesses.

Besides the fact that they both ship constant data to the GPU, I don’t see how they’re similar at all. Maybe they’re similar if you’re looking at a uniform like:


  uniform vec4 arbitraryArray[2048]

But that’s far from standard procedure when working with uniforms.

I realize that this is essentially the same comment as those complaining that there are multiple ways to fill a VBO or a texture, with no indication as to which one to use to get optimal performance.

The only case where it is not understood how to get maximum performance from a buffer object involves streaming. And even that case is no longer valid, now that MapBufferRange exists with the GL_UNSYNCHRONIZED_BIT.

How is somebody starting OpenGL supposed to know when he should use this instead of say a big uniform array?

And how exactly would you tell the user this with the API? How would you define these implementation-dependent access patterns? And how would you make sure it works the same across platforms?

OpenGL is a hardware abstraction layer. It cannot dictate performance, only behavior.

bertgp · August 25, 2009, 12:08pm

Alfonse_Reinheart:

One of them sets uniforms. The other does texture accesses.

Besides the fact that they both ship constant data to the GPU, I don’t see how they’re similar at all. Maybe they’re similar if you’re looking at a uniform like:
  uniform vec4 arbitraryArray[2048]
But that’s far from standard procedure when working with uniforms.

My point is that I can’t fathom why you would rather have a texture or a uniform array in any given circumstance. What’s the point of having these 2 methods? They are both ways of sending an array of data to a shader and that’s it.

As an application developer, I shouldn’t have to care which path the driver takes internally for getting the data to the shader. There can be some cases where different paths with well defined performance characteristics make sense because the application knows how it will use its data and take advantage of this information.

But let’s face it, uniform arrays and texture buffers are solely used to send an array of values to a shader. Texture buffers can be shared across shaders but so can uniform arrays with uniform buffers.

That’s true, but to get the maximum performance to push more content on the GPU at the same framerate, you need to take this into account. It might be implementation dependent, but it makes a hell of a big performance difference if you use a texture for coherent memory access instead of a uniform array.

Anyways, all I’m saying is that it is mostly a black art when trying to find the optimal path for some OpenGL operations. IMHO, some options should be removed because they are duplicates and some others should be more documented to get the most out of them.

Alfonse_Reinheart · August 25, 2009, 4:08pm

My point is that I can’t fathom why you would rather have a texture or a uniform array in any given circumstance. What’s the point of having these 2 methods? They are both ways of sending an array of data to a shader and that’s it.

One of them sets uniforms. The other is a texture. They are in no way equivalent or interchangeable.

What you’re suggesting is that uniforms should never be able to be arrays at all. Or that uniform buffers should not be able to store uniform values that happen to be arrays.

Textures are textures; no particular execution of a shader stage is expected to use every pixel of each texture. With uniforms, there is an expectation that any or all uniforms will be used by each execution of the shader stage. So take your access patterns from that: if the array in question is going to be sampled randomly, use a uniform buffer. If the array sampling is fairly sequential (as would be the case in instancing), use a texture buffer.

Brolingstanz · August 25, 2009, 4:18pm

Signed normalized RGBA textures (currently EXT_texture_snorm).

Eosie · August 27, 2009, 1:24am

The performance characteristics of both methods are very different. Also, the memory used for uniform variables is pretty small (64kB on 280GTX), so there must be another way to feed shaders.

It’s the other way around, at least on G80 and later NVIDIA hardware: Textures are generally best suited for random access. Uniform buffers are as fast as reading from a register if and only if a GPU thread group (let’s say 16 consecutive vertices in a vertex shader) reads the same address. Otherwise, it still may be quite fast if there is no cache-miss (yep, the memory for uniforms is cached). However if your accesses are completely random, expect huge performance losses.

This all might not hold for ATI hardware.

It’s already part of OpenGL 3.1.

kRogue · August 27, 2009, 3:15am

What I don’t understand is this: why does OpenGL make the devs go through hoops to get optimal rendering performance? I mean, obviously the driver is the best place to figure out if texture buffers, per-instance vertex attributes, uniform value array, etc. should be used on any given piece of hardware so why can’t we have one API call to set per-instance stuff and let the driver figure out the most optimal way to send the data?

Conversion between a uniform array and a 1D texture buffer should be handled automatically by the driver. Why should we bang our heads trying to figure out when to use each method? Even worse, this almost always has to be done through reverse engineering of the drivers and extensive benchmarks. This is pretty easy to screw up by forgetting to take into account one of the numerous states or not replicating exactly the application’s usage pattern. Also, this is obviously not “future-proof” as the driver behavior can change.

Note to be rude, but:

Textures for using like textures, i.e. typical use pattern of, well, textures.

Buffer objects for using like data, i.e. localized, quasi-random access.

As for a new comer seeing which to use, actually GL makes strong hints:

texturing –> sampler1D/2D/3D –> textures!
data–> vertex buffer objects–> buffers!

it is when you get to more advanced stuff like texture buffer object, uniform buffer object that it gets murkier, but by then you are no longer a new comer.

The specifying of dynamic buffer data is a harder call, i.e. glMapBuffer vs glSubBufferData… but again those different API’s exists for different usage patterns… so, perhaps what you would like is some kind of companion doc for the GL spec that is for developers and implementers, a “hint specification” where it states the usage and performance expectations of different API calls?

edit:

It’s the other way around, at least on G80 and later NVIDIA hardware: Textures are generally best suited for random access.

Wow that I did not know at all! I always figured that textures we for very, very sequential access; I can make sense of the uniform buffer part on caching… my knowledge of the speed state is basically just:

slowest to fasted:
1.texel fetch
2. uniform texture buffer object fetch
3. uniform buffer value (i.e. the bindless graphics deal) fetch

But the part where texel fetches are well suited for random access totall blew me out of the water!

bertgp · August 27, 2009, 7:18am

Yup that would be great and its absence is my biggest gripe. One can eventually get to know all the performance characteristics of different usage patterns and multiple API paths available to accomplish the same task, but a lot of time will be lost and a lot of errors are possible along the way. This is what I call “guesswork” in my previous post. I don’t really care if there are multiple API paths to do some operation as long as each one is fully described and I can know which one to use in which situation.

This is quite a good example of this from Eosie:

The performance characteristics of both methods are very different. Also, the memory used for uniform variables is pretty small (64kB on 280GTX), so there must be another way to feed shaders.

Having to search forums for posts arguing back and forth what is the best method for doing X is not, IMHO, a sign that an API is well documented. This is what I mean by “black art”.

If you ever try to build an app that can guarantee no performance glitches due to texture uploads, shader changes, state changes, etc. you will feel the pain of this “black art” :).