PDA

View Full Version : Gaps between DX10 and OpenGL 3.2



elFarto
08-22-2009, 09:38 AM
In an effort to close the gap to DX10[.1], I'm interested in collecting the current features present in DX10, but not in OpenGL 3.2.

Here's what I have so far:
per-instance vertex attributes, vertex attributes that change per instance, rather than per vertex. asfloat/asint bitwise conversions, e.g. packing a 16-bit integer into a 16-bit floatCan anyone think of anything else?

Regards
elFarto

Ilian Dinev
08-22-2009, 10:02 AM
The first one is present, as http://www.opengl.org/registry/specs/ARB/instanced_arrays.txt

elFarto
08-22-2009, 10:34 AM
The first one is present, as http://www.opengl.org/registry/specs/ARB/instanced_arrays.txtI was under the impression this wasn't supported on DX10 class hardware, see here (http://www.opengl.org/discussion_boards/ubbthreads.php?ubb=showflat&Number=253338).

Regards
elFarto

Dark Photon
08-22-2009, 11:52 AM
The first one is present, as http://www.opengl.org/registry/specs/ARB/instanced_arrays.txtI was under the impression this wasn't supported on DX10 class hardware, see here (http://www.opengl.org/discussion_boards/ubbthreads.php?ubb=showflat&Number=253338)
Specifically, here (http://www.opengl.org/discussion_boards/ubbthreads.php?ubb=showflat&Number=253345#Post2533 45). With the latest NVidia 3.x drivers it's not supported on a GeForce 8 at least.

Also this instanced array thing (specifically vertex attribute frequency dividers) was first implemented in OpenGL (AFAIK) on GeForce 6 cards (DX SM3.0) as NVX_instanced_arrays. Per Michael Gold, "NVX_instanced_arrays was implemented in software for GeForce 7xxx, anticipating the SM4 functionality. In the end it proved lackluster enough that we killed it."

So it sounds very likely that ARB_instanced_arrays (http://www.opengl.org/registry/specs/ARB/instanced_arrays.txt) might only be supported on the ancient hardware that had support for it, in the NVidia camp at least.

But who knows. Maybe it'll make a comeback. It can certainly be more convenient to push instance data into vertex arrays than texture buffers, though that definitely caps how much you can pull in and forces how you pack it.

elFarto
08-22-2009, 12:33 PM
But who knows. Maybe it'll make a comeback. It can certainly be more convenient to push instance data into vertex arrays than texture buffers, though that definitely caps how much you can pull in and forces how you pack it.
The hardware obviously supports something like instanced_arrays, just a little bit simpler.

There's a nice example (http://msdn.microsoft.com/en-us/library/bb205317%28VS.85%29.aspx) on MSDN on how D3D10 does it, note that it uses 2 separate buffers to hold the data, one for vertex data and one for instance data.

Regards
elFarto

Brolingstanz
08-22-2009, 01:11 PM
By my reckoning 3.2 + extensions pretty much clinches the greater 10.1 feature set, and then some (sync, seamless cubemap, ...).

If 3.3 interns the recently added extensions, which seems very probable, that should neatly wrap up SM4.1 and go a stride or two towards SM5.

Though I confess that after all the rumpus at Siggraph09 I'm already way primed for SM6 & 7. :-)

Back to my software renderer...

Alfonse Reinheart
08-22-2009, 03:08 PM
If 3.3 interns the recently added extensions

What "recently added extensions" are you referring to?

elFarto
08-23-2009, 04:11 AM
Found a few new ones:
Map the contents of a texture (ID3D10Texture[123]D::Map) Ability to use front/back buffer as a texture. This seems to mirror the GLimage object from the original Longs Peak design.Regards
elFarto

Brolingstanz
08-23-2009, 04:18 AM
If 3.3 interns the recently added extensions

What "recently added extensions" are you referring to?

Personally I'm pretty excited about all the new Apple extensions.

bertgp
08-24-2009, 08:11 AM
But who knows. Maybe it'll make a comeback. It can certainly be more convenient to push instance data into vertex arrays than texture buffers, though that definitely caps how much you can pull in and forces how you pack it.
The hardware obviously supports something like instanced_arrays, just a little bit simpler.

There's a nice example (http://msdn.microsoft.com/en-us/library/bb205317%28VS.85%29.aspx) on MSDN on how D3D10 does it, note that it uses 2 separate buffers to hold the data, one for vertex data and one for instance data.

Regards
elFarto

What I don't understand is this: why does OpenGL make the devs go through hoops to get optimal rendering performance? I mean, obviously the driver is the best place to figure out if texture buffers, per-instance vertex attributes, uniform value array, etc. should be used on any given piece of hardware so why can't we have _one_ API call to set per-instance stuff and let the driver figure out the most optimal way to send the data?

Conversion between a uniform array and a 1D texture buffer should be handled automatically by the driver. Why should we bang our heads trying to figure out when to use each method? Even worse, this almost always has to be done through reverse engineering of the drivers and extensive benchmarks. This is pretty easy to screw up by forgetting to take into account one of the numerous states or not replicating exactly the application's usage pattern. Also, this is obviously not "future-proof" as the driver behavior can change.

elFarto
08-24-2009, 11:32 AM
What I don't understand is this: why does OpenGL make the devs go through hoops to get optimal rendering performance? I mean, obviously the driver is the best place to figure out if texture buffers, per-instance vertex attributes, uniform value array, etc. should be used on any given piece of hardware so why can't we have _one_ API call to set per-instance stuff and let the driver figure out the most optimal way to send the data?

Conversion between a uniform array and a 1D texture buffer should be handled automatically by the driver. Why should we bang our heads trying to figure out when to use each method? Even worse, this almost always has to be done through reverse engineering of the drivers and extensive benchmarks. This is pretty easy to screw up by forgetting to take into account one of the numerous states or not replicating exactly the application's usage pattern. Also, this is obviously not "future-proof" as the driver behavior can change.
I think part of the problem is legacy behaviour. But the methods you specified, uniform array and 1D texture buffer are different. While there should be one way todo things, we shouldn't go removing things just because they're kinda similar.

One more thing I found:
sub-views of textures
Regards
elFarto

Alfonse Reinheart
08-24-2009, 11:40 AM
I mean, obviously the driver is the best place to figure out if texture buffers, per-instance vertex attributes, uniform value array, etc. should be used on any given piece of hardware so why can't we have _one_ API call to set per-instance stuff and let the driver figure out the most optimal way to send the data?

Because texture buffers and uniform buffer objects are not bound to the idea of per-instance data. There are many uses of these features that do not revolve around per-instance data. There is no API for setting per-instance data at all; it's all up to the user.

bertgp
08-25-2009, 08:07 AM
I mean, obviously the driver is the best place to figure out if texture buffers, per-instance vertex attributes, uniform value array, etc. should be used on any given piece of hardware so why can't we have _one_ API call to set per-instance stuff and let the driver figure out the most optimal way to send the data?

Because texture buffers and uniform buffer objects are not bound to the idea of per-instance data. There are many uses of these features that do not revolve around per-instance data. There is no API for setting per-instance data at all; it's all up to the user.

You're right but I didn't make myself clear enough. I mixed the instancing comment with the texture buffer stuff.

Fundamentally, uniform arrays and texture buffers are ways to send data to the shader. Regular textures are different because they are optimized for different memory access patterns (1). However, I can't see any use case where one of the 2 aforementioned methods is a better fit conceptually than the other. Maybe there is one and someone can enlighten me.

In practice, one approach _is_ faster than the other depending on the size of the data and maybe some other undocumented factors. Why is that? You can't know without extensive benchmarks. Even with all the results, why have 2 ways of doing essentially the same operation?

As I'm writing this, I realize that this is essentially the same comment as those complaining that there are multiple ways to fill a VBO or a texture, with no indication as to which one to use to get optimal performance. Please it needs to be easy to know which API path to use in different use cases. Take out the guesswork for the devs!

(1) By the way, I learned some time ago that textures are optimized for certain access patterns by looking into the hardware design side of GPUs and by talking to people who develop GPU drivers. How is somebody starting OpenGL supposed to know when he should use this instead of say a big uniform array? Without experience with the fixed function pipeline which had well-defined roles for textures, the multitude of options for sending data to a shader is confusing.

Alfonse Reinheart
08-25-2009, 12:07 PM
However, I can't see any use case where one of the 2 aforementioned methods is a better fit conceptually than the other.

One of them sets uniforms. The other does texture accesses.

Besides the fact that they both ship constant data to the GPU, I don't see how they're similar at all. Maybe they're similar if you're looking at a uniform like:



uniform vec4 arbitraryArray[2048]


But that's far from standard procedure when working with uniforms.


I realize that this is essentially the same comment as those complaining that there are multiple ways to fill a VBO or a texture, with no indication as to which one to use to get optimal performance.

The only case where it is not understood how to get maximum performance from a buffer object involves streaming. And even that case is no longer valid, now that MapBufferRange exists with the GL_UNSYNCHRONIZED_BIT.


How is somebody starting OpenGL supposed to know when he should use this instead of say a big uniform array?

And how exactly would you tell the user this with the API? How would you define these implementation-dependent access patterns? And how would you make sure it works the same across platforms?

OpenGL is a hardware abstraction layer. It cannot dictate performance, only behavior.

bertgp
08-25-2009, 01:08 PM
One of them sets uniforms. The other does texture accesses.

Besides the fact that they both ship constant data to the GPU, I don't see how they're similar at all. Maybe they're similar if you're looking at a uniform like:



uniform vec4 arbitraryArray[2048]


But that's far from standard procedure when working with uniforms.


My point is that I can't fathom why you would rather have a texture or a uniform array in any given circumstance. What's the point of having these 2 methods? They are both ways of sending an array of data to a shader and that's it.

As an application developer, I shouldn't have to care which path the driver takes internally for getting the data to the shader. There can be some cases where different paths with well defined performance characteristics make sense because the application knows how it will use its data and take advantage of this information.

But let's face it, uniform arrays and texture buffers are solely used to send an array of values to a shader. Texture buffers can be shared across shaders but so can uniform arrays with uniform buffers.



And how exactly would you tell the user this with the API? How would you define these implementation-dependent access patterns? And how would you make sure it works the same across platforms?

OpenGL is a hardware abstraction layer. It cannot dictate performance, only behavior.


That's true, but to get the maximum performance to push more content on the GPU at the same framerate, you need to take this into account. It might be implementation dependent, but it makes a hell of a big performance difference if you use a texture for coherent memory access instead of a uniform array.

Anyways, all I'm saying is that it is mostly a black art when trying to find the optimal path for some OpenGL operations. IMHO, some options should be removed because they are duplicates and some others should be more documented to get the most out of them.

Alfonse Reinheart
08-25-2009, 05:08 PM
My point is that I can't fathom why you would rather have a texture or a uniform array in any given circumstance. What's the point of having these 2 methods? They are both ways of sending an array of data to a shader and that's it.

One of them sets uniforms. The other is a texture. They are in no way equivalent or interchangeable.

What you're suggesting is that uniforms should never be able to be arrays at all. Or that uniform buffers should not be able to store uniform values that happen to be arrays.

Textures are textures; no particular execution of a shader stage is expected to use every pixel of each texture. With uniforms, there is an expectation that any or all uniforms will be used by each execution of the shader stage. So take your access patterns from that: if the array in question is going to be sampled randomly, use a uniform buffer. If the array sampling is fairly sequential (as would be the case in instancing), use a texture buffer.

Brolingstanz
08-25-2009, 05:18 PM
- Signed normalized RGBA textures (currently EXT_texture_snorm).

Eosie
08-27-2009, 02:24 AM
My point is that I can't fathom why you would rather have a texture or a uniform array in any given circumstance. What's the point of having these 2 methods? They are both ways of sending an array of data to a shader and that's it.
The performance characteristics of both methods are very different. Also, the memory used for uniform variables is pretty small (64kB on 280GTX), so there must be another way to feed shaders.


So take your access patterns from that: if the array in question is going to be sampled randomly, use a uniform buffer. If the array sampling is fairly sequential (as would be the case in instancing), use a texture buffer.
It's the other way around, at least on G80 and later NVIDIA hardware: Textures are generally best suited for random access. Uniform buffers are as fast as reading from a register if and only if a GPU thread group (let's say 16 consecutive vertices in a vertex shader) reads the same address. Otherwise, it still may be quite fast if there is no cache-miss (yep, the memory for uniforms is cached). However if your accesses are completely random, expect huge performance losses.

This all might not hold for ATI hardware.



- Signed normalized RGBA textures (currently EXT_texture_snorm).
It's already part of OpenGL 3.1.

kRogue
08-27-2009, 04:15 AM
What I don't understand is this: why does OpenGL make the devs go through hoops to get optimal rendering performance? I mean, obviously the driver is the best place to figure out if texture buffers, per-instance vertex attributes, uniform value array, etc. should be used on any given piece of hardware so why can't we have _one_ API call to set per-instance stuff and let the driver figure out the most optimal way to send the data?

Conversion between a uniform array and a 1D texture buffer should be handled automatically by the driver. Why should we bang our heads trying to figure out when to use each method? Even worse, this almost always has to be done through reverse engineering of the drivers and extensive benchmarks. This is pretty easy to screw up by forgetting to take into account one of the numerous states or not replicating exactly the application's usage pattern. Also, this is obviously not "future-proof" as the driver behavior can change.


Note to be rude, but:

Textures for using like textures, i.e. typical use pattern of, well, textures.

Buffer objects for using like data, i.e. localized, quasi-random access.

As for a new comer seeing which to use, actually GL makes strong hints:

texturing --> sampler1D/2D/3D --> textures!
data--> vertex buffer objects--> buffers!

it is when you get to more advanced stuff like texture buffer object, uniform buffer object that it gets murkier, but by then you are no longer a new comer.

The specifying of _dynamic_ buffer data is a harder call, i.e. glMapBuffer vs glSubBufferData... but again those different API's exists for different usage patterns... so, perhaps what you would like is some kind of companion doc for the GL spec that is for developers and implementers, a "hint specification" where it states the usage and performance expectations of different API calls?

edit:



It's the other way around, at least on G80 and later NVIDIA hardware: Textures are generally best suited for random access.

Wow that I did not know at all! I always figured that textures we for very, very sequential access; I can make sense of the uniform buffer part on caching... my knowledge of the speed state is basically just:

slowest to fasted:
1.texel fetch
2. uniform texture buffer object fetch
3. uniform buffer value (i.e. the bindless graphics deal) fetch


But the part where texel fetches are well suited for random access totall blew me out of the water!

bertgp
08-27-2009, 08:18 AM
[...] perhaps what you would like is some kind of companion doc for the GL spec that is for developers and implementers, a "hint specification" where it states the usage and performance expectations of different API calls?

Yup that would be great and its absence is my biggest gripe. One can *eventually* get to know all the performance characteristics of different usage patterns and multiple API paths available to accomplish the same task, but a lot of time will be lost and a lot of errors are possible along the way. This is what I call "guesswork" in my previous post. I don't really care if there are multiple API paths to do some operation as long as each one is fully described and I can know which one to use in which situation.

This is quite a good example of this from Eosie:

The performance characteristics of both methods are very different. Also, the memory used for uniform variables is pretty small (64kB on 280GTX), so there must be another way to feed shaders.

Having to search forums for posts arguing back and forth what is the best method for doing X is not, IMHO, a sign that an API is well documented. This is what I mean by "black art".

If you ever try to build an app that can guarantee no performance glitches due to texture uploads, shader changes, state changes, etc. you will feel the pain of this "black art" :).

Alfonse Reinheart
08-27-2009, 11:50 AM
a sign that an API is well documented.

API documentation dictates behavior, not performance. Performance changes based on hardware vendors and other things; it cannot be enforced.


If you ever try to build an app that can guarantee no performance glitches due to texture uploads, shader changes, state changes, etc. you will feel the pain of this "black art"

The differences in performance that are being discussed here are fairly small. Unless you're pushing the hardware to its limits (and nowadays, that's a lot of hardware to be pushing), it's generally not going to make a visual difference to the user.

kRogue
08-30-2009, 08:04 AM
API documentation dictates behavior, not performance. Performance changes based on hardware vendors and other things; it cannot be enforced.


Strictly speaking this is true, but in practice the story is much, much different.

Firstly an API is supposed to have the developer naturally use the "fast" path for the hardware, this is complicated by the fact that GL is supported by several hardware vendors (for consumer alone there are atleast 3 such vendors for GL3: nVidia, ATI and S3). However, for a fair number of extensions, performance expectations are given, i.e for example texture buffer object texel fetching is supposed to be faster than a texel fetch from a texture, etc.

Also, considering that the IHV's contribute heavily to the GL3 spec, it is not unreasonable for the spec to include performance expectations or usage hints. By providing usage hints, software developers can get an idea of what they should do to hit the "fast path", and hardware vendors can potentially optimize their drivers for the expected usage patterns.

Alfonse Reinheart
08-30-2009, 12:54 PM
Do you know why streaming with buffer objects is such a minefield? Usage hints.

The definition of the usage hints tells you what to do to achieve maximum streaming performance: use one of "stream" hints, and map the buffer to upload your data. Such a simple thing, and yet, that may or may not get proper streaming performance. It all depends on the implementation.

Usage hints are a bad idea. Different implementations will implement the hints in different ways, leading to the same problem that the hints were trying to solve: you have to test on every hardware to see if you're getting the best possible performance.

kRogue
08-31-2009, 12:04 AM
Err, I was misunderstood, by usage hint, I did not mean an API entry point or additional arguments of how an object will be used, I mean:

Usage hint: state in the API how a set of functions is most likely to be used.

On the other hand that using streaming buffer objects is a minefield because of the usage hint in mapping and creating buffer objects has a potential communication fault: The spec does not state sufficiently clearly how those usage hints mean are expected to be used, as such driver implementers and software developers need to guess. The hints for buffer object creation seem to be pretty well spelled out, but the glMapBufferRange just gives some properties of the expected behaviour of mapping. What would be useful for both developers and implementers is an expected usage pattern: for example how to do streaming well with the API, by stating how the API expects to do streaming, then implementers and developers can see what is expected, if anything this is an example where in the spec or in a companion doc would help.

Alfonse Reinheart
08-31-2009, 11:44 AM
The spec does not state sufficiently clearly how those usage hints mean are expected to be used, as such driver implementers and software developers need to guess.

Oh no: the spec is very clear about what each of the 3x3 combination of usage hints mean for how the user should use the buffer. The only one that could be considered slightly unclear is DYNAMIC, and that's due to the question of when something deserves to be STREAM vs. DYNAMIC or STATIC vs. DYNAMIC.


the glMapBufferRange just gives some properties of the expected behaviour of mapping

The only thing that is unclear is whether implementations will properly utilize these values. For example, do you need to use DrawRangeElements for the invalidate range flag to work? Will the implementation even bother with invalidate range, instead just blocking until any part of the buffer is no longer in use?


What would be useful for both developers and implementers is an expected usage pattern: for example how to do streaming well with the API, by stating how the API expects to do streaming, then implementers and developers can see what is expected, if anything this is an example where in the spec or in a companion doc would help.

Driver developers will implement whatever ID does for streaming. What any such companion performance hint guide says is irrelevant next to making the next ID game run fast.

kRogue
08-31-2009, 03:18 PM
Oh no: the spec is very clear about what each of the 3x3 combination of usage hints mean for how the user should use the buffer. The only one that could be considered slightly unclear is DYNAMIC, and that's due to the question of when something deserves to be STREAM vs. DYNAMIC or STATIC vs. DYNAMIC.


um look what I wrote:



...
The hints for buffer object creation seem to be pretty well spelled out, but the glMapBufferRange just gives some properties of the expected behaviour of mapping
...



Also,



The only thing that is unclear is whether implementations will properly utilize these values. For example, do you need to use DrawRangeElements for the invalidate range flag to work? Will the implementation even bother with invalidate range, instead just blocking until any part of the buffer is no longer in use?


Well, if the driver does not utilize these values, that is naughty of the driver, it is _supposed_ to use them, right?



Driver developers will implement whatever ID does for streaming. What any such companion performance hint guide says is irrelevant next to making the next ID game run fast.


Ouch. So, whatever ID does, the driver writers follow? I find that kind of too cynical to believe, especially since ID games are all, even their upcoming RAGE, GL 2.1, not 3.x. Following the same logic, does Apple then optimize their GL for Blizzard games?

Jan
08-31-2009, 05:08 PM
"even their upcoming RAGE GL 2.1"

And even that is not for sure anymore.

Alfonse Reinheart
08-31-2009, 06:19 PM
Well, if the driver does not utilize these values, that is naughty of the driver, it is _supposed_ to use them, right?

The driver is supposed to make the program as a whole fast. It is quite possible that collating the data necessary to make the invalidate range flag work causes each draw call to be slower.


I find that kind of too cynical to believe, especially since ID games are all

You were not an OpenGL programmer during the days of compiled vertex arrays. For many drivers and driver revisions, there was only one compiled format that gave good performance: the one Quake 2 (or 3?) used. All others gave horrible performance.

While that particular nightmare is long over, OpenGL driver optimizations are still focused on what ID does.


Apple then optimize their GL for Blizzard games?

I'm sure Blizzard games spend as little time in AppleGL as possible.