PDA

View Full Version : direct_state_access reloaded



oc2k1
09-17-2009, 12:56 PM
It seams to be that many people liked the EXT_direct_state_access, but unfortunately it' gone in OpenGL-3.2 or 3.1 with forward compatible bit.

The extension self is written against 2.1 For 3.2 there would be the question which parts are deprecated by design.... :

Maybe leave that current version as is and start a revised version for 3.2 as ARB or anything else.

Removing the functions for deprecated stuff should be easy. More problematic are the texture functions: They have still the border argument. Because this functionality is deprecated, it should be removed. (It requires new function names ARB at the end would solve that)

One interesting point would be that there is nothing that can't be emulated by non driver code. Sure it is a little bit slower, but the code is easier to maintain. And if the IHVs may implement that later, it may improve also the speed....

I think the right way to get that working would be:
1. rework the extension paper
2. write a program that writes the emulation code
3. hope that it will become a real extension written against 3.x

Chris Lux
09-18-2009, 02:49 AM
this extension was the single good thing a year back and i hope i will be picked up for core profile OpenGL 3.x.

Groovounet
09-18-2009, 06:32 AM
3. hope that it will become a real extension written against 3.x


And maybe just OpenGL 3.2 core!
I'm really look forward to see this extension updated!

elFarto
09-27-2009, 05:24 AM
It's not really worth directly converting the direct_state_access extension. You would still be asking the driver to do lots of extra int -> object lookups. It would be much better to have a variation of this extension with opaque pointer types.

Here's a mock up of a fictional direct_buffer_objects extension:


typedef void* Buffer

Buffer CreateBuffer(int size, void *data, enum usage);

GenBuffers(int count, Buffer *buffer)
BufferData(Buffer buffer, int size, const void *data, enum usage);
BufferSubData(Buffer buffer, int offset, int size, const void *data);
DeleteBuffers(int count, Buffer *buffer)
FlushMappedBufferRange(Buffer buffer, int offset, int size);
GetBufferParameter<T>(Buffer buffer, enum param, T *data);
void* GetBufferPointer(Buffer buffer, enum param);
GetBufferSubData(Buffer buffer, int offset, int size, void *data);
void* MapBuffer(Buffer, enum access);
void* MapBufferRange(Buffer, int offset, int size, enum access);
UnmapBuffer(Buffer buffer);
CopyBufferSubData(Buffer src, Buffer dest, int srcoffset, int destoffset, int size);

BindBuffer(enum target, Buffer buffer); //target cannot be array_buffer, element_array_buffer, copy_{read,write}_buffer
BindBufferRange(enum target, uint index, Buffer buffer, int offset, int size);
BindBufferBase(enum target, uint index, Buffer buffer);

//vertex arrays
VertexAttrib(int index, int size, enum type, boolean normalized, Buffer, buffer, int offset, int stride);
VertexAttribI(int index, int size, enum type, Buffer, buffer, int offset, int stride);

ElementPointer(int index, enum type, Buffer buffer, int offset, int size);//if size == 0, use whole buffer
ElementPointerRange(int index, enum type, int start, int end, Buffer buffer, int offset, int size);

DrawElements(enum mode, int instances, int elementIndex, int baseVertex);
MultiDrawElements(enum *mode, int *instances, int *elementIndex, int *baseVertex);

//texture buffers
TexBuffer(Buffer buffer, enum internalformat);
BindBuffer is unfortunately still there due to texturing methods still using it.

The ElementPointer and [Multi]DrawElements are new, in an attempt to stem the explosion of Draw combinations.

Regards
elFarto

mfort
09-27-2009, 09:14 AM
It's not really worth directly converting the direct_state_access extension. You would still be asking the driver to do lots of extra int -> object lookups. It would be much better to have a variation of this extension with opaque pointer types.


Sorry, I do not understand all the hype about ints vs pointers.
What is the slowdown of the lookup? On 32bit OS the object number generated by glGen* could be a pointer (at least in OpenGl 3.x) right now. Nobody said that the number must be monotonously increasing integer starting from 1.

What stops driver implementers to use pointer compression methods that are used in 64bit Java VM (32bit pointers accessing 32GB RAM)?

I think the ARB should focus on getting rid of binding mechanism. Replacing ints by pointer is not the future.

Stephen A
09-27-2009, 09:43 AM
What stops driver implementers to use pointer compression methods that are used in 64bit Java VM (32bit pointers accessing 32GB RAM)?
The simple fact that OpenGL 2.1- allowed you to specify your own object IDs. You could say glBindTexture(GL_TEXTURE_2D, 0xDEADBEEF) and the driver would have to keep working.

Using opaque pointers is a huge step forward in this regard and a hypothetical ARB DSA extension is the best way to implement this (break compatibility only *once* instead of breaking it with DSA and breaking it again with opaque types).

Huge ++ to elFarto's suggestion. I'd prefer some extra type safety in there (different types for vertex buffer and texture buffer), but otherwise 100% agreed.

elFarto
09-27-2009, 09:54 AM
I think part of the problem is that you used to be able to tell the OpenGL what id/name you wanted to use for your object. Prehaps the driver programmers haven't gotten round to changing it, prehaps it's a lot of work for them. Without looking at the driver's code it's impossible to say.

On your second point, nothing stops them, it's just not a very nice solution (you can't store an arbitrary 64-bit pointer in 32-bits). My 64-bit XP can use 128GB of physical memory, which needs 37 bits to store, assuming 8-byte alignment, you could pack it to 34-bits, which is still 2 short.

It's not so much making them pointers, it's making them opaque types to help make sure you don't pass the wrong type of object into a function, and if your going to change the type system you might as well make them wide enough to hold a pointer so no tricks are need to store one.


Huge ++ to elFarto's suggestion. I'd prefer some extra type safety in there (different types for vertex buffer and texture buffer), but otherwise 100% agreed.
You could, but they are technically interchangeable, so you can bind one as the other. R2VB is an example of when you really do want to do this, use a PBO as a VBO.

I've got some other API's reworked aswell (Image, FBO, VAO), I'll try and get round to finishing the rest of them.

Regards
elFarto

mfort
09-27-2009, 09:58 AM
The simple fact that OpenGL 2.1- allowed you to specify your own object IDs. You could say glBindTexture(GL_TEXTURE_2D, 0xDEADBEEF) and the driver would have to keep working.


Hmm, 2.1 is the most recent spec?
FYI, the 3.0 deprecated application defined object names. End of story.

Of course opaque objects are mostly better but there are far bigger problems then this.

elFarto
09-27-2009, 10:02 AM
Of course opaque objects are mostly better but there are far bigger problems then this.
Of course there a bigger problems (textures/samplers being one IMHO) but that doesn't mean we don't want it fixed :)

Regards
elFarto

Alfonse Reinheart
09-27-2009, 12:18 PM
What is the slowdown of the lookup? On 32bit OS the object number generated by glGen* could be a pointer (at least in OpenGl 3.x) right now. Nobody said that the number must be monotonously increasing integer starting from 1.

What is the slowdown? Have you looked at NVIDIA's bindless graphics performance statistics? NVIDIA took out most of the object-based stuff and gained substantial performance increases. Now, not all of this is due to int-to-pointer conversions. But some of it is.

Essentially, the problem is cache performance. Because your application does rendering and then other stuff every frame, by the time it gets to the render loop, the cache has lost all of the rendering information. Therefore, every graphics memory access, every one, is a cache miss. And every access of an object is a cache miss.

You might think this is insignificant overall; you're wrong. Most of the other low-hanging fruit in performance has already been picked. This is what is left.

Taking out the int-to-pointer conversion won't undo all of the memory accesses compared to bindless graphics. But it'll take away some of it, and that's better than nothing.

mfort
09-27-2009, 12:47 PM
What is the slowdown? Have you looked at NVIDIA's bindless graphics performance statistics? NVIDIA took out most of the object-based stuff and gained substantial performance increases. Now, not all of this is due to int-to-pointer conversions. But some of it is.


Well, I am very well aware of that paper. But this is about something totally different.
They extract GPU address for some GL object and than the app supplies this address in render loop instead of object name. So they totally skip the object name -> object data -> gpu address path.

But if you make object pointers instead of object names you won't get all this excellent speedup. You still have to make object data -> gpu address resolution. Low hanging fruit would be to make object name a pointer on 32bit OS. Then everybody can see what is the speedup. BTW. Driver developers should do this benchmark.

So instead of making new "pointer" API I would rather go for new "bindless" API. If programmer has to change its implementation then there must be really good reason.
I do not want to rewrite the code for pointer API and then again N months later to bindless API.

Alfonse Reinheart
09-27-2009, 01:49 PM
But if you make object pointers instead of object names you won't get all this excellent speedup.

I know. I said, "not all of this is due to int-to-pointer conversions?"

The point is that you get *something*.


I do not want to rewrite the code for pointer API and then again N months later to bindless API.

That's not going to happen. For many reasons.

Bindless graphics is a horrible API. It breaks the basic model of vertex shaders (let alone uniforms), requiring you to write shaders specifically for it. It is also incredibly low-level, which makes widespread implementation difficult if not impossible.

A better solution is to identify places where driver developers have to do lots of verification of things and eliminate them.

For example, part of bindless graphics is the inclusion of the ability to "lock" a buffer object. The reason for this is because OpenGL implementations can and will move that buffer object around. The only way to get a consistent buffer object pointer address is to tell the implementation not to do it anymore.

Rendering with a VAO normally requires querying every attached buffer object to get their current GPU address pointer, and uploading those buffer objects if they are not in the GPU. This adds a lot onto the overhead and cache issues of rendering.

If however, you could lock a VAO, thus telling OpenGL that it shouldn't move the buffers attached to it around (as well as preventing you from being able to delete those buffers), then the implementation is free to not bother to query the location of each buffer object in the VAO at render time. Instead, it can just build a GPU-ready sequence of commands to start the rendering.

Locking the VAO would also cause it to become immutable; attempts to modify or delete it (or its attached objects) would fail.

You can even combine locking with getting pointer names. Locking a VAO would mean you get a pointer back, which would be used in pointer APIs. While the VAO is locked, you cannot bind the integer name at all; you must use the pointer API.

Stephen A
09-27-2009, 01:59 PM
The simple fact that OpenGL 2.1- allowed you to specify your own object IDs. You could say glBindTexture(GL_TEXTURE_2D, 0xDEADBEEF) and the driver would have to keep working.


Hmm, 2.1 is the most recent spec?
FYI, the 3.0 deprecated application defined object names. End of story.
FYI, 3.2 introduces the compatibility profile which brings them back. End of story?

Alfonse Reinheart
09-27-2009, 02:25 PM
FYI, 3.2 introduces the compatibility profile which brings them back.

You don't have to use the compatibility profile. Implementations aren't required to even support them. The default when using the new create context is to get a core profile.

Dark Photon
09-28-2009, 06:43 AM
But if you make object pointers instead of object names you won't get all this excellent speedup. You still have to make object data -> gpu address resolution. Low hanging fruit would be to make object name a pointer on 32bit OS.
Yeah, but who's still on one of those ;)

BTW, for others reading, the unstated NVidia bindless graphics assertion being discussed is here (http://developer.download.nvidia.com/opengl/tutorials/bindless_graphics.pdf), pg. 4-5.


So instead of making new "pointer" API I would rather go for new "bindless" API. If programmer has to change its implementation then there must be really good reason.
Up to 7X speedup is a pretty damn good reason, if you're selling commercial products built on OpenGL. Content sells. Only academicians can afford to adopt the purist theoretical perfection argument, since it's about ease of learning not performance.

Though long-term, you're probably right. If devs get wide-spread speed-ups through this technique, probably a general API retrofit would be best. Though it's great that OpenGL supports adding prototype features like this through an extension mechanism so developers can take them for a spin and provide feedback before OpenGL bets the farm on them. And kudos to NVidia for doing so. Big plus for GL over D3D.


I do not want to rewrite the code for pointer API and then again N months later to bindless API.
In a perfect world I don't either. Then again, back to (commercial) reality, where it's adapt or die.

mfort
09-28-2009, 07:09 AM
But if you make object pointers instead of object names you won't get all this excellent speedup. You still have to make object data -> gpu address resolution. Low hanging fruit would be to make object name a pointer on 32bit OS.
Yeah, but who's still on one of those ;)

Vast majority. Perfectly enough for benchmarking the potential benefit.




Up to 7X speedup (bindless api) is a pretty damn good reason, if you're selling commercial products built on OpenGL.

Yes, it is. Go for it. But I also agree with Alfonse Reinheart that the NV bindless API is little bit too low level. Please beware that object pointers is not the same as bindless API. I simply do not believe that the object name to object pointer resolution is the bottleneck due to cache miss. It can be done without memory indirection at all, such as an index to vector of objects.

elFarto
09-28-2009, 08:56 AM
It can be done without memory indirection at all, such as an index to vector of objects.
It's extremely likely that the OpenGL drivers use a hashtable, rather than an array/vector. Vectors don't make good data structures for this purpose, especially if you need to support the user specifying their own handles (which ATI and NVIDIA's drivers do).

Regards
elFarto

mfort
09-28-2009, 09:22 AM
if you need to support the user specifying their own handles (which ATI and NVIDIA's drivers do).


they don't need to (in OpenGL 3.0+)

Application-generated object names - the names of all object types, such as
buffer, query, and texture objects, must be generated using the corresponding
Gen* commands. Trying to bind an object name not returned by a Gen*
command will result in an INVALID_OPERATION error. This behavior is already
the case for framebuffer, renderbuffer, and vertex array objects.

Alfonse Reinheart
09-28-2009, 11:15 AM
they don't need to (in OpenGL 3.0+)

But since the same driver has to support both, they can't assume that.

mfort
09-28-2009, 11:32 AM
But since the same driver has to support both, they can't assume that.
they are not stupid. They can have two modes. Pre 3.0 and 3.0+. Decent junior programmer must be able to do that in C++. This is typical low hanging fruit. Replacing one module that gets rid of cache misses during translating object names to pointers.
Do not change API just because you think you cannot make better implementation of it. Otherwise you will only be rewriting your code all the time. You can think that the new code would be better until you realize you are in the same sh*t. Designing API and making compatible implementations is hard.

Alfonse Reinheart
09-28-2009, 11:56 AM
Do not change API just because you think you cannot make better implementation of it.

The easier you make it to have implementations provide more performance, the more likely it will be that implementations will actually provide that performance.

Stephen A
09-28-2009, 02:38 PM
Do not change API just because you think you cannot make better implementation of it. Otherwise you will only be rewriting your code all the time. You can think that the new code would be better until you realize you are in the same sh*t. Designing API and making compatible implementations is hard.
So says you. However note that DirectX changes APIs every single version and it's the better API for it.

mfort
09-28-2009, 03:03 PM
So says you. However note that DirectX changes APIs every single version and it's the better API for it.

So cheap argument. DirectX is not evolving. Every single version is a revolution. They do not need to keep compatibility. They don't care. Their philosophy is to make new API from scratch all the time. Made it well done, as close as possible to current HW technology. They are doing this by purpose, not because they have poor programmers that cannot keep compatibility. They made their way and they are successful in it. OpenGL is different. Many small, evolving steps. As much compatible as possible.

Alfonse Reinheart
09-28-2009, 03:31 PM
OpenGL is different. Many small, evolving steps. As much compatible as possible.

Which is why OpenGL is worse. It is more prone to driver bugs (due to the complexity of the vast API), contains innumerable ideas that made sense at the time but are ridiculous from a modern perspective (go ahead; explain how to attach a buffer object to a VAO to someone or how to attach a texture to a program), etc.

kRogue
10-14-2009, 02:15 PM
Bindless graphics is a horrible API. It breaks the basic model of vertex shaders (let alone uniforms), requiring you to write shaders specifically for it. It is also incredibly low-level, which makes widespread implementation difficult if not impossible.


I soooooooooo disagree on it being a horrible API. Also, Alfonse Reinheart has his wishes of doing VAO locks, so beware of not so neutral opinions.

Here is why I like the bindless API:
1. very, very straight forward to use: allocate buffer, make memory resident. That is it. If you re-allocate the buffer then you have to make it resident again. Pretty simple and straight forward in my eyes. For vertex attributes it does not require one to rewrite shaders or anything.

2. clearly breaks the glVertexAttribPointer call into two functions that
a. set format of data
b. set source of data

3. Pointers in shaders! Maybe this is what Alfonse Reinheart is saying about needing to rewrite your shaders? Without bindless graphics putting a complicated scene graph uploaded to the GPU requires a high level of trickiness and not something I'd want to do by hand!

As for it being to low level, come on people, how low level is really providing a GPU address for a block of GPU managed memory. If GLint64/GLuint64 bothers you, lets just do this:

typedef uint_64 GL_buffer_object_address;

and then it will look like the bindless API is abstract now(rather than calling it an address call it a lock-binding point or something, giggles). As for being too low level, buffer objects are all ready very low level, it is not like buffer objects are allowed to compress their data or anything, it is _raw_ bytes whose memory management is done by GL (with all the horror of indirect rendering across different endianness one can imagine!). Additionally, UBO's have much nastier packing rules than nVidia's bindless API.

My 2 cents.

Alfonse Reinheart
10-14-2009, 03:59 PM
Without bindless graphics putting a complicated scene graph uploaded to the GPU requires a high level of trickiness and not something I'd want to do by hand!

If I were doing something where I needed a "complicated scene graph," I'm pretty sure I'd use OpenCL. This is why OpenCL exists; so that OpenGL's shading language can be a shading language, not an arbitrary programming language.


As for it being to low level, come on people, how low level is really providing a GPU address for a block of GPU managed memory.

Very. When you deal with direct pointers to things, you are dealing at a low level. And pretending to hide the pointer doesn't help.

If you can achieve the same performance effects of bindless without breaking the abstraction the way it does, then you should. That's why mapping buffers is OK; it allows you to get performance that you otherwise couldn't.

I have yet to see evidence that an alternate API, aimed at the particular gains of bindless graphics but while preserving the abstraction, would be unable to achieve similar results.


As for being too low level, buffer objects are all ready very low level, it is not like buffer objects are allowed to compress their data or anything, it is _raw_ bytes whose memory management is done by GL (with all the horror of indirect rendering across different endianness one can imagine!).

It isn't low level, because you are unable to directly access or affect this memory. The memory has a controlled interface, which is what gives drivers freedom.


Additionally, UBO's have much nastier packing rules than nVidia's bindless API.

That's because UBOs are, get this, cross platform. Bindless graphics only has to work on NVIDIA hardware. There is a reason why this is an NV extension, and not an EXT extension like separate shader objects.

And the std140 packing rules are basically standard C. I fail to see how this is "nasty" in any way.

kRogue
10-15-2009, 01:33 PM
That's because UBOs are, get this, cross platform. Bindless graphics only has to work on NVIDIA hardware. There is a reason why this is an NV extension, and not an EXT extension like separate shader objects.

And the std140 packing rules are basically standard C. I fail to see how this is "nasty" in any way.


*cough* *hack*. UBO's have a funny packing when it comes to vec3, ivec3 and uvec3... they all take up the room of a 4 vector, i.e. simple 32-bit aligned packing rules are not enough to describe, or for that matter 64-bit packing rules. Though in truth the issue is mute: you can query GL for the offsets anyways. Additionally the bindless graphics API does define packing rules, and actually they are 9/10 easier than UBO.

Buffer objects are only quasi-cross platform, do indirect rendering between different endianess to see what I mean.



It isn't low level, because you are unable to directly access or affect this memory. The memory has a controlled interface, which is what gives drivers freedom.


um *cough* *again*. Have you really read the bindless extensions at all really, or even tried to use them. Firstly bindless graphics is only about _reading_ from buffers, there is nothing there about _writing_, the only part that is required is that GL needs to be told to make the buffer avialable and to give an address for the GPU to use it. Change the word address to handle and you are all set!

Just ot make sure we areon the same page, lets take a look at what it's interface for the pointers in shaders is:



GLSL over-simple example:
uniform float **funkyness;
in ivec2 indexing;

//
funkiness_I_want=funkyness[indexing.x][indexing.y]


Nothing weird there, looks like can finally do lots of things we take for granted everywhere else. In fact this is much clearer than say packing data into several texture buffer objects once you get into more complicated structures (since a fixed texture buffer object can only return 1 type always). Now for the GL side:



GLuint *funky_buffers, funkyness;
GLuint64 *funky_buffer_addresses, funkiness_address;

funky_buffers=new GLuint[dimX];
funky_buffer_addresses=new GLuint64[dimX];
glGenBuffers(&amp;funky_buffers, dimX);
glGenBuffers(&amp;funkyness, 1);



for(int i=0;i<dimX;++i)
{
//allocate them
glBindBuffer(GL_ARRAY_BUFFER, funky_buffers[i]);
glBufferData(GL_ARRAY_BUFFER, sizeof(float)*dimY, NULL, usage_enum);

//make the buffer resident:
glMakeNamedBufferResidentNV(funky_buffers[i], GL_READ_ONLY);

//get the "address"
glGetNamedBufferParameterui64vNV(funky_buffers[i], GL_BUFFER_GPU_ADDRESS_NV, &amp;funky_buffer_addresses[i]);
}

//fill funkyness with the "poitners" to each buffer object
glBindBuffer(GL_ARRAY_BUFFER, funkyness);
glBufferData(GL_ARRAY_BUFFER, sizeof(GLuint64)*dimY, funky_buffer_addresses, usage_enum);

glMakeNamedBufferResidentNV(funkyness, GL_READ_ONLY);
glGetNamedBufferParameterui64vNV(funkyness, GL_BUFFER_GPU_ADDRESS_NV, &amp;funkiness_address);

//do whatever you like to fill the buffer data with
//glBufferSubData or transform feed back, or whatever
//just don't reallocate the buffer object with glBufferData
//also note that you can change what buffers funkiness uses
//by just changing the values.

GLint funkyness_uniform;

funkyness_uniform=glGetUniformLocation(GLSLProgram , "funkiness");

glUniformui64NV(funkyness_uniform, funkiness_address);



How does that break that break abstraction really? change the word address and the type GLuint64 to say "locked-buffer-id" and "GL_locked_buffer_id_type".

Lastly:



If I were doing something where I needed a "complicated scene graph," I'm pretty sure I'd use OpenCL. This is why OpenCL exists; so that OpenGL's shading language can be a shading language, not an arbitrary programming language.


Unfreaking believable, really. If one can send the data in a more flexible way to the shader then that is SOOO much better. Simple things like skinning are much easier with bindless than without (MD5 skinning is much easier to write with bindless than without) Bindless graphics also gets rid of something that is so _irritating_ in 3d graphics: the endless cleverly repacking of vertex data to fit into the simple vertex attribute model. An additional bit is this: alot of the need to stream vertex data to the GPU is no longer needed with bindless graphics, you can do all the calculation on GPU with the much of the flexibility one takes for granted on CPU. With bindless graphics, if you are sick enough, you can reduce rendering many different models to just one instanced draw call, not just models in different places, but models with different data sets entirely. Weather or not this is the best thing for performance is not clear since:



5) What are the performance characteristics of buffer loads?



RESOLVED: Likely somewhere between uniforms and texture fetches,

but totally implementation-dependent. Uniforms still serve a purpose

for "program locals". Buffer loads may have different caching

behavior than either uniforms or texture fetches, but the expectation

is that they will be cached reads of memory and all the common sense

guidelines to try to maintain locality of reference apply.



One more nasty bit:


Which is why OpenGL is worse. It is more prone to driver bugs (due to the complexity of the vast API), contains innumerable ideas that made sense at the time but are ridiculous from a modern perspective (go ahead; explain how to attach a buffer object to a VAO to someone or how to attach a texture to a program), etc.


Giggles, EXT_direct_state_access handles most of that quite well, in at this point the best thing to do for DSA is to jsut take the extension into the spec, for the compatibility profile as is, for the core profile just remove all the references to removed stuff. On the subject of writing drivers, take a look at slide 37 of Kilgard presentation (http://www.slideshare.net/Mark_Kilgard/opengl-32-and-more ):



Deprecation Myths
-Feature removal will result in a faster driver
-Feature removal will result in a higher quality driver
-Feature removal will result in a cleaner API
-Not removing features means OpenGL will die
-Only useless features were deprecated
----Far from true


Considering who Kilgard is, I tend to take his word.

Alfonse Reinheart
10-15-2009, 02:46 PM
UBO's have a funny packing when it comes to vec3, ivec3 and uvec3... they all take up the room of a 4 vector, i.e. simple 32-bit aligned packing rules are not enough to describe, or for that matter 64-bit packing rules.

This is not "funny" packing. It's quite common when dealing with low-level SEE-type math operations that vec3's take up the same room as vec4's.


Buffer objects are only quasi-cross platform, do indirect rendering between different endianess to see what I mean.

How would that even be possible? Wouldn't that mean that you had a CPU with a different endianness than the GPU that you're using? That would break one of the basic assumptions of buffer objects.

And using pointers instead of buffer objects wouldn't improve this any. So I'm not really sure what your point here is.


In fact this is much clearer than say packing data into several texture buffer objects once you get into more complicated structures

I fail to see how this would not be achievable with simply a more flexible implementation of uniform buffer objects. NVIDIA is perfectly capable of, via pointers, making UBO accesses into pointer accesses behind the scenes. So why don't they?

It would be even cleaner, since you would not need access to actual pointers.


the endless cleverly repacking of vertex data to fit into the simple vertex attribute model.

If your vertex data is actually vertex data, where is the "repacking" coming from?

If you need a general-purpose computation API, OpenCL exists. I see no need to make GLSL into that.


Considering who Kilgard is, I tend to take his word.

Considering that Kilgard works for NVIDIA, who does not have a vested interest in making OpenGL implementations easier to write (since they already have one. They want Intel's job with Larrabee to be as hard as possible), I'll go with the empirical data: ATI's D3D implementation is more solid than their OpenGL implementation, and D3D implementations are easier to write than GL implementations.

Further, his statements make no sense. It is a verifiable fact that the least buggy code is the code that is never written. So while feature removal will not guarantee these things, not removing features certainly isn't helping.

Unfortunately, Kilgard's words can't make ATI or Intel's OpenGL implementations better. Making OpenGL implementations simpler has at least some chance of working.

kRogue
10-16-2009, 12:42 AM
This is turning into a flame war, but oh well, since you do not know what indirect rendering even is:



How would that even be possible? Wouldn't that mean that you had a CPU with a different endianness than the GPU that you're using? That would break one of the basic assumptions of buffer objects.

And using pointers instead of buffer objects wouldn't improve this any. So I'm not really sure what your point here is.


SO jsut so you know: indirect rendering is where the process and the the GL-server are on different machines. One can with X-windows launch a process on one machine and have it render on another, this rendering includes _GL_. Now what happens when the endianess of where the process is running and the X-server don't match? All hell breaks loose with respect to buffer objects. The endianness of the data inside a buffer object is the endianness of the _server_, not the process. So now you pack data into your buffer object, you have to make sure that you pack in the endianness of the X-server where it is being rendered. This is a _big_ deal under a variety of circumstances. Before buffer objects the vertex data was considered _client_ data and as such the transport mechanism and GL driver took care of the endianness issues, but with buffer objects it is _server_ state and must be in the endianness of the server. My point on the buffer objects being server memory is that the abstraction leaks significantly anyways, you need to know that the endianness of the server and the machine running the process.



I fail to see how this would not be achievable with simply a more flexible implementation of uniform buffer objects. NVIDIA is perfectly capable of, via pointers, making UBO accesses into pointer accesses behind the scenes. So why don't they?

You are really missing some critical bits:
1. UBO's have a very, very fine limit on size.
2. There is a very hard limit on the number of UBO's available
3. The rule of thumb of UBO's is that it is slowed than a uniform access but faster than everything else subject to sequential caching rules.

Lets give a simple, simple, example where bindless graphics is definitely worthwhile.

We have two key frame meshes, call them A and B, separately animated, each with a different number of frames. You wish to create a mesh where some vertices are from A and some from B. Of critical importance is that some triangles have vertices from A and B. Texture co-ordinates however are not taken from A or B.

How would you do this without bindless graphics? The easiest, not to mention dumbest thing, is that one creates a new keyframe mesh which has number_frames=number_frames(A)*number_frames(B) and proceed directly from there. Another approach is to use transform feed back, but all of these answeres are actually silly, this is an example where API is getting in the way. Bindless graphics gives you this shader:



uniform mat4 *matrixTransformations;
uniform vec4 **meshVerticesFrame0, **meshVerticesFrame1;
in ivec2 which_vertex; // .x holds which mesh, .y holds which vertex
uniform float *t;

void main(void)
{
vec4 v;

v=matrixTransformations[which_vertex.x]*
mix(meshVerticesFrame0[which_Vertex.x][which_Vertex.y],
meshVerticesFrame0[which_Vertex.x][which_Vertex.y],
t[which_Vertex.x]);

//whatever more...
}


simple, easy to read and even supports an arbitrary number of meshes. This was just a quick simple job, and the GL code since we do not have extra steps at all is much, much easier.

Lets move onto md5 skinning, ok?

For md5 skinning a vertex v, is computed as


for(i=0, p=vec3(0,0,0); i<number_weights(v); ++i)
{
p+= weight(v,i) * matrix[ which_joint(v,i) ]*weight_position(v,i);
}


the typical way to map that into GL without bindless graphics is to set a hard maximum number of the number of weights and then for each possible weight one uses an attribute. This incurs a memory waste since some vertices have lots of weights and some have very few. As an exercise write it with bindless graphics, and observe that less video memory is needed, the code is easier to read on both the GLSL and GL side.



Considering that Kilgard works for NVIDIA, who does not have a vested interest in making OpenGL implementations easier to write (since they already have one. They want Intel's job with Larrabee to be as hard as possible), I'll go with the empirical data: ATI's D3D implementation is more solid than their OpenGL implementation, and D3D implementations are easier to write than GL implementations.

Further, his statements make no sense. It is a verifiable fact that the least buggy code is the code that is never written. So while feature removal will not guarantee these things, not removing features certainly isn't helping.


Now you are really beginning to shovel it with such choice gems that nVidia is trying to make GL harder to implement, unbelievable. The reason why that up until a year or 2 ago that ATI had poor GL drivers was simple: they did not spend the man power on it, just enough to run Quake/Doom/Id Game. D3D drivers are not easy to write and can be quite hairy too, Have you written D3D drivers? Have you written GL drivers?

Alfonse Reinheart
10-16-2009, 11:48 AM
1. UBO's have a very, very fine limit on size.

But they do not have to. If NVIDIA wanted, they could implement UBOs as actual pointers under the hood. Then, they could simply put a really big number on the size limit.

And if this is not possible (presumably because uniforms must have a definite size when defined in GLSL), they could simply have made an extension relaxing that limitation. That is, you could define a uniform like:



uniform mat4 myMatrixList[];


This would only be legal in a uniform block. The size then becomes whatever the user gives it. There would be specific grammar restrictions on how this can work (unbounded arrays must be the last thing in the block, etc), but there is nothing preventing this from being implemented.

This provides similar functionality while maintaining the abstraction. The only thing you lose is indirection: the ability to put a pointer inside a uniform and access it indirectly. Essentially uniform within a uniform.

Note: I'm not arguing against the utility of bindless. Yes, you can find uses for it. I'm arguing against the fact that it breaks a very useful abstraction. And it does so without needing to.


2. There is a very hard limit on the number of UBO's available

See above.


How would you do this without bindless graphics?

If "bindless" was implemented as above, it would work just fine. It would also be cross-platform, rather than NVIDIA-specific.


the typical way to map that into GL without bindless graphics is to set a hard maximum number of the number of weights and then for each possible weight one uses an attribute.

The typical way this is done is to limit the number of weights to 4, so that the weights all fit into 1 attribute. Yes, this does waste memory for vertices with fewer than 4 weights. But it is certainly good enough.

I would also point out that you lose something with bindless. If your mesh data is no longer using attributes to get information, then you also lose the automatic conversion to the input type. You can pass unsigned bytes normalized on [0,1] as attributes.

But if you want to do that with bindless and avoid attributes, then your shader must specifically be written to use and expect unsigned bytes, and it must be specifically written to do the conversion, including normalization). If you have one mesh that uses unsigned bytes and one mesh that uses unsigned shorts, you must have and maintain two different shaders.

I'll take the hardcoded, efficient, and free conversion logic for attributes over that.


Now you are really beginning to shovel it with such choice gems that nVidia is trying to make GL harder to implement, unbelievable.

I never said that. I said that they do not have a vested interest in making implementations easier. That is different from saying that they are actively trying to make them harder.

Not being interested in making things easier means that they provide no support for doing so. So when an NVIDIA spokesperson comes along and says that making OpenGL less complex will not improve buggy implementations, I will take this comment with the proper skepticism based on where it comes from.


Have you written D3D drivers? Have you written GL drivers?

It is still an order-of-magnitude easier to write a D3D10 driver than a full 3.2 compatibility OpenGL implementation. Just a few of the things you have to do in GL that you don't in D3D10:

1: TexEnv.
2: Fixed-function T&amp;L.
3: Selection.
4: The stupid form of feedback.
5: Partial fixed function interactions (when some things are shaders and others are FF).
6: GLSL compiler (D3D implementations only have to write to an assembly language).

These are not small tasks. TexEnv in particular is a highly complicated bit of shader building, as is the fixed function T&amp;L.

Yes, it can be done; both NVIDIA and ATI have done this. But it is still non-trivial code that has to be written and tested. D3D does not have this.

kRogue
10-16-2009, 01:24 PM
But they do not have to. If NVIDIA wanted, they could implement UBOs as actual pointers under the hood. Then, they could simply put a really big number on the size limit.


But they probably NEED to. The expected access performance for UBO's is higher than bindless. Additionally, if you take that view, the UBO's should not have any practical size limits, just as samplerBuffer's don't. But the point is that each has a different expected usage pattern and as such they are implemented differently. Even over in D3D10 land the equivalents to UBO and Texture buffer objects are different and the MSDN article on them goes on (and on) about that too.



The typical way this is done is to limit the number of weights to 4, so that the weights all fit into 1 attribute. Yes, this does waste memory for vertices with fewer than 4 weights. But it is certainly good enough.


ROFL. Open up an MD5 mesh from Doom3, a game a generation old, and see that 4 is NOT enough.



I would also point out that you lose something with bindless. If your mesh data is no longer using attributes to get information, then you also lose the automatic conversion to the input type. You can pass unsigned bytes normalized on [0,1] as attributes.


Keep in mind that the driver does this conversion for you, and ahem, it is not at all for free. Additionally, like 99.99% of the time the form of the input data is pretty fixed, and you do not vary the form of the input data for a fixed shader, so the ability to change the format/interpretation of the data without changing the shader has a pretty weak use case, really freaking weak.




if "bindless" was implemented as above, it would work just fine. It would also be cross-platform, rather than NVIDIA-specific.


NO. Even allowing stuff like MyData[] because you have to put it at then end of an UBO you are not matching bindless at all, because you cannot do:



struct
{
mat3 some3Mats[];
mat4 some4Mats[];
}


but you can do in bindless:



struct
{
mat3 *someMat3s;
mat4 *someMats4;
}


Shoot with bindless you can do SO much more:



struct perThingy
{
mat4 *funky1;
float giggles;
int foobar;
}

struct
{
struct perThingy **holybatman;
vec3 ***jumpyWillikirs;
mat3 m;
};




The main issue is what is the data type of the "pointer" over on the GL side? Here again, you can abstract just as follows:



GLbuffer_binding v;

glMakeNameBufferResident(myBuffer);
v=glGetNameBufferGLSLHandle(myBuffer);

//later:
ptr[location]=v;
glBufferSubData(blah.blah);


How does that break the abstraction? It also clearly admits something we have to see immediately: all a buffer object is, is memory managed and manipulated through GL. That is it. There is no abstraction in that, nothing more. There is no abstraction there. The only possible bitch slap you can have is that different GL implementations may handle buffer data in a real wonked out way, maybe the "GPU address" is like 1024 bytes or something, in that case bindless is hosed, it assumes that GLSL could quickly access the memory of a buffer object by looking at something 64 bits wide, where maybe like in 2140 we'll need 1024 bit wide addresses or something equally silly (actually 64bit OS's don't even use the full 64bits as an address anyways). The natural hack fix is then that one would have to query GL for the size of the thingy so GLSL could get to the memory faster, but this is hackyish and would be horribly awkward to use.

Just to keep harping on how much great bindless is consider a typical differed shading system: you draw to some offscreen buffers typical stuff:
1. diffuse color
2. normal
3. specular
4. positional data (typically just z)
5. material ID.

where material ID sets the differed shader to use. Now, if you wanted to support say per-mesh data, then you would have to start packing that data into one (or two) common buffer objects and bind them as texture buffer objects, and naturally fetch the offsets for that pixel into those buffers with another value (or pack an offset into the g-buffer which in turn refers to another texture buffer object which in turn holds the material ID's etc). Now with bindless you don't have to pack the data so awkwardly, you can do what you really want: POINT to the data, with it increased readability and much easier GL code to.

I am not even going to start harping on your driver comments, really I am not.. must control myself.. must... sighs have to say it:

1. glTexEnv is unfortunatley really two functions wrapped up into one:
a. controls multi-texturing for fixed function pipeline fragment uber-"shader".
b. GL_TEXTURE_FILTER_CONTROL: affects choosing texture level of detail

For a. again, what we are seeing is that the fixed T&amp;L is an "uber"-shader. For b. that is awkward, I cannot even defend it.

2.Fixed-function T&amp;L. Not that it matters, if you look at it correctly it just means that the GL implementation provides a default uber-vertex and fragment shaders linked together (with name 0) and a variety of state from fixed T&amp;L like GL_LIGHT, etc, are mapped to appropriate under the covers glUniform calls.
3. Selection: ahem, like 99.99% chance this is implemented all on CPU anyways nowadays.
4. GL_FEEDBACK, ie. stupid feedback, same story: 99% chance in software as well.
5. partial fixed function interactions: enter EXT_seperate_shader_objects and that fixed T&amp;L vertex and framgnet stages are private shaders of the driver bound to name 0.



6. GLSL compiler (D3D implementations only have to write to an assembly language).

This is almost right and worth noting, except that the assembly has to be transformed into whatever the GPU really thinks in. It would be nice that the Kronos would update the asm style shaders (rather than just nVidia updating it and giving it NV extension status only) and then we could see GLSL compilers as external programs... well actually we already have that on nVidia platform anyways, it is called cgc -oglsl. But wait there is more! Different GPU architectures, just like different CPU's, will want to schedule instructions and break them down differently. So now to take a generic assembly interface a D3D driver most likely has buried in it a dynamic recompiler. Joy.

Alfonse Reinheart
10-16-2009, 01:58 PM
But they probably NEED to. The expected access performance for UBO's is higher than bindless. Additionally, if you take that view, the UBO's should not have any practical size limits, just as samplerBuffer's don't. But the point is that each has a different expected usage pattern and as such they are implemented differently. Even over in D3D10 land the equivalents to UBO and Texture buffer objects are different and the MSDN article on them goes on (and on) about that too.

The purpose of an abstraction is to abstract things. This frees the hardware to implement things how it wants, and exposes this functionality to the user via the abstraction.

This is part of the reason why performance is not a part of the OpenGL specification. NVIDIA is perfectly free to use pointers to implement uniform buffers. If they don't do it, it is because they choose not so.


ROFL. Open up an MD5 mesh from Doom3, a game a generation old, and see that 4 is NOT enough.

I said the typical way. Doom3 is not a typical game.


Even allowing stuff like MyData[] because you have to put it at then end of an UBO you are not matching bindless at all, because you cannot do:

Just break the struct into two separate uniform buffers. Yes, it's not as "pretty" as the single struct of pointers, but it gets the job done. And that's what matters.


How does that break the abstraction?

Because you can have indirection in the shader.

The pointer value can be stored in a uniform. The shader can read that value, cast it to a pointer, and access it as just another pointer.

Once you have pointers in GLSL, they can go anywhere. That's why it is so important to keep them out.

Furthermore, it breaks the careful packing rules that allow UBO to be cross platform.


2.Fixed-function T&amp;L. Not that it matters, if you look at it correctly it just means that the GL implementation provides a default shader (with name 0) and a variety of state from fixed T&amp;L like GL_LIGHT, etc, are mapped to appropriate under the covers glUniform calls.
3. Selection: ahem, like 99.99% chance this is implemented all on CPU anyways nowadays.
4. GL_FEEDBACK, ie. stupid feedback, same story: 99% chance in software as well.

The default shader has to be written and debugged. And, if it is meant to be used, it must run reasonably fast. Thus, it must be optimized. A single massive monolithic shader won't run fast; you have to dynamically built it from pieces of shaders for optimal performance.

Things that run in software still have to be written. That means you now need to write, debug, and maintain a software renderer. This is not a trivial undertaking.


Different GPU architectures, just like different CPU's, will want to schedule instructions and break them down differently. So now to take a generic assembly interface a D3D driver most likely has buried in it a dynamic recompiler.

I don't know what you mean by a "dynamic recompiler", but whatever you would do for the assembly language, you would do for GLSL. And you still have to implement the compiler part, which is a non-trivial thing that the assembly version makes fairly trivial.

kRogue
10-16-2009, 02:26 PM
The purpose of an abstraction is to abstract things. This frees the hardware to implement things how it wants, and exposes this functionality to the user via the abstraction.

This is part of the reason why performance is not a part of the OpenGL specification. NVIDIA is perfectly free to use pointers to implement uniform buffers. If they don't do it, it is because they choose not so.


Sighs. The entire point of having all these different ways of accessing data is that they implicitly state how you intend to access it. GL is not just about abstracting, which is nice up to a point. It is about using the 3D hardware _well_ without needed to write for a specific GPU or understand every freaking hardware's internals well.



I said the typical way. Doom3 is not a typical game.


Right, Doom3 is not typical at all, completely wierd architecture, nothing but corner use cases. Bad model formats, etc. Give me a break.



Because you can have indirection in the shader.

The pointer value can be stored in a uniform. The shader can read that value, cast it to a pointer, and access it as just another pointer.


so what. The abstraction that a buffer object is just this: bytes managed by GL. To be an ass, then casting pointers in C is also a horrible abstraction break right? Actually I don't think that Bindless lets you cast between pointer types at all, but this I have to check.



Furthermore, it breaks the careful packing rules that allow UBO to be cross platform.


Bindless also has strict packing rules, they are specified in the specification. Those packing rules guarantee already it is cross platform.



The default shader has to be written and debugged. And, if it is meant to be used, it must run reasonably fast. Thus, it must be optimized. A single massive monolithic shader won't run fast; you have to dynamically built it from pieces of shaders for optimal performance.

Funny that, quite some time ago, I think it was nVidia, published shader code that gave fixed functionality via a shader. And really, look at what fixed function is, writing one shader for it is NOT a big deal at all (you can make a case requiring 8 shaders for the number of texture combiner stages, 1 shader for each possible count). Not to be nasty, that is far from rocket science.




Things that run in software still have to be written. That means you now need to write, debug, and maintain a software renderer. This is not a trivial undertaking.


giggles: both old style feedback and selection are NOT rendering anything, they fully focus on the vertex processing stage! Shoot, it is not very hard to implement using transform feed back and reading back from GL the buffer data. Please.



I don't know what you mean by a "dynamic recompiler", but whatever you would do for the assembly language, you would do for GLSL. And you still have to implement the compiler part, which is a non-trivial thing that the assembly version makes fairly trivial.


ROLF. Ok, you need to look up what a dynamic recompiler is. The bone dead simple answer is this: you feed it compiled code for one architecture and it outputs compiled code for another architecture with the understanding that it will schedule and such for that output architecture, and guess what, epic high chance that is what the D3D drivers have to take the D3D assembly and feed something into the GPU. do you think a texture fetch is just "one instruction" on a GPU? Between filtering and all that love it is a set of instructions.

Alfonse Reinheart
10-16-2009, 04:09 PM
The abstraction that a buffer object is just this: bytes managed by GL.

No, it isn't. That's the abstraction that NV_vertex_array_range provides.

Buffer objects abstracts the location of the memory as well as allocation. Individual buffer objects have no relation to one another. They cannot refer to one another. And they do not live in any one particular place.


To be an ass, then casting pointers in C is also a horrible abstraction break right?

Which is why every competent book on C++ will tell you that having to do an explicit cast is generally a design flaw that you should avoid. C can't avoid it, but C is a low-level language used for writing general-purpose applications.


Bindless also has strict packing rules, they are specified in the specification. Those packing rules guarantee already it is cross platform.

No, they do not. These packing rules guarantee it works on all platforms that support the extension. Platforms that cannot support these packing rules would be unable to support the extension.

To have rules that guarantee true cross-platform support, you have to actually talk to people who make other platforms. The Uniform buffers packing rules are a compromise that was created so that all platforms could implement them.


you feed it compiled code for one architecture and it outputs compiled code for another architecture with the understanding that it will schedule and such for that output architecture

There needs to be a special term for a compiler that happens to compile for an instruction set that the compiler itself is not executing?

This is all part of the optimization stage of building programs. You cannot deny that it is harder to write a GLSL compiler+linker than it is to write one for, say, the highest version of NV assembly. They may share the optimizer underneath, but the front-end part of the compiler is much more complex in the GLSL case.


do you think a texture fetch is just "one instruction" on a GPU? Between filtering and all that love it is a set of instructions.

That assumes that filtering is not done by specialized filtering units and must be done by the shader implicitly. Most of the time, this is not the case. On ATI hardware, depth accesses must do the depth comparison in the shader, as they do not have dedicated hardware for that comparison. Otherwise, you can expect texture operators to be single instruction.

Of course, them being single cycle is another matter entirely.

kRogue
10-17-2009, 01:24 AM
Sighs, the idiocy is too much, here goes, really this is the last freaking time:



No, it isn't. That's the abstraction that NV_vertex_array_range provides.

Buffer objects abstracts the location of the memory as well as allocation. Individual buffer objects have no relation to one another. They cannot refer to one another. And they do not live in any one particular place.


F'ing BS dude. That buffer objects can't refer to each other is a missing feature and bindless provides that. Where the buffer is really located on the card is so heavily abstracted in bindless anyways: the "GPU address" is most certainly virtual, etc.



Which is why every competent book on C++ will tell you that having to do an explicit cast is generally a design flaw that you should avoid. C can't avoid it, but C is a low-level language used for writing general-purpose applications.


GLSL is sooo much more like C, and it should be since it is executed on every fragment and vertex. But wait it gets better! HLSL lets you cast will-nilly too. The reason: it makes performance development easier and helps stop the API and language from getting in your way. What do you think happens when you write assembly anyway?



No, they do not. These packing rules guarantee it works on all platforms that support the extension. Platforms that cannot support these packing rules would be unable to support the extension.

To have rules that guarantee true cross-platform support, you have to actually talk to people who make other platforms. The Uniform buffers packing rules are a compromise that was created so that all platforms could implement them.


more BS: that packing rules provided by bindless provide a means so that it will work on other platforms, now that the UBO rules look the way they do is for those GPU's that are SSE-ish in their behavior. You can make a case for that. But guess what, that really, really does not matter. It is not exactly brain surgery to have an option for bindless to use UBO packing rules instead. But wait! Why did nvidia make those kind of packing rules? The answer is so that struct made in a 32bit packingare almost the same as struct for bindless. That was done as an effort for cross-platform.

You could make this case:
UBO packing rules are what they are to allow for SSE-like behavior. All that means is that the packing on UBO's then supports:
1) SSE 32-bit packing.

it is not really cross hardware to fantasy wacky hardware which can not support that. So with that mind, if I my head worked the same deficient way yours did, I would also say UBO is not cross platform.



There needs to be a special term for a compiler that happens to compile for an instruction set that the compiler itself is not executing?

This is all part of the optimization stage of building programs. You cannot deny that it is harder to write a GLSL compiler+linker than it is to write one for, say, the highest version of NV assembly. They may share the optimizer underneath...


Sighs, here we go: make a D3D shader:

HLSL source --> D3D compiler --> D3D assembly

That D3D compiler does do work and source analysis.

Over in the driver:

D3D assembly --> Driver dynamic recompiler --> GPU instructions

Now, that middle step is needed in the D3D driver.



...but the front-end part of the compiler is much more complex in the GLSL case.


Giggles: the folks that pushed GLSL, 3DLabs, released an open source, use it anyway you like it license, GLSL front end.



Otherwise, you can expect texture operators to be single instruction.

Giggles, rolf. Now I know you have no clue, really no clue. That it is presented as one operation has nothing to do with what goes on. Do you think Intel's Laurabee will do it in one instruction? Get real.

As a side note, given that the thread topic is on direct state access, and for the last 2+ pages has been a debate on bindless graphics, I am no longer going to take the troll bait on this.

kRogue
10-19-2009, 01:56 PM
Just a quick FYI for those that read through my bile:

1. NV's bindless graphics DOES support pointer casting.

knackered
12-16-2009, 04:07 PM
that was fun - can't believe I missed it all these months. kRogue makes a good case.
I hate the whole vertex attribute/uniform bollocks.