PDA

View Full Version : More efficeint drawing of vertex data



Gedolo2
02-02-2015, 12:35 PM
With DSA and bindless textures there is one large, missing part to make the API modern with bindless paradigm. Namely in vertex processing.
http://www.nvidia.com/object/bindless_graphics.html
Want vindless vertex processing in as ARB in current 4.x line of OpenGL.
In OpenGL 4.x.

Very, very useful for processing vertex intensive scenes. Last large, missing piece in making the API bindless.
Efficient drawing of vertex data is very, very useful. Is a natural fit with the newer DSA and bindless textures way of working.
https://www.opengl.org/registry/specs/NV/bindless_multi_draw_indirect.txt
http://www.ustream.tv/recorded/51212968
http://www.nvidia.com/object/bindless_graphics.html
Last slides of the following presentation ( slide 77 and up) http://www.slideshare.net/Mark_Kilgard/opengl-45-update-for-nvidia-gpus?related=1

Gedolo2
02-03-2015, 12:51 PM
Modernizing the 4.x line is important. Even with the glNext initiative.
Adding these makes the OpenGL 4.x API compete well with current alternatives. Not coming over hopelessly ancient.
Provides a good stepping stone between older OpenGL versions and glNext.
Developers developing too early or unwilling to develop for glNext can choose OpenGL 4.x and will have a API somewhat modern in design.
Will also easier the porting work between glNext and OpenGL4.x if better, more fficient vertex processing gets added.

Alfonse Reinheart
02-03-2015, 01:48 PM
Provides a good stepping stone between older OpenGL versions and glNext.
...
Will also easier the porting work between glNext and OpenGL4.x if better, more fficient vertex processing gets added.

OK, I'll play this game.

Prove it.

Public API documentation on D3D12 and Mantle is scarce and behind NDA walls. However, Metal's docs are publicly available. And according to them, (https://developer.apple.com/library/ios/documentation/Metal/Reference/MTLRenderCommandEncoder_Ref/index.html#//apple_ref/occ/intfm/MTLRenderCommandEncoder/setVertexBuffer:offset:atIndex:) their equivalent to glVertexAttribPointer/glBindVertexBuffer/glBufferAddressRangeNV takes an MTLBuffer object reference. Not a GPU memory address like glBufferAddressRangeNV.

Given that as an example of an actual next-gen API, how exactly would switching to resident buffers and GPU memory addresses in any way help move to glNext?


Adding these makes the OpenGL 4.x API compete well with current alternatives. Not coming over hopelessly ancient.

Round 2: prove it.

How does OpenGL 4.5, by any objective measurement come across as "hopelessly ancient" compared to "current alternatives" in this regard? Neither D3D11, nor any version of D3D before it, directly uses GPU memory addresses. It requires using an actual object. Which is equivalent to binding it.

Indeed, the only difference structurally between OpenGL 4.5 and D3D on this matter is that OpenGL's VAOs are fully mutable, while D3D's equivalent objects are not.

I defy you to provide evidence for your claims. Show me one of the following:

1: A "current alternative" API, in common usage, that does what NV_vertex_buffer_unified_memory (https://www.opengl.org/registry/specs/NV/vertex_buffer_unified_memory.txt) does. This means specifically passing a GPU pointer which references GPU-allocated memory to be used as source data for vertex arrays.

2: A claim from a person (who has actual standing, not some random guy on the Internet) who claims that OpenGL 4.5 is "hopelessly ancient" (or something similar) because of its inability to get GPU addresses from buffers and use them as source data for vertex arrays.

All I'm asking for is just one of these. One or the other. And if you can't find any, then your claim is based solely on your own desires and lack of knowledge of other APIs, not on anything that anyone who actually matters believes.

kRogue
02-04-2015, 08:48 AM
I am going to put in some HW notes here: Some HW has dedicated vertex fetch some do not. For example, AMD hardware does NOT have dedicated vertex fetch. Thus, for AMD hardware changing vertex format means vertex shader change. Intel hardware, according to implementation in Mesa (and the public docs for Gen8 and below) do have a dedicated vertex fetch unit, so changing vertex format does NOT trigger a shader change.

On the subject of bindless. The truth is that it would be faster than the current bind buffer madness. On paper, an implementation can use VAO to shrink N-name to buffer "GPU address" fetches to 1 (by packing into the VAO the "addresses" of the vertex buffers). However, that is still slower than just passing along the "GPU address" of the buffers directly. Also, a smart implementation of VAO will still need to walk the attributes, see which change, and then do the right thing. In contrast GL_NV_vertex_buffer_unified_memory lets the application do the work.

Lastly, for Metal vs OpenGL. Metal, like NVIDIA's bindless, has a dedicated interface to specify the vertex format (type, stride, count, and so on). We do have this in GL since version 4.something, and also ability to simple say "use this buffer starting at offset X for attribute Y" and use same format state... but that is wrapped into the VAO, so what happens is that an application that uses VAO's will have lots of buggers specifying format and source completely, even if all that needs to change is source. If the application does not use VAO, then the implementation has to the the name lookup for each source change. So all in all, it is pick your poison rather than a good solution existing aside from NVIDIA's extension.

Just my 2 cents.

Alfonse Reinheart
02-04-2015, 09:17 AM
On the subject of bindless. The truth is that it would be faster than the current bind buffer madness. On paper, an implementation can use VAO to shrink N-name to buffer "GPU address" fetches to 1 (by packing into the VAO the "addresses" of the vertex buffers). However, that is still slower than just passing along the "GPU address" of the buffers directly. Also, a smart implementation of VAO will still need to walk the attributes, see which change, and then do the right thing. In contrast GL_NV_vertex_buffer_unified_memory lets the application do the work.

No one is disputing whether providing GPU addresses is faster than specifying an object to fetch the GPU address from. Nor is anyone disputing how much gain there is from such a change (However, if you want to have that discussion, consider this. The NVIDIA material never actually tests vertex_attrib_binding with bindless attributes alone. Their tests involve both bindless attributes and shader_buffer_load, which replaces UBOs).


Lastly, for Metal vs OpenGL. Metal, like NVIDIA's bindless, has a dedicated interface to specify the vertex format (type, stride, count, and so on). We do have this in GL since version 4.something, and also ability to simple say "use this buffer starting at offset X for attribute Y" and use same format state... but that is wrapped into the VAO, so what happens is that an application that uses VAO's will have lots of buggers specifying format and source completely, even if all that needs to change is source.

Any evidence to back that up? The whole point of vertex_attrib_binding is to avoid that, to have one VAO per format and just change buffer bindings, just like D3D. So which applications actually use VAB and have lots of VAOs with the same formats?


If the application does not use VAO, then the implementation has to the the name lookup for each source change.

And it'd have to do that for Metal too. So... again, what's the difference? If using GPU addresses was that much faster and a reasonable global abstraction, surely Apple would have used them. And yet they didn't.

Maybe there's a reason for that.

kRogue
02-05-2015, 02:57 AM
Any evidence to back that up? The whole point of vertex_attrib_binding is to avoid that, to have one VAO per format and just change buffer bindings, just like D3D. So which applications actually use VAB and have lots of VAOs with the same formats?

The use case you are saying is this:

Application creates on VAO per attribute layout
Application changes buffer binding per mesh (essentially)

Doing that means that there are (usually) 2 names chases per mesh: one for index and one for attributes. In contrast, if one has one VAO per mesh there is only one name change. This is what I am saying in terms of picking a poison.



And it'd have to do that for Metal too. So... again, what's the difference? If using GPU addresses was that much faster and a reasonable global abstraction, surely Apple would have used them. And yet they didn't.

Maybe there's a reason for that.
Somebits about Metal: Firstly there is no name lookup associated to "MTLBuffer", but there is the pointer chase [so half the pain]. Secondly, Metal is iOS only which means iPad/iPod only. The GPU is a power-limited device. Moreover, the GPU is a tiled based renderer, and the draw call does two things underneath: issue binning command AND add stuff to another command buffer to the eventual tile walk (which is done once per Metal command buffer). These operations are non-trivial to perform, and require the driver to do quite a bit of stuff. In contrast, for the PC of NVIDIA and AMD the gfx cards are hulking beasts with their own dedicated memory commanded over the PCI bus by the driver. For them, the driver "work" is to convert from API to command buffer bytes should be tiny and ideally trivial. The binding, the names make it less trivial and become a higher percentage of work (where as for iOS the name chase is just the beginning of all the work needed to issue a draw command AND they are unified memory devices anyways, so cache miss is guaranteed almost anyways).

Though at the end of the day, half the pain of GL could have been avoided if it used pointers directly for its objects instead of GLuin name thing. That alone doubles the pain.

Alfonse Reinheart
02-05-2015, 09:35 AM
The use case you are saying is this:

Application creates on VAO per attribute layout
Application changes buffer binding per mesh (essentially)

Doing that means that there are (usually) 2 names chases per mesh: one for index and one for attributes. In contrast, if one has one VAO per mesh there is only one name change. This is what I am saying in terms of picking a poison.

In what way do attributes involve a "name chase"? Any programmer who calls glGetAttribLocation() for every attribute they use at render time deserves the performance they get. That's the only name chase I can think of.

glBindVertexBuffers only needs to "name chase" the buffer names, since for the majority of the time, you're reusing the same VAO. Thus, VAO binds will be quite rare.

I would also point out that this is exactly how things work in the D3D world, modulo OpenGL object "name chasing". Which again, only happens to actual OpenGL objects: buffer objects and the occasional VAO.


Though at the end of the day, half the pain of GL could have been avoided if it used pointers directly for its objects instead of GLuin name thing. That alone doubles the pain.

For all of NVIDIA's allegations of the "pain" of OpenGL name lookup performance-wise... typical use of the OpenGL API is still faster than D3D11 for the same scene. Valve discovered that one when they ported the Source engine to it, and they weren't using NVIDIAs gimmicks to get that performance win.

So there's no evidence that name lookup is a significant problem, relative to other APIs. That's not to say that performance wouldn't be better without it. But to say that OpenGL as it stands would be considered "hopelessly ancient" or something equally ridiculous for not having bindless vertex arrays is laughable.

Dark Photon
02-05-2015, 06:23 PM
For all of NVIDIA's allegations of the "pain" of OpenGL name lookup performance-wise... typical use of the OpenGL API is still faster than D3D11 for the same scene. Valve discovered that one when they ported the Source engine to it, and they weren't using NVIDIAs gimmicks to get that performance win.

So there's no evidence that name lookup is a significant problem, relative to other APIs. That's not to say that performance wouldn't be better without it. But to say that OpenGL as it stands would be considered "hopelessly ancient" or something equally ridiculous for not having bindless vertex arrays is laughable.

Probably true, but even so, what's your point? I too would welcome our new bindless overlords.

I have used bindless attributes and index lists, without shader_buffer_load (to your point above), and with and without VAOs, and I'm here to tell you it can make a big difference in GPU throughput, obviating the senseless need to spend more time optimizing when you get the performance boost you need by such a simple code change, and in so doing eliminate much needless overhead in the GL driver (doing binds, name lookups, making buffers resident ... whatever!). "What" is not important to us; that it can result in such huge increases in performance is.

Performance relative to D3D isn't forward thinking. Performance relative to what's possible? Now "that's" what's interesting!

Alfonse Reinheart
02-05-2015, 07:48 PM
Probably true, but even so, what's your point?

I was arguing against Gedolo2's position (https://www.opengl.org/discussion_boards/showthread.php/185646-More-efficeint-drawing-of-vertex-data?p=1264152&viewfull=1#post1264152) that bindless vertex arrays "makes the OpenGL 4.x API compete well with current alternatives." That it is essential to make OpenGL "Not coming over hopelessly ancient." And that bindless vertex arrays "Provides a good stepping stone between older OpenGL versions and glNext."

There is not one shred of evidence for any of those positions. OpenGL already competes just fine with "current alternatives". It does not "come over hopelessly ancient." And, given Apple Metal as an idea of what we might see from glNext, there is no reason to believe that bindless vertex arrays would be used in glNext, thus not making it a "stepping stone" towards anything.

In short, my overall point is that his point is founded on ignorance. I make no claim as to whether OpenGL 4.6 should use bindless vertex arrays, only that his reasons for doing so don't hold water.

However, if you really want to have that discussion:

The problem with bindless vertex arrays can readily be seen by the fact that they're still an NVIDIA extension. Allow me to explain that.

NVIDIA came out with this feature a bit less than 5 years ago. Since that time, it has progressed precisely nowhere in the OpenGL ecosystem. Oh, plenty of people use it. But nobody but NVIDIA implements it.

Now, you could argue that it's an NVIDIA extension; of course nobody else implements it. Except that's not true. NV_texture_barrier is widely implemented, even on Intel hardware. Hell, Apple enforces compliance (https://developer.apple.com/opengl/capabilities/) to it. Both NV_texture_barrier and NV_vertex_buffer_unified_memory were released in the same batch of extensions. So clearly, the fact that NV_vertex_buffer_unified_memory is an NVIDIA extension has not inhibited anyone from implementing it.

You could try to claim that the extension is covered by patents of some kind. Even if that's true, it is just as true for ARB_bindless_texture. Yet Khronos and NVIDIA worked out whatever legal issues were necessary to make that a widely implemented extension for hardware that supports it. So again, that doens't jive, unless NVIDIA wants bindless textures supported elsewhere but not bindless vertex arrays.

And if that's the case, then you lose by default, since Khronos couldn't make it core functionality even if they wanted to.

Then there's ARB_bindless_texture. NV_bindless_texture became ARB_bindless_texture in about 6-9 months. It's been almost five years since NV_vertex_buffer_unified_memory, yet here we are. ARB_vertex_attrib_binding was in fact the perfect time to introduce the feature, since it was already rearranging how VAOs and vertex format state works. Even with that perfect opportunity, the ARB didn't take it.

Lastly is OpenGL 4.5. That was rather feature light, and the most feature rich part of the set came from OpenGL ES 3.1. DSA is important to be sure. But the ARB's workload for the 4.5 release was the lowest it's been since GL 3.1. And that was a half-year cycle. Clearly, the ARB could have devoted time to bring out bindless vertex arrays, but they didn't.

Since bindless vertex arrays haven't happened by now, one of the following must be true:

1. The ARB has evaluated and rejected the concept.
2. NVIDIA wants it to remain proprietary.
3. NVIDIA and the ARB are working on the concept behind the scenes, yet haven't brought it to market after 5 years.

#3 seems rather unlikely. And both #1 and #2 mean that it's just not gonna happen.

So it's not gonna happen.

That being said, stranger things have happened. I would have expected hell to freeze over before we got explicit_uniform_locations, considering that people were asking for that before OpenGL 2.0. So it's not beyond belief that it could happen.

Just highly unlikely.

As for the merits of the technology on its own, yes, it can help save performance. But the biggest reason I suspect #1 happened is because NV_vertex_buffer_unified_memory uses GPU addresses.

Bindless textures don't really use GPU addresses. What you get there are 64-bit handles that reference the texture. They're may be integers, but the value of those integers is completely opaque and meaningless. Oh sure, they probably are GPU memory addresses that reference a data structure of some kind. But you don't treat them that way; you treat them like references in C++. You can store a reference. You can return a reference. But the only thing you can do with a reference is access the object it refers to.

By contrast, unified memory are pointers. You perform pointer arithmetic on them; indeed, you're required to do so by the glBufferAddressRangeNV call. It takes a 64-bit address and a size; any offsetting ought to have been done by the user.

If the functionality instead returned an opaque handle (which again could be a GPU address), and glBufferAddressRangeNV took a handle plus an offset, I think that would have gone a long way into making the functionality more acceptable. And I don't think NVIDIA would like the idea of sacrificing performance by making people do two memory fetches and an add for each buffer non-bind (one fetch for the handle, one for the offset).

Not only that, returning actual GPU addresses has to be some kind of security problem. And with all of the "robustness" stuff coming out in the last few years, I don't think the ARB is going to start sanctioning such things. At least with 64-bit opaque handles, you can introduce a simple bitshift (perhaps randomized at application startup time) to do some basic obfuscation to the handle. It's hardly the most secure thing in the world, but it's something.


Performance relative to D3D isn't forward thinking.

Forward thinking is for glNext, not OpenGL. And I'd still lay odds that you won't see GPU memory addresses anywhere in glNext either. Not outside of an NVIDIA-only extension, at any rate.

kRogue
02-10-2015, 02:36 PM
There is not one shred of evidence for any of those positions. OpenGL already competes just fine with "current alternatives". It does not "come over hopelessly ancient." And, given Apple Metal as an idea of what we might see from glNext, there is no reason to believe that bindless vertex arrays would be used in glNext, thus not making it a "stepping stone" towards anything.


No; OpenGL does not compete particularly well against other 3D API's. Tools are much poorer than other API's. Compliance of drivers are all over the place. Feature sets of drivers are also all over the place. I wish that was not the case, but it is. GL has more warts than a warthog.



By contrast, unified memory are pointers. You perform pointer arithmetic on them; indeed, you're required to do so by the glBufferAddressRangeNV call. It takes a 64-bit address and a size; any offsetting ought to have been done by the user.

If the functionality instead returned an opaque handle (which again could be a GPU address), and glBufferAddressRangeNV took a handle plus an offset, I think that would have gone a long way into making the functionality more acceptable. And I don't think NVIDIA would like the idea of sacrificing performance by making people do two memory fetches and an add for each buffer non-bind (one fetch for the handle, one for the offset).

Not only that, returning actual GPU addresses has to be some kind of security problem. And with all of the "robustness" stuff coming out in the last few years, I don't think the ARB is going to start sanctioning such things. At least with 64-bit opaque handles, you can introduce a simple bitshift (perhaps randomized at application startup time) to do some basic obfuscation to the handle. It's hardly the most secure thing in the world, but it's something.


Um, the addresses are essentially per-context or per-context-share-group. For each context one wants to use a buffer, one needs to make it resident for -each- context it will be accessed. The address is not absolute and is virtual and all those magicks. (I admit I do not have hard proof, but it makes sense).




NVIDIA came out with this feature a bit less than 5 years ago. Since that time, it has progressed precisely nowhere in the OpenGL ecosystem. Oh, plenty of people use it. But nobody but NVIDIA implements it.


One of the difficulties with NVIDIA's NV_vertex_buffer_unified_memory is that it builds from NV_shader_buffer_load. The extension NV_shader_buffer_load allows pointer (read) access from shaders of arbitrary buffers in GPU. Most GPU's are SIMD things, NVIDIA's GPU is different (they call it SIMT). Scattered memory access is really awful, painful, hard and terribly inefficient for SIMD architectures where as in NVIDIA's it is not as such a big disaster. For most gizmo's when a fragment (or really any shader stage) is invoked many elements are processed in one invocation. To be precise and specific, for example on Intel hardware a pixel shader can have SIMD8 or SIMD16 mode. This means one "pixel shader invocation" does the computation/work for 8 (respectively 16 pixels). Implementing scattered reads is suicide on SIMD architectures (usually). This is why, I believe, that we will not see NV_shader_buffer_load (or NV_shader_buffer_store) on anyone but NVIDIA.


Now, the above does not mean one cannot still use the address to feed the vertex buffers directly, a different extension essentially. Here the royal pain is that there is also VAO and it just makes the whole thing a giant mess. Moreover, some hardware does not have that the GPU addresses are 64-bit things anyways, they are .. something different and potentially quite wonky.

In truth, the ultimate question about the viability of something like NV_vertex_buffer_unified_memory to just set vertex attribute sources for other archs is what kind of numbers does the GPU require to access a buffer. Is it just a single number? Can the offset be added to that number to do the right thing? Or does it need to be a special number and an offset separately? Or does accessing a buffer object require other numbers when pushed to the GPU (like a size for robust access)? These questions are highly specific to hardware in question and even the generation of that hardware. We could have a "handle" which is really just a pointer for those needed numbers, but the API implementation is still then a handle and an offset. It saves the name look up idiocy but still requires a pointer fetch.

Alfonse Reinheart
02-10-2015, 04:39 PM
No; OpenGL does not compete particularly well against other 3D API's. Tools are much poorer than other API's. Compliance of drivers are all over the place. Feature sets of drivers are also all over the place. I wish that was not the case, but it is. GL has more warts than a warthog.

That's all very true. But you're rather missing the point.

My point is that having bindless vertex arrays would not affect OpenGL's competitiveness with other APIs. Thus, his argument is bunk.


Um, the addresses are essentially per-context or per-context-share-group. For each context one wants to use a buffer, one needs to make it resident for -each- context it will be accessed. The address is not absolute and is virtual and all those magicks. (I admit I do not have hard proof, but it makes sense).

Sure, the addresses are virtual. And they're not permanent.

But there is no guarantee in the specification that virtual addresses are directly bound to a context or share group. The spec just says that if you try, it won't work and will possibly crash. Which sounds very much like it might still work. Hence the security concern.


Now, the above does not mean one cannot still use the address to feed the vertex buffers directly, a different extension essentially. Here the royal pain is that there is also VAO and it just makes the whole thing a giant mess.

In what way does VAOs make things any more of a "giant mess" than you get in, say, Direct3D?


Moreover, some hardware does not have that the GPU addresses are 64-bit things anyways, they are .. something different and potentially quite wonky.

In a fixed version of bindless vertex arrays, the handle would work like bindless textures. It's a number, but it's completely arbitrary and specified by the API. So the number can be whatever the driver needs it to be in order to be fast. And I would say that any offset should be user-provided, ala glBindVertexBuffer(s).

It would have to be highly unconventional hardware indeed that couldn't generate some kind of value that:

* 64-bits (or less) in size
* Encodes whatever driver data is needed
* Has a non-pointer-accessing transformation into the actual driver data
* Has a constant-time transformation into the actual driver data
* Is not a GPU address

I can't imagine what the driver's data would have to be for that to not be possible.

mhagain
02-11-2015, 01:58 AM
For all of NVIDIA's allegations of the "pain" of OpenGL name lookup performance-wise... typical use of the OpenGL API is still faster than D3D11 for the same scene. Valve discovered that one when they ported the Source engine to it...
That's not quite true.

What Valve ported was D3D9 code, and other benchmarks exist (e.g the Unigine benchmarks; see http://www.g-truc.net/post-0547.html) which show the opposite.

kRogue
02-11-2015, 03:58 AM
In what way does VAOs make things any more of a "giant mess" than you get in, say, Direct3D?

The crutch of the issues are the following: there there would multiple ways to specify attribute sources: VAO using existing GL API's, and bindless. Another question that comes up is then is bindless part of VAO state? Which state takes precedence (the current API or bindless or whatever was called last)? Is it ok to mix (i.e. some attribute with bindless some with traditional API)? None of these questions are show stoppers, but the answers are a mess.




In a fixed version of bindless vertex arrays, the handle would work like bindless textures. It's a number, but it's completely arbitrary and specified by the API. So the number can be whatever the driver needs it to be in order to be fast. And I would say that any offset should be user-provided, ala glBindVertexBuffer(s).

It would have to be highly unconventional hardware indeed that couldn't generate some kind of value that:

* 64-bits (or less) in size
* Encodes whatever driver data is needed
* Has a non-pointer-accessing transformation into the actual driver data
* Has a constant-time transformation into the actual driver data
* Is not a GPU address

I can't imagine what the driver's data would have to be for that to not be possible.


Essentially, a buffer object is then accessed by a handle instead of GLuint name. That is the only difference now essentially. What happens in practice is then that a driver could, recast the handle to a pointer type. If the handle can be used directly by a shader, then things are much more complicated. One could say, NO, I want that to be a magic number directly used by the GPU. From there is gets much more complicated and highly GPU architecture specific about how it can access memory from shaders. Once one goes there, the nature of caching and how to partition caches becomes more involved. In a sick way, bindless_texture is sort-of-easier because samper units do their own caching (on top of some other caches too)... but raw memory is uglier. If the caching is not right (especially for devices with shared memory with CPU), then this is nasty... not to mention the entire SIMD mess.

Do not get me wrong, I love bindless attributes, I think it is great. I also think NV_shader_buffer_load is freaking awesome and sweet. However, before howling at the top of your lungs that hardware "surely works a way", I invite you to read the official docs of AMD and Intel on their hardware. I think you will find it fascinating and horrifying at the same time. They are so different in various places and memory access by GPU on shared memory systems (like Intel for example) is not simple.

Alfonse Reinheart
02-11-2015, 08:53 AM
That's not quite true.

What Valve ported was D3D9 code, and other benchmarks exist (e.g the Unigine benchmarks; see http://www.g-truc.net/post-0547.html) which show the opposite.

That benchmark includes a D3D9 test, which also appears faster than OpenGL code. So their results differ from Valve's even with the same test. Since they couldn't even reproduce Valve's results, it is entirely possible that they just wrote poor OpenGL code.

Which wouldn't be surprising, since you're comparing an engine few people use to one of the most frequently used game engines on the market today. Valve has the pick of the litter when it comes to OpenGL professionals, while Unigine Corp... just makes an engine.

Equally importantly, that test shows that D3D9 and D3D11... don't differ that much in performance. It's only a few FPS, less than 10 in most tests. Thus, assuming that the Unigine guys know how to write D3D9 and 11 code, an OpenGL engine that's as good as Valve's ought to be able to achieve performance parity.


The crutch of the issues are the following: there there would multiple ways to specify attribute sources: VAO using existing GL API's, and bindless.

That's very true. But that ship has already sailed. We have two ways of creating textures. We already have two ways of specifying vertex data (with vertex_attrib_binding). We have two ways of using shader code (all-in-one-program vs. separate programs). We just got a new way of creating and modifying virtually every object type.

Really, who'd notice one more at this point ;)


Another question that comes up is then is bindless part of VAO state? Which state takes precedence (the current API or bindless or whatever was called last)? Is it ok to mix (i.e. some attribute with bindless some with traditional API)? None of these questions are show stoppers, but the answers are a mess.

Those questions are answered by the NVIDIA extension (yes, they're VAO state; there is a VAO enable/disable for bindless VAs; see previous). I don't see how these answers create "a mess".


However, before howling at the top of your lungs that hardware "surely works a way", I invite you to read the official docs of AMD and Intel on their hardware. I think you will find it fascinating and horrifying at the same time. They are so different in various places and memory access by GPU on shared memory systems (like Intel for example) is not simple.

Do you have any links I could look at? I'm not really sure what to Google around for.

Gedolo2
02-18-2015, 07:23 AM
@Alfonse Reinheart

I might have exaggerated a bit with my "OpenGL is ridiculously ancient" claim, this claim is a subjective judgement of some features of the API and is a personal opinion. I'm so annoyed by ancient state machines.
However, how the vertex processing is divided over the vertex units should have moved to the GPU driver instead of application logic layer a decade ago. Even before the arrival of newer GPU architectures such as GCN, putting that decision in the application logic is bad design. Very, very, very bad design!


Although specifications of DirectX12, Metal, Mantle and glNext are scarce. There is some information about their features available.
A common theme in the next generation API's however is reducing CPU overhaed.
The idea of bindless being one element in the next gen API's.

http://www.extremetech.com/computing/190581-nvidias-ace-in-the-hole-against-amd-maxwell-is-the-first-gpu-with-full-directx-12-support
http://www.g-truc.net/doc/Candidate%20features%20for%20OpenGL%205.pdf
http://www.anandtech.com/show/7889/microsoft-announces-directx-12-low-level-graphics-programming-comes-to-directx/2
http://www.extremetech.com/gaming/183567-apple-unveils-metal-api-for-ios-8-will-shave-off-opengl-overhead-just-like-mantle-dx12

https://www.gamingonlinux.com/articles/whats-next-in-graphics-apis.4753
https//www.slideshare.net/slideshow/embed_code/42464487?startSlide=12 (https://www.slideshare.net/slideshow/embed_code/42464487?startSlide=12)

Round 1:
You misunderstand.
The statement about competing with current API's (DirectX 11.0 - 11.2) and the statement about OpenGL being ridiculously ancient are two statements that are not suppose to have anything to do with each other.

The ridiculously ancient statement is subjective and my own personal view on the API.

Round 2:
See above.

NOTE: If you know how to specify bindless vertex processing better please do help out with commenting how it should be done.

Alfonse Reinheart
02-18-2015, 09:09 AM
I'm so annoyed by ancient state machines.

... OK. But that has nothing to do with bindless vertex arrays. Since, you know, that functionality uses the state machine.

You seem to be ignorant on what NV_vertex_buffer_unified_memory (https://www.opengl.org/registry/specs/NV/vertex_buffer_unified_memory.txt) does, so allow me to enlighten you. It is not "pass pointers to the shader, and let it figure out how to fetch its vertex data via gl_VertexID and gl_InstanceID." The only thing this extension does is replace the "bind buffer object" call with "bind GPU pointer". The shader itself is completely unaffected. It's all about turning a heavy-weight "turn object name into GPU pointer" operation into a 0-weight operation.

Bindless vertex array calls set VAO state, just like binding buffers. So they're using the state machine. So why are you not annoyed with NVIDIA's use of "ancient state machines" in that API?


However, how the vertex processing is divided over the vertex units should have moved to the GPU driver instead of application logic layer a decade ago. Even before the arrival of newer GPU architectures such as GCN, putting that decision in the application logic is bad design. Very, very, very bad design!

Why is that bad design, exactly? As kRogue pointed out, some hardware has no dedicated vertex pulling logic, and some does. So if it were all done explicitly by the driver, you'd basically be screwing over all hardware that does. And for little reason.

I for one don't want to have to tie my shaders into a specific vertex format. I don't want to have to write different shaders just to be able to use different vertex formats with the exact same programming logic as before. If a piece of hardware needs the VS to do that, it can generate that code efficiently and slip it in at the top of my shader when needed. Vertex format changes are heavyweight; that's why ARB_vertex_attrib_binding exists.

Again, not that this has anything to do with bindless vertex arrays, since they still set state.


The idea of bindless being one element in the next gen API's.

Um, they are only "bindless" in the sense that they don't have a context to bind things to. The basic operations that are analogous to binding still exist in, for example, Apple Metal. When you're setting up your vertex data, you pass it MTLBuffer objects, not GPU pointers. Apple Metal doesn't even offer bindless texture support; you have to set up textures in the shader's environment, just like you do with its equivalent to UBOs.

Now, as others have mentioned, Apple Metal targets mobile GPUs only, so it's technically older. But my point is this: you're assuming what will happen without any form of evidence. The only next-gen API that we've actually seen doesn't use bindless. So why are you so convinced that glNext will?

Besides bindless textures, which is already near-core, of course.

Gedolo2
02-18-2015, 09:26 AM
It's more the overuse of state machines that bugs me.

I have not read through all the information on the nvidia extension, was in a hurry.
Looks like I made a mistake.
(I am annoyed by their use of ancient state machines in that API. How dare they!:()
The lightening of the operation is a big step in the right direction. Should be more common in graphics API's.

It's bad design because you can't expect application developers to start optimizing for all the different graphics cards. I'm assuming non bindless is having to manually specify which vertex unit you use for each batch of vertex operations.
The whole point of a graphics API is to allow a unified way of sending instructions to the graphics cards. While not have to build in logic to handle each and every different piece of hardware in your application logic. This is good software design 101.
If the hardware doesn't have dedicated logic write the logic in the software driver for the graphics card. The driver could check for the availability of hardware logic and use the hardware logic over the driver software logic. No hardware has to be screwed over. You can have the best results for all cases with no degradation in performance for the worst case scenario.
The driver manufacturer knows best how to divide the work over the vertex processing hardware and how to optimize it for maximum performance without letting other hardware architectures get in the way. An application by definition cannot do this without starting to write code for each and every graphics card family!




Now, as others have mentioned, Apple Metal targets mobile GPUs only, so it's technically older. But my point is this: you're assuming what will happen without any form of evidence. The only next-gen API that we've actually seen doesn't use bindless. So why are you so convinced that glNext will?

Besides bindless textures, which is already near-core, of course.
trying to predict the future sometimes results in getting it wrong.
Due to not having specifications out yet, I have to make some guesses.
I'm not a hardware specialist, unlike you.

mbentrup
02-18-2015, 11:13 AM
It's more the overuse of state machines that bugs me.

I have not read through all the information on the nvidia extension, was in a hurry.
Looks like I made a mistake.
(I am annoyed by their use of ancient state machines in that API. How dare they!:()
The lightening of the operation is a big step in the right direction. Should be more common in graphics API's.

It's bad design because you can't expect application developers to start optimizing for all the different graphics cards. I'm assuming non bindless is having to manually specify which vertex unit you use for each batch of vertex operations.
The whole point of a graphics API is to allow a unified way of sending instructions to the graphics cards. While not have to build in logic to handle each and every different piece of hardware in your application logic. This is good software design 101.
If the hardware doesn't have dedicated logic write the logic in the software driver for the graphics card. The driver could check for the availability of hardware logic and use the hardware logic over the driver software logic. No hardware has to be screwed over. You can have the best results for all cases with no degradation in performance for the worst case scenario.
The driver manufacturer knows best how to divide the work over the vertex processing hardware and how to optimize it for maximum performance without letting other hardware architectures get in the way. An application by definition cannot do this without starting to write code for each and every graphics card family!


trying to predict the future sometimes results in getting it wrong.
Due to not having specifications out yet, I have to make some guesses.
I'm not a hardware specialist, unlike you.

Well, actually the trend in graphics APIs (Metal, DX12, glNext) is to go away from abstraction, closer to the hardware.

Alfonse Reinheart
02-18-2015, 12:27 PM
Well, actually the trend in graphics APIs (Metal, DX12, glNext) is to go away from abstraction, closer to the hardware.

You can't really say that yet. D3D12 and glNext are (generally) unknown with regard to their level of abstraction. And Metal, no matter what Apples wants to claim, is generally speaking no closer to the hardware than OpenGL ES.

The biggest structural change for Metal compared to OpenGL is the explicit control over command queues. And that's not so much a lowering of the abstraction so much as a different abstraction. It's a sideways move, going from an immediate abstraction to a buffered abstraction. After all, it's not like you're writing command tokens into memory buffers and then telling the GPU to execute them. You're still using API calls to set things like viewports, etc. You're just calling them in a different way.

But the core abstractions we see in OpenGL are still there in Metal. You have state objects (admittedly immutable, but the object abstraction remains). You have resource objects. You have vertex formats definitions defined by the API. And so forth.

Alfonse Reinheart
02-18-2015, 01:18 PM
I have not read through all the information on the nvidia extension, was in a hurry.
Looks like I made a mistake.

I... what? :doh:

I want to recap what has just happened in this thread, so that you can fully understand the problem.

You asked for bindless vertex arrays. You linked to various posts, articles, and videos about it. You made various claims about what it would do for OpenGL competitively and how OpenGL would be if this weren't available.

And yet... you couldn't be bothered to research it yourself. You couldn't take 10 minutes of your life to learn exactly what it was you were asking for.

In short, you have asked for something that, by your own admission, you don't even know what it is!

We're not talking about digging into the micro-details of various hardware here. We're not talking about deep knowledge of various drivers and the way hardware works. We're not talking about being "a hardware specialist". We're talking about reading and understanding a publicly available extension specification (https://www.opengl.org/registry/specs/NV/vertex_buffer_unified_memory.txt).

When it comes to asking for something from somebody else, a piece of advice: your time is not as valuable as theirs. So you should never be "in a hurry" to make a suggestion; this forum will still be here tomorrow. Research first, understand first; then bother someone once you have some understanding of the idea.

kRogue
02-20-2015, 12:36 AM
You can't really say that yet. D3D12 and glNext are (generally) unknown with regard to their level of abstraction. And Metal, no matter what Apples wants to claim, is generally speaking no closer to the hardware than OpenGL ES.

The biggest structural change for Metal compared to OpenGL is the explicit control over command queues. And that's not so much a lowering of the abstraction so much as a different abstraction. It's a sideways move, going from an immediate abstraction to a buffered abstraction. After all, it's not like you're writing command tokens into memory buffers and then telling the GPU to execute them. You're still using API calls to set things like viewports, etc. You're just calling them in a different way.

But the core abstractions we see in OpenGL are still there in Metal. You have state objects (admittedly immutable, but the object abstraction remains). You have resource objects. You have vertex formats definitions defined by the API. And so forth.

I agree with Alfonse mostly here. The Metal API is also designed to be much more tile-based renderer friendly. Indeed, rendering command buffer encoded does not allow the modification of any texture data except for the actual texture target. Blitting commands are done by a separate encoder and so on. The API is quite slick to automagically avoid the use patterns that murder tile-based renderers.

On a side note, you had asked where to dig up documentation for AMD or Intel. Intel has 01.org website, and the public GPU documentation is at https://01.org/linuxgraphics/documentation . Additionally, the open source project Mesa has a driver for Intel's GPU family updated and maintained by Intel personal. The source can be found in Mesa at src/mesa/drivers/dri/i965 (that driver covers Gen4,5,6,7 and 8).

Also for Intel hardware there are some very detailed blogs about memory stuff: https://bwidawsk.net/blog/index.php/2014/06/the-global-gtt-part-1/ (and subsequent elements in the series) and for newer-ish hardware and kernel features: https://01.org/linuxgraphics/blogs/vivijim/2012/i915/gem-crashcourse-daniel-vetter

Alfonse Reinheart
02-20-2015, 07:14 AM
The Metal API is also designed to be much more tile-based renderer friendly. Indeed, rendering command buffer encoded does not allow the modification of any texture data except for the actual texture target. Blitting commands are done by a separate encoder and so on. The API is quite slick to automagically avoid the use patterns that murder tile-based renderers.

I hadn't really thought about that, but that does explain some of the more oddball things in Metal. It also makes me somewhat concerned about glNext, since it is intended (from what I understand) to span both mobile and desktop platforms with a single API.

Gedolo2
02-27-2015, 03:04 PM
I... what? :doh:

I want to recap what has just happened in this thread, so that you can fully understand the problem.

You asked for bindless vertex arrays. You linked to various posts, articles, and videos about it. You made various claims about what it would do for OpenGL competitively and how OpenGL would be if this weren't available.

And yet... you couldn't be bothered to research it yourself. You couldn't take 10 minutes of your life to learn exactly what it was you were asking for.

In short, you have asked for something that, by your own admission, you don't even know what it is!

We're not talking about digging into the micro-details of various hardware here. We're not talking about deep knowledge of various drivers and the way hardware works. We're not talking about being "a hardware specialist". We're talking about reading and understanding a publicly available extension specification (https://www.opengl.org/registry/specs/NV/vertex_buffer_unified_memory.txt).

When it comes to asking for something from somebody else, a piece of advice: your time is not as valuable as theirs. So you should never be "in a hurry" to make a suggestion; this forum will still be here tomorrow. Research first, understand first; then bother someone once you have some understanding of the idea.
You're right I should have been more patient.

Somebody suggested it to me, thus I added it to my feature request.
I read the overview section and found it would help in lowering overhead in vertex processing.
Being relevant to the area of graphics processing I was talking about I wrongly assumed this extension would do the thing I wanted.
The extension does help in processing vertex data efficiently and might be useful for OpenGL.