Part of the Khronos Group
OpenGL.org

The Industry's Foundation for High Performance Graphics

from games to virtual reality, mobile phones to supercomputers

Page 1 of 3 123 LastLast
Results 1 to 10 of 23

Thread: More efficeint drawing of vertex data

  1. #1
    Intern Contributor
    Join Date
    Nov 2013
    Posts
    51

    More efficeint drawing of vertex data

    With DSA and bindless textures there is one large, missing part to make the API modern with bindless paradigm. Namely in vertex processing.
    http://www.nvidia.com/object/bindless_graphics.html
    Want vindless vertex processing in as ARB in current 4.x line of OpenGL.
    In OpenGL 4.x.

    Very, very useful for processing vertex intensive scenes. Last large, missing piece in making the API bindless.
    Efficient drawing of vertex data is very, very useful. Is a natural fit with the newer DSA and bindless textures way of working.
    https://www.opengl.org/registry/spec...w_indirect.txt
    http://www.ustream.tv/recorded/51212968
    http://www.nvidia.com/object/bindless_graphics.html
    Last slides of the following presentation ( slide 77 and up) http://www.slideshare.net/Mark_Kilga...gpus?related=1
    Last edited by Gedolo2; 02-02-2015 at 11:54 AM.

  2. #2
    Intern Contributor
    Join Date
    Nov 2013
    Posts
    51
    Modernizing the 4.x line is important. Even with the glNext initiative.
    Adding these makes the OpenGL 4.x API compete well with current alternatives. Not coming over hopelessly ancient.
    Provides a good stepping stone between older OpenGL versions and glNext.
    Developers developing too early or unwilling to develop for glNext can choose OpenGL 4.x and will have a API somewhat modern in design.
    Will also easier the porting work between glNext and OpenGL4.x if better, more fficient vertex processing gets added.

  3. #3
    Senior Member OpenGL Lord
    Join Date
    May 2009
    Posts
    5,974
    Provides a good stepping stone between older OpenGL versions and glNext.
    ...
    Will also easier the porting work between glNext and OpenGL4.x if better, more fficient vertex processing gets added.
    OK, I'll play this game.

    Prove it.

    Public API documentation on D3D12 and Mantle is scarce and behind NDA walls. However, Metal's docs are publicly available. And according to them, their equivalent to glVertexAttribPointer/glBindVertexBuffer/glBufferAddressRangeNV takes an MTLBuffer object reference. Not a GPU memory address like glBufferAddressRangeNV.

    Given that as an example of an actual next-gen API, how exactly would switching to resident buffers and GPU memory addresses in any way help move to glNext?

    Adding these makes the OpenGL 4.x API compete well with current alternatives. Not coming over hopelessly ancient.
    Round 2: prove it.

    How does OpenGL 4.5, by any objective measurement come across as "hopelessly ancient" compared to "current alternatives" in this regard? Neither D3D11, nor any version of D3D before it, directly uses GPU memory addresses. It requires using an actual object. Which is equivalent to binding it.

    Indeed, the only difference structurally between OpenGL 4.5 and D3D on this matter is that OpenGL's VAOs are fully mutable, while D3D's equivalent objects are not.

    I defy you to provide evidence for your claims. Show me one of the following:

    1: A "current alternative" API, in common usage, that does what NV_vertex_buffer_unified_memory does. This means specifically passing a GPU pointer which references GPU-allocated memory to be used as source data for vertex arrays.

    2: A claim from a person (who has actual standing, not some random guy on the Internet) who claims that OpenGL 4.5 is "hopelessly ancient" (or something similar) because of its inability to get GPU addresses from buffers and use them as source data for vertex arrays.

    All I'm asking for is just one of these. One or the other. And if you can't find any, then your claim is based solely on your own desires and lack of knowledge of other APIs, not on anything that anyone who actually matters believes.

  4. #4
    Advanced Member Frequent Contributor
    Join Date
    Apr 2009
    Posts
    612
    I am going to put in some HW notes here: Some HW has dedicated vertex fetch some do not. For example, AMD hardware does NOT have dedicated vertex fetch. Thus, for AMD hardware changing vertex format means vertex shader change. Intel hardware, according to implementation in Mesa (and the public docs for Gen8 and below) do have a dedicated vertex fetch unit, so changing vertex format does NOT trigger a shader change.

    On the subject of bindless. The truth is that it would be faster than the current bind buffer madness. On paper, an implementation can use VAO to shrink N-name to buffer "GPU address" fetches to 1 (by packing into the VAO the "addresses" of the vertex buffers). However, that is still slower than just passing along the "GPU address" of the buffers directly. Also, a smart implementation of VAO will still need to walk the attributes, see which change, and then do the right thing. In contrast GL_NV_vertex_buffer_unified_memory lets the application do the work.

    Lastly, for Metal vs OpenGL. Metal, like NVIDIA's bindless, has a dedicated interface to specify the vertex format (type, stride, count, and so on). We do have this in GL since version 4.something, and also ability to simple say "use this buffer starting at offset X for attribute Y" and use same format state... but that is wrapped into the VAO, so what happens is that an application that uses VAO's will have lots of buggers specifying format and source completely, even if all that needs to change is source. If the application does not use VAO, then the implementation has to the the name lookup for each source change. So all in all, it is pick your poison rather than a good solution existing aside from NVIDIA's extension.

    Just my 2 cents.

  5. #5
    Senior Member OpenGL Lord
    Join Date
    May 2009
    Posts
    5,974
    On the subject of bindless. The truth is that it would be faster than the current bind buffer madness. On paper, an implementation can use VAO to shrink N-name to buffer "GPU address" fetches to 1 (by packing into the VAO the "addresses" of the vertex buffers). However, that is still slower than just passing along the "GPU address" of the buffers directly. Also, a smart implementation of VAO will still need to walk the attributes, see which change, and then do the right thing. In contrast GL_NV_vertex_buffer_unified_memory lets the application do the work.
    No one is disputing whether providing GPU addresses is faster than specifying an object to fetch the GPU address from. Nor is anyone disputing how much gain there is from such a change (However, if you want to have that discussion, consider this. The NVIDIA material never actually tests vertex_attrib_binding with bindless attributes alone. Their tests involve both bindless attributes and shader_buffer_load, which replaces UBOs).

    Lastly, for Metal vs OpenGL. Metal, like NVIDIA's bindless, has a dedicated interface to specify the vertex format (type, stride, count, and so on). We do have this in GL since version 4.something, and also ability to simple say "use this buffer starting at offset X for attribute Y" and use same format state... but that is wrapped into the VAO, so what happens is that an application that uses VAO's will have lots of buggers specifying format and source completely, even if all that needs to change is source.
    Any evidence to back that up? The whole point of vertex_attrib_binding is to avoid that, to have one VAO per format and just change buffer bindings, just like D3D. So which applications actually use VAB and have lots of VAOs with the same formats?

    If the application does not use VAO, then the implementation has to the the name lookup for each source change.
    And it'd have to do that for Metal too. So... again, what's the difference? If using GPU addresses was that much faster and a reasonable global abstraction, surely Apple would have used them. And yet they didn't.

    Maybe there's a reason for that.

  6. #6
    Advanced Member Frequent Contributor
    Join Date
    Apr 2009
    Posts
    612
    Quote Originally Posted by Alfonse Reinheart View Post
    Any evidence to back that up? The whole point of vertex_attrib_binding is to avoid that, to have one VAO per format and just change buffer bindings, just like D3D. So which applications actually use VAB and have lots of VAOs with the same formats?
    The use case you are saying is this:
    • Application creates on VAO per attribute layout
    • Application changes buffer binding per mesh (essentially)

    Doing that means that there are (usually) 2 names chases per mesh: one for index and one for attributes. In contrast, if one has one VAO per mesh there is only one name change. This is what I am saying in terms of picking a poison.

    And it'd have to do that for Metal too. So... again, what's the difference? If using GPU addresses was that much faster and a reasonable global abstraction, surely Apple would have used them. And yet they didn't.

    Maybe there's a reason for that.
    Somebits about Metal: Firstly there is no name lookup associated to "MTLBuffer", but there is the pointer chase [so half the pain]. Secondly, Metal is iOS only which means iPad/iPod only. The GPU is a power-limited device. Moreover, the GPU is a tiled based renderer, and the draw call does two things underneath: issue binning command AND add stuff to another command buffer to the eventual tile walk (which is done once per Metal command buffer). These operations are non-trivial to perform, and require the driver to do quite a bit of stuff. In contrast, for the PC of NVIDIA and AMD the gfx cards are hulking beasts with their own dedicated memory commanded over the PCI bus by the driver. For them, the driver "work" is to convert from API to command buffer bytes should be tiny and ideally trivial. The binding, the names make it less trivial and become a higher percentage of work (where as for iOS the name chase is just the beginning of all the work needed to issue a draw command AND they are unified memory devices anyways, so cache miss is guaranteed almost anyways).

    Though at the end of the day, half the pain of GL could have been avoided if it used pointers directly for its objects instead of GLuin name thing. That alone doubles the pain.

  7. #7
    Senior Member OpenGL Lord
    Join Date
    May 2009
    Posts
    5,974
    Quote Originally Posted by kRogue View Post
    The use case you are saying is this:
    • Application creates on VAO per attribute layout
    • Application changes buffer binding per mesh (essentially)

    Doing that means that there are (usually) 2 names chases per mesh: one for index and one for attributes. In contrast, if one has one VAO per mesh there is only one name change. This is what I am saying in terms of picking a poison.
    In what way do attributes involve a "name chase"? Any programmer who calls glGetAttribLocation() for every attribute they use at render time deserves the performance they get. That's the only name chase I can think of.

    glBindVertexBuffers only needs to "name chase" the buffer names, since for the majority of the time, you're reusing the same VAO. Thus, VAO binds will be quite rare.

    I would also point out that this is exactly how things work in the D3D world, modulo OpenGL object "name chasing". Which again, only happens to actual OpenGL objects: buffer objects and the occasional VAO.

    Quote Originally Posted by kRogue View Post
    Though at the end of the day, half the pain of GL could have been avoided if it used pointers directly for its objects instead of GLuin name thing. That alone doubles the pain.
    For all of NVIDIA's allegations of the "pain" of OpenGL name lookup performance-wise... typical use of the OpenGL API is still faster than D3D11 for the same scene. Valve discovered that one when they ported the Source engine to it, and they weren't using NVIDIAs gimmicks to get that performance win.

    So there's no evidence that name lookup is a significant problem, relative to other APIs. That's not to say that performance wouldn't be better without it. But to say that OpenGL as it stands would be considered "hopelessly ancient" or something equally ridiculous for not having bindless vertex arrays is laughable.

  8. #8
    Senior Member OpenGL Guru Dark Photon's Avatar
    Join Date
    Oct 2004
    Location
    Druidia
    Posts
    4,321
    Quote Originally Posted by Alfonse Reinheart View Post
    For all of NVIDIA's allegations of the "pain" of OpenGL name lookup performance-wise... typical use of the OpenGL API is still faster than D3D11 for the same scene. Valve discovered that one when they ported the Source engine to it, and they weren't using NVIDIAs gimmicks to get that performance win.

    So there's no evidence that name lookup is a significant problem, relative to other APIs. That's not to say that performance wouldn't be better without it. But to say that OpenGL as it stands would be considered "hopelessly ancient" or something equally ridiculous for not having bindless vertex arrays is laughable.
    Probably true, but even so, what's your point? I too would welcome our new bindless overlords.

    I have used bindless attributes and index lists, without shader_buffer_load (to your point above), and with and without VAOs, and I'm here to tell you it can make a big difference in GPU throughput, obviating the senseless need to spend more time optimizing when you get the performance boost you need by such a simple code change, and in so doing eliminate much needless overhead in the GL driver (doing binds, name lookups, making buffers resident ... whatever!). "What" is not important to us; that it can result in such huge increases in performance is.

    Performance relative to D3D isn't forward thinking. Performance relative to what's possible? Now "that's" what's interesting!
    Last edited by Dark Photon; 02-05-2015 at 05:31 PM.

  9. #9
    Senior Member OpenGL Lord
    Join Date
    May 2009
    Posts
    5,974
    Probably true, but even so, what's your point?
    I was arguing against Gedolo2's position that bindless vertex arrays "makes the OpenGL 4.x API compete well with current alternatives." That it is essential to make OpenGL "Not coming over hopelessly ancient." And that bindless vertex arrays "Provides a good stepping stone between older OpenGL versions and glNext."

    There is not one shred of evidence for any of those positions. OpenGL already competes just fine with "current alternatives". It does not "come over hopelessly ancient." And, given Apple Metal as an idea of what we might see from glNext, there is no reason to believe that bindless vertex arrays would be used in glNext, thus not making it a "stepping stone" towards anything.

    In short, my overall point is that his point is founded on ignorance. I make no claim as to whether OpenGL 4.6 should use bindless vertex arrays, only that his reasons for doing so don't hold water.

    However, if you really want to have that discussion:

    The problem with bindless vertex arrays can readily be seen by the fact that they're still an NVIDIA extension. Allow me to explain that.

    NVIDIA came out with this feature a bit less than 5 years ago. Since that time, it has progressed precisely nowhere in the OpenGL ecosystem. Oh, plenty of people use it. But nobody but NVIDIA implements it.

    Now, you could argue that it's an NVIDIA extension; of course nobody else implements it. Except that's not true. NV_texture_barrier is widely implemented, even on Intel hardware. Hell, Apple enforces compliance to it. Both NV_texture_barrier and NV_vertex_buffer_unified_memory were released in the same batch of extensions. So clearly, the fact that NV_vertex_buffer_unified_memory is an NVIDIA extension has not inhibited anyone from implementing it.

    You could try to claim that the extension is covered by patents of some kind. Even if that's true, it is just as true for ARB_bindless_texture. Yet Khronos and NVIDIA worked out whatever legal issues were necessary to make that a widely implemented extension for hardware that supports it. So again, that doens't jive, unless NVIDIA wants bindless textures supported elsewhere but not bindless vertex arrays.

    And if that's the case, then you lose by default, since Khronos couldn't make it core functionality even if they wanted to.

    Then there's ARB_bindless_texture. NV_bindless_texture became ARB_bindless_texture in about 6-9 months. It's been almost five years since NV_vertex_buffer_unified_memory, yet here we are. ARB_vertex_attrib_binding was in fact the perfect time to introduce the feature, since it was already rearranging how VAOs and vertex format state works. Even with that perfect opportunity, the ARB didn't take it.

    Lastly is OpenGL 4.5. That was rather feature light, and the most feature rich part of the set came from OpenGL ES 3.1. DSA is important to be sure. But the ARB's workload for the 4.5 release was the lowest it's been since GL 3.1. And that was a half-year cycle. Clearly, the ARB could have devoted time to bring out bindless vertex arrays, but they didn't.

    Since bindless vertex arrays haven't happened by now, one of the following must be true:

    1. The ARB has evaluated and rejected the concept.
    2. NVIDIA wants it to remain proprietary.
    3. NVIDIA and the ARB are working on the concept behind the scenes, yet haven't brought it to market after 5 years.

    #3 seems rather unlikely. And both #1 and #2 mean that it's just not gonna happen.

    So it's not gonna happen.

    That being said, stranger things have happened. I would have expected hell to freeze over before we got explicit_uniform_locations, considering that people were asking for that before OpenGL 2.0. So it's not beyond belief that it could happen.

    Just highly unlikely.

    As for the merits of the technology on its own, yes, it can help save performance. But the biggest reason I suspect #1 happened is because NV_vertex_buffer_unified_memory uses GPU addresses.

    Bindless textures don't really use GPU addresses. What you get there are 64-bit handles that reference the texture. They're may be integers, but the value of those integers is completely opaque and meaningless. Oh sure, they probably are GPU memory addresses that reference a data structure of some kind. But you don't treat them that way; you treat them like references in C++. You can store a reference. You can return a reference. But the only thing you can do with a reference is access the object it refers to.

    By contrast, unified memory are pointers. You perform pointer arithmetic on them; indeed, you're required to do so by the glBufferAddressRangeNV call. It takes a 64-bit address and a size; any offsetting ought to have been done by the user.

    If the functionality instead returned an opaque handle (which again could be a GPU address), and glBufferAddressRangeNV took a handle plus an offset, I think that would have gone a long way into making the functionality more acceptable. And I don't think NVIDIA would like the idea of sacrificing performance by making people do two memory fetches and an add for each buffer non-bind (one fetch for the handle, one for the offset).

    Not only that, returning actual GPU addresses has to be some kind of security problem. And with all of the "robustness" stuff coming out in the last few years, I don't think the ARB is going to start sanctioning such things. At least with 64-bit opaque handles, you can introduce a simple bitshift (perhaps randomized at application startup time) to do some basic obfuscation to the handle. It's hardly the most secure thing in the world, but it's something.

    Performance relative to D3D isn't forward thinking.
    Forward thinking is for glNext, not OpenGL. And I'd still lay odds that you won't see GPU memory addresses anywhere in glNext either. Not outside of an NVIDIA-only extension, at any rate.

  10. #10
    Advanced Member Frequent Contributor
    Join Date
    Apr 2009
    Posts
    612
    Quote Originally Posted by Alfonse Reinheart View Post

    There is not one shred of evidence for any of those positions. OpenGL already competes just fine with "current alternatives". It does not "come over hopelessly ancient." And, given Apple Metal as an idea of what we might see from glNext, there is no reason to believe that bindless vertex arrays would be used in glNext, thus not making it a "stepping stone" towards anything.
    No; OpenGL does not compete particularly well against other 3D API's. Tools are much poorer than other API's. Compliance of drivers are all over the place. Feature sets of drivers are also all over the place. I wish that was not the case, but it is. GL has more warts than a warthog.

    By contrast, unified memory are pointers. You perform pointer arithmetic on them; indeed, you're required to do so by the glBufferAddressRangeNV call. It takes a 64-bit address and a size; any offsetting ought to have been done by the user.

    If the functionality instead returned an opaque handle (which again could be a GPU address), and glBufferAddressRangeNV took a handle plus an offset, I think that would have gone a long way into making the functionality more acceptable. And I don't think NVIDIA would like the idea of sacrificing performance by making people do two memory fetches and an add for each buffer non-bind (one fetch for the handle, one for the offset).

    Not only that, returning actual GPU addresses has to be some kind of security problem. And with all of the "robustness" stuff coming out in the last few years, I don't think the ARB is going to start sanctioning such things. At least with 64-bit opaque handles, you can introduce a simple bitshift (perhaps randomized at application startup time) to do some basic obfuscation to the handle. It's hardly the most secure thing in the world, but it's something.
    Um, the addresses are essentially per-context or per-context-share-group. For each context one wants to use a buffer, one needs to make it resident for -each- context it will be accessed. The address is not absolute and is virtual and all those magicks. (I admit I do not have hard proof, but it makes sense).


    NVIDIA came out with this feature a bit less than 5 years ago. Since that time, it has progressed precisely nowhere in the OpenGL ecosystem. Oh, plenty of people use it. But nobody but NVIDIA implements it.
    One of the difficulties with NVIDIA's NV_vertex_buffer_unified_memory is that it builds from NV_shader_buffer_load. The extension NV_shader_buffer_load allows pointer (read) access from shaders of arbitrary buffers in GPU. Most GPU's are SIMD things, NVIDIA's GPU is different (they call it SIMT). Scattered memory access is really awful, painful, hard and terribly inefficient for SIMD architectures where as in NVIDIA's it is not as such a big disaster. For most gizmo's when a fragment (or really any shader stage) is invoked many elements are processed in one invocation. To be precise and specific, for example on Intel hardware a pixel shader can have SIMD8 or SIMD16 mode. This means one "pixel shader invocation" does the computation/work for 8 (respectively 16 pixels). Implementing scattered reads is suicide on SIMD architectures (usually). This is why, I believe, that we will not see NV_shader_buffer_load (or NV_shader_buffer_store) on anyone but NVIDIA.


    Now, the above does not mean one cannot still use the address to feed the vertex buffers directly, a different extension essentially. Here the royal pain is that there is also VAO and it just makes the whole thing a giant mess. Moreover, some hardware does not have that the GPU addresses are 64-bit things anyways, they are .. something different and potentially quite wonky.

    In truth, the ultimate question about the viability of something like NV_vertex_buffer_unified_memory to just set vertex attribute sources for other archs is what kind of numbers does the GPU require to access a buffer. Is it just a single number? Can the offset be added to that number to do the right thing? Or does it need to be a special number and an offset separately? Or does accessing a buffer object require other numbers when pushed to the GPU (like a size for robust access)? These questions are highly specific to hardware in question and even the generation of that hardware. We could have a "handle" which is really just a pointer for those needed numbers, but the API implementation is still then a handle and an offset. It saves the name look up idiocy but still requires a pointer fetch.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •