Part of the Khronos Group
OpenGL.org

The Industry's Foundation for High Performance Graphics

from games to virtual reality, mobile phones to supercomputers

Page 5 of 7 FirstFirst ... 34567 LastLast
Results 41 to 50 of 69

Thread: Toward Version 4.3

  1. #41
    Junior Member Regular Contributor
    Join Date
    Jan 2004
    Location
    Czech Republic, EU
    Posts
    190
    NVIDIA's bindless APIs seem to be a clear proof that OpenGL is flawed and cannot perform well as far as the overall CPU overhead is concerned. See:

    Quote Originally Posted by GL_NV_bindless_texture
    The ability to access textures without having to bind and/or re-bind them is similar to the capability provided by the NV_shader_buffer_load extension that allows shaders to access buffer objects without binding them. In both cases, these extensions significantly reduce the amount of API and internal GL driver overhead needed to manage resource bindings.
    Such hacks would not be needed if the design of OpenGL allowed drivers to have low enough CPU overhead in the first place.
    (usually just hobbyist) OpenGL driver developer

  2. #42
    Senior Member OpenGL Pro
    Join Date
    Apr 2010
    Location
    Germany
    Posts
    1,129
    Quote Originally Posted by Eosie
    OpenGL is flawed and cannot perform well as far as the overall CPU overhead is concerned.
    OpenGL is flawed. That is true. But you cannot generally state that OpenGL implementations cannot perform well in regard to CPU overhead because the CPU overhead is highly dependent on the use case. For instance, VAOs are a great way to reduce CPU overhead and with a clever buffer manager, instancing, multi draws, all which is of course backed by the current API, you can reduce bindings and draw calls to a minimum already. It's the same for textures. If you don't use textures at all, i.e. you render everything procedurally, you don't have any CPU overhead in that regard.

    Quote Originally Posted by Eosie
    Such hacks would not be needed if the design of OpenGL allowed drivers to have low enough CPU overhead in the first place.
    It's not the GL which allows drivers to incur low overhead. It's the hardware. OpenGL and the drivers that implement the spec are merely exposing hardware features on a higher level. The extension is nothing more than a mapping of current hardware features to the GL. That's not a hack but has been at the center of the extension mechanism for all eternity. Of course, this 'hack' could and should be core but as this thread shows, very few of us believe it'll happen anytime soon.

  3. #43
    Senior Member OpenGL Guru
    Join Date
    May 2009
    Posts
    4,948
    NVIDIA's bindless APIs seem to be a clear proof that OpenGL is flawed and cannot perform well as far as the overall CPU overhead is concerned.
    No, NVIDIA's bindless APIs is clear proof that OpenGL can be improved with regard to CPU caching performance. "cannot perform well as far as the overall CPU overhead is concerned" is errant nonsense; OpenGL is significantly better than D3D in this regard, as it allows the driver to do draw call marshalling.

    Just check out NVIDIA's 10,000 draw call PDF. Page 14 clearly shows that NVIDIA's OpenGL implementation has significantly better CPU batch behavior, to the point where batch size is clearly not the dominating factor in performance.

    Bindless is, first and foremost, about exploiting NVIDIA hardware, providing access to lower-level aspects of what their hardware can do. Obviously lower-level APIs will be faster than higher-level ones.

    Of course, this 'hack' could and should be core
    No it shouldn't.

    Bindless as far as uniforms are concerned is very hardware-specific. You're not going to be able to generalize that much beyond NVIDIA's hardware. Bindless vertex rendering might be possible, but even then, it's iffy. It's mostly about making buffers resident, and what restrictions you place on that. In NVIDIA-land, a resident buffer can still be mapped and modified; do you really want to force everyone to be able to do that?

    In either case, we should not be giving people integer "GPU addresses" that they offset themselves. That's way too low-level for an abstraction. It should simply be a name for a buffer that has been made resident.

  4. #44
    Junior Member Regular Contributor
    Join Date
    Jan 2004
    Location
    Czech Republic, EU
    Posts
    190
    Quote Originally Posted by Alfonse Reinheart View Post
    No, NVIDIA's bindless APIs is clear proof that OpenGL can be improved with regard to CPU caching performance. "cannot perform well as far as the overall CPU overhead is concerned" is errant nonsense; OpenGL is significantly better than D3D in this regard, as it allows the driver to do draw call marshalling.

    Just check out NVIDIA's 10,000 draw call PDF. Page 14 clearly shows that NVIDIA's OpenGL implementation has significantly better CPU batch behavior, to the point where batch size is clearly not the dominating factor in performance.
    Oh please, are you kidding me? It's a PDF comparing GeForce 2/3/4/FX performance on some ancient OpenGL implementation and DirectX 8 (I guess?) on some ancient version of Windows. You would very surprised how much APIs, OS driver interfaces, drivers, and hardware have evolved since then. It's a whole new world today.

    Anyway, the point is the existence of the "bindless" extensions shows how desperate the driver developers are. They are obviously very aware that the whole ARB won't agree on a complete 0penGL rewrite unanimously, so they had to find another way. I don't blame them, it's logical. However if the OpenGL API could be reworked such that it achieves at least 80% of performance increase of what bindless APIs advertise, I'd call it a huge win.
    (usually just hobbyist) OpenGL driver developer

  5. #45
    Senior Member OpenGL Guru
    Join Date
    May 2009
    Posts
    4,948
    You would very surprised how much APIs, OS driver interfaces, drivers, and hardware have evolved since then. It's a whole new world today.
    And which of these changes would suddenly cause cache miss rates to increase? Obviously, if something has changed that would cause performance to degrade, you should be able to point to exactly what it was.

    Obviously, modern CPUs outstrip the speed of memory to feed them by significantly more now than they did then. Thus cache misses hurt proportionately more nowadays. But that doesn't invalidate the previous data. It simply means that there are additional concerns besides batch size.

    Batches still count, especially in D3D land.

    The fact that bindless provides a remarkable speedup alone is not evidence that OpenGL performance has degraded. After all, for all you know, that level of performance speedup could have been possible back then if bindless vertex rendering had been implemented.

    Anyway, the point is the existence of the "bindless" extensions shows how desperate the driver developers are.
    By your logic, the existence of NV_path_rendering would mean that driver developers are "desperate" to get 2D rendering via OpenGL.

    The problem with your claim is the incorrect assumption that NVIDIA == driver developers. If AMD and/or Intel had their own competing "bindless" specs out there, your claim might hold some weight. But as it stands, no; the absolute best you can conclude is that NVIDIA is "desperate".

    Another point against this is that NVIDIA has shown a clear willingness to work with others on EXT extensions to expose shared hardware features, like EXT_shader_image_load_store. Indeed, EXT_separate_shader_objects was basically all NVIDIA's spec, with a bit of consulting with the outside world (the ARB version is what happens when a committee comes along and ruins something that wasn't terribly great to begin with). And yet, both of those are EXT extensions, not NVs.

    Coupled with the possible patent on bindless textures, it's much more likely that NVIDIA is simply doing what NVIDIA does: expose hardware-specific features via extensions. That's what they've always done, and there's little likelihood that they're going to stop anytime soon. Bindless isn't some "desperate" act to kick the ARB in the rear or circumvent it. It's just NVIDIA saying "we've got cool proprietary stuff." Like they always do.

  6. #46
    Senior Member OpenGL Pro
    Join Date
    Jan 2007
    Posts
    1,217
    D3D10 or 11 no longer have the old draw call overhead; that's been dead for positively ages. D3D9 on Vista or 7 also shares this performance characteristic.

    Every hardware vendor's marketing department would prefer you to be using their proprietary stuff. That was the entire motive behind AMD's fit of "make the API go away" a while back. It's a trifle disingenuous to make AMD look blameless in this.

    The key important thing here is not NV bindless vs some other hypothetical implementation of same - the key important thing is addressing the bind-to-modify madness that has afflicted OpenGL since day one. There's absolutely nothing proprietary or even hardware-dependent about that; D3D doesn't have bind-to-modify, AMD hardware can do it, Intel hardware can do it.

    It's not good enough to release a flawed, wonky or flat-out-insane first revision of a spec (buffer objects, GL2) and hope to patch it with extensions later on. Even in the absence of significant new hardware features (which it is by no means safe to predict) future versions of GL must focus on removing barriers to productivity.

  7. #47
    Junior Member Regular Contributor
    Join Date
    May 2012
    Posts
    100
    API design in general should be an abstraction driven by usability rather than how hardware can support some features or how it can be optimal for the hardware...Hardware changes, and it's unpredictable how it will change. The API designers should rather focus on how the API should be used instead of how it's implemented. Usability and elegance should take first priority here.
    Anyway we can wait till September this year and see. There will be some big changes

  8. #48
    Senior Member OpenGL Guru
    Join Date
    May 2009
    Posts
    4,948
    Every hardware vendor's marketing department would prefer you to be using their proprietary stuff. That was the entire motive behind AMD's fit of "make the API go away" a while back. It's a trifle disingenuous to make AMD look blameless in this.
    It wasn't a "fit"; it was a comment. A "fit" would have been many comments over a long period of time. And considering the fact that they didn't do anything about it, it obviously wasn't a grave concern for them. Simply stating a fact.

    Actions speak louder than words. And AMD's actions aren't saying much.

    The key important thing here is not NV bindless vs some other hypothetical implementation of same - the key important thing is addressing the bind-to-modify madness that has afflicted OpenGL since day one.
    But... that has nothing to do with the problem that NVIDIA's bindless solves.

    Bindless doesn't get its performance from removing bind-to-modify. Indeed, it doesn't remove this at all. The problem isn't bind-to-modify. The problem is that binding an object for rendering requires a lot of overhead due to the nature of the objects themselves. Objects in OpenGL are not pointers to driver-created objects. They're references to pointers to driver-created objects. That's two levels of indirection and thus more opportunities for cache misses.

    Bindless goes from 2 indirections to zero by directly using a GPU address. D3D has (possibly) one fewer indirections. Removing bind-to-modify would go from 2 indirections to... 2. Because it doesn't address the number of indirections.

    To reduce the number of indirections, you have to deal with the objects themselves, not how you modify them. OpenGL objects would have to stop being numbers and start being actual pointers. You would have to be forbidden to do things like reallocate texture storage (which we already almost completely have with ARB_texture_storage), reallocate buffer object storage (ie: being able to call glBufferData more than once with a different size and usage hint) and so forth.

    The fact that you'd be using glNamedBufferData instead of glBufferData to reallocate buffer object storage does nothing to resolve the underlying problem. The driver has to check to see if the buffer object has changed since the last time you talked about it. It has to resolve the buffer object into a GPU pointer, which also may mean DMA-ing it to video memory. And so forth.

    These checks have nothing to do with bind-to-modify. Getting rid of bind-to-modify will not make OpenGL implementations faster.

  9. #49
    Senior Member OpenGL Pro
    Join Date
    Jan 2007
    Posts
    1,217
    Quote Originally Posted by Alfonse Reinheart View Post
    Getting rid of bind-to-modify will not make OpenGL implementations faster.
    Bind-to-modify pollutes GPU and driver caches, it pollutes state-change filtering, it affects design decisions in your program, it pollutes VAOs. This isn't some theoretical "more opportunities for cache misses" thing, this is "cache misses do happen, as well as all of this other unnecessary junk". An extra layer of indirection is down in the noise of any performance graph compared to this.

    We're not talking about drawing teapots on planar surfaces here. We're not talking about loading everything once at startup and using static data for the entire program here. Real-world programs are very dynamic. Objects move into and out of the scene, transient effects come and go, on-screen GUI elements are updated and change, and this happens in every single frame.

    Getting rid of bind-to-modify will make GL programs go faster.

  10. #50
    Senior Member OpenGL Guru
    Join Date
    May 2009
    Posts
    4,948
    Bind-to-modify pollutes GPU and driver caches
    I cannot imagine a sane implementation of OpenGL that actually causes GPU changes from the act of binding an object. Unless it's a sampler object, and even then it's kinda iffy. Driver caches being polluted is far more about the extra indirection.

    If I do this (with a previously created `buf`):

    Code :
    glBindBuffer(GL_UNIFORM_BUFFER, buf);
    glBufferSubData(GL_UNIFORM_BUFFER, ...);

    The first line will set the currently bound buffer object to refer to `buf`. That involves changing an effectively global value. There may be some memory reads to check for errors, and the actual object behind `buf` may need to be tracked down and allocated if it doesn't exist. So, that's a hash-table access to get the actual object behind `buf`, followed by some reads of the buffer's data (to see if it exists), followed by a memory write to the global value.

    That's 3 memory accesses. And let's assume they're all uncached. So 3 cache misses.

    The second line will do some uploading to the buffer. So we fetch from the global value the buffer object. We'll assume that this implementation was not written by idiots, so the global value being written was not `buf`, but a pointer to the actual internal buffer object. So we access the global pointer value, get the buffer's GPU and/or CPU address, and do whatever logic is needed to schedule an upload.

    That's 2 memory accesses (outside of the scheduling logic). However, fetching the global pointer value is guaranteed to be a cached access, since we just wrote that value in the `glBindBuffer` call. Also, we may have brought the buffer object's data into the cache when we did those bind-time validation checks. So worst-case, this is only 1 cache miss. Best case, 0 misses, but let's say 1.

    Total cache misses: 4. Total number of different cache lines accessed: 4.

    Now, let's look at this:

    Code :
    glNamedBufferSubDataEXT(buf, ...);

    So, first we need to resolve `buf` into a buffer object. That requires accessing our hash table to turn it into a pointer to our internal buffer object data. Following this, we must check to see if this is a valid object. After that, we have to get the buffer's GPU and/or CPU address, and do whatever logic is needed to schedule an upload. That's 3 memory accesses.

    Total cache misses: 3. Total number of different cache lines accessed: 3.

    So, the difference is 4:3 in both number of cache misses and how many separate lines are accessed. Fair enough.

    Now, let's look at what happens when the hash table is removed and we deal directly with opaque pointers.

    The first one goes from 4 cache misses down to 3. The second goes from 3 down to 2. So, that "extra indirection" seems to be a pretty significant thing, as removing it reduced our number of cache misses by 25% in the bind-to-modify case and by 33% in the DSA case.

    DSA alone only reduces cache misses by 25%.

    But wait; since we're dealing with "GL programs", we need to consider how often these cache misses will happen. How often will a bind point actually not be in the cache?

    Obviously the first time you use a bind point in each rendering frame, it will not be in the cache. But after that? Since so many functions use these bind points, the number of cache misses is probably not going to be huge.

    What about the cache hit/miss rate for the indirection, the hash table itself? That is in fact rather worse. Every time you use a new object (by "new", I mean unused this frame), that's a likely cache miss on the hash table. That's going to be pretty frequent.

    As you say, "Real-world programs are very dynamic." You're going to have thousands of textures used in a frame. You may have a few dozen buffer objects (depending on how you use them). You may have hundreds of programs.

    So which miss rate is going to be higher: new object within this frame? Or bind points?

    My money is on new objects. So getting rid of that indirection ultimately gets you more. So this:

    An extra layer of indirection is down in the noise of any performance graph compared to this.
    doesn't seem to bear up to scrutiny.

    Real-world programs are very dynamic. Objects move into and out of the scene, transient effects come and go, on-screen GUI elements are updated and change, and this happens in every single frame.
    This seems like a non-sequitor. Objects moving out of a scene is simply a matter of what programs and buffers you bind to render with. It has nothing to do with what object state gets modified (outside of uniforms set to those programs, but I'll get to that in a bit).

    GUI elements and particle systems are a matter of buffer streaming. The predominant performance bottleneck associated with that (when you manage to find the fast path) is in the generation of the new vertex data and in the uploading of it to the GPU. The extra time for binding that buffer in order to invalidate and map it is irrelevant next to issuing a DMA. So those don't count.

    As for uniforms within programs, these are set all the time. It's not that you necessarily reset all the uniforms every time you render a new object of course. The fact is that uniform setting is a constant operation, something you do all the time when rendering.

    Indeed, uniforms are probably the object state most likely to be modified when rendering. And that last part is the key: when you change uniform state, it is only because you want to render with that program.

    It's not bind-to-modify with programs; it's bind-to-modify-and-render. Modifying uniforms without binding the program doesn't change the fact that when you change uniforms, you're also almost certainly going to immediately render with it. Which means you need to bind the program. So you've lost nothing with bind-to-modify for programs; you were going to bind it anyway.

    it pollutes state-change filtering
    In what way? No OpenGL implementation would actually trigger any GPU activity based solely on binding. Except for sampler objects, and even then, probably not.

    The only time it would affect filtering is if you are in the middle of the render loop, and you do something stupid. Like bind an object, modify it, and then bind something else over top of it.

    And I can't think of a sensible reason to do this in a real program.

    Getting rid of bind-to-modify will make GL programs go faster.
    There is strong evidence that bindless has a significant effect on GL programs. Where is your evidence that bind-to-modify has a similar effect?

    Yes, it will probably have an effect. But I seriously doubt it's going to be anywhere near as what you get with bindless.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •